summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/populate.html
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--doc/src/sgml/html/populate.html206
1 files changed, 206 insertions, 0 deletions
diff --git a/doc/src/sgml/html/populate.html b/doc/src/sgml/html/populate.html
new file mode 100644
index 0000000..b49980b
--- /dev/null
+++ b/doc/src/sgml/html/populate.html
@@ -0,0 +1,206 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.4. Populating a Database</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /><link rel="next" href="non-durability.html" title="14.5. Non-Durable Settings" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.4. Populating a Database</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="non-durability.html" title="14.5. Non-Durable Settings">Next</a></td></tr></table><hr /></div><div class="sect1" id="POPULATE"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.4. Populating a Database <a href="#POPULATE" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="populate.html#DISABLE-AUTOCOMMIT">14.4.1. Disable Autocommit</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-COPY-FROM">14.4.2. Use <code class="command">COPY</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-RM-INDEXES">14.4.3. Remove Indexes</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-RM-FKEYS">14.4.4. Remove Foreign Key Constraints</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-WORK-MEM">14.4.5. Increase <code class="varname">maintenance_work_mem</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-MAX-WAL-SIZE">14.4.6. Increase <code class="varname">max_wal_size</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-PITR">14.4.7. Disable WAL Archival and Streaming Replication</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-ANALYZE">14.4.8. Run <code class="command">ANALYZE</code> Afterwards</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-PG-DUMP">14.4.9. Some Notes about <span class="application">pg_dump</span></a></span></dt></dl></div><p>
+ One might need to insert a large amount of data when first populating
+ a database. This section contains some suggestions on how to make
+ this process as efficient as possible.
+ </p><div class="sect2" id="DISABLE-AUTOCOMMIT"><div class="titlepage"><div><div><h3 class="title">14.4.1. Disable Autocommit <a href="#DISABLE-AUTOCOMMIT" class="id_link">#</a></h3></div></div></div><a id="id-1.5.13.7.3.2" class="indexterm"></a><p>
+ When using multiple <code class="command">INSERT</code>s, turn off autocommit and just do
+ one commit at the end. (In plain
+ SQL, this means issuing <code class="command">BEGIN</code> at the start and
+ <code class="command">COMMIT</code> at the end. Some client libraries might
+ do this behind your back, in which case you need to make sure the
+ library does it when you want it done.) If you allow each
+ insertion to be committed separately,
+ <span class="productname">PostgreSQL</span> is doing a lot of work for
+ each row that is added. An additional benefit of doing all
+ insertions in one transaction is that if the insertion of one row
+ were to fail then the insertion of all rows inserted up to that
+ point would be rolled back, so you won't be stuck with partially
+ loaded data.
+ </p></div><div class="sect2" id="POPULATE-COPY-FROM"><div class="titlepage"><div><div><h3 class="title">14.4.2. Use <code class="command">COPY</code> <a href="#POPULATE-COPY-FROM" class="id_link">#</a></h3></div></div></div><p>
+ Use <a class="link" href="sql-copy.html" title="COPY"><code class="command">COPY</code></a> to load
+ all the rows in one command, instead of using a series of
+ <code class="command">INSERT</code> commands. The <code class="command">COPY</code>
+ command is optimized for loading large numbers of rows; it is less
+ flexible than <code class="command">INSERT</code>, but incurs significantly
+ less overhead for large data loads. Since <code class="command">COPY</code>
+ is a single command, there is no need to disable autocommit if you
+ use this method to populate a table.
+ </p><p>
+ If you cannot use <code class="command">COPY</code>, it might help to use <a class="link" href="sql-prepare.html" title="PREPARE"><code class="command">PREPARE</code></a> to create a
+ prepared <code class="command">INSERT</code> statement, and then use
+ <code class="command">EXECUTE</code> as many times as required. This avoids
+ some of the overhead of repeatedly parsing and planning
+ <code class="command">INSERT</code>. Different interfaces provide this facility
+ in different ways; look for <span class="quote">“<span class="quote">prepared statements</span>”</span> in the interface
+ documentation.
+ </p><p>
+ Note that loading a large number of rows using
+ <code class="command">COPY</code> is almost always faster than using
+ <code class="command">INSERT</code>, even if <code class="command">PREPARE</code> is used and
+ multiple insertions are batched into a single transaction.
+ </p><p>
+ <code class="command">COPY</code> is fastest when used within the same
+ transaction as an earlier <code class="command">CREATE TABLE</code> or
+ <code class="command">TRUNCATE</code> command. In such cases no WAL
+ needs to be written, because in case of an error, the files
+ containing the newly loaded data will be removed anyway.
+ However, this consideration only applies when
+ <a class="xref" href="runtime-config-wal.html#GUC-WAL-LEVEL">wal_level</a> is <code class="literal">minimal</code>
+ as all commands must write WAL otherwise.
+ </p></div><div class="sect2" id="POPULATE-RM-INDEXES"><div class="titlepage"><div><div><h3 class="title">14.4.3. Remove Indexes <a href="#POPULATE-RM-INDEXES" class="id_link">#</a></h3></div></div></div><p>
+ If you are loading a freshly created table, the fastest method is to
+ create the table, bulk load the table's data using
+ <code class="command">COPY</code>, then create any indexes needed for the
+ table. Creating an index on pre-existing data is quicker than
+ updating it incrementally as each row is loaded.
+ </p><p>
+ If you are adding large amounts of data to an existing table,
+ it might be a win to drop the indexes,
+ load the table, and then recreate the indexes. Of course, the
+ database performance for other users might suffer
+ during the time the indexes are missing. One should also think
+ twice before dropping a unique index, since the error checking
+ afforded by the unique constraint will be lost while the index is
+ missing.
+ </p></div><div class="sect2" id="POPULATE-RM-FKEYS"><div class="titlepage"><div><div><h3 class="title">14.4.4. Remove Foreign Key Constraints <a href="#POPULATE-RM-FKEYS" class="id_link">#</a></h3></div></div></div><p>
+ Just as with indexes, a foreign key constraint can be checked
+ <span class="quote">“<span class="quote">in bulk</span>”</span> more efficiently than row-by-row. So it might be
+ useful to drop foreign key constraints, load data, and re-create
+ the constraints. Again, there is a trade-off between data load
+ speed and loss of error checking while the constraint is missing.
+ </p><p>
+ What's more, when you load data into a table with existing foreign key
+ constraints, each new row requires an entry in the server's list of
+ pending trigger events (since it is the firing of a trigger that checks
+ the row's foreign key constraint). Loading many millions of rows can
+ cause the trigger event queue to overflow available memory, leading to
+ intolerable swapping or even outright failure of the command. Therefore
+ it may be <span class="emphasis"><em>necessary</em></span>, not just desirable, to drop and re-apply
+ foreign keys when loading large amounts of data. If temporarily removing
+ the constraint isn't acceptable, the only other recourse may be to split
+ up the load operation into smaller transactions.
+ </p></div><div class="sect2" id="POPULATE-WORK-MEM"><div class="titlepage"><div><div><h3 class="title">14.4.5. Increase <code class="varname">maintenance_work_mem</code> <a href="#POPULATE-WORK-MEM" class="id_link">#</a></h3></div></div></div><p>
+ Temporarily increasing the <a class="xref" href="runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM">maintenance_work_mem</a>
+ configuration variable when loading large amounts of data can
+ lead to improved performance. This will help to speed up <code class="command">CREATE
+ INDEX</code> commands and <code class="command">ALTER TABLE ADD FOREIGN KEY</code> commands.
+ It won't do much for <code class="command">COPY</code> itself, so this advice is
+ only useful when you are using one or both of the above techniques.
+ </p></div><div class="sect2" id="POPULATE-MAX-WAL-SIZE"><div class="titlepage"><div><div><h3 class="title">14.4.6. Increase <code class="varname">max_wal_size</code> <a href="#POPULATE-MAX-WAL-SIZE" class="id_link">#</a></h3></div></div></div><p>
+ Temporarily increasing the <a class="xref" href="runtime-config-wal.html#GUC-MAX-WAL-SIZE">max_wal_size</a>
+ configuration variable can also
+ make large data loads faster. This is because loading a large
+ amount of data into <span class="productname">PostgreSQL</span> will
+ cause checkpoints to occur more often than the normal checkpoint
+ frequency (specified by the <code class="varname">checkpoint_timeout</code>
+ configuration variable). Whenever a checkpoint occurs, all dirty
+ pages must be flushed to disk. By increasing
+ <code class="varname">max_wal_size</code> temporarily during bulk
+ data loads, the number of checkpoints that are required can be
+ reduced.
+ </p></div><div class="sect2" id="POPULATE-PITR"><div class="titlepage"><div><div><h3 class="title">14.4.7. Disable WAL Archival and Streaming Replication <a href="#POPULATE-PITR" class="id_link">#</a></h3></div></div></div><p>
+ When loading large amounts of data into an installation that uses
+ WAL archiving or streaming replication, it might be faster to take a
+ new base backup after the load has completed than to process a large
+ amount of incremental WAL data. To prevent incremental WAL logging
+ while loading, disable archiving and streaming replication, by setting
+ <a class="xref" href="runtime-config-wal.html#GUC-WAL-LEVEL">wal_level</a> to <code class="literal">minimal</code>,
+ <a class="xref" href="runtime-config-wal.html#GUC-ARCHIVE-MODE">archive_mode</a> to <code class="literal">off</code>, and
+ <a class="xref" href="runtime-config-replication.html#GUC-MAX-WAL-SENDERS">max_wal_senders</a> to zero.
+ But note that changing these settings requires a server restart,
+ and makes any base backups taken before unavailable for archive
+ recovery and standby server, which may lead to data loss.
+ </p><p>
+ Aside from avoiding the time for the archiver or WAL sender to process the
+ WAL data, doing this will actually make certain commands faster, because
+ they do not to write WAL at all if <code class="varname">wal_level</code>
+ is <code class="literal">minimal</code> and the current subtransaction (or top-level
+ transaction) created or truncated the table or index they change. (They
+ can guarantee crash safety more cheaply by doing
+ an <code class="function">fsync</code> at the end than by writing WAL.)
+ </p></div><div class="sect2" id="POPULATE-ANALYZE"><div class="titlepage"><div><div><h3 class="title">14.4.8. Run <code class="command">ANALYZE</code> Afterwards <a href="#POPULATE-ANALYZE" class="id_link">#</a></h3></div></div></div><p>
+ Whenever you have significantly altered the distribution of data
+ within a table, running <a class="link" href="sql-analyze.html" title="ANALYZE"><code class="command">ANALYZE</code></a> is strongly recommended. This
+ includes bulk loading large amounts of data into the table. Running
+ <code class="command">ANALYZE</code> (or <code class="command">VACUUM ANALYZE</code>)
+ ensures that the planner has up-to-date statistics about the
+ table. With no statistics or obsolete statistics, the planner might
+ make poor decisions during query planning, leading to poor
+ performance on any tables with inaccurate or nonexistent
+ statistics. Note that if the autovacuum daemon is enabled, it might
+ run <code class="command">ANALYZE</code> automatically; see
+ <a class="xref" href="routine-vacuuming.html#VACUUM-FOR-STATISTICS" title="25.1.3. Updating Planner Statistics">Section 25.1.3</a>
+ and <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="25.1.6. The Autovacuum Daemon">Section 25.1.6</a> for more information.
+ </p></div><div class="sect2" id="POPULATE-PG-DUMP"><div class="titlepage"><div><div><h3 class="title">14.4.9. Some Notes about <span class="application">pg_dump</span> <a href="#POPULATE-PG-DUMP" class="id_link">#</a></h3></div></div></div><p>
+ Dump scripts generated by <span class="application">pg_dump</span> automatically apply
+ several, but not all, of the above guidelines. To restore a
+ <span class="application">pg_dump</span> dump as quickly as possible, you need to
+ do a few extra things manually. (Note that these points apply while
+ <span class="emphasis"><em>restoring</em></span> a dump, not while <span class="emphasis"><em>creating</em></span> it.
+ The same points apply whether loading a text dump with
+ <span class="application">psql</span> or using <span class="application">pg_restore</span> to load
+ from a <span class="application">pg_dump</span> archive file.)
+ </p><p>
+ By default, <span class="application">pg_dump</span> uses <code class="command">COPY</code>, and when
+ it is generating a complete schema-and-data dump, it is careful to
+ load data before creating indexes and foreign keys. So in this case
+ several guidelines are handled automatically. What is left
+ for you to do is to:
+ </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
+ Set appropriate (i.e., larger than normal) values for
+ <code class="varname">maintenance_work_mem</code> and
+ <code class="varname">max_wal_size</code>.
+ </p></li><li class="listitem"><p>
+ If using WAL archiving or streaming replication, consider disabling
+ them during the restore. To do that, set <code class="varname">archive_mode</code>
+ to <code class="literal">off</code>,
+ <code class="varname">wal_level</code> to <code class="literal">minimal</code>, and
+ <code class="varname">max_wal_senders</code> to zero before loading the dump.
+ Afterwards, set them back to the right values and take a fresh
+ base backup.
+ </p></li><li class="listitem"><p>
+ Experiment with the parallel dump and restore modes of both
+ <span class="application">pg_dump</span> and <span class="application">pg_restore</span> and find the
+ optimal number of concurrent jobs to use. Dumping and restoring in
+ parallel by means of the <code class="option">-j</code> option should give you a
+ significantly higher performance over the serial mode.
+ </p></li><li class="listitem"><p>
+ Consider whether the whole dump should be restored as a single
+ transaction. To do that, pass the <code class="option">-1</code> or
+ <code class="option">--single-transaction</code> command-line option to
+ <span class="application">psql</span> or <span class="application">pg_restore</span>. When using this
+ mode, even the smallest of errors will rollback the entire restore,
+ possibly discarding many hours of processing. Depending on how
+ interrelated the data is, that might seem preferable to manual cleanup,
+ or not. <code class="command">COPY</code> commands will run fastest if you use a single
+ transaction and have WAL archiving turned off.
+ </p></li><li class="listitem"><p>
+ If multiple CPUs are available in the database server, consider using
+ <span class="application">pg_restore</span>'s <code class="option">--jobs</code> option. This
+ allows concurrent data loading and index creation.
+ </p></li><li class="listitem"><p>
+ Run <code class="command">ANALYZE</code> afterwards.
+ </p></li></ul></div><p>
+ </p><p>
+ A data-only dump will still use <code class="command">COPY</code>, but it does not
+ drop or recreate indexes, and it does not normally touch foreign
+ keys.
+
+ <a href="#ftn.id-1.5.13.7.11.4.2" class="footnote"><sup class="footnote" id="id-1.5.13.7.11.4.2">[14]</sup></a>
+
+ So when loading a data-only dump, it is up to you to drop and recreate
+ indexes and foreign keys if you wish to use those techniques.
+ It's still useful to increase <code class="varname">max_wal_size</code>
+ while loading the data, but don't bother increasing
+ <code class="varname">maintenance_work_mem</code>; rather, you'd do that while
+ manually recreating indexes and foreign keys afterwards.
+ And don't forget to <code class="command">ANALYZE</code> when you're done; see
+ <a class="xref" href="routine-vacuuming.html#VACUUM-FOR-STATISTICS" title="25.1.3. Updating Planner Statistics">Section 25.1.3</a>
+ and <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="25.1.6. The Autovacuum Daemon">Section 25.1.6</a> for more information.
+ </p></div><div class="footnotes"><br /><hr style="width:100; text-align:left;margin-left: 0" /><div id="ftn.id-1.5.13.7.11.4.2" class="footnote"><p><a href="#id-1.5.13.7.11.4.2" class="para"><sup class="para">[14] </sup></a>
+ You can get the effect of disabling foreign keys by using
+ the <code class="option">--disable-triggers</code> option — but realize that
+ that eliminates, rather than just postpones, foreign key
+ validation, and so it is possible to insert bad data if you use it.
+ </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="non-durability.html" title="14.5. Non-Durable Settings">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.5. Non-Durable Settings</td></tr></table></div></body></html> \ No newline at end of file