diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
commit | 293913568e6a7a86fd1479e1cff8e2ecb58d6568 (patch) | |
tree | fc3b469a3ec5ab71b36ea97cc7aaddb838423a0c /doc/src/sgml/html/populate.html | |
parent | Initial commit. (diff) | |
download | postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.tar.xz postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.zip |
Adding upstream version 16.2.upstream/16.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | doc/src/sgml/html/populate.html | 206 |
1 files changed, 206 insertions, 0 deletions
diff --git a/doc/src/sgml/html/populate.html b/doc/src/sgml/html/populate.html new file mode 100644 index 0000000..b49980b --- /dev/null +++ b/doc/src/sgml/html/populate.html @@ -0,0 +1,206 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.4. Populating a Database</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /><link rel="next" href="non-durability.html" title="14.5. Non-Durable Settings" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.4. Populating a Database</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="non-durability.html" title="14.5. Non-Durable Settings">Next</a></td></tr></table><hr /></div><div class="sect1" id="POPULATE"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.4. Populating a Database <a href="#POPULATE" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="populate.html#DISABLE-AUTOCOMMIT">14.4.1. Disable Autocommit</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-COPY-FROM">14.4.2. Use <code class="command">COPY</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-RM-INDEXES">14.4.3. Remove Indexes</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-RM-FKEYS">14.4.4. Remove Foreign Key Constraints</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-WORK-MEM">14.4.5. Increase <code class="varname">maintenance_work_mem</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-MAX-WAL-SIZE">14.4.6. Increase <code class="varname">max_wal_size</code></a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-PITR">14.4.7. Disable WAL Archival and Streaming Replication</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-ANALYZE">14.4.8. Run <code class="command">ANALYZE</code> Afterwards</a></span></dt><dt><span class="sect2"><a href="populate.html#POPULATE-PG-DUMP">14.4.9. Some Notes about <span class="application">pg_dump</span></a></span></dt></dl></div><p> + One might need to insert a large amount of data when first populating + a database. This section contains some suggestions on how to make + this process as efficient as possible. + </p><div class="sect2" id="DISABLE-AUTOCOMMIT"><div class="titlepage"><div><div><h3 class="title">14.4.1. Disable Autocommit <a href="#DISABLE-AUTOCOMMIT" class="id_link">#</a></h3></div></div></div><a id="id-1.5.13.7.3.2" class="indexterm"></a><p> + When using multiple <code class="command">INSERT</code>s, turn off autocommit and just do + one commit at the end. (In plain + SQL, this means issuing <code class="command">BEGIN</code> at the start and + <code class="command">COMMIT</code> at the end. Some client libraries might + do this behind your back, in which case you need to make sure the + library does it when you want it done.) If you allow each + insertion to be committed separately, + <span class="productname">PostgreSQL</span> is doing a lot of work for + each row that is added. An additional benefit of doing all + insertions in one transaction is that if the insertion of one row + were to fail then the insertion of all rows inserted up to that + point would be rolled back, so you won't be stuck with partially + loaded data. + </p></div><div class="sect2" id="POPULATE-COPY-FROM"><div class="titlepage"><div><div><h3 class="title">14.4.2. Use <code class="command">COPY</code> <a href="#POPULATE-COPY-FROM" class="id_link">#</a></h3></div></div></div><p> + Use <a class="link" href="sql-copy.html" title="COPY"><code class="command">COPY</code></a> to load + all the rows in one command, instead of using a series of + <code class="command">INSERT</code> commands. The <code class="command">COPY</code> + command is optimized for loading large numbers of rows; it is less + flexible than <code class="command">INSERT</code>, but incurs significantly + less overhead for large data loads. Since <code class="command">COPY</code> + is a single command, there is no need to disable autocommit if you + use this method to populate a table. + </p><p> + If you cannot use <code class="command">COPY</code>, it might help to use <a class="link" href="sql-prepare.html" title="PREPARE"><code class="command">PREPARE</code></a> to create a + prepared <code class="command">INSERT</code> statement, and then use + <code class="command">EXECUTE</code> as many times as required. This avoids + some of the overhead of repeatedly parsing and planning + <code class="command">INSERT</code>. Different interfaces provide this facility + in different ways; look for <span class="quote">“<span class="quote">prepared statements</span>”</span> in the interface + documentation. + </p><p> + Note that loading a large number of rows using + <code class="command">COPY</code> is almost always faster than using + <code class="command">INSERT</code>, even if <code class="command">PREPARE</code> is used and + multiple insertions are batched into a single transaction. + </p><p> + <code class="command">COPY</code> is fastest when used within the same + transaction as an earlier <code class="command">CREATE TABLE</code> or + <code class="command">TRUNCATE</code> command. In such cases no WAL + needs to be written, because in case of an error, the files + containing the newly loaded data will be removed anyway. + However, this consideration only applies when + <a class="xref" href="runtime-config-wal.html#GUC-WAL-LEVEL">wal_level</a> is <code class="literal">minimal</code> + as all commands must write WAL otherwise. + </p></div><div class="sect2" id="POPULATE-RM-INDEXES"><div class="titlepage"><div><div><h3 class="title">14.4.3. Remove Indexes <a href="#POPULATE-RM-INDEXES" class="id_link">#</a></h3></div></div></div><p> + If you are loading a freshly created table, the fastest method is to + create the table, bulk load the table's data using + <code class="command">COPY</code>, then create any indexes needed for the + table. Creating an index on pre-existing data is quicker than + updating it incrementally as each row is loaded. + </p><p> + If you are adding large amounts of data to an existing table, + it might be a win to drop the indexes, + load the table, and then recreate the indexes. Of course, the + database performance for other users might suffer + during the time the indexes are missing. One should also think + twice before dropping a unique index, since the error checking + afforded by the unique constraint will be lost while the index is + missing. + </p></div><div class="sect2" id="POPULATE-RM-FKEYS"><div class="titlepage"><div><div><h3 class="title">14.4.4. Remove Foreign Key Constraints <a href="#POPULATE-RM-FKEYS" class="id_link">#</a></h3></div></div></div><p> + Just as with indexes, a foreign key constraint can be checked + <span class="quote">“<span class="quote">in bulk</span>”</span> more efficiently than row-by-row. So it might be + useful to drop foreign key constraints, load data, and re-create + the constraints. Again, there is a trade-off between data load + speed and loss of error checking while the constraint is missing. + </p><p> + What's more, when you load data into a table with existing foreign key + constraints, each new row requires an entry in the server's list of + pending trigger events (since it is the firing of a trigger that checks + the row's foreign key constraint). Loading many millions of rows can + cause the trigger event queue to overflow available memory, leading to + intolerable swapping or even outright failure of the command. Therefore + it may be <span class="emphasis"><em>necessary</em></span>, not just desirable, to drop and re-apply + foreign keys when loading large amounts of data. If temporarily removing + the constraint isn't acceptable, the only other recourse may be to split + up the load operation into smaller transactions. + </p></div><div class="sect2" id="POPULATE-WORK-MEM"><div class="titlepage"><div><div><h3 class="title">14.4.5. Increase <code class="varname">maintenance_work_mem</code> <a href="#POPULATE-WORK-MEM" class="id_link">#</a></h3></div></div></div><p> + Temporarily increasing the <a class="xref" href="runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM">maintenance_work_mem</a> + configuration variable when loading large amounts of data can + lead to improved performance. This will help to speed up <code class="command">CREATE + INDEX</code> commands and <code class="command">ALTER TABLE ADD FOREIGN KEY</code> commands. + It won't do much for <code class="command">COPY</code> itself, so this advice is + only useful when you are using one or both of the above techniques. + </p></div><div class="sect2" id="POPULATE-MAX-WAL-SIZE"><div class="titlepage"><div><div><h3 class="title">14.4.6. Increase <code class="varname">max_wal_size</code> <a href="#POPULATE-MAX-WAL-SIZE" class="id_link">#</a></h3></div></div></div><p> + Temporarily increasing the <a class="xref" href="runtime-config-wal.html#GUC-MAX-WAL-SIZE">max_wal_size</a> + configuration variable can also + make large data loads faster. This is because loading a large + amount of data into <span class="productname">PostgreSQL</span> will + cause checkpoints to occur more often than the normal checkpoint + frequency (specified by the <code class="varname">checkpoint_timeout</code> + configuration variable). Whenever a checkpoint occurs, all dirty + pages must be flushed to disk. By increasing + <code class="varname">max_wal_size</code> temporarily during bulk + data loads, the number of checkpoints that are required can be + reduced. + </p></div><div class="sect2" id="POPULATE-PITR"><div class="titlepage"><div><div><h3 class="title">14.4.7. Disable WAL Archival and Streaming Replication <a href="#POPULATE-PITR" class="id_link">#</a></h3></div></div></div><p> + When loading large amounts of data into an installation that uses + WAL archiving or streaming replication, it might be faster to take a + new base backup after the load has completed than to process a large + amount of incremental WAL data. To prevent incremental WAL logging + while loading, disable archiving and streaming replication, by setting + <a class="xref" href="runtime-config-wal.html#GUC-WAL-LEVEL">wal_level</a> to <code class="literal">minimal</code>, + <a class="xref" href="runtime-config-wal.html#GUC-ARCHIVE-MODE">archive_mode</a> to <code class="literal">off</code>, and + <a class="xref" href="runtime-config-replication.html#GUC-MAX-WAL-SENDERS">max_wal_senders</a> to zero. + But note that changing these settings requires a server restart, + and makes any base backups taken before unavailable for archive + recovery and standby server, which may lead to data loss. + </p><p> + Aside from avoiding the time for the archiver or WAL sender to process the + WAL data, doing this will actually make certain commands faster, because + they do not to write WAL at all if <code class="varname">wal_level</code> + is <code class="literal">minimal</code> and the current subtransaction (or top-level + transaction) created or truncated the table or index they change. (They + can guarantee crash safety more cheaply by doing + an <code class="function">fsync</code> at the end than by writing WAL.) + </p></div><div class="sect2" id="POPULATE-ANALYZE"><div class="titlepage"><div><div><h3 class="title">14.4.8. Run <code class="command">ANALYZE</code> Afterwards <a href="#POPULATE-ANALYZE" class="id_link">#</a></h3></div></div></div><p> + Whenever you have significantly altered the distribution of data + within a table, running <a class="link" href="sql-analyze.html" title="ANALYZE"><code class="command">ANALYZE</code></a> is strongly recommended. This + includes bulk loading large amounts of data into the table. Running + <code class="command">ANALYZE</code> (or <code class="command">VACUUM ANALYZE</code>) + ensures that the planner has up-to-date statistics about the + table. With no statistics or obsolete statistics, the planner might + make poor decisions during query planning, leading to poor + performance on any tables with inaccurate or nonexistent + statistics. Note that if the autovacuum daemon is enabled, it might + run <code class="command">ANALYZE</code> automatically; see + <a class="xref" href="routine-vacuuming.html#VACUUM-FOR-STATISTICS" title="25.1.3. Updating Planner Statistics">Section 25.1.3</a> + and <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="25.1.6. The Autovacuum Daemon">Section 25.1.6</a> for more information. + </p></div><div class="sect2" id="POPULATE-PG-DUMP"><div class="titlepage"><div><div><h3 class="title">14.4.9. Some Notes about <span class="application">pg_dump</span> <a href="#POPULATE-PG-DUMP" class="id_link">#</a></h3></div></div></div><p> + Dump scripts generated by <span class="application">pg_dump</span> automatically apply + several, but not all, of the above guidelines. To restore a + <span class="application">pg_dump</span> dump as quickly as possible, you need to + do a few extra things manually. (Note that these points apply while + <span class="emphasis"><em>restoring</em></span> a dump, not while <span class="emphasis"><em>creating</em></span> it. + The same points apply whether loading a text dump with + <span class="application">psql</span> or using <span class="application">pg_restore</span> to load + from a <span class="application">pg_dump</span> archive file.) + </p><p> + By default, <span class="application">pg_dump</span> uses <code class="command">COPY</code>, and when + it is generating a complete schema-and-data dump, it is careful to + load data before creating indexes and foreign keys. So in this case + several guidelines are handled automatically. What is left + for you to do is to: + </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p> + Set appropriate (i.e., larger than normal) values for + <code class="varname">maintenance_work_mem</code> and + <code class="varname">max_wal_size</code>. + </p></li><li class="listitem"><p> + If using WAL archiving or streaming replication, consider disabling + them during the restore. To do that, set <code class="varname">archive_mode</code> + to <code class="literal">off</code>, + <code class="varname">wal_level</code> to <code class="literal">minimal</code>, and + <code class="varname">max_wal_senders</code> to zero before loading the dump. + Afterwards, set them back to the right values and take a fresh + base backup. + </p></li><li class="listitem"><p> + Experiment with the parallel dump and restore modes of both + <span class="application">pg_dump</span> and <span class="application">pg_restore</span> and find the + optimal number of concurrent jobs to use. Dumping and restoring in + parallel by means of the <code class="option">-j</code> option should give you a + significantly higher performance over the serial mode. + </p></li><li class="listitem"><p> + Consider whether the whole dump should be restored as a single + transaction. To do that, pass the <code class="option">-1</code> or + <code class="option">--single-transaction</code> command-line option to + <span class="application">psql</span> or <span class="application">pg_restore</span>. When using this + mode, even the smallest of errors will rollback the entire restore, + possibly discarding many hours of processing. Depending on how + interrelated the data is, that might seem preferable to manual cleanup, + or not. <code class="command">COPY</code> commands will run fastest if you use a single + transaction and have WAL archiving turned off. + </p></li><li class="listitem"><p> + If multiple CPUs are available in the database server, consider using + <span class="application">pg_restore</span>'s <code class="option">--jobs</code> option. This + allows concurrent data loading and index creation. + </p></li><li class="listitem"><p> + Run <code class="command">ANALYZE</code> afterwards. + </p></li></ul></div><p> + </p><p> + A data-only dump will still use <code class="command">COPY</code>, but it does not + drop or recreate indexes, and it does not normally touch foreign + keys. + + <a href="#ftn.id-1.5.13.7.11.4.2" class="footnote"><sup class="footnote" id="id-1.5.13.7.11.4.2">[14]</sup></a> + + So when loading a data-only dump, it is up to you to drop and recreate + indexes and foreign keys if you wish to use those techniques. + It's still useful to increase <code class="varname">max_wal_size</code> + while loading the data, but don't bother increasing + <code class="varname">maintenance_work_mem</code>; rather, you'd do that while + manually recreating indexes and foreign keys afterwards. + And don't forget to <code class="command">ANALYZE</code> when you're done; see + <a class="xref" href="routine-vacuuming.html#VACUUM-FOR-STATISTICS" title="25.1.3. Updating Planner Statistics">Section 25.1.3</a> + and <a class="xref" href="routine-vacuuming.html#AUTOVACUUM" title="25.1.6. The Autovacuum Daemon">Section 25.1.6</a> for more information. + </p></div><div class="footnotes"><br /><hr style="width:100; text-align:left;margin-left: 0" /><div id="ftn.id-1.5.13.7.11.4.2" class="footnote"><p><a href="#id-1.5.13.7.11.4.2" class="para"><sup class="para">[14] </sup></a> + You can get the effect of disabling foreign keys by using + the <code class="option">--disable-triggers</code> option — but realize that + that eliminates, rather than just postpones, foreign key + validation, and so it is possible to insert bad data if you use it. + </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="non-durability.html" title="14.5. Non-Durable Settings">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.5. Non-Durable Settings</td></tr></table></div></body></html>
\ No newline at end of file |