diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-16 19:46:48 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-16 19:46:48 +0000 |
commit | 311bcfc6b3acdd6fd152798c7f287ddf74fa2a98 (patch) | |
tree | 0ec307299b1dada3701e42f4ca6eda57d708261e /doc/src/sgml/html/wal-configuration.html | |
parent | Initial commit. (diff) | |
download | postgresql-15-311bcfc6b3acdd6fd152798c7f287ddf74fa2a98.tar.xz postgresql-15-311bcfc6b3acdd6fd152798c7f287ddf74fa2a98.zip |
Adding upstream version 15.4.upstream/15.4upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/html/wal-configuration.html')
-rw-r--r-- | doc/src/sgml/html/wal-configuration.html | 295 |
1 files changed, 295 insertions, 0 deletions
diff --git a/doc/src/sgml/html/wal-configuration.html b/doc/src/sgml/html/wal-configuration.html new file mode 100644 index 0000000..e683789 --- /dev/null +++ b/doc/src/sgml/html/wal-configuration.html @@ -0,0 +1,295 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>30.5. WAL Configuration</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="wal-async-commit.html" title="30.4. Asynchronous Commit" /><link rel="next" href="wal-internals.html" title="30.6. WAL Internals" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">30.5. <acronym class="acronym">WAL</acronym> Configuration</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="wal-async-commit.html" title="30.4. Asynchronous Commit">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><th width="60%" align="center">Chapter 30. Reliability and the Write-Ahead Log</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="wal-internals.html" title="30.6. WAL Internals">Next</a></td></tr></table><hr /></div><div class="sect1" id="WAL-CONFIGURATION"><div class="titlepage"><div><div><h2 class="title" style="clear: both">30.5. <acronym class="acronym">WAL</acronym> Configuration</h2></div></div></div><p> + There are several <acronym class="acronym">WAL</acronym>-related configuration parameters that + affect database performance. This section explains their use. + Consult <a class="xref" href="runtime-config.html" title="Chapter 20. Server Configuration">Chapter 20</a> for general information about + setting server configuration parameters. + </p><p> + <em class="firstterm">Checkpoints</em><a id="id-1.6.17.7.3.2" class="indexterm"></a> + are points in the sequence of transactions at which it is guaranteed + that the heap and index data files have been updated with all + information written before that checkpoint. At checkpoint time, all + dirty data pages are flushed to disk and a special checkpoint record is + written to the log file. (The change records were previously flushed + to the <acronym class="acronym">WAL</acronym> files.) + In the event of a crash, the crash recovery procedure looks at the latest + checkpoint record to determine the point in the log (known as the redo + record) from which it should start the REDO operation. Any changes made to + data files before that point are guaranteed to be already on disk. + Hence, after a checkpoint, log segments preceding the one containing + the redo record are no longer needed and can be recycled or removed. (When + <acronym class="acronym">WAL</acronym> archiving is being done, the log segments must be + archived before being recycled or removed.) + </p><p> + The checkpoint requirement of flushing all dirty data pages to disk + can cause a significant I/O load. For this reason, checkpoint + activity is throttled so that I/O begins at checkpoint start and completes + before the next checkpoint is due to start; this minimizes performance + degradation during checkpoints. + </p><p> + The server's checkpointer process automatically performs + a checkpoint every so often. A checkpoint is begun every <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-TIMEOUT">checkpoint_timeout</a> seconds, or if + <a class="xref" href="runtime-config-wal.html#GUC-MAX-WAL-SIZE">max_wal_size</a> is about to be exceeded, + whichever comes first. + The default settings are 5 minutes and 1 GB, respectively. + If no WAL has been written since the previous checkpoint, new checkpoints + will be skipped even if <code class="varname">checkpoint_timeout</code> has passed. + (If WAL archiving is being used and you want to put a lower limit on how + often files are archived in order to bound potential data loss, you should + adjust the <a class="xref" href="runtime-config-wal.html#GUC-ARCHIVE-TIMEOUT">archive_timeout</a> parameter rather than the + checkpoint parameters.) + It is also possible to force a checkpoint by using the SQL + command <code class="command">CHECKPOINT</code>. + </p><p> + Reducing <code class="varname">checkpoint_timeout</code> and/or + <code class="varname">max_wal_size</code> causes checkpoints to occur + more often. This allows faster after-crash recovery, since less work + will need to be redone. However, one must balance this against the + increased cost of flushing dirty data pages more often. If + <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> is set (as is the default), there is + another factor to consider. To ensure data page consistency, + the first modification of a data page after each checkpoint results in + logging the entire page content. In that case, + a smaller checkpoint interval increases the volume of output to the WAL log, + partially negating the goal of using a smaller interval, + and in any case causing more disk I/O. + </p><p> + Checkpoints are fairly expensive, first because they require writing + out all currently dirty buffers, and second because they result in + extra subsequent WAL traffic as discussed above. It is therefore + wise to set the checkpointing parameters high enough so that checkpoints + don't happen too often. As a simple sanity check on your checkpointing + parameters, you can set the <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-WARNING">checkpoint_warning</a> + parameter. If checkpoints happen closer together than + <code class="varname">checkpoint_warning</code> seconds, + a message will be output to the server log recommending increasing + <code class="varname">max_wal_size</code>. Occasional appearance of such + a message is not cause for alarm, but if it appears often then the + checkpoint control parameters should be increased. Bulk operations such + as large <code class="command">COPY</code> transfers might cause a number of such warnings + to appear if you have not set <code class="varname">max_wal_size</code> high + enough. + </p><p> + To avoid flooding the I/O system with a burst of page writes, + writing dirty buffers during a checkpoint is spread over a period of time. + That period is controlled by + <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET">checkpoint_completion_target</a>, which is + given as a fraction of the checkpoint interval (configured by using + <code class="varname">checkpoint_timeout</code>). + The I/O rate is adjusted so that the checkpoint finishes when the + given fraction of + <code class="varname">checkpoint_timeout</code> seconds have elapsed, or before + <code class="varname">max_wal_size</code> is exceeded, whichever is sooner. + With the default value of 0.9, + <span class="productname">PostgreSQL</span> can be expected to complete each checkpoint + a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's + duration). This spreads out the I/O as much as possible so that the checkpoint + I/O load is consistent throughout the checkpoint interval. The disadvantage of + this is that prolonging checkpoints affects recovery time, because more WAL + segments will need to be kept around for possible use in recovery. A user + concerned about the amount of time required to recover might wish to reduce + <code class="varname">checkpoint_timeout</code> so that checkpoints occur more frequently + but still spread the I/O across the checkpoint interval. Alternatively, + <code class="varname">checkpoint_completion_target</code> could be reduced, but this would + result in times of more intense I/O (during the checkpoint) and times of less I/O + (after the checkpoint completed but before the next scheduled checkpoint) and + therefore is not recommended. + Although <code class="varname">checkpoint_completion_target</code> could be set as high as + 1.0, it is typically recommended to set it to no higher than 0.9 (the default) + since checkpoints include some other activities besides writing dirty buffers. + A setting of 1.0 is quite likely to result in checkpoints not being + completed on time, which would result in performance loss due to + unexpected variation in the number of WAL segments needed. + </p><p> + On Linux and POSIX platforms <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-FLUSH-AFTER">checkpoint_flush_after</a> + allows to force the OS that pages written by the checkpoint should be + flushed to disk after a configurable number of bytes. Otherwise, these + pages may be kept in the OS's page cache, inducing a stall when + <code class="literal">fsync</code> is issued at the end of a checkpoint. This setting will + often help to reduce transaction latency, but it also can have an adverse + effect on performance; particularly for workloads that are bigger than + <a class="xref" href="runtime-config-resource.html#GUC-SHARED-BUFFERS">shared_buffers</a>, but smaller than the OS's page cache. + </p><p> + The number of WAL segment files in <code class="filename">pg_wal</code> directory depends on + <code class="varname">min_wal_size</code>, <code class="varname">max_wal_size</code> and + the amount of WAL generated in previous checkpoint cycles. When old log + segment files are no longer needed, they are removed or recycled (that is, + renamed to become future segments in the numbered sequence). If, due to a + short-term peak of log output rate, <code class="varname">max_wal_size</code> is + exceeded, the unneeded segment files will be removed until the system + gets back under this limit. Below that limit, the system recycles enough + WAL files to cover the estimated need until the next checkpoint, and + removes the rest. The estimate is based on a moving average of the number + of WAL files used in previous checkpoint cycles. The moving average + is increased immediately if the actual usage exceeds the estimate, so it + accommodates peak usage rather than average usage to some extent. + <code class="varname">min_wal_size</code> puts a minimum on the amount of WAL files + recycled for future usage; that much WAL is always recycled for future use, + even if the system is idle and the WAL usage estimate suggests that little + WAL is needed. + </p><p> + Independently of <code class="varname">max_wal_size</code>, + the most recent <a class="xref" href="runtime-config-replication.html#GUC-WAL-KEEP-SIZE">wal_keep_size</a> megabytes of + WAL files plus one additional WAL file are + kept at all times. Also, if WAL archiving is used, old segments cannot be + removed or recycled until they are archived. If WAL archiving cannot keep up + with the pace that WAL is generated, or if <code class="varname">archive_command</code> + or <code class="varname">archive_library</code> + fails repeatedly, old WAL files will accumulate in <code class="filename">pg_wal</code> + until the situation is resolved. A slow or failed standby server that + uses a replication slot will have the same effect (see + <a class="xref" href="warm-standby.html#STREAMING-REPLICATION-SLOTS" title="27.2.6. Replication Slots">Section 27.2.6</a>). + </p><p> + In archive recovery or standby mode, the server periodically performs + <em class="firstterm">restartpoints</em>,<a id="id-1.6.17.7.12.2" class="indexterm"></a> + which are similar to checkpoints in normal operation: the server forces + all its state to disk, updates the <code class="filename">pg_control</code> file to + indicate that the already-processed WAL data need not be scanned again, + and then recycles any old log segment files in the <code class="filename">pg_wal</code> + directory. + Restartpoints can't be performed more frequently than checkpoints on the + primary because restartpoints can only be performed at checkpoint records. + A restartpoint is triggered when a checkpoint record is reached if at + least <code class="varname">checkpoint_timeout</code> seconds have passed since the last + restartpoint, or if WAL size is about to exceed + <code class="varname">max_wal_size</code>. However, because of limitations on when a + restartpoint can be performed, <code class="varname">max_wal_size</code> is often exceeded + during recovery, by up to one checkpoint cycle's worth of WAL. + (<code class="varname">max_wal_size</code> is never a hard limit anyway, so you should + always leave plenty of headroom to avoid running out of disk space.) + </p><p> + There are two commonly used internal <acronym class="acronym">WAL</acronym> functions: + <code class="function">XLogInsertRecord</code> and <code class="function">XLogFlush</code>. + <code class="function">XLogInsertRecord</code> is used to place a new record into + the <acronym class="acronym">WAL</acronym> buffers in shared memory. If there is no + space for the new record, <code class="function">XLogInsertRecord</code> will have + to write (move to kernel cache) a few filled <acronym class="acronym">WAL</acronym> + buffers. This is undesirable because <code class="function">XLogInsertRecord</code> + is used on every database low level modification (for example, row + insertion) at a time when an exclusive lock is held on affected + data pages, so the operation needs to be as fast as possible. What + is worse, writing <acronym class="acronym">WAL</acronym> buffers might also force the + creation of a new log segment, which takes even more + time. Normally, <acronym class="acronym">WAL</acronym> buffers should be written + and flushed by an <code class="function">XLogFlush</code> request, which is + made, for the most part, at transaction commit time to ensure that + transaction records are flushed to permanent storage. On systems + with high log output, <code class="function">XLogFlush</code> requests might + not occur often enough to prevent <code class="function">XLogInsertRecord</code> + from having to do writes. On such systems + one should increase the number of <acronym class="acronym">WAL</acronym> buffers by + modifying the <a class="xref" href="runtime-config-wal.html#GUC-WAL-BUFFERS">wal_buffers</a> parameter. When + <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> is set and the system is very busy, + setting <code class="varname">wal_buffers</code> higher will help smooth response times + during the period immediately following each checkpoint. + </p><p> + The <a class="xref" href="runtime-config-wal.html#GUC-COMMIT-DELAY">commit_delay</a> parameter defines for how many + microseconds a group commit leader process will sleep after acquiring a + lock within <code class="function">XLogFlush</code>, while group commit + followers queue up behind the leader. This delay allows other server + processes to add their commit records to the WAL buffers so that all of + them will be flushed by the leader's eventual sync operation. No sleep + will occur if <a class="xref" href="runtime-config-wal.html#GUC-FSYNC">fsync</a> is not enabled, or if fewer + than <a class="xref" href="runtime-config-wal.html#GUC-COMMIT-SIBLINGS">commit_siblings</a> other sessions are currently + in active transactions; this avoids sleeping when it's unlikely that + any other session will commit soon. Note that on some platforms, the + resolution of a sleep request is ten milliseconds, so that any nonzero + <code class="varname">commit_delay</code> setting between 1 and 10000 + microseconds would have the same effect. Note also that on some + platforms, sleep operations may take slightly longer than requested by + the parameter. + </p><p> + Since the purpose of <code class="varname">commit_delay</code> is to allow the + cost of each flush operation to be amortized across concurrently + committing transactions (potentially at the expense of transaction + latency), it is necessary to quantify that cost before the setting can + be chosen intelligently. The higher that cost is, the more effective + <code class="varname">commit_delay</code> is expected to be in increasing + transaction throughput, up to a point. The <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program can be used to measure the average time + in microseconds that a single WAL flush operation takes. A value of + half of the average time the program reports it takes to flush after a + single 8kB write operation is often the most effective setting for + <code class="varname">commit_delay</code>, so this value is recommended as the + starting point to use when optimizing for a particular workload. While + tuning <code class="varname">commit_delay</code> is particularly useful when the + WAL log is stored on high-latency rotating disks, benefits can be + significant even on storage media with very fast sync times, such as + solid-state drives or RAID arrays with a battery-backed write cache; + but this should definitely be tested against a representative workload. + Higher values of <code class="varname">commit_siblings</code> should be used in + such cases, whereas smaller <code class="varname">commit_siblings</code> values + are often helpful on higher latency media. Note that it is quite + possible that a setting of <code class="varname">commit_delay</code> that is too + high can increase transaction latency by so much that total transaction + throughput suffers. + </p><p> + When <code class="varname">commit_delay</code> is set to zero (the default), it + is still possible for a form of group commit to occur, but each group + will consist only of sessions that reach the point where they need to + flush their commit records during the window in which the previous + flush operation (if any) is occurring. At higher client counts a + <span class="quote">“<span class="quote">gangway effect</span>”</span> tends to occur, so that the effects of group + commit become significant even when <code class="varname">commit_delay</code> is + zero, and thus explicitly setting <code class="varname">commit_delay</code> tends + to help less. Setting <code class="varname">commit_delay</code> can only help + when (1) there are some concurrently committing transactions, and (2) + throughput is limited to some degree by commit rate; but with high + rotational latency this setting can be effective in increasing + transaction throughput with as few as two clients (that is, a single + committing client with one sibling transaction). + </p><p> + The <a class="xref" href="runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a> parameter determines how + <span class="productname">PostgreSQL</span> will ask the kernel to force + <acronym class="acronym">WAL</acronym> updates out to disk. + All the options should be the same in terms of reliability, with + the exception of <code class="literal">fsync_writethrough</code>, which can sometimes + force a flush of the disk cache even when other options do not do so. + However, it's quite platform-specific which one will be the fastest. + You can test the speeds of different options using the <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program. + Note that this parameter is irrelevant if <code class="varname">fsync</code> + has been turned off. + </p><p> + Enabling the <a class="xref" href="runtime-config-developer.html#GUC-WAL-DEBUG">wal_debug</a> configuration parameter + (provided that <span class="productname">PostgreSQL</span> has been + compiled with support for it) will result in each + <code class="function">XLogInsertRecord</code> and <code class="function">XLogFlush</code> + <acronym class="acronym">WAL</acronym> call being logged to the server log. This + option might be replaced by a more general mechanism in the future. + </p><p> + There are two internal functions to write WAL data to disk: + <code class="function">XLogWrite</code> and <code class="function">issue_xlog_fsync</code>. + When <a class="xref" href="runtime-config-statistics.html#GUC-TRACK-WAL-IO-TIMING">track_wal_io_timing</a> is enabled, the total + amounts of time <code class="function">XLogWrite</code> writes and + <code class="function">issue_xlog_fsync</code> syncs WAL data to disk are counted as + <code class="literal">wal_write_time</code> and <code class="literal">wal_sync_time</code> in + <a class="xref" href="monitoring-stats.html#PG-STAT-WAL-VIEW" title="Table 28.24. pg_stat_wal View">pg_stat_wal</a>, respectively. + <code class="function">XLogWrite</code> is normally called by + <code class="function">XLogInsertRecord</code> (when there is no space for the new + record in WAL buffers), <code class="function">XLogFlush</code> and the WAL writer, + to write WAL buffers to disk and call <code class="function">issue_xlog_fsync</code>. + <code class="function">issue_xlog_fsync</code> is normally called by + <code class="function">XLogWrite</code> to sync WAL files to disk. + If <code class="varname">wal_sync_method</code> is either + <code class="literal">open_datasync</code> or <code class="literal">open_sync</code>, + a write operation in <code class="function">XLogWrite</code> guarantees to sync written + WAL data to disk and <code class="function">issue_xlog_fsync</code> does nothing. + If <code class="varname">wal_sync_method</code> is either <code class="literal">fdatasync</code>, + <code class="literal">fsync</code>, or <code class="literal">fsync_writethrough</code>, + the write operation moves WAL buffers to kernel cache and + <code class="function">issue_xlog_fsync</code> syncs them to disk. Regardless + of the setting of <code class="varname">track_wal_io_timing</code>, the number + of times <code class="function">XLogWrite</code> writes and + <code class="function">issue_xlog_fsync</code> syncs WAL data to disk are also + counted as <code class="literal">wal_write</code> and <code class="literal">wal_sync</code> + in <code class="structname">pg_stat_wal</code>, respectively. + </p><p> + The <a class="xref" href="runtime-config-wal.html#GUC-RECOVERY-PREFETCH">recovery_prefetch</a> parameter can be used to reduce + I/O wait times during recovery by instructing the kernel to initiate reads + of disk blocks that will soon be needed but are not currently in + <span class="productname">PostgreSQL</span>'s buffer pool. + The <a class="xref" href="runtime-config-resource.html#GUC-MAINTENANCE-IO-CONCURRENCY">maintenance_io_concurrency</a> and + <a class="xref" href="runtime-config-wal.html#GUC-WAL-DECODE-BUFFER-SIZE">wal_decode_buffer_size</a> settings limit prefetching + concurrency and distance, respectively. By default, it is set to + <code class="literal">try</code>, which enables the feature on systems where + <code class="function">posix_fadvise</code> is available. + </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="wal-async-commit.html" title="30.4. Asynchronous Commit">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="wal-internals.html" title="30.6. WAL Internals">Next</a></td></tr><tr><td width="40%" align="left" valign="top">30.4. Asynchronous Commit </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 30.6. WAL Internals</td></tr></table></div></body></html>
\ No newline at end of file |