summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/wal-configuration.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/html/wal-configuration.html')
-rw-r--r--doc/src/sgml/html/wal-configuration.html295
1 files changed, 295 insertions, 0 deletions
diff --git a/doc/src/sgml/html/wal-configuration.html b/doc/src/sgml/html/wal-configuration.html
new file mode 100644
index 0000000..45d4f2f
--- /dev/null
+++ b/doc/src/sgml/html/wal-configuration.html
@@ -0,0 +1,295 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>30.5. WAL Configuration</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="wal-async-commit.html" title="30.4. Asynchronous Commit" /><link rel="next" href="wal-internals.html" title="30.6. WAL Internals" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">30.5. <acronym class="acronym">WAL</acronym> Configuration</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="wal-async-commit.html" title="30.4. Asynchronous Commit">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><th width="60%" align="center">Chapter 30. Reliability and the Write-Ahead Log</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="wal-internals.html" title="30.6. WAL Internals">Next</a></td></tr></table><hr /></div><div class="sect1" id="WAL-CONFIGURATION"><div class="titlepage"><div><div><h2 class="title" style="clear: both">30.5. <acronym class="acronym">WAL</acronym> Configuration</h2></div></div></div><p>
+ There are several <acronym class="acronym">WAL</acronym>-related configuration parameters that
+ affect database performance. This section explains their use.
+ Consult <a class="xref" href="runtime-config.html" title="Chapter 20. Server Configuration">Chapter 20</a> for general information about
+ setting server configuration parameters.
+ </p><p>
+ <em class="firstterm">Checkpoints</em><a id="id-1.6.17.7.3.2" class="indexterm"></a>
+ are points in the sequence of transactions at which it is guaranteed
+ that the heap and index data files have been updated with all
+ information written before that checkpoint. At checkpoint time, all
+ dirty data pages are flushed to disk and a special checkpoint record is
+ written to the log file. (The change records were previously flushed
+ to the <acronym class="acronym">WAL</acronym> files.)
+ In the event of a crash, the crash recovery procedure looks at the latest
+ checkpoint record to determine the point in the log (known as the redo
+ record) from which it should start the REDO operation. Any changes made to
+ data files before that point are guaranteed to be already on disk.
+ Hence, after a checkpoint, log segments preceding the one containing
+ the redo record are no longer needed and can be recycled or removed. (When
+ <acronym class="acronym">WAL</acronym> archiving is being done, the log segments must be
+ archived before being recycled or removed.)
+ </p><p>
+ The checkpoint requirement of flushing all dirty data pages to disk
+ can cause a significant I/O load. For this reason, checkpoint
+ activity is throttled so that I/O begins at checkpoint start and completes
+ before the next checkpoint is due to start; this minimizes performance
+ degradation during checkpoints.
+ </p><p>
+ The server's checkpointer process automatically performs
+ a checkpoint every so often. A checkpoint is begun every <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-TIMEOUT">checkpoint_timeout</a> seconds, or if
+ <a class="xref" href="runtime-config-wal.html#GUC-MAX-WAL-SIZE">max_wal_size</a> is about to be exceeded,
+ whichever comes first.
+ The default settings are 5 minutes and 1 GB, respectively.
+ If no WAL has been written since the previous checkpoint, new checkpoints
+ will be skipped even if <code class="varname">checkpoint_timeout</code> has passed.
+ (If WAL archiving is being used and you want to put a lower limit on how
+ often files are archived in order to bound potential data loss, you should
+ adjust the <a class="xref" href="runtime-config-wal.html#GUC-ARCHIVE-TIMEOUT">archive_timeout</a> parameter rather than the
+ checkpoint parameters.)
+ It is also possible to force a checkpoint by using the SQL
+ command <code class="command">CHECKPOINT</code>.
+ </p><p>
+ Reducing <code class="varname">checkpoint_timeout</code> and/or
+ <code class="varname">max_wal_size</code> causes checkpoints to occur
+ more often. This allows faster after-crash recovery, since less work
+ will need to be redone. However, one must balance this against the
+ increased cost of flushing dirty data pages more often. If
+ <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> is set (as is the default), there is
+ another factor to consider. To ensure data page consistency,
+ the first modification of a data page after each checkpoint results in
+ logging the entire page content. In that case,
+ a smaller checkpoint interval increases the volume of output to the WAL log,
+ partially negating the goal of using a smaller interval,
+ and in any case causing more disk I/O.
+ </p><p>
+ Checkpoints are fairly expensive, first because they require writing
+ out all currently dirty buffers, and second because they result in
+ extra subsequent WAL traffic as discussed above. It is therefore
+ wise to set the checkpointing parameters high enough so that checkpoints
+ don't happen too often. As a simple sanity check on your checkpointing
+ parameters, you can set the <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-WARNING">checkpoint_warning</a>
+ parameter. If checkpoints happen closer together than
+ <code class="varname">checkpoint_warning</code> seconds,
+ a message will be output to the server log recommending increasing
+ <code class="varname">max_wal_size</code>. Occasional appearance of such
+ a message is not cause for alarm, but if it appears often then the
+ checkpoint control parameters should be increased. Bulk operations such
+ as large <code class="command">COPY</code> transfers might cause a number of such warnings
+ to appear if you have not set <code class="varname">max_wal_size</code> high
+ enough.
+ </p><p>
+ To avoid flooding the I/O system with a burst of page writes,
+ writing dirty buffers during a checkpoint is spread over a period of time.
+ That period is controlled by
+ <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-COMPLETION-TARGET">checkpoint_completion_target</a>, which is
+ given as a fraction of the checkpoint interval (configured by using
+ <code class="varname">checkpoint_timeout</code>).
+ The I/O rate is adjusted so that the checkpoint finishes when the
+ given fraction of
+ <code class="varname">checkpoint_timeout</code> seconds have elapsed, or before
+ <code class="varname">max_wal_size</code> is exceeded, whichever is sooner.
+ With the default value of 0.9,
+ <span class="productname">PostgreSQL</span> can be expected to complete each checkpoint
+ a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
+ duration). This spreads out the I/O as much as possible so that the checkpoint
+ I/O load is consistent throughout the checkpoint interval. The disadvantage of
+ this is that prolonging checkpoints affects recovery time, because more WAL
+ segments will need to be kept around for possible use in recovery. A user
+ concerned about the amount of time required to recover might wish to reduce
+ <code class="varname">checkpoint_timeout</code> so that checkpoints occur more frequently
+ but still spread the I/O across the checkpoint interval. Alternatively,
+ <code class="varname">checkpoint_completion_target</code> could be reduced, but this would
+ result in times of more intense I/O (during the checkpoint) and times of less I/O
+ (after the checkpoint completed but before the next scheduled checkpoint) and
+ therefore is not recommended.
+ Although <code class="varname">checkpoint_completion_target</code> could be set as high as
+ 1.0, it is typically recommended to set it to no higher than 0.9 (the default)
+ since checkpoints include some other activities besides writing dirty buffers.
+ A setting of 1.0 is quite likely to result in checkpoints not being
+ completed on time, which would result in performance loss due to
+ unexpected variation in the number of WAL segments needed.
+ </p><p>
+ On Linux and POSIX platforms <a class="xref" href="runtime-config-wal.html#GUC-CHECKPOINT-FLUSH-AFTER">checkpoint_flush_after</a>
+ allows to force the OS that pages written by the checkpoint should be
+ flushed to disk after a configurable number of bytes. Otherwise, these
+ pages may be kept in the OS's page cache, inducing a stall when
+ <code class="literal">fsync</code> is issued at the end of a checkpoint. This setting will
+ often help to reduce transaction latency, but it also can have an adverse
+ effect on performance; particularly for workloads that are bigger than
+ <a class="xref" href="runtime-config-resource.html#GUC-SHARED-BUFFERS">shared_buffers</a>, but smaller than the OS's page cache.
+ </p><p>
+ The number of WAL segment files in <code class="filename">pg_wal</code> directory depends on
+ <code class="varname">min_wal_size</code>, <code class="varname">max_wal_size</code> and
+ the amount of WAL generated in previous checkpoint cycles. When old log
+ segment files are no longer needed, they are removed or recycled (that is,
+ renamed to become future segments in the numbered sequence). If, due to a
+ short-term peak of log output rate, <code class="varname">max_wal_size</code> is
+ exceeded, the unneeded segment files will be removed until the system
+ gets back under this limit. Below that limit, the system recycles enough
+ WAL files to cover the estimated need until the next checkpoint, and
+ removes the rest. The estimate is based on a moving average of the number
+ of WAL files used in previous checkpoint cycles. The moving average
+ is increased immediately if the actual usage exceeds the estimate, so it
+ accommodates peak usage rather than average usage to some extent.
+ <code class="varname">min_wal_size</code> puts a minimum on the amount of WAL files
+ recycled for future usage; that much WAL is always recycled for future use,
+ even if the system is idle and the WAL usage estimate suggests that little
+ WAL is needed.
+ </p><p>
+ Independently of <code class="varname">max_wal_size</code>,
+ the most recent <a class="xref" href="runtime-config-replication.html#GUC-WAL-KEEP-SIZE">wal_keep_size</a> megabytes of
+ WAL files plus one additional WAL file are
+ kept at all times. Also, if WAL archiving is used, old segments cannot be
+ removed or recycled until they are archived. If WAL archiving cannot keep up
+ with the pace that WAL is generated, or if <code class="varname">archive_command</code>
+ or <code class="varname">archive_library</code>
+ fails repeatedly, old WAL files will accumulate in <code class="filename">pg_wal</code>
+ until the situation is resolved. A slow or failed standby server that
+ uses a replication slot will have the same effect (see
+ <a class="xref" href="warm-standby.html#STREAMING-REPLICATION-SLOTS" title="27.2.6. Replication Slots">Section 27.2.6</a>).
+ </p><p>
+ In archive recovery or standby mode, the server periodically performs
+ <em class="firstterm">restartpoints</em>,<a id="id-1.6.17.7.12.2" class="indexterm"></a>
+ which are similar to checkpoints in normal operation: the server forces
+ all its state to disk, updates the <code class="filename">pg_control</code> file to
+ indicate that the already-processed WAL data need not be scanned again,
+ and then recycles any old log segment files in the <code class="filename">pg_wal</code>
+ directory.
+ Restartpoints can't be performed more frequently than checkpoints on the
+ primary because restartpoints can only be performed at checkpoint records.
+ A restartpoint is triggered when a checkpoint record is reached if at
+ least <code class="varname">checkpoint_timeout</code> seconds have passed since the last
+ restartpoint, or if WAL size is about to exceed
+ <code class="varname">max_wal_size</code>. However, because of limitations on when a
+ restartpoint can be performed, <code class="varname">max_wal_size</code> is often exceeded
+ during recovery, by up to one checkpoint cycle's worth of WAL.
+ (<code class="varname">max_wal_size</code> is never a hard limit anyway, so you should
+ always leave plenty of headroom to avoid running out of disk space.)
+ </p><p>
+ There are two commonly used internal <acronym class="acronym">WAL</acronym> functions:
+ <code class="function">XLogInsertRecord</code> and <code class="function">XLogFlush</code>.
+ <code class="function">XLogInsertRecord</code> is used to place a new record into
+ the <acronym class="acronym">WAL</acronym> buffers in shared memory. If there is no
+ space for the new record, <code class="function">XLogInsertRecord</code> will have
+ to write (move to kernel cache) a few filled <acronym class="acronym">WAL</acronym>
+ buffers. This is undesirable because <code class="function">XLogInsertRecord</code>
+ is used on every database low level modification (for example, row
+ insertion) at a time when an exclusive lock is held on affected
+ data pages, so the operation needs to be as fast as possible. What
+ is worse, writing <acronym class="acronym">WAL</acronym> buffers might also force the
+ creation of a new log segment, which takes even more
+ time. Normally, <acronym class="acronym">WAL</acronym> buffers should be written
+ and flushed by an <code class="function">XLogFlush</code> request, which is
+ made, for the most part, at transaction commit time to ensure that
+ transaction records are flushed to permanent storage. On systems
+ with high log output, <code class="function">XLogFlush</code> requests might
+ not occur often enough to prevent <code class="function">XLogInsertRecord</code>
+ from having to do writes. On such systems
+ one should increase the number of <acronym class="acronym">WAL</acronym> buffers by
+ modifying the <a class="xref" href="runtime-config-wal.html#GUC-WAL-BUFFERS">wal_buffers</a> parameter. When
+ <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> is set and the system is very busy,
+ setting <code class="varname">wal_buffers</code> higher will help smooth response times
+ during the period immediately following each checkpoint.
+ </p><p>
+ The <a class="xref" href="runtime-config-wal.html#GUC-COMMIT-DELAY">commit_delay</a> parameter defines for how many
+ microseconds a group commit leader process will sleep after acquiring a
+ lock within <code class="function">XLogFlush</code>, while group commit
+ followers queue up behind the leader. This delay allows other server
+ processes to add their commit records to the WAL buffers so that all of
+ them will be flushed by the leader's eventual sync operation. No sleep
+ will occur if <a class="xref" href="runtime-config-wal.html#GUC-FSYNC">fsync</a> is not enabled, or if fewer
+ than <a class="xref" href="runtime-config-wal.html#GUC-COMMIT-SIBLINGS">commit_siblings</a> other sessions are currently
+ in active transactions; this avoids sleeping when it's unlikely that
+ any other session will commit soon. Note that on some platforms, the
+ resolution of a sleep request is ten milliseconds, so that any nonzero
+ <code class="varname">commit_delay</code> setting between 1 and 10000
+ microseconds would have the same effect. Note also that on some
+ platforms, sleep operations may take slightly longer than requested by
+ the parameter.
+ </p><p>
+ Since the purpose of <code class="varname">commit_delay</code> is to allow the
+ cost of each flush operation to be amortized across concurrently
+ committing transactions (potentially at the expense of transaction
+ latency), it is necessary to quantify that cost before the setting can
+ be chosen intelligently. The higher that cost is, the more effective
+ <code class="varname">commit_delay</code> is expected to be in increasing
+ transaction throughput, up to a point. The <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program can be used to measure the average time
+ in microseconds that a single WAL flush operation takes. A value of
+ half of the average time the program reports it takes to flush after a
+ single 8kB write operation is often the most effective setting for
+ <code class="varname">commit_delay</code>, so this value is recommended as the
+ starting point to use when optimizing for a particular workload. While
+ tuning <code class="varname">commit_delay</code> is particularly useful when the
+ WAL log is stored on high-latency rotating disks, benefits can be
+ significant even on storage media with very fast sync times, such as
+ solid-state drives or RAID arrays with a battery-backed write cache;
+ but this should definitely be tested against a representative workload.
+ Higher values of <code class="varname">commit_siblings</code> should be used in
+ such cases, whereas smaller <code class="varname">commit_siblings</code> values
+ are often helpful on higher latency media. Note that it is quite
+ possible that a setting of <code class="varname">commit_delay</code> that is too
+ high can increase transaction latency by so much that total transaction
+ throughput suffers.
+ </p><p>
+ When <code class="varname">commit_delay</code> is set to zero (the default), it
+ is still possible for a form of group commit to occur, but each group
+ will consist only of sessions that reach the point where they need to
+ flush their commit records during the window in which the previous
+ flush operation (if any) is occurring. At higher client counts a
+ <span class="quote">“<span class="quote">gangway effect</span>”</span> tends to occur, so that the effects of group
+ commit become significant even when <code class="varname">commit_delay</code> is
+ zero, and thus explicitly setting <code class="varname">commit_delay</code> tends
+ to help less. Setting <code class="varname">commit_delay</code> can only help
+ when (1) there are some concurrently committing transactions, and (2)
+ throughput is limited to some degree by commit rate; but with high
+ rotational latency this setting can be effective in increasing
+ transaction throughput with as few as two clients (that is, a single
+ committing client with one sibling transaction).
+ </p><p>
+ The <a class="xref" href="runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a> parameter determines how
+ <span class="productname">PostgreSQL</span> will ask the kernel to force
+ <acronym class="acronym">WAL</acronym> updates out to disk.
+ All the options should be the same in terms of reliability, with
+ the exception of <code class="literal">fsync_writethrough</code>, which can sometimes
+ force a flush of the disk cache even when other options do not do so.
+ However, it's quite platform-specific which one will be the fastest.
+ You can test the speeds of different options using the <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program.
+ Note that this parameter is irrelevant if <code class="varname">fsync</code>
+ has been turned off.
+ </p><p>
+ Enabling the <a class="xref" href="runtime-config-developer.html#GUC-WAL-DEBUG">wal_debug</a> configuration parameter
+ (provided that <span class="productname">PostgreSQL</span> has been
+ compiled with support for it) will result in each
+ <code class="function">XLogInsertRecord</code> and <code class="function">XLogFlush</code>
+ <acronym class="acronym">WAL</acronym> call being logged to the server log. This
+ option might be replaced by a more general mechanism in the future.
+ </p><p>
+ There are two internal functions to write WAL data to disk:
+ <code class="function">XLogWrite</code> and <code class="function">issue_xlog_fsync</code>.
+ When <a class="xref" href="runtime-config-statistics.html#GUC-TRACK-WAL-IO-TIMING">track_wal_io_timing</a> is enabled, the total
+ amounts of time <code class="function">XLogWrite</code> writes and
+ <code class="function">issue_xlog_fsync</code> syncs WAL data to disk are counted as
+ <code class="literal">wal_write_time</code> and <code class="literal">wal_sync_time</code> in
+ <a class="xref" href="monitoring-stats.html#PG-STAT-WAL-VIEW" title="Table 28.24. pg_stat_wal View">pg_stat_wal</a>, respectively.
+ <code class="function">XLogWrite</code> is normally called by
+ <code class="function">XLogInsertRecord</code> (when there is no space for the new
+ record in WAL buffers), <code class="function">XLogFlush</code> and the WAL writer,
+ to write WAL buffers to disk and call <code class="function">issue_xlog_fsync</code>.
+ <code class="function">issue_xlog_fsync</code> is normally called by
+ <code class="function">XLogWrite</code> to sync WAL files to disk.
+ If <code class="varname">wal_sync_method</code> is either
+ <code class="literal">open_datasync</code> or <code class="literal">open_sync</code>,
+ a write operation in <code class="function">XLogWrite</code> guarantees to sync written
+ WAL data to disk and <code class="function">issue_xlog_fsync</code> does nothing.
+ If <code class="varname">wal_sync_method</code> is either <code class="literal">fdatasync</code>,
+ <code class="literal">fsync</code>, or <code class="literal">fsync_writethrough</code>,
+ the write operation moves WAL buffers to kernel cache and
+ <code class="function">issue_xlog_fsync</code> syncs them to disk. Regardless
+ of the setting of <code class="varname">track_wal_io_timing</code>, the number
+ of times <code class="function">XLogWrite</code> writes and
+ <code class="function">issue_xlog_fsync</code> syncs WAL data to disk are also
+ counted as <code class="literal">wal_write</code> and <code class="literal">wal_sync</code>
+ in <code class="structname">pg_stat_wal</code>, respectively.
+ </p><p>
+ The <a class="xref" href="runtime-config-wal.html#GUC-RECOVERY-PREFETCH">recovery_prefetch</a> parameter can be used to reduce
+ I/O wait times during recovery by instructing the kernel to initiate reads
+ of disk blocks that will soon be needed but are not currently in
+ <span class="productname">PostgreSQL</span>'s buffer pool.
+ The <a class="xref" href="runtime-config-resource.html#GUC-MAINTENANCE-IO-CONCURRENCY">maintenance_io_concurrency</a> and
+ <a class="xref" href="runtime-config-wal.html#GUC-WAL-DECODE-BUFFER-SIZE">wal_decode_buffer_size</a> settings limit prefetching
+ concurrency and distance, respectively. By default, it is set to
+ <code class="literal">try</code>, which enables the feature on systems where
+ <code class="function">posix_fadvise</code> is available.
+ </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="wal-async-commit.html" title="30.4. Asynchronous Commit">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="wal-internals.html" title="30.6. WAL Internals">Next</a></td></tr><tr><td width="40%" align="left" valign="top">30.4. Asynchronous Commit </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 30.6. WAL Internals</td></tr></table></div></body></html> \ No newline at end of file