summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/wal.sgml
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--doc/src/sgml/wal.sgml898
1 files changed, 898 insertions, 0 deletions
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
new file mode 100644
index 0000000..24e1c89
--- /dev/null
+++ b/doc/src/sgml/wal.sgml
@@ -0,0 +1,898 @@
+<!-- doc/src/sgml/wal.sgml -->
+
+<chapter id="wal">
+ <title>Reliability and the Write-Ahead Log</title>
+
+ <para>
+ This chapter explains how the Write-Ahead Log is used to obtain
+ efficient, reliable operation.
+ </para>
+
+ <sect1 id="wal-reliability">
+ <title>Reliability</title>
+
+ <para>
+ Reliability is an important property of any serious database
+ system, and <productname>PostgreSQL</productname> does everything possible to
+ guarantee reliable operation. One aspect of reliable operation is
+ that all data recorded by a committed transaction should be stored
+ in a nonvolatile area that is safe from power loss, operating
+ system failure, and hardware failure (except failure of the
+ nonvolatile area itself, of course). Successfully writing the data
+ to the computer's permanent storage (disk drive or equivalent)
+ ordinarily meets this requirement. In fact, even if a computer is
+ fatally damaged, if the disk drives survive they can be moved to
+ another computer with similar hardware and all committed
+ transactions will remain intact.
+ </para>
+
+ <para>
+ While forcing data to the disk platters periodically might seem like
+ a simple operation, it is not. Because disk drives are dramatically
+ slower than main memory and CPUs, several layers of caching exist
+ between the computer's main memory and the disk platters.
+ First, there is the operating system's buffer cache, which caches
+ frequently requested disk blocks and combines disk writes. Fortunately,
+ all operating systems give applications a way to force writes from
+ the buffer cache to disk, and <productname>PostgreSQL</productname> uses those
+ features. (See the <xref linkend="guc-wal-sync-method"/> parameter
+ to adjust how this is done.)
+ </para>
+
+ <para>
+ Next, there might be a cache in the disk drive controller; this is
+ particularly common on <acronym>RAID</acronym> controller cards. Some of
+ these caches are <firstterm>write-through</firstterm>, meaning writes are sent
+ to the drive as soon as they arrive. Others are
+ <firstterm>write-back</firstterm>, meaning data is sent to the drive at
+ some later time. Such caches can be a reliability hazard because the
+ memory in the disk controller cache is volatile, and will lose its
+ contents in a power failure. Better controller cards have
+ <firstterm>battery-backup units</firstterm> (<acronym>BBU</acronym>s), meaning
+ the card has a battery that
+ maintains power to the cache in case of system power loss. After power
+ is restored the data will be written to the disk drives.
+ </para>
+
+ <para>
+ And finally, most disk drives have caches. Some are write-through
+ while some are write-back, and the same concerns about data loss
+ exist for write-back drive caches as for disk controller
+ caches. Consumer-grade IDE and SATA drives are particularly likely
+ to have write-back caches that will not survive a power failure. Many
+ solid-state drives (SSD) also have volatile write-back caches.
+ </para>
+
+ <para>
+ These caches can typically be disabled; however, the method for doing
+ this varies by operating system and drive type:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ On <productname>Linux</productname>, IDE and SATA drives can be queried using
+ <command>hdparm -I</command>; write caching is enabled if there is
+ a <literal>*</literal> next to <literal>Write cache</literal>. <command>hdparm -W 0</command>
+ can be used to turn off write caching. SCSI drives can be queried
+ using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</application></ulink>.
+ Use <command>sdparm --get=WCE</command> to check
+ whether the write cache is enabled and <command>sdparm --clear=WCE</command>
+ to disable it.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ On <productname>FreeBSD</productname>, IDE drives can be queried using
+ <command>atacontrol</command> and write caching turned off using
+ <literal>hw.ata.wc=0</literal> in <filename>/boot/loader.conf</filename>;
+ SCSI drives can be queried using <command>camcontrol identify</command>,
+ and the write cache both queried and changed using
+ <command>sdparm</command> when available.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ On <productname>Solaris</productname>, the disk write cache is controlled by
+ <command>format -e</command>.
+ (The Solaris <acronym>ZFS</acronym> file system is safe with disk write-cache
+ enabled because it issues its own disk cache flush commands.)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
+ <literal>open_datasync</literal> (the default), write caching can be disabled
+ by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
+ Alternatively, set <varname>wal_sync_method</varname> to
+ <literal>fsync</literal> or <literal>fsync_writethrough</literal>, which prevent
+ write caching.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ On <productname>macOS</productname>, write caching can be prevented by
+ setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ Recent SATA drives (those following <acronym>ATAPI-6</acronym> or later)
+ offer a drive cache flush command (<command>FLUSH CACHE EXT</command>),
+ while SCSI drives have long supported a similar command
+ <command>SYNCHRONIZE CACHE</command>. These commands are not directly
+ accessible to <productname>PostgreSQL</productname>, but some file systems
+ (e.g., <acronym>ZFS</acronym>, <acronym>ext4</acronym>) can use them to flush
+ data to the platters on write-back-enabled drives. Unfortunately, such
+ file systems behave suboptimally when combined with battery-backup unit
+ (<acronym>BBU</acronym>) disk controllers. In such setups, the synchronize
+ command forces all data from the controller cache to the disks,
+ eliminating much of the benefit of the BBU. You can run the
+ <xref linkend="pgtestfsync"/> program to see
+ if you are affected. If you are affected, the performance benefits
+ of the BBU can be regained by turning off write barriers in
+ the file system or reconfiguring the disk controller, if that is
+ an option. If write barriers are turned off, make sure the battery
+ remains functional; a faulty battery can potentially lead to data loss.
+ Hopefully file system and disk controller designers will eventually
+ address this suboptimal behavior.
+ </para>
+
+ <para>
+ When the operating system sends a write request to the storage hardware,
+ there is little it can do to make sure the data has arrived at a truly
+ non-volatile storage area. Rather, it is the
+ administrator's responsibility to make certain that all storage components
+ ensure integrity for both data and file-system metadata.
+ Avoid disk controllers that have non-battery-backed write caches.
+ At the drive level, disable write-back caching if the
+ drive cannot guarantee the data will be written before shutdown.
+ If you use SSDs, be aware that many of these do not honor cache flush
+ commands by default.
+ You can test for reliable I/O subsystem behavior using <ulink
+ url="https://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
+ </para>
+
+ <para>
+ Another risk of data loss is posed by the disk platter write
+ operations themselves. Disk platters are divided into sectors,
+ commonly 512 bytes each. Every physical read or write operation
+ processes a whole sector.
+ When a write request arrives at the drive, it might be for some multiple
+ of 512 bytes (<productname>PostgreSQL</productname> typically writes 8192 bytes, or
+ 16 sectors, at a time), and the process of writing could fail due
+ to power loss at any time, meaning some of the 512-byte sectors were
+ written while others were not. To guard against such failures,
+ <productname>PostgreSQL</productname> periodically writes full page images to
+ permanent WAL storage <emphasis>before</emphasis> modifying the actual page on
+ disk. By doing this, during crash recovery <productname>PostgreSQL</productname> can
+ restore partially-written pages from WAL. If you have file-system software
+ that prevents partial page writes (e.g., ZFS), you can turn off
+ this page imaging by turning off the <xref
+ linkend="guc-full-page-writes"/> parameter. Battery-Backed Unit
+ (BBU) disk controllers do not prevent partial page writes unless
+ they guarantee that data is written to the BBU as full (8kB) pages.
+ </para>
+ <para>
+ <productname>PostgreSQL</productname> also protects against some kinds of data corruption
+ on storage devices that may occur because of hardware errors or media failure over time,
+ such as reading/writing garbage data.
+ <itemizedlist>
+ <listitem>
+ <para>
+ Each individual record in a WAL file is protected by a CRC-32 (32-bit) check
+ that allows us to tell if record contents are correct. The CRC value
+ is set when we write each WAL record and checked during crash recovery,
+ archive recovery and replication.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Data pages are not currently checksummed by default, though full page images
+ recorded in WAL records will be protected; see <link
+ linkend="app-initdb-data-checksums"><application>initdb</application></link>
+ for details about enabling data checksums.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Internal data structures such as <filename>pg_xact</filename>, <filename>pg_subtrans</filename>, <filename>pg_multixact</filename>,
+ <filename>pg_serial</filename>, <filename>pg_notify</filename>, <filename>pg_stat</filename>, <filename>pg_snapshots</filename> are not directly
+ checksummed, nor are pages protected by full page writes. However, where
+ such data structures are persistent, WAL records are written that allow
+ recent changes to be accurately rebuilt at crash recovery and those
+ WAL records are protected as discussed above.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Individual state files in <filename>pg_twophase</filename> are protected by CRC-32.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Temporary data files used in larger SQL queries for sorts,
+ materializations and intermediate results are not currently checksummed,
+ nor will WAL records be written for changes to those files.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ <productname>PostgreSQL</productname> does not protect against correctable memory errors
+ and it is assumed you will operate using RAM that uses industry standard
+ Error Correcting Codes (ECC) or better protection.
+ </para>
+ </sect1>
+
+ <sect1 id="checksums">
+ <title>Data Checksums</title>
+ <indexterm>
+ <primary>checksums</primary>
+ </indexterm>
+
+ <para>
+ By default, data pages are not protected by checksums, but this can
+ optionally be enabled for a cluster. When enabled, each data page includes
+ a checksum that is updated when the page is written and verified each time
+ the page is read. Only data pages are protected by checksums; internal data
+ structures and temporary files are not.
+ </para>
+
+ <para>
+ Checksums are normally enabled when the cluster is initialized using <link
+ linkend="app-initdb-data-checksums"><application>initdb</application></link>.
+ They can also be enabled or disabled at a later time as an offline
+ operation. Data checksums are enabled or disabled at the full cluster
+ level, and cannot be specified individually for databases or tables.
+ </para>
+
+ <para>
+ The current state of checksums in the cluster can be verified by viewing the
+ value of the read-only configuration variable <xref
+ linkend="guc-data-checksums" /> by issuing the command <command>SHOW
+ data_checksums</command>.
+ </para>
+
+ <para>
+ When attempting to recover from page corruptions, it may be necessary to
+ bypass the checksum protection. To do this, temporarily set the
+ configuration parameter <xref linkend="guc-ignore-checksum-failure" />.
+ </para>
+
+ <sect2 id="checksums-offline-enable-disable">
+ <title>Off-line Enabling of Checksums</title>
+
+ <para>
+ The <link linkend="app-pgchecksums"><application>pg_checksums</application></link>
+ application can be used to enable or disable data checksums, as well as
+ verify checksums, on an offline cluster.
+ </para>
+
+ </sect2>
+ </sect1>
+
+ <sect1 id="wal-intro">
+ <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
+
+ <indexterm zone="wal">
+ <primary>WAL</primary>
+ </indexterm>
+
+ <indexterm>
+ <primary>transaction log</primary>
+ <see>WAL</see>
+ </indexterm>
+
+ <para>
+ <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
+ is a standard method for ensuring data integrity. A detailed
+ description can be found in most (if not all) books about
+ transaction processing. Briefly, <acronym>WAL</acronym>'s central
+ concept is that changes to data files (where tables and indexes
+ reside) must be written only after those changes have been logged,
+ that is, after log records describing the changes have been flushed
+ to permanent storage. If we follow this procedure, we do not need
+ to flush data pages to disk on every transaction commit, because we
+ know that in the event of a crash we will be able to recover the
+ database using the log: any changes that have not been applied to
+ the data pages can be redone from the log records. (This is
+ roll-forward recovery, also known as REDO.)
+ </para>
+
+ <tip>
+ <para>
+ Because <acronym>WAL</acronym> restores database file
+ contents after a crash, journaled file systems are not necessary for
+ reliable storage of the data files or WAL files. In fact, journaling
+ overhead can reduce performance, especially if journaling
+ causes file system <emphasis>data</emphasis> to be flushed
+ to disk. Fortunately, data flushing during journaling can
+ often be disabled with a file system mount option, e.g.,
+ <literal>data=writeback</literal> on a Linux ext3 file system.
+ Journaled file systems do improve boot speed after a crash.
+ </para>
+ </tip>
+
+
+ <para>
+ Using <acronym>WAL</acronym> results in a
+ significantly reduced number of disk writes, because only the log
+ file needs to be flushed to disk to guarantee that a transaction is
+ committed, rather than every data file changed by the transaction.
+ The log file is written sequentially,
+ and so the cost of syncing the log is much less than the cost of
+ flushing the data pages. This is especially true for servers
+ handling many small transactions touching different parts of the data
+ store. Furthermore, when the server is processing many small concurrent
+ transactions, one <function>fsync</function> of the log file may
+ suffice to commit many transactions.
+ </para>
+
+ <para>
+ <acronym>WAL</acronym> also makes it possible to support on-line
+ backup and point-in-time recovery, as described in <xref
+ linkend="continuous-archiving"/>. By archiving the WAL data we can support
+ reverting to any time instant covered by the available WAL data:
+ we simply install a prior physical backup of the database, and
+ replay the WAL log just as far as the desired time. What's more,
+ the physical backup doesn't have to be an instantaneous snapshot
+ of the database state &mdash; if it is made over some period of time,
+ then replaying the WAL log for that period will fix any internal
+ inconsistencies.
+ </para>
+ </sect1>
+
+ <sect1 id="wal-async-commit">
+ <title>Asynchronous Commit</title>
+
+ <indexterm>
+ <primary>synchronous commit</primary>
+ </indexterm>
+
+ <indexterm>
+ <primary>asynchronous commit</primary>
+ </indexterm>
+
+ <para>
+ <firstterm>Asynchronous commit</firstterm> is an option that allows transactions
+ to complete more quickly, at the cost that the most recent transactions may
+ be lost if the database should crash. In many applications this is an
+ acceptable trade-off.
+ </para>
+
+ <para>
+ As described in the previous section, transaction commit is normally
+ <firstterm>synchronous</firstterm>: the server waits for the transaction's
+ <acronym>WAL</acronym> records to be flushed to permanent storage
+ before returning a success indication to the client. The client is
+ therefore guaranteed that a transaction reported to be committed will
+ be preserved, even in the event of a server crash immediately after.
+ However, for short transactions this delay is a major component of the
+ total transaction time. Selecting asynchronous commit mode means that
+ the server returns success as soon as the transaction is logically
+ completed, before the <acronym>WAL</acronym> records it generated have
+ actually made their way to disk. This can provide a significant boost
+ in throughput for small transactions.
+ </para>
+
+ <para>
+ Asynchronous commit introduces the risk of data loss. There is a short
+ time window between the report of transaction completion to the client
+ and the time that the transaction is truly committed (that is, it is
+ guaranteed not to be lost if the server crashes). Thus asynchronous
+ commit should not be used if the client will take external actions
+ relying on the assumption that the transaction will be remembered.
+ As an example, a bank would certainly not use asynchronous commit for
+ a transaction recording an ATM's dispensing of cash. But in many
+ scenarios, such as event logging, there is no need for a strong
+ guarantee of this kind.
+ </para>
+
+ <para>
+ The risk that is taken by using asynchronous commit is of data loss,
+ not data corruption. If the database should crash, it will recover
+ by replaying <acronym>WAL</acronym> up to the last record that was
+ flushed. The database will therefore be restored to a self-consistent
+ state, but any transactions that were not yet flushed to disk will
+ not be reflected in that state. The net effect is therefore loss of
+ the last few transactions. Because the transactions are replayed in
+ commit order, no inconsistency can be introduced &mdash; for example,
+ if transaction B made changes relying on the effects of a previous
+ transaction A, it is not possible for A's effects to be lost while B's
+ effects are preserved.
+ </para>
+
+ <para>
+ The user can select the commit mode of each transaction, so that
+ it is possible to have both synchronous and asynchronous commit
+ transactions running concurrently. This allows flexible trade-offs
+ between performance and certainty of transaction durability.
+ The commit mode is controlled by the user-settable parameter
+ <xref linkend="guc-synchronous-commit"/>, which can be changed in any of
+ the ways that a configuration parameter can be set. The mode used for
+ any one transaction depends on the value of
+ <varname>synchronous_commit</varname> when transaction commit begins.
+ </para>
+
+ <para>
+ Certain utility commands, for instance <command>DROP TABLE</command>, are
+ forced to commit synchronously regardless of the setting of
+ <varname>synchronous_commit</varname>. This is to ensure consistency
+ between the server's file system and the logical state of the database.
+ The commands supporting two-phase commit, such as <command>PREPARE
+ TRANSACTION</command>, are also always synchronous.
+ </para>
+
+ <para>
+ If the database crashes during the risk window between an
+ asynchronous commit and the writing of the transaction's
+ <acronym>WAL</acronym> records,
+ then changes made during that transaction <emphasis>will</emphasis> be lost.
+ The duration of the
+ risk window is limited because a background process (the <quote>WAL
+ writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk
+ every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+ The actual maximum duration of the risk window is three times
+ <varname>wal_writer_delay</varname> because the WAL writer is
+ designed to favor writing whole pages at a time during busy periods.
+ </para>
+
+ <caution>
+ <para>
+ An immediate-mode shutdown is equivalent to a server crash, and will
+ therefore cause loss of any unflushed asynchronous commits.
+ </para>
+ </caution>
+
+ <para>
+ Asynchronous commit provides behavior different from setting
+ <xref linkend="guc-fsync"/> = off.
+ <varname>fsync</varname> is a server-wide
+ setting that will alter the behavior of all transactions. It disables
+ all logic within <productname>PostgreSQL</productname> that attempts to synchronize
+ writes to different portions of the database, and therefore a system
+ crash (that is, a hardware or operating system crash, not a failure of
+ <productname>PostgreSQL</productname> itself) could result in arbitrarily bad
+ corruption of the database state. In many scenarios, asynchronous
+ commit provides most of the performance improvement that could be
+ obtained by turning off <varname>fsync</varname>, but without the risk
+ of data corruption.
+ </para>
+
+ <para>
+ <xref linkend="guc-commit-delay"/> also sounds very similar to
+ asynchronous commit, but it is actually a synchronous commit method
+ (in fact, <varname>commit_delay</varname> is ignored during an
+ asynchronous commit). <varname>commit_delay</varname> causes a delay
+ just before a transaction flushes <acronym>WAL</acronym> to disk, in
+ the hope that a single flush executed by one such transaction can also
+ serve other transactions committing at about the same time. The
+ setting can be thought of as a way of increasing the time window in
+ which transactions can join a group about to participate in a single
+ flush, to amortize the cost of the flush among multiple transactions.
+ </para>
+
+ </sect1>
+
+ <sect1 id="wal-configuration">
+ <title><acronym>WAL</acronym> Configuration</title>
+
+ <para>
+ There are several <acronym>WAL</acronym>-related configuration parameters that
+ affect database performance. This section explains their use.
+ Consult <xref linkend="runtime-config"/> for general information about
+ setting server configuration parameters.
+ </para>
+
+ <para>
+ <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm>
+ are points in the sequence of transactions at which it is guaranteed
+ that the heap and index data files have been updated with all
+ information written before that checkpoint. At checkpoint time, all
+ dirty data pages are flushed to disk and a special checkpoint record is
+ written to the log file. (The change records were previously flushed
+ to the <acronym>WAL</acronym> files.)
+ In the event of a crash, the crash recovery procedure looks at the latest
+ checkpoint record to determine the point in the log (known as the redo
+ record) from which it should start the REDO operation. Any changes made to
+ data files before that point are guaranteed to be already on disk.
+ Hence, after a checkpoint, log segments preceding the one containing
+ the redo record are no longer needed and can be recycled or removed. (When
+ <acronym>WAL</acronym> archiving is being done, the log segments must be
+ archived before being recycled or removed.)
+ </para>
+
+ <para>
+ The checkpoint requirement of flushing all dirty data pages to disk
+ can cause a significant I/O load. For this reason, checkpoint
+ activity is throttled so that I/O begins at checkpoint start and completes
+ before the next checkpoint is due to start; this minimizes performance
+ degradation during checkpoints.
+ </para>
+
+ <para>
+ The server's checkpointer process automatically performs
+ a checkpoint every so often. A checkpoint is begun every <xref
+ linkend="guc-checkpoint-timeout"/> seconds, or if
+ <xref linkend="guc-max-wal-size"/> is about to be exceeded,
+ whichever comes first.
+ The default settings are 5 minutes and 1 GB, respectively.
+ If no WAL has been written since the previous checkpoint, new checkpoints
+ will be skipped even if <varname>checkpoint_timeout</varname> has passed.
+ (If WAL archiving is being used and you want to put a lower limit on how
+ often files are archived in order to bound potential data loss, you should
+ adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
+ checkpoint parameters.)
+ It is also possible to force a checkpoint by using the SQL
+ command <command>CHECKPOINT</command>.
+ </para>
+
+ <para>
+ Reducing <varname>checkpoint_timeout</varname> and/or
+ <varname>max_wal_size</varname> causes checkpoints to occur
+ more often. This allows faster after-crash recovery, since less work
+ will need to be redone. However, one must balance this against the
+ increased cost of flushing dirty data pages more often. If
+ <xref linkend="guc-full-page-writes"/> is set (as is the default), there is
+ another factor to consider. To ensure data page consistency,
+ the first modification of a data page after each checkpoint results in
+ logging the entire page content. In that case,
+ a smaller checkpoint interval increases the volume of output to the WAL log,
+ partially negating the goal of using a smaller interval,
+ and in any case causing more disk I/O.
+ </para>
+
+ <para>
+ Checkpoints are fairly expensive, first because they require writing
+ out all currently dirty buffers, and second because they result in
+ extra subsequent WAL traffic as discussed above. It is therefore
+ wise to set the checkpointing parameters high enough so that checkpoints
+ don't happen too often. As a simple sanity check on your checkpointing
+ parameters, you can set the <xref linkend="guc-checkpoint-warning"/>
+ parameter. If checkpoints happen closer together than
+ <varname>checkpoint_warning</varname> seconds,
+ a message will be output to the server log recommending increasing
+ <varname>max_wal_size</varname>. Occasional appearance of such
+ a message is not cause for alarm, but if it appears often then the
+ checkpoint control parameters should be increased. Bulk operations such
+ as large <command>COPY</command> transfers might cause a number of such warnings
+ to appear if you have not set <varname>max_wal_size</varname> high
+ enough.
+ </para>
+
+ <para>
+ To avoid flooding the I/O system with a burst of page writes,
+ writing dirty buffers during a checkpoint is spread over a period of time.
+ That period is controlled by
+ <xref linkend="guc-checkpoint-completion-target"/>, which is
+ given as a fraction of the checkpoint interval (configured by using
+ <varname>checkpoint_timeout</varname>).
+ The I/O rate is adjusted so that the checkpoint finishes when the
+ given fraction of
+ <varname>checkpoint_timeout</varname> seconds have elapsed, or before
+ <varname>max_wal_size</varname> is exceeded, whichever is sooner.
+ With the default value of 0.9,
+ <productname>PostgreSQL</productname> can be expected to complete each checkpoint
+ a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
+ duration). This spreads out the I/O as much as possible so that the checkpoint
+ I/O load is consistent throughout the checkpoint interval. The disadvantage of
+ this is that prolonging checkpoints affects recovery time, because more WAL
+ segments will need to be kept around for possible use in recovery. A user
+ concerned about the amount of time required to recover might wish to reduce
+ <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently
+ but still spread the I/O across the checkpoint interval. Alternatively,
+ <varname>checkpoint_completion_target</varname> could be reduced, but this would
+ result in times of more intense I/O (during the checkpoint) and times of less I/O
+ (after the checkpoint completed but before the next scheduled checkpoint) and
+ therefore is not recommended.
+ Although <varname>checkpoint_completion_target</varname> could be set as high as
+ 1.0, it is typically recommended to set it to no higher than 0.9 (the default)
+ since checkpoints include some other activities besides writing dirty buffers.
+ A setting of 1.0 is quite likely to result in checkpoints not being
+ completed on time, which would result in performance loss due to
+ unexpected variation in the number of WAL segments needed.
+ </para>
+
+ <para>
+ On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/>
+ allows to force the OS that pages written by the checkpoint should be
+ flushed to disk after a configurable number of bytes. Otherwise, these
+ pages may be kept in the OS's page cache, inducing a stall when
+ <literal>fsync</literal> is issued at the end of a checkpoint. This setting will
+ often help to reduce transaction latency, but it also can have an adverse
+ effect on performance; particularly for workloads that are bigger than
+ <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache.
+ </para>
+
+ <para>
+ The number of WAL segment files in <filename>pg_wal</filename> directory depends on
+ <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and
+ the amount of WAL generated in previous checkpoint cycles. When old log
+ segment files are no longer needed, they are removed or recycled (that is,
+ renamed to become future segments in the numbered sequence). If, due to a
+ short-term peak of log output rate, <varname>max_wal_size</varname> is
+ exceeded, the unneeded segment files will be removed until the system
+ gets back under this limit. Below that limit, the system recycles enough
+ WAL files to cover the estimated need until the next checkpoint, and
+ removes the rest. The estimate is based on a moving average of the number
+ of WAL files used in previous checkpoint cycles. The moving average
+ is increased immediately if the actual usage exceeds the estimate, so it
+ accommodates peak usage rather than average usage to some extent.
+ <varname>min_wal_size</varname> puts a minimum on the amount of WAL files
+ recycled for future usage; that much WAL is always recycled for future use,
+ even if the system is idle and the WAL usage estimate suggests that little
+ WAL is needed.
+ </para>
+
+ <para>
+ Independently of <varname>max_wal_size</varname>,
+ the most recent <xref linkend="guc-wal-keep-size"/> megabytes of
+ WAL files plus one additional WAL file are
+ kept at all times. Also, if WAL archiving is used, old segments cannot be
+ removed or recycled until they are archived. If WAL archiving cannot keep up
+ with the pace that WAL is generated, or if <varname>archive_command</varname>
+ fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename>
+ until the situation is resolved. A slow or failed standby server that
+ uses a replication slot will have the same effect (see
+ <xref linkend="streaming-replication-slots"/>).
+ </para>
+
+ <para>
+ In archive recovery or standby mode, the server periodically performs
+ <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm>
+ which are similar to checkpoints in normal operation: the server forces
+ all its state to disk, updates the <filename>pg_control</filename> file to
+ indicate that the already-processed WAL data need not be scanned again,
+ and then recycles any old log segment files in the <filename>pg_wal</filename>
+ directory.
+ Restartpoints can't be performed more frequently than checkpoints on the
+ primary because restartpoints can only be performed at checkpoint records.
+ A restartpoint is triggered when a checkpoint record is reached if at
+ least <varname>checkpoint_timeout</varname> seconds have passed since the last
+ restartpoint, or if WAL size is about to exceed
+ <varname>max_wal_size</varname>. However, because of limitations on when a
+ restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded
+ during recovery, by up to one checkpoint cycle's worth of WAL.
+ (<varname>max_wal_size</varname> is never a hard limit anyway, so you should
+ always leave plenty of headroom to avoid running out of disk space.)
+ </para>
+
+ <para>
+ There are two commonly used internal <acronym>WAL</acronym> functions:
+ <function>XLogInsertRecord</function> and <function>XLogFlush</function>.
+ <function>XLogInsertRecord</function> is used to place a new record into
+ the <acronym>WAL</acronym> buffers in shared memory. If there is no
+ space for the new record, <function>XLogInsertRecord</function> will have
+ to write (move to kernel cache) a few filled <acronym>WAL</acronym>
+ buffers. This is undesirable because <function>XLogInsertRecord</function>
+ is used on every database low level modification (for example, row
+ insertion) at a time when an exclusive lock is held on affected
+ data pages, so the operation needs to be as fast as possible. What
+ is worse, writing <acronym>WAL</acronym> buffers might also force the
+ creation of a new log segment, which takes even more
+ time. Normally, <acronym>WAL</acronym> buffers should be written
+ and flushed by an <function>XLogFlush</function> request, which is
+ made, for the most part, at transaction commit time to ensure that
+ transaction records are flushed to permanent storage. On systems
+ with high log output, <function>XLogFlush</function> requests might
+ not occur often enough to prevent <function>XLogInsertRecord</function>
+ from having to do writes. On such systems
+ one should increase the number of <acronym>WAL</acronym> buffers by
+ modifying the <xref linkend="guc-wal-buffers"/> parameter. When
+ <xref linkend="guc-full-page-writes"/> is set and the system is very busy,
+ setting <varname>wal_buffers</varname> higher will help smooth response times
+ during the period immediately following each checkpoint.
+ </para>
+
+ <para>
+ The <xref linkend="guc-commit-delay"/> parameter defines for how many
+ microseconds a group commit leader process will sleep after acquiring a
+ lock within <function>XLogFlush</function>, while group commit
+ followers queue up behind the leader. This delay allows other server
+ processes to add their commit records to the WAL buffers so that all of
+ them will be flushed by the leader's eventual sync operation. No sleep
+ will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer
+ than <xref linkend="guc-commit-siblings"/> other sessions are currently
+ in active transactions; this avoids sleeping when it's unlikely that
+ any other session will commit soon. Note that on some platforms, the
+ resolution of a sleep request is ten milliseconds, so that any nonzero
+ <varname>commit_delay</varname> setting between 1 and 10000
+ microseconds would have the same effect. Note also that on some
+ platforms, sleep operations may take slightly longer than requested by
+ the parameter.
+ </para>
+
+ <para>
+ Since the purpose of <varname>commit_delay</varname> is to allow the
+ cost of each flush operation to be amortized across concurrently
+ committing transactions (potentially at the expense of transaction
+ latency), it is necessary to quantify that cost before the setting can
+ be chosen intelligently. The higher that cost is, the more effective
+ <varname>commit_delay</varname> is expected to be in increasing
+ transaction throughput, up to a point. The <xref
+ linkend="pgtestfsync"/> program can be used to measure the average time
+ in microseconds that a single WAL flush operation takes. A value of
+ half of the average time the program reports it takes to flush after a
+ single 8kB write operation is often the most effective setting for
+ <varname>commit_delay</varname>, so this value is recommended as the
+ starting point to use when optimizing for a particular workload. While
+ tuning <varname>commit_delay</varname> is particularly useful when the
+ WAL log is stored on high-latency rotating disks, benefits can be
+ significant even on storage media with very fast sync times, such as
+ solid-state drives or RAID arrays with a battery-backed write cache;
+ but this should definitely be tested against a representative workload.
+ Higher values of <varname>commit_siblings</varname> should be used in
+ such cases, whereas smaller <varname>commit_siblings</varname> values
+ are often helpful on higher latency media. Note that it is quite
+ possible that a setting of <varname>commit_delay</varname> that is too
+ high can increase transaction latency by so much that total transaction
+ throughput suffers.
+ </para>
+
+ <para>
+ When <varname>commit_delay</varname> is set to zero (the default), it
+ is still possible for a form of group commit to occur, but each group
+ will consist only of sessions that reach the point where they need to
+ flush their commit records during the window in which the previous
+ flush operation (if any) is occurring. At higher client counts a
+ <quote>gangway effect</quote> tends to occur, so that the effects of group
+ commit become significant even when <varname>commit_delay</varname> is
+ zero, and thus explicitly setting <varname>commit_delay</varname> tends
+ to help less. Setting <varname>commit_delay</varname> can only help
+ when (1) there are some concurrently committing transactions, and (2)
+ throughput is limited to some degree by commit rate; but with high
+ rotational latency this setting can be effective in increasing
+ transaction throughput with as few as two clients (that is, a single
+ committing client with one sibling transaction).
+ </para>
+
+ <para>
+ The <xref linkend="guc-wal-sync-method"/> parameter determines how
+ <productname>PostgreSQL</productname> will ask the kernel to force
+ <acronym>WAL</acronym> updates out to disk.
+ All the options should be the same in terms of reliability, with
+ the exception of <literal>fsync_writethrough</literal>, which can sometimes
+ force a flush of the disk cache even when other options do not do so.
+ However, it's quite platform-specific which one will be the fastest.
+ You can test the speeds of different options using the <xref
+ linkend="pgtestfsync"/> program.
+ Note that this parameter is irrelevant if <varname>fsync</varname>
+ has been turned off.
+ </para>
+
+ <para>
+ Enabling the <xref linkend="guc-wal-debug"/> configuration parameter
+ (provided that <productname>PostgreSQL</productname> has been
+ compiled with support for it) will result in each
+ <function>XLogInsertRecord</function> and <function>XLogFlush</function>
+ <acronym>WAL</acronym> call being logged to the server log. This
+ option might be replaced by a more general mechanism in the future.
+ </para>
+
+ <para>
+ There are two internal functions to write WAL data to disk:
+ <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
+ When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
+ amounts of time <function>XLogWrite</function> writes and
+ <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
+ <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in
+ <xref linkend="pg-stat-wal-view"/>, respectively.
+ <function>XLogWrite</function> is normally called by
+ <function>XLogInsertRecord</function> (when there is no space for the new
+ record in WAL buffers), <function>XLogFlush</function> and the WAL writer,
+ to write WAL buffers to disk and call <function>issue_xlog_fsync</function>.
+ <function>issue_xlog_fsync</function> is normally called by
+ <function>XLogWrite</function> to sync WAL files to disk.
+ If <varname>wal_sync_method</varname> is either
+ <literal>open_datasync</literal> or <literal>open_sync</literal>,
+ a write operation in <function>XLogWrite</function> guarantees to sync written
+ WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
+ If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
+ <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+ the write operation moves WAL buffers to kernel cache and
+ <function>issue_xlog_fsync</function> syncs them to disk. Regardless
+ of the setting of <varname>track_wal_io_timing</varname>, the number
+ of times <function>XLogWrite</function> writes and
+ <function>issue_xlog_fsync</function> syncs WAL data to disk are also
+ counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
+ in <structname>pg_stat_wal</structname>, respectively.
+ </para>
+ </sect1>
+
+ <sect1 id="wal-internals">
+ <title>WAL Internals</title>
+
+ <indexterm zone="wal-internals">
+ <primary>LSN</primary>
+ </indexterm>
+
+ <para>
+ <acronym>WAL</acronym> is automatically enabled; no action is
+ required from the administrator except ensuring that the
+ disk-space requirements for the <acronym>WAL</acronym> logs are met,
+ and that any necessary tuning is done (see <xref
+ linkend="wal-configuration"/>).
+ </para>
+
+ <para>
+ <acronym>WAL</acronym> records are appended to the <acronym>WAL</acronym>
+ logs as each new record is written. The insert position is described by
+ a Log Sequence Number (<acronym>LSN</acronym>) that is a byte offset into
+ the logs, increasing monotonically with each new record.
+ <acronym>LSN</acronym> values are returned as the datatype
+ <link linkend="datatype-pg-lsn"><type>pg_lsn</type></link>. Values can be
+ compared to calculate the volume of <acronym>WAL</acronym> data that
+ separates them, so they are used to measure the progress of replication
+ and recovery.
+ </para>
+
+ <para>
+ <acronym>WAL</acronym> logs are stored in the directory
+ <filename>pg_wal</filename> under the data directory, as a set of
+ segment files, normally each 16 MB in size (but the size can be changed
+ by altering the <option>--wal-segsize</option> <application>initdb</application> option). Each segment is
+ divided into pages, normally 8 kB each (this size can be changed via the
+ <option>--with-wal-blocksize</option> configure option). The log record headers
+ are described in <filename>access/xlogrecord.h</filename>; the record
+ content is dependent on the type of event that is being logged. Segment
+ files are given ever-increasing numbers as names, starting at
+ <filename>000000010000000000000001</filename>. The numbers do not wrap,
+ but it will take a very, very long time to exhaust the
+ available stock of numbers.
+ </para>
+
+ <para>
+ It is advantageous if the log is located on a different disk from the
+ main database files. This can be achieved by moving the
+ <filename>pg_wal</filename> directory to another location (while the server
+ is shut down, of course) and creating a symbolic link from the
+ original location in the main data directory to the new location.
+ </para>
+
+ <para>
+ The aim of <acronym>WAL</acronym> is to ensure that the log is
+ written before database records are altered, but this can be subverted by
+ disk drives<indexterm><primary>disk drive</primary></indexterm> that falsely report a
+ successful write to the kernel,
+ when in fact they have only cached the data and not yet stored it
+ on the disk. A power failure in such a situation might lead to
+ irrecoverable data corruption. Administrators should try to ensure
+ that disks holding <productname>PostgreSQL</productname>'s
+ <acronym>WAL</acronym> log files do not make such false reports.
+ (See <xref linkend="wal-reliability"/>.)
+ </para>
+
+ <para>
+ After a checkpoint has been made and the log flushed, the
+ checkpoint's position is saved in the file
+ <filename>pg_control</filename>. Therefore, at the start of recovery,
+ the server first reads <filename>pg_control</filename> and
+ then the checkpoint record; then it performs the REDO operation by
+ scanning forward from the log location indicated in the checkpoint
+ record. Because the entire content of data pages is saved in the
+ log on the first page modification after a checkpoint (assuming
+ <xref linkend="guc-full-page-writes"/> is not disabled), all pages
+ changed since the checkpoint will be restored to a consistent
+ state.
+ </para>
+
+ <para>
+ To deal with the case where <filename>pg_control</filename> is
+ corrupt, we should support the possibility of scanning existing log
+ segments in reverse order &mdash; newest to oldest &mdash; in order to find the
+ latest checkpoint. This has not been implemented yet.
+ <filename>pg_control</filename> is small enough (less than one disk page)
+ that it is not subject to partial-write problems, and as of this writing
+ there have been no reports of database failures due solely to the inability
+ to read <filename>pg_control</filename> itself. So while it is
+ theoretically a weak spot, <filename>pg_control</filename> does not
+ seem to be a problem in practice.
+ </para>
+ </sect1>
+</chapter>