1 files changed, 911 insertions, 0 deletions
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
new file mode 100644
index 0000000..27fb020
--- /dev/null
+++ b/doc/src/sgml/wal.sgml
@@ -0,0 +1,911 @@
+<!-- doc/src/sgml/wal.sgml -->
+
+<chapter id="wal">
+ <title>Reliability and the Write-Ahead Log</title>
+
+ <para>
+  This chapter explains how the Write-Ahead Log is used to obtain
+  efficient, reliable operation.
+ </para>
+
+ <sect1 id="wal-reliability">
+  <title>Reliability</title>
+
+  <para>
+   Reliability is an important property of any serious database
+   system, and <productname>PostgreSQL</productname> does everything possible to
+   guarantee reliable operation. One aspect of reliable operation is
+   that all data recorded by a committed transaction should be stored
+   in a nonvolatile area that is safe from power loss, operating
+   system failure, and hardware failure (except failure of the
+   nonvolatile area itself, of course).  Successfully writing the data
+   to the computer's permanent storage (disk drive or equivalent)
+   ordinarily meets this requirement.  In fact, even if a computer is
+   fatally damaged, if the disk drives survive they can be moved to
+   another computer with similar hardware and all committed
+   transactions will remain intact.
+  </para>
+
+  <para>
+   While forcing data to the disk platters periodically might seem like
+   a simple operation, it is not. Because disk drives are dramatically
+   slower than main memory and CPUs, several layers of caching exist
+   between the computer's main memory and the disk platters.
+   First, there is the operating system's buffer cache, which caches
+   frequently requested disk blocks and combines disk writes. Fortunately,
+   all operating systems give applications a way to force writes from
+   the buffer cache to disk, and <productname>PostgreSQL</productname> uses those
+   features.  (See the <xref linkend="guc-wal-sync-method"/> parameter
+   to adjust how this is done.)
+  </para>
+
+  <para>
+   Next, there might be a cache in the disk drive controller; this is
+   particularly common on <acronym>RAID</acronym> controller cards. Some of
+   these caches are <firstterm>write-through</firstterm>, meaning writes are sent
+   to the drive as soon as they arrive. Others are
+   <firstterm>write-back</firstterm>, meaning data is sent to the drive at
+   some later time. Such caches can be a reliability hazard because the
+   memory in the disk controller cache is volatile, and will lose its
+   contents in a power failure.  Better controller cards have
+   <firstterm>battery-backup units</firstterm> (<acronym>BBU</acronym>s), meaning
+   the card has a battery that
+   maintains power to the cache in case of system power loss.  After power
+   is restored the data will be written to the disk drives.
+  </para>
+
+  <para>
+   And finally, most disk drives have caches. Some are write-through
+   while some are write-back, and the same concerns about data loss
+   exist for write-back drive caches as for disk controller
+   caches.  Consumer-grade IDE and SATA drives are particularly likely
+   to have write-back caches that will not survive a power failure.  Many
+   solid-state drives (SSD) also have volatile write-back caches.
+  </para>
+
+  <para>
+   These caches can typically be disabled; however, the method for doing
+   this varies by operating system and drive type:
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        On <productname>Linux</productname>, IDE and SATA drives can be queried using
+        <command>hdparm -I</command>; write caching is enabled if there is
+        a <literal>*</literal> next to <literal>Write cache</literal>.  <command>hdparm -W 0</command>
+        can be used to turn off write caching.  SCSI drives can be queried
+        using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</application></ulink>.
+        Use <command>sdparm --get=WCE</command> to check
+        whether the write cache is enabled and <command>sdparm --clear=WCE</command>
+        to disable it.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        On <productname>FreeBSD</productname>, IDE drives can be queried using
+        <command>atacontrol</command> and write caching turned off using
+        <literal>hw.ata.wc=0</literal> in <filename>/boot/loader.conf</filename>;
+        SCSI drives can be queried using <command>camcontrol identify</command>,
+        and the write cache both queried and changed using
+        <command>sdparm</command> when available.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        On <productname>Solaris</productname>, the disk write cache is controlled by
+        <command>format -e</command>.
+        (The Solaris <acronym>ZFS</acronym> file system is safe with disk write-cache
+        enabled because it issues its own disk cache flush commands.)
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is
+        <literal>open_datasync</literal> (the default), write caching can be disabled
+        by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>.
+        Alternatively, set <varname>wal_sync_method</varname> to
+        <literal>fsync</literal> or <literal>fsync_writethrough</literal>, which prevent
+        write caching.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        On <productname>macOS</productname>, write caching can be prevented by
+        setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>.
+      </para>
+    </listitem>
+  </itemizedlist>
+
+  <para>
+   Recent SATA drives (those following <acronym>ATAPI-6</acronym> or later)
+   offer a drive cache flush command (<command>FLUSH CACHE EXT</command>),
+   while SCSI drives have long supported a similar command
+   <command>SYNCHRONIZE CACHE</command>.  These commands are not directly
+   accessible to <productname>PostgreSQL</productname>, but some file systems
+   (e.g., <acronym>ZFS</acronym>, <acronym>ext4</acronym>) can use them to flush
+   data to the platters on write-back-enabled drives.  Unfortunately, such
+   file systems behave suboptimally when combined with battery-backup unit
+   (<acronym>BBU</acronym>) disk controllers.  In such setups, the synchronize
+   command forces all data from the controller cache to the disks,
+   eliminating much of the benefit of the BBU.  You can run the
+   <xref linkend="pgtestfsync"/> program to see
+   if you are affected.  If you are affected, the performance benefits
+   of the BBU can be regained by turning off write barriers in
+   the file system or reconfiguring the disk controller, if that is
+   an option.  If write barriers are turned off, make sure the battery
+   remains functional; a faulty battery can potentially lead to data loss.
+   Hopefully file system and disk controller designers will eventually
+   address this suboptimal behavior.
+  </para>
+
+  <para>
+   When the operating system sends a write request to the storage hardware,
+   there is little it can do to make sure the data has arrived at a truly
+   non-volatile storage area. Rather, it is the
+   administrator's responsibility to make certain that all storage components
+   ensure integrity for both data and file-system metadata.
+   Avoid disk controllers that have non-battery-backed write caches.
+   At the drive level, disable write-back caching if the
+   drive cannot guarantee the data will be written before shutdown.
+   If you use SSDs, be aware that many of these do not honor cache flush
+   commands by default.
+   You can test for reliable I/O subsystem behavior using <ulink
+   url="https://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
+  </para>
+
+  <para>
+   Another risk of data loss is posed by the disk platter write
+   operations themselves. Disk platters are divided into sectors,
+   commonly 512 bytes each.  Every physical read or write operation
+   processes a whole sector.
+   When a write request arrives at the drive, it might be for some multiple
+   of 512 bytes (<productname>PostgreSQL</productname> typically writes 8192 bytes, or
+   16 sectors, at a time), and the process of writing could fail due
+   to power loss at any time, meaning some of the 512-byte sectors were
+   written while others were not.  To guard against such failures,
+   <productname>PostgreSQL</productname> periodically writes full page images to
+   permanent WAL storage <emphasis>before</emphasis> modifying the actual page on
+   disk. By doing this, during crash recovery <productname>PostgreSQL</productname> can
+   restore partially-written pages from WAL.  If you have file-system software
+   that prevents partial page writes (e.g., ZFS),  you can turn off
+   this page imaging by turning off the <xref
+   linkend="guc-full-page-writes"/> parameter. Battery-Backed Unit
+   (BBU) disk controllers do not prevent partial page writes unless
+   they guarantee that data is written to the BBU as full (8kB) pages.
+  </para>
+  <para>
+   <productname>PostgreSQL</productname> also protects against some kinds of data corruption
+   on storage devices that may occur because of hardware errors or media failure over time,
+   such as reading/writing garbage data.
+   <itemizedlist>
+    <listitem>
+     <para>
+      Each individual record in a WAL file is protected by a CRC-32 (32-bit) check
+      that allows us to tell if record contents are correct. The CRC value
+      is set when we write each WAL record and checked during crash recovery,
+      archive recovery and replication.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Data pages are not currently checksummed by default, though full page images
+      recorded in WAL records will be protected; see <link
+      linkend="app-initdb-data-checksums"><application>initdb</application></link>
+      for details about enabling data checksums.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Internal data structures such as <filename>pg_xact</filename>, <filename>pg_subtrans</filename>, <filename>pg_multixact</filename>,
+      <filename>pg_serial</filename>, <filename>pg_notify</filename>, <filename>pg_stat</filename>, <filename>pg_snapshots</filename> are not directly
+      checksummed, nor are pages protected by full page writes. However, where
+      such data structures are persistent, WAL records are written that allow
+      recent changes to be accurately rebuilt at crash recovery and those
+      WAL records are protected as discussed above.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Individual state files in <filename>pg_twophase</filename> are protected by CRC-32.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Temporary data files used in larger SQL queries for sorts,
+      materializations and intermediate results are not currently checksummed,
+      nor will WAL records be written for changes to those files.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+  <para>
+   <productname>PostgreSQL</productname> does not protect against correctable memory errors
+   and it is assumed you will operate using RAM that uses industry standard
+   Error Correcting Codes (ECC) or better protection.
+  </para>
+ </sect1>
+
+ <sect1 id="checksums">
+  <title>Data Checksums</title>
+  <indexterm>
+   <primary>checksums</primary>
+  </indexterm>
+
+  <para>
+   By default, data pages are not protected by checksums, but this can
+   optionally be enabled for a cluster. When enabled, each data page includes
+   a checksum that is updated when the page is written and verified each time
+   the page is read. Only data pages are protected by checksums; internal data
+   structures and temporary files are not.
+  </para>
+
+  <para>
+   Checksums are normally enabled when the cluster is initialized using <link
+   linkend="app-initdb-data-checksums"><application>initdb</application></link>.
+   They can also be enabled or disabled at a later time as an offline
+   operation. Data checksums are enabled or disabled at the full cluster
+   level, and cannot be specified individually for databases or tables.
+  </para>
+
+  <para>
+   The current state of checksums in the cluster can be verified by viewing the
+   value of the read-only configuration variable <xref
+   linkend="guc-data-checksums" /> by issuing the command <command>SHOW
+   data_checksums</command>.
+  </para>
+
+  <para>
+   When attempting to recover from page corruptions, it may be necessary to
+   bypass the checksum protection. To do this, temporarily set the
+   configuration parameter <xref linkend="guc-ignore-checksum-failure" />.
+  </para>
+
+  <sect2 id="checksums-offline-enable-disable">
+   <title>Off-line Enabling of Checksums</title>
+
+   <para>
+    The <link linkend="app-pgchecksums"><application>pg_checksums</application></link>
+    application can be used to enable or disable data checksums, as well as
+    verify checksums, on an offline cluster.
+   </para>
+
+  </sect2>
+ </sect1>
+
+  <sect1 id="wal-intro">
+   <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
+
+   <indexterm zone="wal">
+    <primary>WAL</primary>
+   </indexterm>
+
+   <indexterm>
+    <primary>transaction log</primary>
+    <see>WAL</see>
+   </indexterm>
+
+   <para>
+    <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
+    is a standard method for ensuring data integrity.  A detailed
+    description can be found in most (if not all) books about
+    transaction processing. Briefly, <acronym>WAL</acronym>'s central
+    concept is that changes to data files (where tables and indexes
+    reside) must be written only after those changes have been logged,
+    that is, after log records describing the changes have been flushed
+    to permanent storage. If we follow this procedure, we do not need
+    to flush data pages to disk on every transaction commit, because we
+    know that in the event of a crash we will be able to recover the
+    database using the log: any changes that have not been applied to
+    the data pages can be redone from the log records.  (This is
+    roll-forward recovery, also known as REDO.)
+   </para>
+
+   <tip>
+    <para>
+     Because <acronym>WAL</acronym> restores database file
+     contents after a crash, journaled file systems are not necessary for
+     reliable storage of the data files or WAL files.  In fact, journaling
+     overhead can reduce performance, especially if journaling
+     causes file system <emphasis>data</emphasis> to be flushed
+     to disk.  Fortunately, data flushing during journaling can
+     often be disabled with a file system mount option, e.g.,
+     <literal>data=writeback</literal> on a Linux ext3 file system.
+     Journaled file systems do improve boot speed after a crash.
+    </para>
+   </tip>
+
+
+   <para>
+    Using <acronym>WAL</acronym> results in a
+    significantly reduced number of disk writes, because only the log
+    file needs to be flushed to disk to guarantee that a transaction is
+    committed, rather than every data file changed by the transaction.
+    The log file is written sequentially,
+    and so the cost of syncing the log is much less than the cost of
+    flushing the data pages.  This is especially true for servers
+    handling many small transactions touching different parts of the data
+    store.  Furthermore, when the server is processing many small concurrent
+    transactions, one <function>fsync</function> of the log file may
+    suffice to commit many transactions.
+   </para>
+
+   <para>
+    <acronym>WAL</acronym> also makes it possible to support on-line
+    backup and point-in-time recovery, as described in <xref
+    linkend="continuous-archiving"/>.  By archiving the WAL data we can support
+    reverting to any time instant covered by the available WAL data:
+    we simply install a prior physical backup of the database, and
+    replay the WAL log just as far as the desired time.  What's more,
+    the physical backup doesn't have to be an instantaneous snapshot
+    of the database state &mdash; if it is made over some period of time,
+    then replaying the WAL log for that period will fix any internal
+    inconsistencies.
+   </para>
+  </sect1>
+
+ <sect1 id="wal-async-commit">
+  <title>Asynchronous Commit</title>
+
+   <indexterm>
+    <primary>synchronous commit</primary>
+   </indexterm>
+
+   <indexterm>
+    <primary>asynchronous commit</primary>
+   </indexterm>
+
+  <para>
+   <firstterm>Asynchronous commit</firstterm> is an option that allows transactions
+   to complete more quickly, at the cost that the most recent transactions may
+   be lost if the database should crash.  In many applications this is an
+   acceptable trade-off.
+  </para>
+
+  <para>
+   As described in the previous section, transaction commit is normally
+   <firstterm>synchronous</firstterm>: the server waits for the transaction's
+   <acronym>WAL</acronym> records to be flushed to permanent storage
+   before returning a success indication to the client.  The client is
+   therefore guaranteed that a transaction reported to be committed will
+   be preserved, even in the event of a server crash immediately after.
+   However, for short transactions this delay is a major component of the
+   total transaction time.  Selecting asynchronous commit mode means that
+   the server returns success as soon as the transaction is logically
+   completed, before the <acronym>WAL</acronym> records it generated have
+   actually made their way to disk.  This can provide a significant boost
+   in throughput for small transactions.
+  </para>
+
+  <para>
+   Asynchronous commit introduces the risk of data loss. There is a short
+   time window between the report of transaction completion to the client
+   and the time that the transaction is truly committed (that is, it is
+   guaranteed not to be lost if the server crashes).  Thus asynchronous
+   commit should not be used if the client will take external actions
+   relying on the assumption that the transaction will be remembered.
+   As an example, a bank would certainly not use asynchronous commit for
+   a transaction recording an ATM's dispensing of cash.  But in many
+   scenarios, such as event logging, there is no need for a strong
+   guarantee of this kind.
+  </para>
+
+  <para>
+   The risk that is taken by using asynchronous commit is of data loss,
+   not data corruption.  If the database should crash, it will recover
+   by replaying <acronym>WAL</acronym> up to the last record that was
+   flushed.  The database will therefore be restored to a self-consistent
+   state, but any transactions that were not yet flushed to disk will
+   not be reflected in that state.  The net effect is therefore loss of
+   the last few transactions.  Because the transactions are replayed in
+   commit order, no inconsistency can be introduced &mdash; for example,
+   if transaction B made changes relying on the effects of a previous
+   transaction A, it is not possible for A's effects to be lost while B's
+   effects are preserved.
+  </para>
+
+  <para>
+   The user can select the commit mode of each transaction, so that
+   it is possible to have both synchronous and asynchronous commit
+   transactions running concurrently.  This allows flexible trade-offs
+   between performance and certainty of transaction durability.
+   The commit mode is controlled by the user-settable parameter
+   <xref linkend="guc-synchronous-commit"/>, which can be changed in any of
+   the ways that a configuration parameter can be set.  The mode used for
+   any one transaction depends on the value of
+   <varname>synchronous_commit</varname> when transaction commit begins.
+  </para>
+
+  <para>
+   Certain utility commands, for instance <command>DROP TABLE</command>, are
+   forced to commit synchronously regardless of the setting of
+   <varname>synchronous_commit</varname>.  This is to ensure consistency
+   between the server's file system and the logical state of the database.
+   The commands supporting two-phase commit, such as <command>PREPARE
+   TRANSACTION</command>, are also always synchronous.
+  </para>
+
+  <para>
+   If the database crashes during the risk window between an
+   asynchronous commit and the writing of the transaction's
+   <acronym>WAL</acronym> records,
+   then changes made during that transaction <emphasis>will</emphasis> be lost.
+   The duration of the
+   risk window is limited because a background process (the <quote>WAL
+   writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk
+   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+   The actual maximum duration of the risk window is three times
+   <varname>wal_writer_delay</varname> because the WAL writer is
+   designed to favor writing whole pages at a time during busy periods.
+  </para>
+
+  <caution>
+   <para>
+    An immediate-mode shutdown is equivalent to a server crash, and will
+    therefore cause loss of any unflushed asynchronous commits.
+   </para>
+  </caution>
+
+  <para>
+   Asynchronous commit provides behavior different from setting
+   <xref linkend="guc-fsync"/> = off.
+   <varname>fsync</varname> is a server-wide
+   setting that will alter the behavior of all transactions.  It disables
+   all logic within <productname>PostgreSQL</productname> that attempts to synchronize
+   writes to different portions of the database, and therefore a system
+   crash (that is, a hardware or operating system crash, not a failure of
+   <productname>PostgreSQL</productname> itself) could result in arbitrarily bad
+   corruption of the database state.  In many scenarios, asynchronous
+   commit provides most of the performance improvement that could be
+   obtained by turning off <varname>fsync</varname>, but without the risk
+   of data corruption.
+  </para>
+
+  <para>
+   <xref linkend="guc-commit-delay"/> also sounds very similar to
+   asynchronous commit, but it is actually a synchronous commit method
+   (in fact, <varname>commit_delay</varname> is ignored during an
+   asynchronous commit).  <varname>commit_delay</varname> causes a delay
+   just before a transaction flushes <acronym>WAL</acronym> to disk, in
+   the hope that a single flush executed by one such transaction can also
+   serve other transactions committing at about the same time.  The
+   setting can be thought of as a way of increasing the time window in
+   which transactions can join a group about to participate in a single
+   flush, to amortize the cost of the flush among multiple transactions.
+  </para>
+
+ </sect1>
+
+ <sect1 id="wal-configuration">
+  <title><acronym>WAL</acronym> Configuration</title>
+
+  <para>
+   There are several <acronym>WAL</acronym>-related configuration parameters that
+   affect database performance. This section explains their use.
+   Consult <xref linkend="runtime-config"/> for general information about
+   setting server configuration parameters.
+  </para>
+
+  <para>
+   <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm>
+   are points in the sequence of transactions at which it is guaranteed
+   that the heap and index data files have been updated with all
+   information written before that checkpoint.  At checkpoint time, all
+   dirty data pages are flushed to disk and a special checkpoint record is
+   written to the log file.  (The change records were previously flushed
+   to the <acronym>WAL</acronym> files.)
+   In the event of a crash, the crash recovery procedure looks at the latest
+   checkpoint record to determine the point in the log (known as the redo
+   record) from which it should start the REDO operation.  Any changes made to
+   data files before that point are guaranteed to be already on disk.
+   Hence, after a checkpoint, log segments preceding the one containing
+   the redo record are no longer needed and can be recycled or removed. (When
+   <acronym>WAL</acronym> archiving is being done, the log segments must be
+   archived before being recycled or removed.)
+  </para>
+
+  <para>
+   The checkpoint requirement of flushing all dirty data pages to disk
+   can cause a significant I/O load.  For this reason, checkpoint
+   activity is throttled so that I/O begins at checkpoint start and completes
+   before the next checkpoint is due to start; this minimizes performance
+   degradation during checkpoints.
+  </para>
+
+  <para>
+   The server's checkpointer process automatically performs
+   a checkpoint every so often.  A checkpoint is begun every <xref
+   linkend="guc-checkpoint-timeout"/> seconds, or if
+   <xref linkend="guc-max-wal-size"/> is about to be exceeded,
+   whichever comes first.
+   The default settings are 5 minutes and 1 GB, respectively.
+   If no WAL has been written since the previous checkpoint, new checkpoints
+   will be skipped even if <varname>checkpoint_timeout</varname> has passed.
+   (If WAL archiving is being used and you want to put a lower limit on how
+   often files are archived in order to bound potential data loss, you should
+   adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the
+   checkpoint parameters.)
+   It is also possible to force a checkpoint by using the SQL
+   command <command>CHECKPOINT</command>.
+  </para>
+
+  <para>
+   Reducing <varname>checkpoint_timeout</varname> and/or
+   <varname>max_wal_size</varname> causes checkpoints to occur
+   more often. This allows faster after-crash recovery, since less work
+   will need to be redone. However, one must balance this against the
+   increased cost of flushing dirty data pages more often. If
+   <xref linkend="guc-full-page-writes"/> is set (as is the default), there is
+   another factor to consider. To ensure data page consistency,
+   the first modification of a data page after each checkpoint results in
+   logging the entire page content. In that case,
+   a smaller checkpoint interval increases the volume of output to the WAL log,
+   partially negating the goal of using a smaller interval,
+   and in any case causing more disk I/O.
+  </para>
+
+  <para>
+   Checkpoints are fairly expensive, first because they require writing
+   out all currently dirty buffers, and second because they result in
+   extra subsequent WAL traffic as discussed above.  It is therefore
+   wise to set the checkpointing parameters high enough so that checkpoints
+   don't happen too often.  As a simple sanity check on your checkpointing
+   parameters, you can set the <xref linkend="guc-checkpoint-warning"/>
+   parameter.  If checkpoints happen closer together than
+   <varname>checkpoint_warning</varname> seconds,
+   a message will be output to the server log recommending increasing
+   <varname>max_wal_size</varname>.  Occasional appearance of such
+   a message is not cause for alarm, but if it appears often then the
+   checkpoint control parameters should be increased. Bulk operations such
+   as large <command>COPY</command> transfers might cause a number of such warnings
+   to appear if you have not set <varname>max_wal_size</varname> high
+   enough.
+  </para>
+
+  <para>
+   To avoid flooding the I/O system with a burst of page writes,
+   writing dirty buffers during a checkpoint is spread over a period of time.
+   That period is controlled by
+   <xref linkend="guc-checkpoint-completion-target"/>, which is
+   given as a fraction of the checkpoint interval (configured by using
+   <varname>checkpoint_timeout</varname>).
+   The I/O rate is adjusted so that the checkpoint finishes when the
+   given fraction of
+   <varname>checkpoint_timeout</varname> seconds have elapsed, or before
+   <varname>max_wal_size</varname> is exceeded, whichever is sooner.
+   With the default value of 0.9,
+   <productname>PostgreSQL</productname> can be expected to complete each checkpoint
+   a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
+   duration).  This spreads out the I/O as much as possible so that the checkpoint
+   I/O load is consistent throughout the checkpoint interval.  The disadvantage of
+   this is that prolonging checkpoints affects recovery time, because more WAL
+   segments will need to be kept around for possible use in recovery.  A user
+   concerned about the amount of time required to recover might wish to reduce
+   <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently
+   but still spread the I/O across the checkpoint interval.  Alternatively,
+   <varname>checkpoint_completion_target</varname> could be reduced, but this would
+   result in times of more intense I/O (during the checkpoint) and times of less I/O
+   (after the checkpoint completed but before the next scheduled checkpoint) and
+   therefore is not recommended.
+   Although <varname>checkpoint_completion_target</varname> could be set as high as
+   1.0, it is typically recommended to set it to no higher than 0.9 (the default)
+   since checkpoints include some other activities besides writing dirty buffers.
+   A setting of 1.0 is quite likely to result in checkpoints not being
+   completed on time, which would result in performance loss due to
+   unexpected variation in the number of WAL segments needed.
+  </para>
+
+  <para>
+   On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/>
+   allows to force the OS that pages written by the checkpoint should be
+   flushed to disk after a configurable number of bytes.  Otherwise, these
+   pages may be kept in the OS's page cache, inducing a stall when
+   <literal>fsync</literal> is issued at the end of a checkpoint.  This setting will
+   often help to reduce transaction latency, but it also can have an adverse
+   effect on performance; particularly for workloads that are bigger than
+   <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache.
+  </para>
+
+  <para>
+   The number of WAL segment files in <filename>pg_wal</filename> directory depends on
+   <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and
+   the amount of WAL generated in previous checkpoint cycles. When old log
+   segment files are no longer needed, they are removed or recycled (that is,
+   renamed to become future segments in the numbered sequence). If, due to a
+   short-term peak of log output rate, <varname>max_wal_size</varname> is
+   exceeded, the unneeded segment files will be removed until the system
+   gets back under this limit. Below that limit, the system recycles enough
+   WAL files to cover the estimated need until the next checkpoint, and
+   removes the rest. The estimate is based on a moving average of the number
+   of WAL files used in previous checkpoint cycles. The moving average
+   is increased immediately if the actual usage exceeds the estimate, so it
+   accommodates peak usage rather than average usage to some extent.
+   <varname>min_wal_size</varname> puts a minimum on the amount of WAL files
+   recycled for future usage; that much WAL is always recycled for future use,
+   even if the system is idle and the WAL usage estimate suggests that little
+   WAL is needed.
+  </para>
+
+  <para>
+   Independently of <varname>max_wal_size</varname>,
+   the most recent <xref linkend="guc-wal-keep-size"/> megabytes of
+   WAL files plus one additional WAL file are
+   kept at all times. Also, if WAL archiving is used, old segments cannot be
+   removed or recycled until they are archived. If WAL archiving cannot keep up
+   with the pace that WAL is generated, or if <varname>archive_command</varname>
+   or <varname>archive_library</varname>
+   fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename>
+   until the situation is resolved. A slow or failed standby server that
+   uses a replication slot will have the same effect (see
+   <xref linkend="streaming-replication-slots"/>).
+  </para>
+
+  <para>
+   In archive recovery or standby mode, the server periodically performs
+   <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm>
+   which are similar to checkpoints in normal operation: the server forces
+   all its state to disk, updates the <filename>pg_control</filename> file to
+   indicate that the already-processed WAL data need not be scanned again,
+   and then recycles any old log segment files in the <filename>pg_wal</filename>
+   directory.
+   Restartpoints can't be performed more frequently than checkpoints on the
+   primary because restartpoints can only be performed at checkpoint records.
+   A restartpoint is triggered when a checkpoint record is reached if at
+   least <varname>checkpoint_timeout</varname> seconds have passed since the last
+   restartpoint, or if WAL size is about to exceed
+   <varname>max_wal_size</varname>. However, because of limitations on when a
+   restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded
+   during recovery, by up to one checkpoint cycle's worth of WAL.
+   (<varname>max_wal_size</varname> is never a hard limit anyway, so you should
+   always leave plenty of headroom to avoid running out of disk space.)
+  </para>
+
+  <para>
+   There are two commonly used internal <acronym>WAL</acronym> functions:
+   <function>XLogInsertRecord</function> and <function>XLogFlush</function>.
+   <function>XLogInsertRecord</function> is used to place a new record into
+   the <acronym>WAL</acronym> buffers in shared memory. If there is no
+   space for the new record, <function>XLogInsertRecord</function> will have
+   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
+   buffers. This is undesirable because <function>XLogInsertRecord</function>
+   is used on every database low level modification (for example, row
+   insertion) at a time when an exclusive lock is held on affected
+   data pages, so the operation needs to be as fast as possible.  What
+   is worse, writing <acronym>WAL</acronym> buffers might also force the
+   creation of a new log segment, which takes even more
+   time. Normally, <acronym>WAL</acronym> buffers should be written
+   and flushed by an <function>XLogFlush</function> request, which is
+   made, for the most part, at transaction commit time to ensure that
+   transaction records are flushed to permanent storage. On systems
+   with high log output, <function>XLogFlush</function> requests might
+   not occur often enough to prevent <function>XLogInsertRecord</function>
+   from having to do writes.  On such systems
+   one should increase the number of <acronym>WAL</acronym> buffers by
+   modifying the <xref linkend="guc-wal-buffers"/> parameter.  When
+   <xref linkend="guc-full-page-writes"/> is set and the system is very busy,
+   setting <varname>wal_buffers</varname> higher will help smooth response times
+   during the period immediately following each checkpoint.
+  </para>
+
+  <para>
+   The <xref linkend="guc-commit-delay"/> parameter defines for how many
+   microseconds a group commit leader process will sleep after acquiring a
+   lock within <function>XLogFlush</function>, while group commit
+   followers queue up behind the leader.  This delay allows other server
+   processes to add their commit records to the WAL buffers so that all of
+   them will be flushed by the leader's eventual sync operation.  No sleep
+   will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer
+   than <xref linkend="guc-commit-siblings"/> other sessions are currently
+   in active transactions; this avoids sleeping when it's unlikely that
+   any other session will commit soon.  Note that on some platforms, the
+   resolution of a sleep request is ten milliseconds, so that any nonzero
+   <varname>commit_delay</varname> setting between 1 and 10000
+   microseconds would have the same effect.  Note also that on some
+   platforms, sleep operations may take slightly longer than requested by
+   the parameter.
+  </para>
+
+  <para>
+   Since the purpose of <varname>commit_delay</varname> is to allow the
+   cost of each flush operation to be amortized across concurrently
+   committing transactions (potentially at the expense of transaction
+   latency), it is necessary to quantify that cost before the setting can
+   be chosen intelligently.  The higher that cost is, the more effective
+   <varname>commit_delay</varname> is expected to be in increasing
+   transaction throughput, up to a point.  The <xref
+   linkend="pgtestfsync"/> program can be used to measure the average time
+   in microseconds that a single WAL flush operation takes.  A value of
+   half of the average time the program reports it takes to flush after a
+   single 8kB write operation is often the most effective setting for
+   <varname>commit_delay</varname>, so this value is recommended as the
+   starting point to use when optimizing for a particular workload.  While
+   tuning <varname>commit_delay</varname> is particularly useful when the
+   WAL log is stored on high-latency rotating disks, benefits can be
+   significant even on storage media with very fast sync times, such as
+   solid-state drives or RAID arrays with a battery-backed write cache;
+   but this should definitely be tested against a representative workload.
+   Higher values of <varname>commit_siblings</varname> should be used in
+   such cases, whereas smaller <varname>commit_siblings</varname> values
+   are often helpful on higher latency media.  Note that it is quite
+   possible that a setting of <varname>commit_delay</varname> that is too
+   high can increase transaction latency by so much that total transaction
+   throughput suffers.
+  </para>
+
+  <para>
+   When <varname>commit_delay</varname> is set to zero (the default), it
+   is still possible for a form of group commit to occur, but each group
+   will consist only of sessions that reach the point where they need to
+   flush their commit records during the window in which the previous
+   flush operation (if any) is occurring.  At higher client counts a
+   <quote>gangway effect</quote> tends to occur, so that the effects of group
+   commit become significant even when <varname>commit_delay</varname> is
+   zero, and thus explicitly setting <varname>commit_delay</varname> tends
+   to help less.  Setting <varname>commit_delay</varname> can only help
+   when (1) there are some concurrently committing transactions, and (2)
+   throughput is limited to some degree by commit rate; but with high
+   rotational latency this setting can be effective in increasing
+   transaction throughput with as few as two clients (that is, a single
+   committing client with one sibling transaction).
+  </para>
+
+  <para>
+   The <xref linkend="guc-wal-sync-method"/> parameter determines how
+   <productname>PostgreSQL</productname> will ask the kernel to force
+   <acronym>WAL</acronym> updates out to disk.
+   All the options should be the same in terms of reliability, with
+   the exception of <literal>fsync_writethrough</literal>, which can sometimes
+   force a flush of the disk cache even when other options do not do so.
+   However, it's quite platform-specific which one will be the fastest.
+   You can test the speeds of different options using the <xref
+   linkend="pgtestfsync"/> program.
+   Note that this parameter is irrelevant if <varname>fsync</varname>
+   has been turned off.
+  </para>
+
+  <para>
+   Enabling the <xref linkend="guc-wal-debug"/> configuration parameter
+   (provided that <productname>PostgreSQL</productname> has been
+   compiled with support for it) will result in each
+   <function>XLogInsertRecord</function> and <function>XLogFlush</function>
+   <acronym>WAL</acronym> call being logged to the server log. This
+   option might be replaced by a more general mechanism in the future.
+  </para>
+
+  <para>
+   There are two internal functions to write WAL data to disk:
+   <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
+   When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
+   amounts of time <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
+   <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in
+   <xref linkend="pg-stat-wal-view"/>, respectively.
+   <function>XLogWrite</function> is normally called by
+   <function>XLogInsertRecord</function> (when there is no space for the new
+   record in WAL buffers), <function>XLogFlush</function> and the WAL writer,
+   to write WAL buffers to disk and call <function>issue_xlog_fsync</function>.
+   <function>issue_xlog_fsync</function> is normally called by
+   <function>XLogWrite</function> to sync WAL files to disk.
+   If <varname>wal_sync_method</varname> is either
+   <literal>open_datasync</literal> or <literal>open_sync</literal>,
+   a write operation in <function>XLogWrite</function> guarantees to sync written
+   WAL data to disk and <function>issue_xlog_fsync</function> does nothing.
+   If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>,
+   <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
+   the write operation moves WAL buffers to kernel cache and
+   <function>issue_xlog_fsync</function> syncs them to disk. Regardless
+   of the setting of <varname>track_wal_io_timing</varname>, the number
+   of times <function>XLogWrite</function> writes and
+   <function>issue_xlog_fsync</function> syncs WAL data to disk are also
+   counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
+   in <structname>pg_stat_wal</structname>, respectively.
+  </para>
+
+  <para>
+   The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce
+   I/O wait times during recovery by instructing the kernel to initiate reads
+   of disk blocks that will soon be needed but are not currently in
+   <productname>PostgreSQL</productname>'s buffer pool.
+   The <xref linkend="guc-maintenance-io-concurrency"/> and
+   <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+   concurrency and distance, respectively.  By default, it is set to
+   <literal>try</literal>, which enables the feature on systems where
+   <function>posix_fadvise</function> is available.
+  </para>
+ </sect1>
+
+ <sect1 id="wal-internals">
+  <title>WAL Internals</title>
+
+  <indexterm zone="wal-internals">
+   <primary>LSN</primary>
+  </indexterm>
+
+  <para>
+   <acronym>WAL</acronym> is automatically enabled; no action is
+   required from the administrator except ensuring that the
+   disk-space requirements for the <acronym>WAL</acronym> logs are met,
+   and that any necessary tuning is done (see <xref
+   linkend="wal-configuration"/>).
+  </para>
+
+  <para>
+   <acronym>WAL</acronym> records are appended to the <acronym>WAL</acronym>
+   logs as each new record is written. The insert position is described by
+   a Log Sequence Number (<acronym>LSN</acronym>) that is a byte offset into
+   the logs, increasing monotonically with each new record.
+   <acronym>LSN</acronym> values are returned as the datatype
+   <link linkend="datatype-pg-lsn"><type>pg_lsn</type></link>. Values can be
+   compared to calculate the volume of <acronym>WAL</acronym> data that
+   separates them, so they are used to measure the progress of replication
+   and recovery.
+  </para>
+
+  <para>
+   <acronym>WAL</acronym> logs are stored in the directory
+   <filename>pg_wal</filename> under the data directory, as a set of
+   segment files, normally each 16 MB in size (but the size can be changed
+   by altering the <option>--wal-segsize</option> <application>initdb</application> option).  Each segment is
+   divided into pages, normally 8 kB each (this size can be changed via the
+   <option>--with-wal-blocksize</option> configure option).  The log record headers
+   are described in <filename>access/xlogrecord.h</filename>; the record
+   content is dependent on the type of event that is being logged.  Segment
+   files are given ever-increasing numbers as names, starting at
+   <filename>000000010000000000000001</filename>.  The numbers do not wrap,
+   but it will take a very, very long time to exhaust the
+   available stock of numbers.
+  </para>
+
+  <para>
+   It is advantageous if the log is located on a different disk from the
+   main database files.  This can be achieved by moving the
+   <filename>pg_wal</filename> directory to another location (while the server
+   is shut down, of course) and creating a symbolic link from the
+   original location in the main data directory to the new location.
+  </para>
+
+  <para>
+   The aim of <acronym>WAL</acronym> is to ensure that the log is
+   written before database records are altered, but this can be subverted by
+   disk drives<indexterm><primary>disk drive</primary></indexterm> that falsely report a
+   successful write to the kernel,
+   when in fact they have only cached the data and not yet stored it
+   on the disk.  A power failure in such a situation might lead to
+   irrecoverable data corruption.  Administrators should try to ensure
+   that disks holding <productname>PostgreSQL</productname>'s
+   <acronym>WAL</acronym> log files do not make such false reports.
+   (See <xref linkend="wal-reliability"/>.)
+  </para>
+
+  <para>
+   After a checkpoint has been made and the log flushed, the
+   checkpoint's position is saved in the file
+   <filename>pg_control</filename>. Therefore, at the start of recovery,
+   the server first reads <filename>pg_control</filename> and
+   then the checkpoint record; then it performs the REDO operation by
+   scanning forward from the log location indicated in the checkpoint
+   record.  Because the entire content of data pages is saved in the
+   log on the first page modification after a checkpoint (assuming
+   <xref linkend="guc-full-page-writes"/> is not disabled), all pages
+   changed since the checkpoint will be restored to a consistent
+   state.
+  </para>
+
+  <para>
+   To deal with the case where <filename>pg_control</filename> is
+   corrupt, we should support the possibility of scanning existing log
+   segments in reverse order &mdash; newest to oldest &mdash; in order to find the
+   latest checkpoint.  This has not been implemented yet.
+   <filename>pg_control</filename> is small enough (less than one disk page)
+   that it is not subject to partial-write problems, and as of this writing
+   there have been no reports of database failures due solely to the inability
+   to read <filename>pg_control</filename> itself.  So while it is
+   theoretically a weak spot, <filename>pg_control</filename> does not
+   seem to be a problem in practice.
+  </para>
+ </sect1>
+</chapter>