diff options
Diffstat (limited to 'doc/src/sgml/wal.sgml')
-rw-r--r-- | doc/src/sgml/wal.sgml | 911 |
1 files changed, 911 insertions, 0 deletions
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml new file mode 100644 index 0000000..27fb020 --- /dev/null +++ b/doc/src/sgml/wal.sgml @@ -0,0 +1,911 @@ +<!-- doc/src/sgml/wal.sgml --> + +<chapter id="wal"> + <title>Reliability and the Write-Ahead Log</title> + + <para> + This chapter explains how the Write-Ahead Log is used to obtain + efficient, reliable operation. + </para> + + <sect1 id="wal-reliability"> + <title>Reliability</title> + + <para> + Reliability is an important property of any serious database + system, and <productname>PostgreSQL</productname> does everything possible to + guarantee reliable operation. One aspect of reliable operation is + that all data recorded by a committed transaction should be stored + in a nonvolatile area that is safe from power loss, operating + system failure, and hardware failure (except failure of the + nonvolatile area itself, of course). Successfully writing the data + to the computer's permanent storage (disk drive or equivalent) + ordinarily meets this requirement. In fact, even if a computer is + fatally damaged, if the disk drives survive they can be moved to + another computer with similar hardware and all committed + transactions will remain intact. + </para> + + <para> + While forcing data to the disk platters periodically might seem like + a simple operation, it is not. Because disk drives are dramatically + slower than main memory and CPUs, several layers of caching exist + between the computer's main memory and the disk platters. + First, there is the operating system's buffer cache, which caches + frequently requested disk blocks and combines disk writes. Fortunately, + all operating systems give applications a way to force writes from + the buffer cache to disk, and <productname>PostgreSQL</productname> uses those + features. (See the <xref linkend="guc-wal-sync-method"/> parameter + to adjust how this is done.) + </para> + + <para> + Next, there might be a cache in the disk drive controller; this is + particularly common on <acronym>RAID</acronym> controller cards. Some of + these caches are <firstterm>write-through</firstterm>, meaning writes are sent + to the drive as soon as they arrive. Others are + <firstterm>write-back</firstterm>, meaning data is sent to the drive at + some later time. Such caches can be a reliability hazard because the + memory in the disk controller cache is volatile, and will lose its + contents in a power failure. Better controller cards have + <firstterm>battery-backup units</firstterm> (<acronym>BBU</acronym>s), meaning + the card has a battery that + maintains power to the cache in case of system power loss. After power + is restored the data will be written to the disk drives. + </para> + + <para> + And finally, most disk drives have caches. Some are write-through + while some are write-back, and the same concerns about data loss + exist for write-back drive caches as for disk controller + caches. Consumer-grade IDE and SATA drives are particularly likely + to have write-back caches that will not survive a power failure. Many + solid-state drives (SSD) also have volatile write-back caches. + </para> + + <para> + These caches can typically be disabled; however, the method for doing + this varies by operating system and drive type: + </para> + + <itemizedlist> + <listitem> + <para> + On <productname>Linux</productname>, IDE and SATA drives can be queried using + <command>hdparm -I</command>; write caching is enabled if there is + a <literal>*</literal> next to <literal>Write cache</literal>. <command>hdparm -W 0</command> + can be used to turn off write caching. SCSI drives can be queried + using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</application></ulink>. + Use <command>sdparm --get=WCE</command> to check + whether the write cache is enabled and <command>sdparm --clear=WCE</command> + to disable it. + </para> + </listitem> + + <listitem> + <para> + On <productname>FreeBSD</productname>, IDE drives can be queried using + <command>atacontrol</command> and write caching turned off using + <literal>hw.ata.wc=0</literal> in <filename>/boot/loader.conf</filename>; + SCSI drives can be queried using <command>camcontrol identify</command>, + and the write cache both queried and changed using + <command>sdparm</command> when available. + </para> + </listitem> + + <listitem> + <para> + On <productname>Solaris</productname>, the disk write cache is controlled by + <command>format -e</command>. + (The Solaris <acronym>ZFS</acronym> file system is safe with disk write-cache + enabled because it issues its own disk cache flush commands.) + </para> + </listitem> + + <listitem> + <para> + On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is + <literal>open_datasync</literal> (the default), write caching can be disabled + by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>. + Alternatively, set <varname>wal_sync_method</varname> to + <literal>fsync</literal> or <literal>fsync_writethrough</literal>, which prevent + write caching. + </para> + </listitem> + + <listitem> + <para> + On <productname>macOS</productname>, write caching can be prevented by + setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>. + </para> + </listitem> + </itemizedlist> + + <para> + Recent SATA drives (those following <acronym>ATAPI-6</acronym> or later) + offer a drive cache flush command (<command>FLUSH CACHE EXT</command>), + while SCSI drives have long supported a similar command + <command>SYNCHRONIZE CACHE</command>. These commands are not directly + accessible to <productname>PostgreSQL</productname>, but some file systems + (e.g., <acronym>ZFS</acronym>, <acronym>ext4</acronym>) can use them to flush + data to the platters on write-back-enabled drives. Unfortunately, such + file systems behave suboptimally when combined with battery-backup unit + (<acronym>BBU</acronym>) disk controllers. In such setups, the synchronize + command forces all data from the controller cache to the disks, + eliminating much of the benefit of the BBU. You can run the + <xref linkend="pgtestfsync"/> program to see + if you are affected. If you are affected, the performance benefits + of the BBU can be regained by turning off write barriers in + the file system or reconfiguring the disk controller, if that is + an option. If write barriers are turned off, make sure the battery + remains functional; a faulty battery can potentially lead to data loss. + Hopefully file system and disk controller designers will eventually + address this suboptimal behavior. + </para> + + <para> + When the operating system sends a write request to the storage hardware, + there is little it can do to make sure the data has arrived at a truly + non-volatile storage area. Rather, it is the + administrator's responsibility to make certain that all storage components + ensure integrity for both data and file-system metadata. + Avoid disk controllers that have non-battery-backed write caches. + At the drive level, disable write-back caching if the + drive cannot guarantee the data will be written before shutdown. + If you use SSDs, be aware that many of these do not honor cache flush + commands by default. + You can test for reliable I/O subsystem behavior using <ulink + url="https://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>. + </para> + + <para> + Another risk of data loss is posed by the disk platter write + operations themselves. Disk platters are divided into sectors, + commonly 512 bytes each. Every physical read or write operation + processes a whole sector. + When a write request arrives at the drive, it might be for some multiple + of 512 bytes (<productname>PostgreSQL</productname> typically writes 8192 bytes, or + 16 sectors, at a time), and the process of writing could fail due + to power loss at any time, meaning some of the 512-byte sectors were + written while others were not. To guard against such failures, + <productname>PostgreSQL</productname> periodically writes full page images to + permanent WAL storage <emphasis>before</emphasis> modifying the actual page on + disk. By doing this, during crash recovery <productname>PostgreSQL</productname> can + restore partially-written pages from WAL. If you have file-system software + that prevents partial page writes (e.g., ZFS), you can turn off + this page imaging by turning off the <xref + linkend="guc-full-page-writes"/> parameter. Battery-Backed Unit + (BBU) disk controllers do not prevent partial page writes unless + they guarantee that data is written to the BBU as full (8kB) pages. + </para> + <para> + <productname>PostgreSQL</productname> also protects against some kinds of data corruption + on storage devices that may occur because of hardware errors or media failure over time, + such as reading/writing garbage data. + <itemizedlist> + <listitem> + <para> + Each individual record in a WAL file is protected by a CRC-32 (32-bit) check + that allows us to tell if record contents are correct. The CRC value + is set when we write each WAL record and checked during crash recovery, + archive recovery and replication. + </para> + </listitem> + <listitem> + <para> + Data pages are not currently checksummed by default, though full page images + recorded in WAL records will be protected; see <link + linkend="app-initdb-data-checksums"><application>initdb</application></link> + for details about enabling data checksums. + </para> + </listitem> + <listitem> + <para> + Internal data structures such as <filename>pg_xact</filename>, <filename>pg_subtrans</filename>, <filename>pg_multixact</filename>, + <filename>pg_serial</filename>, <filename>pg_notify</filename>, <filename>pg_stat</filename>, <filename>pg_snapshots</filename> are not directly + checksummed, nor are pages protected by full page writes. However, where + such data structures are persistent, WAL records are written that allow + recent changes to be accurately rebuilt at crash recovery and those + WAL records are protected as discussed above. + </para> + </listitem> + <listitem> + <para> + Individual state files in <filename>pg_twophase</filename> are protected by CRC-32. + </para> + </listitem> + <listitem> + <para> + Temporary data files used in larger SQL queries for sorts, + materializations and intermediate results are not currently checksummed, + nor will WAL records be written for changes to those files. + </para> + </listitem> + </itemizedlist> + </para> + <para> + <productname>PostgreSQL</productname> does not protect against correctable memory errors + and it is assumed you will operate using RAM that uses industry standard + Error Correcting Codes (ECC) or better protection. + </para> + </sect1> + + <sect1 id="checksums"> + <title>Data Checksums</title> + <indexterm> + <primary>checksums</primary> + </indexterm> + + <para> + By default, data pages are not protected by checksums, but this can + optionally be enabled for a cluster. When enabled, each data page includes + a checksum that is updated when the page is written and verified each time + the page is read. Only data pages are protected by checksums; internal data + structures and temporary files are not. + </para> + + <para> + Checksums are normally enabled when the cluster is initialized using <link + linkend="app-initdb-data-checksums"><application>initdb</application></link>. + They can also be enabled or disabled at a later time as an offline + operation. Data checksums are enabled or disabled at the full cluster + level, and cannot be specified individually for databases or tables. + </para> + + <para> + The current state of checksums in the cluster can be verified by viewing the + value of the read-only configuration variable <xref + linkend="guc-data-checksums" /> by issuing the command <command>SHOW + data_checksums</command>. + </para> + + <para> + When attempting to recover from page corruptions, it may be necessary to + bypass the checksum protection. To do this, temporarily set the + configuration parameter <xref linkend="guc-ignore-checksum-failure" />. + </para> + + <sect2 id="checksums-offline-enable-disable"> + <title>Off-line Enabling of Checksums</title> + + <para> + The <link linkend="app-pgchecksums"><application>pg_checksums</application></link> + application can be used to enable or disable data checksums, as well as + verify checksums, on an offline cluster. + </para> + + </sect2> + </sect1> + + <sect1 id="wal-intro"> + <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title> + + <indexterm zone="wal"> + <primary>WAL</primary> + </indexterm> + + <indexterm> + <primary>transaction log</primary> + <see>WAL</see> + </indexterm> + + <para> + <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>) + is a standard method for ensuring data integrity. A detailed + description can be found in most (if not all) books about + transaction processing. Briefly, <acronym>WAL</acronym>'s central + concept is that changes to data files (where tables and indexes + reside) must be written only after those changes have been logged, + that is, after log records describing the changes have been flushed + to permanent storage. If we follow this procedure, we do not need + to flush data pages to disk on every transaction commit, because we + know that in the event of a crash we will be able to recover the + database using the log: any changes that have not been applied to + the data pages can be redone from the log records. (This is + roll-forward recovery, also known as REDO.) + </para> + + <tip> + <para> + Because <acronym>WAL</acronym> restores database file + contents after a crash, journaled file systems are not necessary for + reliable storage of the data files or WAL files. In fact, journaling + overhead can reduce performance, especially if journaling + causes file system <emphasis>data</emphasis> to be flushed + to disk. Fortunately, data flushing during journaling can + often be disabled with a file system mount option, e.g., + <literal>data=writeback</literal> on a Linux ext3 file system. + Journaled file systems do improve boot speed after a crash. + </para> + </tip> + + + <para> + Using <acronym>WAL</acronym> results in a + significantly reduced number of disk writes, because only the log + file needs to be flushed to disk to guarantee that a transaction is + committed, rather than every data file changed by the transaction. + The log file is written sequentially, + and so the cost of syncing the log is much less than the cost of + flushing the data pages. This is especially true for servers + handling many small transactions touching different parts of the data + store. Furthermore, when the server is processing many small concurrent + transactions, one <function>fsync</function> of the log file may + suffice to commit many transactions. + </para> + + <para> + <acronym>WAL</acronym> also makes it possible to support on-line + backup and point-in-time recovery, as described in <xref + linkend="continuous-archiving"/>. By archiving the WAL data we can support + reverting to any time instant covered by the available WAL data: + we simply install a prior physical backup of the database, and + replay the WAL log just as far as the desired time. What's more, + the physical backup doesn't have to be an instantaneous snapshot + of the database state — if it is made over some period of time, + then replaying the WAL log for that period will fix any internal + inconsistencies. + </para> + </sect1> + + <sect1 id="wal-async-commit"> + <title>Asynchronous Commit</title> + + <indexterm> + <primary>synchronous commit</primary> + </indexterm> + + <indexterm> + <primary>asynchronous commit</primary> + </indexterm> + + <para> + <firstterm>Asynchronous commit</firstterm> is an option that allows transactions + to complete more quickly, at the cost that the most recent transactions may + be lost if the database should crash. In many applications this is an + acceptable trade-off. + </para> + + <para> + As described in the previous section, transaction commit is normally + <firstterm>synchronous</firstterm>: the server waits for the transaction's + <acronym>WAL</acronym> records to be flushed to permanent storage + before returning a success indication to the client. The client is + therefore guaranteed that a transaction reported to be committed will + be preserved, even in the event of a server crash immediately after. + However, for short transactions this delay is a major component of the + total transaction time. Selecting asynchronous commit mode means that + the server returns success as soon as the transaction is logically + completed, before the <acronym>WAL</acronym> records it generated have + actually made their way to disk. This can provide a significant boost + in throughput for small transactions. + </para> + + <para> + Asynchronous commit introduces the risk of data loss. There is a short + time window between the report of transaction completion to the client + and the time that the transaction is truly committed (that is, it is + guaranteed not to be lost if the server crashes). Thus asynchronous + commit should not be used if the client will take external actions + relying on the assumption that the transaction will be remembered. + As an example, a bank would certainly not use asynchronous commit for + a transaction recording an ATM's dispensing of cash. But in many + scenarios, such as event logging, there is no need for a strong + guarantee of this kind. + </para> + + <para> + The risk that is taken by using asynchronous commit is of data loss, + not data corruption. If the database should crash, it will recover + by replaying <acronym>WAL</acronym> up to the last record that was + flushed. The database will therefore be restored to a self-consistent + state, but any transactions that were not yet flushed to disk will + not be reflected in that state. The net effect is therefore loss of + the last few transactions. Because the transactions are replayed in + commit order, no inconsistency can be introduced — for example, + if transaction B made changes relying on the effects of a previous + transaction A, it is not possible for A's effects to be lost while B's + effects are preserved. + </para> + + <para> + The user can select the commit mode of each transaction, so that + it is possible to have both synchronous and asynchronous commit + transactions running concurrently. This allows flexible trade-offs + between performance and certainty of transaction durability. + The commit mode is controlled by the user-settable parameter + <xref linkend="guc-synchronous-commit"/>, which can be changed in any of + the ways that a configuration parameter can be set. The mode used for + any one transaction depends on the value of + <varname>synchronous_commit</varname> when transaction commit begins. + </para> + + <para> + Certain utility commands, for instance <command>DROP TABLE</command>, are + forced to commit synchronously regardless of the setting of + <varname>synchronous_commit</varname>. This is to ensure consistency + between the server's file system and the logical state of the database. + The commands supporting two-phase commit, such as <command>PREPARE + TRANSACTION</command>, are also always synchronous. + </para> + + <para> + If the database crashes during the risk window between an + asynchronous commit and the writing of the transaction's + <acronym>WAL</acronym> records, + then changes made during that transaction <emphasis>will</emphasis> be lost. + The duration of the + risk window is limited because a background process (the <quote>WAL + writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk + every <xref linkend="guc-wal-writer-delay"/> milliseconds. + The actual maximum duration of the risk window is three times + <varname>wal_writer_delay</varname> because the WAL writer is + designed to favor writing whole pages at a time during busy periods. + </para> + + <caution> + <para> + An immediate-mode shutdown is equivalent to a server crash, and will + therefore cause loss of any unflushed asynchronous commits. + </para> + </caution> + + <para> + Asynchronous commit provides behavior different from setting + <xref linkend="guc-fsync"/> = off. + <varname>fsync</varname> is a server-wide + setting that will alter the behavior of all transactions. It disables + all logic within <productname>PostgreSQL</productname> that attempts to synchronize + writes to different portions of the database, and therefore a system + crash (that is, a hardware or operating system crash, not a failure of + <productname>PostgreSQL</productname> itself) could result in arbitrarily bad + corruption of the database state. In many scenarios, asynchronous + commit provides most of the performance improvement that could be + obtained by turning off <varname>fsync</varname>, but without the risk + of data corruption. + </para> + + <para> + <xref linkend="guc-commit-delay"/> also sounds very similar to + asynchronous commit, but it is actually a synchronous commit method + (in fact, <varname>commit_delay</varname> is ignored during an + asynchronous commit). <varname>commit_delay</varname> causes a delay + just before a transaction flushes <acronym>WAL</acronym> to disk, in + the hope that a single flush executed by one such transaction can also + serve other transactions committing at about the same time. The + setting can be thought of as a way of increasing the time window in + which transactions can join a group about to participate in a single + flush, to amortize the cost of the flush among multiple transactions. + </para> + + </sect1> + + <sect1 id="wal-configuration"> + <title><acronym>WAL</acronym> Configuration</title> + + <para> + There are several <acronym>WAL</acronym>-related configuration parameters that + affect database performance. This section explains their use. + Consult <xref linkend="runtime-config"/> for general information about + setting server configuration parameters. + </para> + + <para> + <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm> + are points in the sequence of transactions at which it is guaranteed + that the heap and index data files have been updated with all + information written before that checkpoint. At checkpoint time, all + dirty data pages are flushed to disk and a special checkpoint record is + written to the log file. (The change records were previously flushed + to the <acronym>WAL</acronym> files.) + In the event of a crash, the crash recovery procedure looks at the latest + checkpoint record to determine the point in the log (known as the redo + record) from which it should start the REDO operation. Any changes made to + data files before that point are guaranteed to be already on disk. + Hence, after a checkpoint, log segments preceding the one containing + the redo record are no longer needed and can be recycled or removed. (When + <acronym>WAL</acronym> archiving is being done, the log segments must be + archived before being recycled or removed.) + </para> + + <para> + The checkpoint requirement of flushing all dirty data pages to disk + can cause a significant I/O load. For this reason, checkpoint + activity is throttled so that I/O begins at checkpoint start and completes + before the next checkpoint is due to start; this minimizes performance + degradation during checkpoints. + </para> + + <para> + The server's checkpointer process automatically performs + a checkpoint every so often. A checkpoint is begun every <xref + linkend="guc-checkpoint-timeout"/> seconds, or if + <xref linkend="guc-max-wal-size"/> is about to be exceeded, + whichever comes first. + The default settings are 5 minutes and 1 GB, respectively. + If no WAL has been written since the previous checkpoint, new checkpoints + will be skipped even if <varname>checkpoint_timeout</varname> has passed. + (If WAL archiving is being used and you want to put a lower limit on how + often files are archived in order to bound potential data loss, you should + adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the + checkpoint parameters.) + It is also possible to force a checkpoint by using the SQL + command <command>CHECKPOINT</command>. + </para> + + <para> + Reducing <varname>checkpoint_timeout</varname> and/or + <varname>max_wal_size</varname> causes checkpoints to occur + more often. This allows faster after-crash recovery, since less work + will need to be redone. However, one must balance this against the + increased cost of flushing dirty data pages more often. If + <xref linkend="guc-full-page-writes"/> is set (as is the default), there is + another factor to consider. To ensure data page consistency, + the first modification of a data page after each checkpoint results in + logging the entire page content. In that case, + a smaller checkpoint interval increases the volume of output to the WAL log, + partially negating the goal of using a smaller interval, + and in any case causing more disk I/O. + </para> + + <para> + Checkpoints are fairly expensive, first because they require writing + out all currently dirty buffers, and second because they result in + extra subsequent WAL traffic as discussed above. It is therefore + wise to set the checkpointing parameters high enough so that checkpoints + don't happen too often. As a simple sanity check on your checkpointing + parameters, you can set the <xref linkend="guc-checkpoint-warning"/> + parameter. If checkpoints happen closer together than + <varname>checkpoint_warning</varname> seconds, + a message will be output to the server log recommending increasing + <varname>max_wal_size</varname>. Occasional appearance of such + a message is not cause for alarm, but if it appears often then the + checkpoint control parameters should be increased. Bulk operations such + as large <command>COPY</command> transfers might cause a number of such warnings + to appear if you have not set <varname>max_wal_size</varname> high + enough. + </para> + + <para> + To avoid flooding the I/O system with a burst of page writes, + writing dirty buffers during a checkpoint is spread over a period of time. + That period is controlled by + <xref linkend="guc-checkpoint-completion-target"/>, which is + given as a fraction of the checkpoint interval (configured by using + <varname>checkpoint_timeout</varname>). + The I/O rate is adjusted so that the checkpoint finishes when the + given fraction of + <varname>checkpoint_timeout</varname> seconds have elapsed, or before + <varname>max_wal_size</varname> is exceeded, whichever is sooner. + With the default value of 0.9, + <productname>PostgreSQL</productname> can be expected to complete each checkpoint + a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's + duration). This spreads out the I/O as much as possible so that the checkpoint + I/O load is consistent throughout the checkpoint interval. The disadvantage of + this is that prolonging checkpoints affects recovery time, because more WAL + segments will need to be kept around for possible use in recovery. A user + concerned about the amount of time required to recover might wish to reduce + <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently + but still spread the I/O across the checkpoint interval. Alternatively, + <varname>checkpoint_completion_target</varname> could be reduced, but this would + result in times of more intense I/O (during the checkpoint) and times of less I/O + (after the checkpoint completed but before the next scheduled checkpoint) and + therefore is not recommended. + Although <varname>checkpoint_completion_target</varname> could be set as high as + 1.0, it is typically recommended to set it to no higher than 0.9 (the default) + since checkpoints include some other activities besides writing dirty buffers. + A setting of 1.0 is quite likely to result in checkpoints not being + completed on time, which would result in performance loss due to + unexpected variation in the number of WAL segments needed. + </para> + + <para> + On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/> + allows to force the OS that pages written by the checkpoint should be + flushed to disk after a configurable number of bytes. Otherwise, these + pages may be kept in the OS's page cache, inducing a stall when + <literal>fsync</literal> is issued at the end of a checkpoint. This setting will + often help to reduce transaction latency, but it also can have an adverse + effect on performance; particularly for workloads that are bigger than + <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache. + </para> + + <para> + The number of WAL segment files in <filename>pg_wal</filename> directory depends on + <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and + the amount of WAL generated in previous checkpoint cycles. When old log + segment files are no longer needed, they are removed or recycled (that is, + renamed to become future segments in the numbered sequence). If, due to a + short-term peak of log output rate, <varname>max_wal_size</varname> is + exceeded, the unneeded segment files will be removed until the system + gets back under this limit. Below that limit, the system recycles enough + WAL files to cover the estimated need until the next checkpoint, and + removes the rest. The estimate is based on a moving average of the number + of WAL files used in previous checkpoint cycles. The moving average + is increased immediately if the actual usage exceeds the estimate, so it + accommodates peak usage rather than average usage to some extent. + <varname>min_wal_size</varname> puts a minimum on the amount of WAL files + recycled for future usage; that much WAL is always recycled for future use, + even if the system is idle and the WAL usage estimate suggests that little + WAL is needed. + </para> + + <para> + Independently of <varname>max_wal_size</varname>, + the most recent <xref linkend="guc-wal-keep-size"/> megabytes of + WAL files plus one additional WAL file are + kept at all times. Also, if WAL archiving is used, old segments cannot be + removed or recycled until they are archived. If WAL archiving cannot keep up + with the pace that WAL is generated, or if <varname>archive_command</varname> + or <varname>archive_library</varname> + fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename> + until the situation is resolved. A slow or failed standby server that + uses a replication slot will have the same effect (see + <xref linkend="streaming-replication-slots"/>). + </para> + + <para> + In archive recovery or standby mode, the server periodically performs + <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm> + which are similar to checkpoints in normal operation: the server forces + all its state to disk, updates the <filename>pg_control</filename> file to + indicate that the already-processed WAL data need not be scanned again, + and then recycles any old log segment files in the <filename>pg_wal</filename> + directory. + Restartpoints can't be performed more frequently than checkpoints on the + primary because restartpoints can only be performed at checkpoint records. + A restartpoint is triggered when a checkpoint record is reached if at + least <varname>checkpoint_timeout</varname> seconds have passed since the last + restartpoint, or if WAL size is about to exceed + <varname>max_wal_size</varname>. However, because of limitations on when a + restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded + during recovery, by up to one checkpoint cycle's worth of WAL. + (<varname>max_wal_size</varname> is never a hard limit anyway, so you should + always leave plenty of headroom to avoid running out of disk space.) + </para> + + <para> + There are two commonly used internal <acronym>WAL</acronym> functions: + <function>XLogInsertRecord</function> and <function>XLogFlush</function>. + <function>XLogInsertRecord</function> is used to place a new record into + the <acronym>WAL</acronym> buffers in shared memory. If there is no + space for the new record, <function>XLogInsertRecord</function> will have + to write (move to kernel cache) a few filled <acronym>WAL</acronym> + buffers. This is undesirable because <function>XLogInsertRecord</function> + is used on every database low level modification (for example, row + insertion) at a time when an exclusive lock is held on affected + data pages, so the operation needs to be as fast as possible. What + is worse, writing <acronym>WAL</acronym> buffers might also force the + creation of a new log segment, which takes even more + time. Normally, <acronym>WAL</acronym> buffers should be written + and flushed by an <function>XLogFlush</function> request, which is + made, for the most part, at transaction commit time to ensure that + transaction records are flushed to permanent storage. On systems + with high log output, <function>XLogFlush</function> requests might + not occur often enough to prevent <function>XLogInsertRecord</function> + from having to do writes. On such systems + one should increase the number of <acronym>WAL</acronym> buffers by + modifying the <xref linkend="guc-wal-buffers"/> parameter. When + <xref linkend="guc-full-page-writes"/> is set and the system is very busy, + setting <varname>wal_buffers</varname> higher will help smooth response times + during the period immediately following each checkpoint. + </para> + + <para> + The <xref linkend="guc-commit-delay"/> parameter defines for how many + microseconds a group commit leader process will sleep after acquiring a + lock within <function>XLogFlush</function>, while group commit + followers queue up behind the leader. This delay allows other server + processes to add their commit records to the WAL buffers so that all of + them will be flushed by the leader's eventual sync operation. No sleep + will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer + than <xref linkend="guc-commit-siblings"/> other sessions are currently + in active transactions; this avoids sleeping when it's unlikely that + any other session will commit soon. Note that on some platforms, the + resolution of a sleep request is ten milliseconds, so that any nonzero + <varname>commit_delay</varname> setting between 1 and 10000 + microseconds would have the same effect. Note also that on some + platforms, sleep operations may take slightly longer than requested by + the parameter. + </para> + + <para> + Since the purpose of <varname>commit_delay</varname> is to allow the + cost of each flush operation to be amortized across concurrently + committing transactions (potentially at the expense of transaction + latency), it is necessary to quantify that cost before the setting can + be chosen intelligently. The higher that cost is, the more effective + <varname>commit_delay</varname> is expected to be in increasing + transaction throughput, up to a point. The <xref + linkend="pgtestfsync"/> program can be used to measure the average time + in microseconds that a single WAL flush operation takes. A value of + half of the average time the program reports it takes to flush after a + single 8kB write operation is often the most effective setting for + <varname>commit_delay</varname>, so this value is recommended as the + starting point to use when optimizing for a particular workload. While + tuning <varname>commit_delay</varname> is particularly useful when the + WAL log is stored on high-latency rotating disks, benefits can be + significant even on storage media with very fast sync times, such as + solid-state drives or RAID arrays with a battery-backed write cache; + but this should definitely be tested against a representative workload. + Higher values of <varname>commit_siblings</varname> should be used in + such cases, whereas smaller <varname>commit_siblings</varname> values + are often helpful on higher latency media. Note that it is quite + possible that a setting of <varname>commit_delay</varname> that is too + high can increase transaction latency by so much that total transaction + throughput suffers. + </para> + + <para> + When <varname>commit_delay</varname> is set to zero (the default), it + is still possible for a form of group commit to occur, but each group + will consist only of sessions that reach the point where they need to + flush their commit records during the window in which the previous + flush operation (if any) is occurring. At higher client counts a + <quote>gangway effect</quote> tends to occur, so that the effects of group + commit become significant even when <varname>commit_delay</varname> is + zero, and thus explicitly setting <varname>commit_delay</varname> tends + to help less. Setting <varname>commit_delay</varname> can only help + when (1) there are some concurrently committing transactions, and (2) + throughput is limited to some degree by commit rate; but with high + rotational latency this setting can be effective in increasing + transaction throughput with as few as two clients (that is, a single + committing client with one sibling transaction). + </para> + + <para> + The <xref linkend="guc-wal-sync-method"/> parameter determines how + <productname>PostgreSQL</productname> will ask the kernel to force + <acronym>WAL</acronym> updates out to disk. + All the options should be the same in terms of reliability, with + the exception of <literal>fsync_writethrough</literal>, which can sometimes + force a flush of the disk cache even when other options do not do so. + However, it's quite platform-specific which one will be the fastest. + You can test the speeds of different options using the <xref + linkend="pgtestfsync"/> program. + Note that this parameter is irrelevant if <varname>fsync</varname> + has been turned off. + </para> + + <para> + Enabling the <xref linkend="guc-wal-debug"/> configuration parameter + (provided that <productname>PostgreSQL</productname> has been + compiled with support for it) will result in each + <function>XLogInsertRecord</function> and <function>XLogFlush</function> + <acronym>WAL</acronym> call being logged to the server log. This + option might be replaced by a more general mechanism in the future. + </para> + + <para> + There are two internal functions to write WAL data to disk: + <function>XLogWrite</function> and <function>issue_xlog_fsync</function>. + When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total + amounts of time <function>XLogWrite</function> writes and + <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as + <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in + <xref linkend="pg-stat-wal-view"/>, respectively. + <function>XLogWrite</function> is normally called by + <function>XLogInsertRecord</function> (when there is no space for the new + record in WAL buffers), <function>XLogFlush</function> and the WAL writer, + to write WAL buffers to disk and call <function>issue_xlog_fsync</function>. + <function>issue_xlog_fsync</function> is normally called by + <function>XLogWrite</function> to sync WAL files to disk. + If <varname>wal_sync_method</varname> is either + <literal>open_datasync</literal> or <literal>open_sync</literal>, + a write operation in <function>XLogWrite</function> guarantees to sync written + WAL data to disk and <function>issue_xlog_fsync</function> does nothing. + If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>, + <literal>fsync</literal>, or <literal>fsync_writethrough</literal>, + the write operation moves WAL buffers to kernel cache and + <function>issue_xlog_fsync</function> syncs them to disk. Regardless + of the setting of <varname>track_wal_io_timing</varname>, the number + of times <function>XLogWrite</function> writes and + <function>issue_xlog_fsync</function> syncs WAL data to disk are also + counted as <literal>wal_write</literal> and <literal>wal_sync</literal> + in <structname>pg_stat_wal</structname>, respectively. + </para> + + <para> + The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce + I/O wait times during recovery by instructing the kernel to initiate reads + of disk blocks that will soon be needed but are not currently in + <productname>PostgreSQL</productname>'s buffer pool. + The <xref linkend="guc-maintenance-io-concurrency"/> and + <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching + concurrency and distance, respectively. By default, it is set to + <literal>try</literal>, which enables the feature on systems where + <function>posix_fadvise</function> is available. + </para> + </sect1> + + <sect1 id="wal-internals"> + <title>WAL Internals</title> + + <indexterm zone="wal-internals"> + <primary>LSN</primary> + </indexterm> + + <para> + <acronym>WAL</acronym> is automatically enabled; no action is + required from the administrator except ensuring that the + disk-space requirements for the <acronym>WAL</acronym> logs are met, + and that any necessary tuning is done (see <xref + linkend="wal-configuration"/>). + </para> + + <para> + <acronym>WAL</acronym> records are appended to the <acronym>WAL</acronym> + logs as each new record is written. The insert position is described by + a Log Sequence Number (<acronym>LSN</acronym>) that is a byte offset into + the logs, increasing monotonically with each new record. + <acronym>LSN</acronym> values are returned as the datatype + <link linkend="datatype-pg-lsn"><type>pg_lsn</type></link>. Values can be + compared to calculate the volume of <acronym>WAL</acronym> data that + separates them, so they are used to measure the progress of replication + and recovery. + </para> + + <para> + <acronym>WAL</acronym> logs are stored in the directory + <filename>pg_wal</filename> under the data directory, as a set of + segment files, normally each 16 MB in size (but the size can be changed + by altering the <option>--wal-segsize</option> <application>initdb</application> option). Each segment is + divided into pages, normally 8 kB each (this size can be changed via the + <option>--with-wal-blocksize</option> configure option). The log record headers + are described in <filename>access/xlogrecord.h</filename>; the record + content is dependent on the type of event that is being logged. Segment + files are given ever-increasing numbers as names, starting at + <filename>000000010000000000000001</filename>. The numbers do not wrap, + but it will take a very, very long time to exhaust the + available stock of numbers. + </para> + + <para> + It is advantageous if the log is located on a different disk from the + main database files. This can be achieved by moving the + <filename>pg_wal</filename> directory to another location (while the server + is shut down, of course) and creating a symbolic link from the + original location in the main data directory to the new location. + </para> + + <para> + The aim of <acronym>WAL</acronym> is to ensure that the log is + written before database records are altered, but this can be subverted by + disk drives<indexterm><primary>disk drive</primary></indexterm> that falsely report a + successful write to the kernel, + when in fact they have only cached the data and not yet stored it + on the disk. A power failure in such a situation might lead to + irrecoverable data corruption. Administrators should try to ensure + that disks holding <productname>PostgreSQL</productname>'s + <acronym>WAL</acronym> log files do not make such false reports. + (See <xref linkend="wal-reliability"/>.) + </para> + + <para> + After a checkpoint has been made and the log flushed, the + checkpoint's position is saved in the file + <filename>pg_control</filename>. Therefore, at the start of recovery, + the server first reads <filename>pg_control</filename> and + then the checkpoint record; then it performs the REDO operation by + scanning forward from the log location indicated in the checkpoint + record. Because the entire content of data pages is saved in the + log on the first page modification after a checkpoint (assuming + <xref linkend="guc-full-page-writes"/> is not disabled), all pages + changed since the checkpoint will be restored to a consistent + state. + </para> + + <para> + To deal with the case where <filename>pg_control</filename> is + corrupt, we should support the possibility of scanning existing log + segments in reverse order — newest to oldest — in order to find the + latest checkpoint. This has not been implemented yet. + <filename>pg_control</filename> is small enough (less than one disk page) + that it is not subject to partial-write problems, and as of this writing + there have been no reports of database failures due solely to the inability + to read <filename>pg_control</filename> itself. So while it is + theoretically a weak spot, <filename>pg_control</filename> does not + seem to be a problem in practice. + </para> + </sect1> +</chapter> |