summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/wal-reliability.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/html/wal-reliability.html')
-rw-r--r--doc/src/sgml/html/wal-reliability.html161
1 files changed, 161 insertions, 0 deletions
diff --git a/doc/src/sgml/html/wal-reliability.html b/doc/src/sgml/html/wal-reliability.html
new file mode 100644
index 0000000..e615912
--- /dev/null
+++ b/doc/src/sgml/html/wal-reliability.html
@@ -0,0 +1,161 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>29.1. Reliability</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log" /><link rel="next" href="wal-intro.html" title="29.2. Write-Ahead Logging (WAL)" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">29.1. Reliability</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Up</a></td><th width="60%" align="center">Chapter 29. Reliability and the Write-Ahead Log</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="wal-intro.html" title="29.2. Write-Ahead Logging (WAL)">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="WAL-RELIABILITY"><div class="titlepage"><div><div><h2 class="title" style="clear: both">29.1. Reliability</h2></div></div></div><p>
+ Reliability is an important property of any serious database
+ system, and <span class="productname">PostgreSQL</span> does everything possible to
+ guarantee reliable operation. One aspect of reliable operation is
+ that all data recorded by a committed transaction should be stored
+ in a nonvolatile area that is safe from power loss, operating
+ system failure, and hardware failure (except failure of the
+ nonvolatile area itself, of course). Successfully writing the data
+ to the computer's permanent storage (disk drive or equivalent)
+ ordinarily meets this requirement. In fact, even if a computer is
+ fatally damaged, if the disk drives survive they can be moved to
+ another computer with similar hardware and all committed
+ transactions will remain intact.
+ </p><p>
+ While forcing data to the disk platters periodically might seem like
+ a simple operation, it is not. Because disk drives are dramatically
+ slower than main memory and CPUs, several layers of caching exist
+ between the computer's main memory and the disk platters.
+ First, there is the operating system's buffer cache, which caches
+ frequently requested disk blocks and combines disk writes. Fortunately,
+ all operating systems give applications a way to force writes from
+ the buffer cache to disk, and <span class="productname">PostgreSQL</span> uses those
+ features. (See the <a class="xref" href="runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a> parameter
+ to adjust how this is done.)
+ </p><p>
+ Next, there might be a cache in the disk drive controller; this is
+ particularly common on <acronym class="acronym">RAID</acronym> controller cards. Some of
+ these caches are <em class="firstterm">write-through</em>, meaning writes are sent
+ to the drive as soon as they arrive. Others are
+ <em class="firstterm">write-back</em>, meaning data is sent to the drive at
+ some later time. Such caches can be a reliability hazard because the
+ memory in the disk controller cache is volatile, and will lose its
+ contents in a power failure. Better controller cards have
+ <em class="firstterm">battery-backup units</em> (<acronym class="acronym">BBU</acronym>s), meaning
+ the card has a battery that
+ maintains power to the cache in case of system power loss. After power
+ is restored the data will be written to the disk drives.
+ </p><p>
+ And finally, most disk drives have caches. Some are write-through
+ while some are write-back, and the same concerns about data loss
+ exist for write-back drive caches as for disk controller
+ caches. Consumer-grade IDE and SATA drives are particularly likely
+ to have write-back caches that will not survive a power failure. Many
+ solid-state drives (SSD) also have volatile write-back caches.
+ </p><p>
+ These caches can typically be disabled; however, the method for doing
+ this varies by operating system and drive type:
+ </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
+ On <span class="productname">Linux</span>, IDE and SATA drives can be queried using
+ <code class="command">hdparm -I</code>; write caching is enabled if there is
+ a <code class="literal">*</code> next to <code class="literal">Write cache</code>. <code class="command">hdparm -W 0</code>
+ can be used to turn off write caching. SCSI drives can be queried
+ using <a class="ulink" href="http://sg.danny.cz/sg/sdparm.html" target="_top"><span class="application">sdparm</span></a>.
+ Use <code class="command">sdparm --get=WCE</code> to check
+ whether the write cache is enabled and <code class="command">sdparm --clear=WCE</code>
+ to disable it.
+ </p></li><li class="listitem"><p>
+ On <span class="productname">FreeBSD</span>, IDE drives can be queried using
+ <code class="command">atacontrol</code> and write caching turned off using
+ <code class="literal">hw.ata.wc=0</code> in <code class="filename">/boot/loader.conf</code>;
+ SCSI drives can be queried using <code class="command">camcontrol identify</code>,
+ and the write cache both queried and changed using
+ <code class="command">sdparm</code> when available.
+ </p></li><li class="listitem"><p>
+ On <span class="productname">Solaris</span>, the disk write cache is controlled by
+ <code class="command">format -e</code>.
+ (The Solaris <acronym class="acronym">ZFS</acronym> file system is safe with disk write-cache
+ enabled because it issues its own disk cache flush commands.)
+ </p></li><li class="listitem"><p>
+ On <span class="productname">Windows</span>, if <code class="varname">wal_sync_method</code> is
+ <code class="literal">open_datasync</code> (the default), write caching can be disabled
+ by unchecking <code class="literal">My Computer\Open\<em class="replaceable"><code>disk drive</code></em>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</code>.
+ Alternatively, set <code class="varname">wal_sync_method</code> to
+ <code class="literal">fsync</code> or <code class="literal">fsync_writethrough</code>, which prevent
+ write caching.
+ </p></li><li class="listitem"><p>
+ On <span class="productname">macOS</span>, write caching can be prevented by
+ setting <code class="varname">wal_sync_method</code> to <code class="literal">fsync_writethrough</code>.
+ </p></li></ul></div><p>
+ Recent SATA drives (those following <acronym class="acronym">ATAPI-6</acronym> or later)
+ offer a drive cache flush command (<code class="command">FLUSH CACHE EXT</code>),
+ while SCSI drives have long supported a similar command
+ <code class="command">SYNCHRONIZE CACHE</code>. These commands are not directly
+ accessible to <span class="productname">PostgreSQL</span>, but some file systems
+ (e.g., <acronym class="acronym">ZFS</acronym>, <acronym class="acronym">ext4</acronym>) can use them to flush
+ data to the platters on write-back-enabled drives. Unfortunately, such
+ file systems behave suboptimally when combined with battery-backup unit
+ (<acronym class="acronym">BBU</acronym>) disk controllers. In such setups, the synchronize
+ command forces all data from the controller cache to the disks,
+ eliminating much of the benefit of the BBU. You can run the
+ <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program to see
+ if you are affected. If you are affected, the performance benefits
+ of the BBU can be regained by turning off write barriers in
+ the file system or reconfiguring the disk controller, if that is
+ an option. If write barriers are turned off, make sure the battery
+ remains functional; a faulty battery can potentially lead to data loss.
+ Hopefully file system and disk controller designers will eventually
+ address this suboptimal behavior.
+ </p><p>
+ When the operating system sends a write request to the storage hardware,
+ there is little it can do to make sure the data has arrived at a truly
+ non-volatile storage area. Rather, it is the
+ administrator's responsibility to make certain that all storage components
+ ensure integrity for both data and file-system metadata.
+ Avoid disk controllers that have non-battery-backed write caches.
+ At the drive level, disable write-back caching if the
+ drive cannot guarantee the data will be written before shutdown.
+ If you use SSDs, be aware that many of these do not honor cache flush
+ commands by default.
+ You can test for reliable I/O subsystem behavior using <a class="ulink" href="https://brad.livejournal.com/2116715.html" target="_top"><code class="filename">diskchecker.pl</code></a>.
+ </p><p>
+ Another risk of data loss is posed by the disk platter write
+ operations themselves. Disk platters are divided into sectors,
+ commonly 512 bytes each. Every physical read or write operation
+ processes a whole sector.
+ When a write request arrives at the drive, it might be for some multiple
+ of 512 bytes (<span class="productname">PostgreSQL</span> typically writes 8192 bytes, or
+ 16 sectors, at a time), and the process of writing could fail due
+ to power loss at any time, meaning some of the 512-byte sectors were
+ written while others were not. To guard against such failures,
+ <span class="productname">PostgreSQL</span> periodically writes full page images to
+ permanent WAL storage <span class="emphasis"><em>before</em></span> modifying the actual page on
+ disk. By doing this, during crash recovery <span class="productname">PostgreSQL</span> can
+ restore partially-written pages from WAL. If you have file-system software
+ that prevents partial page writes (e.g., ZFS), you can turn off
+ this page imaging by turning off the <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> parameter. Battery-Backed Unit
+ (BBU) disk controllers do not prevent partial page writes unless
+ they guarantee that data is written to the BBU as full (8kB) pages.
+ </p><p>
+ <span class="productname">PostgreSQL</span> also protects against some kinds of data corruption
+ on storage devices that may occur because of hardware errors or media failure over time,
+ such as reading/writing garbage data.
+ </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
+ Each individual record in a WAL file is protected by a CRC-32 (32-bit) check
+ that allows us to tell if record contents are correct. The CRC value
+ is set when we write each WAL record and checked during crash recovery,
+ archive recovery and replication.
+ </p></li><li class="listitem"><p>
+ Data pages are not currently checksummed by default, though full page images
+ recorded in WAL records will be protected; see <a class="link" href="app-initdb.html#APP-INITDB-DATA-CHECKSUMS"><span class="application">initdb</span></a>
+ for details about enabling data page checksums.
+ </p></li><li class="listitem"><p>
+ Internal data structures such as <code class="filename">pg_xact</code>, <code class="filename">pg_subtrans</code>, <code class="filename">pg_multixact</code>,
+ <code class="filename">pg_serial</code>, <code class="filename">pg_notify</code>, <code class="filename">pg_stat</code>, <code class="filename">pg_snapshots</code> are not directly
+ checksummed, nor are pages protected by full page writes. However, where
+ such data structures are persistent, WAL records are written that allow
+ recent changes to be accurately rebuilt at crash recovery and those
+ WAL records are protected as discussed above.
+ </p></li><li class="listitem"><p>
+ Individual state files in <code class="filename">pg_twophase</code> are protected by CRC-32.
+ </p></li><li class="listitem"><p>
+ Temporary data files used in larger SQL queries for sorts,
+ materializations and intermediate results are not currently checksummed,
+ nor will WAL records be written for changes to those files.
+ </p></li></ul></div><p>
+ </p><p>
+ <span class="productname">PostgreSQL</span> does not protect against correctable memory errors
+ and it is assumed you will operate using RAM that uses industry standard
+ Error Correcting Codes (ECC) or better protection.
+ </p></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="wal.html" title="Chapter 29. Reliability and the Write-Ahead Log">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="wal-intro.html" title="29.2. Write-Ahead Logging (WAL)">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 29. Reliability and the Write-Ahead Log </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 29.2. Write-Ahead Logging (<acronym xmlns="http://www.w3.org/1999/xhtml" class="acronym">WAL</acronym>)</td></tr></table></div></body></html> \ No newline at end of file