diff options
Diffstat (limited to 'doc/src/sgml/html/wal-reliability.html')
-rw-r--r-- | doc/src/sgml/html/wal-reliability.html | 161 |
1 files changed, 161 insertions, 0 deletions
diff --git a/doc/src/sgml/html/wal-reliability.html b/doc/src/sgml/html/wal-reliability.html new file mode 100644 index 0000000..8cac3cf --- /dev/null +++ b/doc/src/sgml/html/wal-reliability.html @@ -0,0 +1,161 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>30.1. Reliability</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log" /><link rel="next" href="checksums.html" title="30.2. Data Checksums" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">30.1. Reliability</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><th width="60%" align="center">Chapter 30. Reliability and the Write-Ahead Log</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="checksums.html" title="30.2. Data Checksums">Next</a></td></tr></table><hr /></div><div class="sect1" id="WAL-RELIABILITY"><div class="titlepage"><div><div><h2 class="title" style="clear: both">30.1. Reliability</h2></div></div></div><p> + Reliability is an important property of any serious database + system, and <span class="productname">PostgreSQL</span> does everything possible to + guarantee reliable operation. One aspect of reliable operation is + that all data recorded by a committed transaction should be stored + in a nonvolatile area that is safe from power loss, operating + system failure, and hardware failure (except failure of the + nonvolatile area itself, of course). Successfully writing the data + to the computer's permanent storage (disk drive or equivalent) + ordinarily meets this requirement. In fact, even if a computer is + fatally damaged, if the disk drives survive they can be moved to + another computer with similar hardware and all committed + transactions will remain intact. + </p><p> + While forcing data to the disk platters periodically might seem like + a simple operation, it is not. Because disk drives are dramatically + slower than main memory and CPUs, several layers of caching exist + between the computer's main memory and the disk platters. + First, there is the operating system's buffer cache, which caches + frequently requested disk blocks and combines disk writes. Fortunately, + all operating systems give applications a way to force writes from + the buffer cache to disk, and <span class="productname">PostgreSQL</span> uses those + features. (See the <a class="xref" href="runtime-config-wal.html#GUC-WAL-SYNC-METHOD">wal_sync_method</a> parameter + to adjust how this is done.) + </p><p> + Next, there might be a cache in the disk drive controller; this is + particularly common on <acronym class="acronym">RAID</acronym> controller cards. Some of + these caches are <em class="firstterm">write-through</em>, meaning writes are sent + to the drive as soon as they arrive. Others are + <em class="firstterm">write-back</em>, meaning data is sent to the drive at + some later time. Such caches can be a reliability hazard because the + memory in the disk controller cache is volatile, and will lose its + contents in a power failure. Better controller cards have + <em class="firstterm">battery-backup units</em> (<acronym class="acronym">BBU</acronym>s), meaning + the card has a battery that + maintains power to the cache in case of system power loss. After power + is restored the data will be written to the disk drives. + </p><p> + And finally, most disk drives have caches. Some are write-through + while some are write-back, and the same concerns about data loss + exist for write-back drive caches as for disk controller + caches. Consumer-grade IDE and SATA drives are particularly likely + to have write-back caches that will not survive a power failure. Many + solid-state drives (SSD) also have volatile write-back caches. + </p><p> + These caches can typically be disabled; however, the method for doing + this varies by operating system and drive type: + </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p> + On <span class="productname">Linux</span>, IDE and SATA drives can be queried using + <code class="command">hdparm -I</code>; write caching is enabled if there is + a <code class="literal">*</code> next to <code class="literal">Write cache</code>. <code class="command">hdparm -W 0</code> + can be used to turn off write caching. SCSI drives can be queried + using <a class="ulink" href="http://sg.danny.cz/sg/sdparm.html" target="_top"><span class="application">sdparm</span></a>. + Use <code class="command">sdparm --get=WCE</code> to check + whether the write cache is enabled and <code class="command">sdparm --clear=WCE</code> + to disable it. + </p></li><li class="listitem"><p> + On <span class="productname">FreeBSD</span>, IDE drives can be queried using + <code class="command">atacontrol</code> and write caching turned off using + <code class="literal">hw.ata.wc=0</code> in <code class="filename">/boot/loader.conf</code>; + SCSI drives can be queried using <code class="command">camcontrol identify</code>, + and the write cache both queried and changed using + <code class="command">sdparm</code> when available. + </p></li><li class="listitem"><p> + On <span class="productname">Solaris</span>, the disk write cache is controlled by + <code class="command">format -e</code>. + (The Solaris <acronym class="acronym">ZFS</acronym> file system is safe with disk write-cache + enabled because it issues its own disk cache flush commands.) + </p></li><li class="listitem"><p> + On <span class="productname">Windows</span>, if <code class="varname">wal_sync_method</code> is + <code class="literal">open_datasync</code> (the default), write caching can be disabled + by unchecking <code class="literal">My Computer\Open\<em class="replaceable"><code>disk drive</code></em>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</code>. + Alternatively, set <code class="varname">wal_sync_method</code> to + <code class="literal">fsync</code> or <code class="literal">fsync_writethrough</code>, which prevent + write caching. + </p></li><li class="listitem"><p> + On <span class="productname">macOS</span>, write caching can be prevented by + setting <code class="varname">wal_sync_method</code> to <code class="literal">fsync_writethrough</code>. + </p></li></ul></div><p> + Recent SATA drives (those following <acronym class="acronym">ATAPI-6</acronym> or later) + offer a drive cache flush command (<code class="command">FLUSH CACHE EXT</code>), + while SCSI drives have long supported a similar command + <code class="command">SYNCHRONIZE CACHE</code>. These commands are not directly + accessible to <span class="productname">PostgreSQL</span>, but some file systems + (e.g., <acronym class="acronym">ZFS</acronym>, <acronym class="acronym">ext4</acronym>) can use them to flush + data to the platters on write-back-enabled drives. Unfortunately, such + file systems behave suboptimally when combined with battery-backup unit + (<acronym class="acronym">BBU</acronym>) disk controllers. In such setups, the synchronize + command forces all data from the controller cache to the disks, + eliminating much of the benefit of the BBU. You can run the + <a class="xref" href="pgtestfsync.html" title="pg_test_fsync"><span class="refentrytitle"><span class="application">pg_test_fsync</span></span></a> program to see + if you are affected. If you are affected, the performance benefits + of the BBU can be regained by turning off write barriers in + the file system or reconfiguring the disk controller, if that is + an option. If write barriers are turned off, make sure the battery + remains functional; a faulty battery can potentially lead to data loss. + Hopefully file system and disk controller designers will eventually + address this suboptimal behavior. + </p><p> + When the operating system sends a write request to the storage hardware, + there is little it can do to make sure the data has arrived at a truly + non-volatile storage area. Rather, it is the + administrator's responsibility to make certain that all storage components + ensure integrity for both data and file-system metadata. + Avoid disk controllers that have non-battery-backed write caches. + At the drive level, disable write-back caching if the + drive cannot guarantee the data will be written before shutdown. + If you use SSDs, be aware that many of these do not honor cache flush + commands by default. + You can test for reliable I/O subsystem behavior using <a class="ulink" href="https://brad.livejournal.com/2116715.html" target="_top"><code class="filename">diskchecker.pl</code></a>. + </p><p> + Another risk of data loss is posed by the disk platter write + operations themselves. Disk platters are divided into sectors, + commonly 512 bytes each. Every physical read or write operation + processes a whole sector. + When a write request arrives at the drive, it might be for some multiple + of 512 bytes (<span class="productname">PostgreSQL</span> typically writes 8192 bytes, or + 16 sectors, at a time), and the process of writing could fail due + to power loss at any time, meaning some of the 512-byte sectors were + written while others were not. To guard against such failures, + <span class="productname">PostgreSQL</span> periodically writes full page images to + permanent WAL storage <span class="emphasis"><em>before</em></span> modifying the actual page on + disk. By doing this, during crash recovery <span class="productname">PostgreSQL</span> can + restore partially-written pages from WAL. If you have file-system software + that prevents partial page writes (e.g., ZFS), you can turn off + this page imaging by turning off the <a class="xref" href="runtime-config-wal.html#GUC-FULL-PAGE-WRITES">full_page_writes</a> parameter. Battery-Backed Unit + (BBU) disk controllers do not prevent partial page writes unless + they guarantee that data is written to the BBU as full (8kB) pages. + </p><p> + <span class="productname">PostgreSQL</span> also protects against some kinds of data corruption + on storage devices that may occur because of hardware errors or media failure over time, + such as reading/writing garbage data. + </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p> + Each individual record in a WAL file is protected by a CRC-32 (32-bit) check + that allows us to tell if record contents are correct. The CRC value + is set when we write each WAL record and checked during crash recovery, + archive recovery and replication. + </p></li><li class="listitem"><p> + Data pages are not currently checksummed by default, though full page images + recorded in WAL records will be protected; see <a class="link" href="app-initdb.html#APP-INITDB-DATA-CHECKSUMS"><span class="application">initdb</span></a> + for details about enabling data checksums. + </p></li><li class="listitem"><p> + Internal data structures such as <code class="filename">pg_xact</code>, <code class="filename">pg_subtrans</code>, <code class="filename">pg_multixact</code>, + <code class="filename">pg_serial</code>, <code class="filename">pg_notify</code>, <code class="filename">pg_stat</code>, <code class="filename">pg_snapshots</code> are not directly + checksummed, nor are pages protected by full page writes. However, where + such data structures are persistent, WAL records are written that allow + recent changes to be accurately rebuilt at crash recovery and those + WAL records are protected as discussed above. + </p></li><li class="listitem"><p> + Individual state files in <code class="filename">pg_twophase</code> are protected by CRC-32. + </p></li><li class="listitem"><p> + Temporary data files used in larger SQL queries for sorts, + materializations and intermediate results are not currently checksummed, + nor will WAL records be written for changes to those files. + </p></li></ul></div><p> + </p><p> + <span class="productname">PostgreSQL</span> does not protect against correctable memory errors + and it is assumed you will operate using RAM that uses industry standard + Error Correcting Codes (ECC) or better protection. + </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="wal.html" title="Chapter 30. Reliability and the Write-Ahead Log">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="checksums.html" title="30.2. Data Checksums">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 30. Reliability and the Write-Ahead Log </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 30.2. Data Checksums</td></tr></table></div></body></html>
\ No newline at end of file |