diff options
Diffstat (limited to 'doc/src/sgml/high-availability.sgml')
-rw-r--r-- | doc/src/sgml/high-availability.sgml | 2493 |
1 files changed, 2493 insertions, 0 deletions
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml new file mode 100644 index 0000000..37c4439 --- /dev/null +++ b/doc/src/sgml/high-availability.sgml @@ -0,0 +1,2493 @@ +<!-- doc/src/sgml/high-availability.sgml --> + +<chapter id="high-availability"> + <title>High Availability, Load Balancing, and Replication</title> + + <indexterm><primary>high availability</primary></indexterm> + <indexterm><primary>failover</primary></indexterm> + <indexterm><primary>replication</primary></indexterm> + <indexterm><primary>load balancing</primary></indexterm> + <indexterm><primary>clustering</primary></indexterm> + <indexterm><primary>data partitioning</primary></indexterm> + + <para> + Database servers can work together to allow a second server to + take over quickly if the primary server fails (high + availability), or to allow several computers to serve the same + data (load balancing). Ideally, database servers could work + together seamlessly. Web servers serving static web pages can + be combined quite easily by merely load-balancing web requests + to multiple machines. In fact, read-only database servers can + be combined relatively easily too. Unfortunately, most database + servers have a read/write mix of requests, and read/write servers + are much harder to combine. This is because though read-only + data needs to be placed on each server only once, a write to any + server has to be propagated to all servers so that future read + requests to those servers return consistent results. + </para> + + <para> + This synchronization problem is the fundamental difficulty for + servers working together. Because there is no single solution + that eliminates the impact of the sync problem for all use cases, + there are multiple solutions. Each solution addresses this + problem in a different way, and minimizes its impact for a specific + workload. + </para> + + <para> + Some solutions deal with synchronization by allowing only one + server to modify the data. Servers that can modify data are + called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers. + Servers that track changes in the master are called <firstterm>standby</firstterm> + or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected + to until it is promoted to a master server is called a <firstterm>warm + standby</firstterm> server, and one that can accept connections and serves read-only + queries is called a <firstterm>hot standby</firstterm> server. + </para> + + <para> + Some solutions are synchronous, + meaning that a data-modifying transaction is not considered + committed until all servers have committed the transaction. This + guarantees that a failover will not lose any data and that all + load-balanced servers will return consistent results no matter + which server is queried. In contrast, asynchronous solutions allow some + delay between the time of a commit and its propagation to the other servers, + opening the possibility that some transactions might be lost in + the switch to a backup server, and that load balanced servers + might return slightly stale results. Asynchronous communication + is used when synchronous would be too slow. + </para> + + <para> + Solutions can also be categorized by their granularity. Some solutions + can deal only with an entire database server, while others allow control + at the per-table or per-database level. + </para> + + <para> + Performance must be considered in any choice. There is usually a + trade-off between functionality and + performance. For example, a fully synchronous solution over a slow + network might cut performance by more than half, while an asynchronous + one might have a minimal performance impact. + </para> + + <para> + The remainder of this section outlines various failover, replication, + and load balancing solutions. + </para> + + <sect1 id="different-replication-solutions"> + <title>Comparison of Different Solutions</title> + + <variablelist> + + <varlistentry> + <term>Shared Disk Failover</term> + <listitem> + + <para> + Shared disk failover avoids synchronization overhead by having only one + copy of the database. It uses a single disk array that is shared by + multiple servers. If the main database server fails, the standby server + is able to mount and start the database as though it were recovering from + a database crash. This allows rapid failover with no data loss. + </para> + + <para> + Shared hardware functionality is common in network storage devices. + Using a network file system is also possible, though care must be + taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref + linkend="creating-cluster-nfs"/>). One significant limitation of this + method is that if the shared disk array fails or becomes corrupt, the + primary and standby servers are both nonfunctional. Another issue is + that the standby server should never access the shared storage while + the primary server is running. + </para> + + </listitem> + </varlistentry> + + <varlistentry> + <term>File System (Block Device) Replication</term> + <listitem> + + <para> + A modified version of shared hardware functionality is file system + replication, where all changes to a file system are mirrored to a file + system residing on another computer. The only restriction is that + the mirroring must be done in a way that ensures the standby server + has a consistent copy of the file system — specifically, writes + to the standby must be done in the same order as those on the master. + <productname>DRBD</productname> is a popular file system replication solution + for Linux. + </para> + +<!-- +https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html + +Oracle RAC is a shared disk approach and just send cache invalidations +to other nodes but not actual data. As the disk is shared, data is +only committed once to disk and there is a distributed locking +protocol to make nodes agree on a serializable transactional order. +--> + + </listitem> + </varlistentry> + + <varlistentry> + <term>Write-Ahead Log Shipping</term> + <listitem> + + <para> + Warm and hot standby servers can be kept current by reading a + stream of write-ahead log (<acronym>WAL</acronym>) + records. If the main server fails, the standby contains + almost all of the data of the main server, and can be quickly + made the new master database server. This can be synchronous or + asynchronous and can only be done for the entire database server. + </para> + <para> + A standby server can be implemented using file-based log shipping + (<xref linkend="warm-standby"/>) or streaming replication (see + <xref linkend="streaming-replication"/>), or a combination of both. For + information on hot standby, see <xref linkend="hot-standby"/>. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Logical Replication</term> + <listitem> + <para> + Logical replication allows a database server to send a stream of data + modifications to another server. <productname>PostgreSQL</productname> + logical replication constructs a stream of logical data modifications + from the WAL. Logical replication allows the data changes from + individual tables to be replicated. Logical replication doesn't require + a particular server to be designated as a master or a replica but allows + data to flow in multiple directions. For more information on logical + replication, see <xref linkend="logical-replication"/>. Through the + logical decoding interface (<xref linkend="logicaldecoding"/>), + third-party extensions can also provide similar functionality. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Trigger-Based Master-Standby Replication</term> + <listitem> + + <para> + A master-standby replication setup sends all data modification + queries to the master server. The master server asynchronously + sends data changes to the standby server. The standby can answer + read-only queries while the master server is running. The + standby server is ideal for data warehouse queries. + </para> + + <para> + <productname>Slony-I</productname> is an example of this type of replication, with per-table + granularity, and support for multiple standby servers. Because it + updates the standby server asynchronously (in batches), there is + possible data loss during fail over. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>SQL-Based Replication Middleware</term> + <listitem> + + <para> + With SQL-based replication middleware, a program intercepts + every SQL query and sends it to one or all servers. Each server + operates independently. Read-write queries must be sent to all servers, + so that every server receives any changes. But read-only queries can be + sent to just one server, allowing the read workload to be distributed + among them. + </para> + + <para> + If queries are simply broadcast unmodified, functions like + <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and + sequences can have different values on different servers. + This is because each server operates independently, and because + SQL queries are broadcast (and not actual modified rows). If + this is unacceptable, either the middleware or the application + must query such values from a single server and then use those + values in write queries. Another option is to use this replication + option with a traditional master-standby setup, i.e., data modification + queries are sent only to the master and are propagated to the + standby servers via master-standby replication, not by the replication + middleware. Care must also be taken that all + transactions either commit or abort on all servers, perhaps + using two-phase commit (<xref linkend="sql-prepare-transaction"/> + and <xref linkend="sql-commit-prepared"/>). + <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname> + are examples of this type of replication. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Asynchronous Multimaster Replication</term> + <listitem> + + <para> + For servers that are not regularly connected or have slow + communication links, like laptops or + remote servers, keeping data consistent among servers is a + challenge. Using asynchronous multimaster replication, each + server works independently, and periodically communicates with + the other servers to identify conflicting transactions. The + conflicts can be resolved by users or conflict resolution rules. + Bucardo is an example of this type of replication. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Synchronous Multimaster Replication</term> + <listitem> + + <para> + In synchronous multimaster replication, each server can accept + write requests, and modified data is transmitted from the + original server to every other server before each transaction + commits. Heavy write activity can cause excessive locking and + commit delays, leading to poor performance. Read requests can + be sent to any server. Some implementations use shared disk + to reduce the communication overhead. Synchronous multimaster + replication is best for mostly read workloads, though its big + advantage is that any server can accept write requests — + there is no need to partition workloads between master and + standby servers, and because the data changes are sent from one + server to another, there is no problem with non-deterministic + functions like <function>random()</function>. + </para> + + <para> + <productname>PostgreSQL</productname> does not offer this type of replication, + though <productname>PostgreSQL</productname> two-phase commit (<xref + linkend="sql-prepare-transaction"/> and <xref + linkend="sql-commit-prepared"/>) + can be used to implement this in application code or middleware. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + <xref linkend="high-availability-matrix"/> summarizes + the capabilities of the various solutions listed above. + </para> + + <table id="high-availability-matrix"> + <title>High Availability, Load Balancing, and Replication Feature Matrix</title> + <tgroup cols="9"> + <colspec colname="col1" colwidth="1.1*"/> + <colspec colname="col2" colwidth="1*"/> + <colspec colname="col3" colwidth="1*"/> + <colspec colname="col4" colwidth="1*"/> + <colspec colname="col5" colwidth="1*"/> + <colspec colname="col6" colwidth="1*"/> + <colspec colname="col7" colwidth="1*"/> + <colspec colname="col8" colwidth="1*"/> + <colspec colname="col9" colwidth="1*"/> + <thead> + <row> + <entry>Feature</entry> + <entry>Shared Disk</entry> + <entry>File System Repl.</entry> + <entry>Write-Ahead Log Shipping</entry> + <entry>Logical Repl.</entry> + <entry>Trigger-Based Repl.</entry> + <entry>SQL Repl. Middle-ware</entry> + <entry>Async. MM Repl.</entry> + <entry>Sync. MM Repl.</entry> + </row> + </thead> + + <tbody> + + <row> + <entry>Popular examples</entry> + <entry align="center">NAS</entry> + <entry align="center">DRBD</entry> + <entry align="center">built-in streaming repl.</entry> + <entry align="center">built-in logical repl., pglogical</entry> + <entry align="center">Londiste, Slony</entry> + <entry align="center">pgpool-II</entry> + <entry align="center">Bucardo</entry> + <entry align="center"></entry> + </row> + + <row> + <entry>Comm. method</entry> + <entry align="center">shared disk</entry> + <entry align="center">disk blocks</entry> + <entry align="center">WAL</entry> + <entry align="center">logical decoding</entry> + <entry align="center">table rows</entry> + <entry align="center">SQL</entry> + <entry align="center">table rows</entry> + <entry align="center">table rows and row locks</entry> + </row> + + <row> + <entry>No special hardware required</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + </row> + + <row> + <entry>Allows multiple master servers</entry> + <entry align="center"></entry> + <entry align="center"></entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + </row> + + <row> + <entry>No master server overhead</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center"></entry> + </row> + + <row> + <entry>No waiting for multiple servers</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">with sync off</entry> + <entry align="center">with sync off</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center"></entry> + </row> + + <row> + <entry>Master failure will never lose data</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">with sync on</entry> + <entry align="center">with sync on</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + </row> + + <row> + <entry>Replicas accept read-only queries</entry> + <entry align="center"></entry> + <entry align="center"></entry> + <entry align="center">with hot standby</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + </row> + + <row> + <entry>Per-table granularity</entry> + <entry align="center"></entry> + <entry align="center"></entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + </row> + + <row> + <entry>No conflict resolution necessary</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + <entry align="center">•</entry> + <entry align="center"></entry> + <entry align="center">•</entry> + </row> + + </tbody> + </tgroup> + </table> + + <para> + There are a few solutions that do not fit into the above categories: + </para> + + <variablelist> + + <varlistentry> + <term>Data Partitioning</term> + <listitem> + + <para> + Data partitioning splits tables into data sets. Each set can + be modified by only one server. For example, data can be + partitioned by offices, e.g., London and Paris, with a server + in each office. If queries combining London and Paris data + are necessary, an application can query both servers, or + master/standby replication can be used to keep a read-only copy + of the other office's data on each server. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Multiple-Server Parallel Query Execution</term> + <listitem> + + <para> + Many of the above solutions allow multiple servers to handle multiple + queries, but none allow a single query to use multiple servers to + complete faster. This solution allows multiple servers to work + concurrently on a single query. It is usually accomplished by + splitting the data among servers and having each server execute its + part of the query and return results to a central server where they + are combined and returned to the user. This can be implemented using the + <productname>PL/Proxy</productname> tool set. + </para> + + </listitem> + </varlistentry> + + </variablelist> + + <para> + It should also be noted that because <productname>PostgreSQL</productname> + is open source and easily extended, a number of companies have + taken <productname>PostgreSQL</productname> and created commercial + closed-source solutions with unique failover, replication, and load + balancing capabilities. These are not discussed here. + </para> + + </sect1> + + + <sect1 id="warm-standby"> + <title>Log-Shipping Standby Servers</title> + + + <para> + Continuous archiving can be used to create a <firstterm>high + availability</firstterm> (HA) cluster configuration with one or more + <firstterm>standby servers</firstterm> ready to take over operations if the + primary server fails. This capability is widely referred to as + <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>. + </para> + + <para> + The primary and standby server work together to provide this capability, + though the servers are only loosely coupled. The primary server operates + in continuous archiving mode, while each standby server operates in + continuous recovery mode, reading the WAL files from the primary. No + changes to the database tables are required to enable this capability, + so it offers low administration overhead compared to some other + replication solutions. This configuration also has relatively low + performance impact on the primary server. + </para> + + <para> + Directly moving WAL records from one database server to another + is typically described as log shipping. <productname>PostgreSQL</productname> + implements file-based log shipping by transferring WAL records + one file (WAL segment) at a time. WAL files (16MB) can be + shipped easily and cheaply over any distance, whether it be to an + adjacent system, another system at the same site, or another system on + the far side of the globe. The bandwidth required for this technique + varies according to the transaction rate of the primary server. + Record-based log shipping is more granular and streams WAL changes + incrementally over a network connection (see <xref + linkend="streaming-replication"/>). + </para> + + <para> + It should be noted that log shipping is asynchronous, i.e., the WAL + records are shipped after transaction commit. As a result, there is a + window for data loss should the primary server suffer a catastrophic + failure; transactions not yet shipped will be lost. The size of the + data loss window in file-based log shipping can be limited by use of the + <varname>archive_timeout</varname> parameter, which can be set as low + as a few seconds. However such a low setting will + substantially increase the bandwidth required for file shipping. + Streaming replication (see <xref linkend="streaming-replication"/>) + allows a much smaller window of data loss. + </para> + + <para> + Recovery performance is sufficiently good that the standby will + typically be only moments away from full + availability once it has been activated. As a result, this is called + a warm standby configuration which offers high + availability. Restoring a server from an archived base backup and + rollforward will take considerably longer, so that technique only + offers a solution for disaster recovery, not high availability. + A standby server can also be used for read-only queries, in which case + it is called a Hot Standby server. See <xref linkend="hot-standby"/> for + more information. + </para> + + <indexterm zone="high-availability"> + <primary>warm standby</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>PITR standby</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>standby server</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>log shipping</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>witness server</primary> + </indexterm> + + <indexterm zone="high-availability"> + <primary>STONITH</primary> + </indexterm> + + <sect2 id="standby-planning"> + <title>Planning</title> + + <para> + It is usually wise to create the primary and standby servers + so that they are as similar as possible, at least from the + perspective of the database server. In particular, the path names + associated with tablespaces will be passed across unmodified, so both + primary and standby servers must have the same mount paths for + tablespaces if that feature is used. Keep in mind that if + <xref linkend="sql-createtablespace"/> + is executed on the primary, any new mount point needed for it must + be created on the primary and all standby servers before the command + is executed. Hardware need not be exactly the same, but experience shows + that maintaining two identical systems is easier than maintaining two + dissimilar ones over the lifetime of the application and system. + In any case the hardware architecture must be the same — shipping + from, say, a 32-bit to a 64-bit system will not work. + </para> + + <para> + In general, log shipping between servers running different major + <productname>PostgreSQL</productname> release + levels is not possible. It is the policy of the PostgreSQL Global + Development Group not to make changes to disk formats during minor release + upgrades, so it is likely that running different minor release levels + on primary and standby servers will work successfully. However, no + formal support for that is offered and you are advised to keep primary + and standby servers at the same release level as much as possible. + When updating to a new minor release, the safest policy is to update + the standby servers first — a new minor release is more likely + to be able to read WAL files from a previous minor release than vice + versa. + </para> + + </sect2> + + <sect2 id="standby-server-operation" xreflabel="Standby Server Operation"> + <title>Standby Server Operation</title> + + <para> + A server enters standby mode if a + <anchor id="file-standby-signal" xreflabel="standby.signal"/> + <filename>standby.signal</filename> + <indexterm><primary><filename>standby.signal</filename></primary></indexterm> + file exists in the data directory when the server is started. + </para> + + <para> + In standby mode, the server continuously applies WAL received from the + master server. The standby server can read WAL from a WAL archive + (see <xref linkend="guc-restore-command"/>) or directly from the master + over a TCP connection (streaming replication). The standby server will + also attempt to restore any WAL found in the standby cluster's + <filename>pg_wal</filename> directory. That typically happens after a server + restart, when the standby replays again WAL that was streamed from the + master before the restart, but you can also manually copy files to + <filename>pg_wal</filename> at any time to have them replayed. + </para> + + <para> + At startup, the standby begins by restoring all WAL available in the + archive location, calling <varname>restore_command</varname>. Once it + reaches the end of WAL available there and <varname>restore_command</varname> + fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory. + If that fails, and streaming replication has been configured, the + standby tries to connect to the primary server and start streaming WAL + from the last valid record found in archive or <filename>pg_wal</filename>. If that fails + or streaming replication is not configured, or if the connection is + later disconnected, the standby goes back to step 1 and tries to + restore the file from the archive again. This loop of retries from the + archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server + is stopped or failover is triggered by a trigger file. + </para> + + <para> + Standby mode is exited and the server switches to normal operation + when <command>pg_ctl promote</command> is run, + <function>pg_promote()</function> is called, or a trigger file is found + (<varname>promote_trigger_file</varname>). Before failover, + any WAL immediately available in the archive or in <filename>pg_wal</filename> will be + restored, but no attempt is made to connect to the master. + </para> + </sect2> + + <sect2 id="preparing-master-for-standby"> + <title>Preparing the Master for Standby Servers</title> + + <para> + Set up continuous archiving on the primary to an archive directory + accessible from the standby, as described + in <xref linkend="continuous-archiving"/>. The archive location should be + accessible from the standby even when the master is down, i.e., it should + reside on the standby server itself or another trusted server, not on + the master server. + </para> + + <para> + If you want to use streaming replication, set up authentication on the + primary server to allow replication connections from the standby + server(s); that is, create a role and provide a suitable entry or + entries in <filename>pg_hba.conf</filename> with the database field set to + <literal>replication</literal>. Also ensure <varname>max_wal_senders</varname> is set + to a sufficiently large value in the configuration file of the primary + server. If replication slots will be used, + ensure that <varname>max_replication_slots</varname> is set sufficiently + high as well. + </para> + + <para> + Take a base backup as described in <xref linkend="backup-base-backup"/> + to bootstrap the standby server. + </para> + </sect2> + + <sect2 id="standby-server-setup"> + <title>Setting Up a Standby Server</title> + + <para> + To set up the standby server, restore the base backup taken from primary + server (see <xref linkend="backup-pitr-recovery"/>). Create a file + <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm> + in the standby's cluster data + directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from + the WAL archive. If you plan to have multiple standby servers for high + availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to + <literal>latest</literal> (the default), to make the standby server follow the timeline change + that occurs at failover to another standby. + </para> + + <note> + <para> + Do not use pg_standby or similar tools with the built-in standby mode + described here. <xref linkend="guc-restore-command"/> should return immediately + if the file does not exist; the server will retry the command again if + necessary. See <xref linkend="log-shipping-alternative"/> + for using tools like pg_standby. + </para> + </note> + + <para> + If you want to use streaming replication, fill in + <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including + the host name (or IP address) and any additional details needed to + connect to the primary server. If the primary needs a password for + authentication, the password needs to be specified in + <xref linkend="guc-primary-conninfo"/> as well. + </para> + + <para> + If you're setting up the standby server for high availability purposes, + set up WAL archiving, connections and authentication like the primary + server, because the standby server will work as a primary server after + failover. + </para> + + <para> + If you're using a WAL archive, its size can be minimized using the <xref + linkend="guc-archive-cleanup-command"/> parameter to remove files that are no + longer required by the standby server. + The <application>pg_archivecleanup</application> utility is designed specifically to + be used with <varname>archive_cleanup_command</varname> in typical single-standby + configurations, see <xref linkend="pgarchivecleanup"/>. + Note however, that if you're using the archive for backup purposes, you + need to retain files needed to recover from at least the latest base + backup, even if they're no longer needed by the standby. + </para> + + <para> + A simple example of configuration is: +<programlisting> +primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000''' +restore_command = 'cp /path/to/archive/%f %p' +archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r' +</programlisting> + </para> + + <para> + You can have any number of standby servers, but if you use streaming + replication, make sure you set <varname>max_wal_senders</varname> high enough in + the primary to allow them to be connected simultaneously. + </para> + + </sect2> + + <sect2 id="streaming-replication"> + <title>Streaming Replication</title> + + <indexterm zone="high-availability"> + <primary>Streaming Replication</primary> + </indexterm> + + <para> + Streaming replication allows a standby server to stay more up-to-date + than is possible with file-based log shipping. The standby connects + to the primary, which streams WAL records to the standby as they're + generated, without waiting for the WAL file to be filled. + </para> + + <para> + Streaming replication is asynchronous by default + (see <xref linkend="synchronous-replication"/>), in which case there is + a small delay between committing a transaction in the primary and the + changes becoming visible in the standby. This delay is however much + smaller than with file-based log shipping, typically under one second + assuming the standby is powerful enough to keep up with the load. With + streaming replication, <varname>archive_timeout</varname> is not required to + reduce the data loss window. + </para> + + <para> + If you use streaming replication without file-based continuous + archiving, the server might recycle old WAL segments before the standby + has received them. If this occurs, the standby will need to be + reinitialized from a new base backup. You can avoid this by setting + <varname>wal_keep_size</varname> to a value large enough to ensure that + WAL segments are not recycled too early, or by configuring a replication + slot for the standby. If you set up a WAL archive that's accessible from + the standby, these solutions are not required, since the standby can + always use the archive to catch up provided it retains enough segments. + </para> + + <para> + To use streaming replication, set up a file-based log-shipping standby + server as described in <xref linkend="warm-standby"/>. The step that + turns a file-based log-shipping standby into streaming replication + standby is setting the <varname>primary_conninfo</varname> setting + to point to the primary server. Set + <xref linkend="guc-listen-addresses"/> and authentication options + (see <filename>pg_hba.conf</filename>) on the primary so that the standby server + can connect to the <literal>replication</literal> pseudo-database on the primary + server (see <xref linkend="streaming-replication-authentication"/>). + </para> + + <para> + On systems that support the keepalive socket option, setting + <xref linkend="guc-tcp-keepalives-idle"/>, + <xref linkend="guc-tcp-keepalives-interval"/> and + <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly + notice a broken connection. + </para> + + <para> + Set the maximum number of concurrent connections from the standby servers + (see <xref linkend="guc-max-wal-senders"/> for details). + </para> + + <para> + When the standby is started and <varname>primary_conninfo</varname> is set + correctly, the standby will connect to the primary after replaying all + WAL files available in the archive. If the connection is established + successfully, you will see a <literal>walreceiver</literal> in the standby, and + a corresponding <literal>walsender</literal> process in the primary. + </para> + + <sect3 id="streaming-replication-authentication"> + <title>Authentication</title> + <para> + It is very important that the access privileges for replication be set up + so that only trusted users can read the WAL stream, because it is + easy to extract privileged information from it. Standby servers must + authenticate to the primary as an account that has the + <literal>REPLICATION</literal> privilege or a superuser. It is + recommended to create a dedicated user account with + <literal>REPLICATION</literal> and <literal>LOGIN</literal> + privileges for replication. While <literal>REPLICATION</literal> + privilege gives very high permissions, it does not allow the user to + modify any data on the primary system, which the + <literal>SUPERUSER</literal> privilege does. + </para> + + <para> + Client authentication for replication is controlled by a + <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the + <replaceable>database</replaceable> field. For example, if the standby is running on + host IP <literal>192.168.1.100</literal> and the account name for replication + is <literal>foo</literal>, the administrator can add the following line to the + <filename>pg_hba.conf</filename> file on the primary: + +<programlisting> +# Allow the user "foo" from host 192.168.1.100 to connect to the primary +# as a replication standby if the user's password is correctly supplied. +# +# TYPE DATABASE USER ADDRESS METHOD +host replication foo 192.168.1.100/32 md5 +</programlisting> + </para> + <para> + The host name and port number of the primary, connection user name, + and password are specified in the <xref linkend="guc-primary-conninfo"/>. + The password can also be set in the <filename>~/.pgpass</filename> file on the + standby (specify <literal>replication</literal> in the <replaceable>database</replaceable> + field). + For example, if the primary is running on host IP <literal>192.168.1.50</literal>, + port <literal>5432</literal>, the account name for replication is + <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator + can add the following line to the <filename>postgresql.conf</filename> file on the + standby: + +<programlisting> +# The standby connects to the primary that is running on host 192.168.1.50 +# and port 5432 as the user "foo" whose password is "foopass". +primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' +</programlisting> + </para> + </sect3> + + <sect3 id="streaming-replication-monitoring"> + <title>Monitoring</title> + <para> + An important health indicator of streaming replication is the amount + of WAL records generated in the primary, but not yet applied in the + standby. You can calculate this lag by comparing the current WAL write + location on the primary with the last WAL location received by the + standby. These locations can be retrieved using + <function>pg_current_wal_lsn</function> on the primary and + <function>pg_last_wal_receive_lsn</function> on the standby, + respectively (see <xref linkend="functions-admin-backup-table"/> and + <xref linkend="functions-recovery-info-table"/> for details). + The last WAL receive location in the standby is also displayed in the + process status of the WAL receiver process, displayed using the + <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details). + </para> + <para> + You can retrieve a list of WAL sender processes via the + <link linkend="monitoring-pg-stat-replication-view"><structname> + pg_stat_replication</structname></link> view. Large differences between + <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field + might indicate that the master server is under heavy load, while + differences between <literal>sent_lsn</literal> and + <function>pg_last_wal_receive_lsn</function> on the standby might indicate + network delay, or that the standby is under heavy load. + </para> + <para> + On a hot standby, the status of the WAL receiver process can be retrieved + via the <link linkend="monitoring-pg-stat-wal-receiver-view"> + <structname>pg_stat_wal_receiver</structname></link> view. A large + difference between <function>pg_last_wal_replay_lsn</function> and the + view's <literal>flushed_lsn</literal> indicates that WAL is being + received faster than it can be replayed. + </para> + </sect3> + </sect2> + + <sect2 id="streaming-replication-slots"> + <title>Replication Slots</title> + <indexterm> + <primary>replication slot</primary> + <secondary>streaming replication</secondary> + </indexterm> + <para> + Replication slots provide an automated way to ensure that the master does + not remove WAL segments until they have been received by all standbys, + and that the master does not remove rows which could cause a + <link linkend="hot-standby-conflict">recovery conflict</link> even when the + standby is disconnected. + </para> + <para> + In lieu of using replication slots, it is possible to prevent the removal + of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by + storing the segments in an archive using + <xref linkend="guc-archive-command"/>. + However, these methods often result in retaining more WAL segments than + required, whereas replication slots retain only the number of segments + known to be needed. On the other hand, replication slots can retain so + many WAL segments that they fill up the space allocated + for <literal>pg_wal</literal>; + <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files + retained by replication slots. + </para> + <para> + Similarly, <xref linkend="guc-hot-standby-feedback"/> + and <xref linkend="guc-vacuum-defer-cleanup-age"/> provide protection against + relevant rows being removed by vacuum, but the former provides no + protection during any time period when the standby is not connected, + and the latter often needs to be set to a high value to provide adequate + protection. Replication slots overcome these disadvantages. + </para> + <sect3 id="streaming-replication-slots-manipulation"> + <title>Querying and Manipulating Replication Slots</title> + <para> + Each replication slot has a name, which can contain lower-case letters, + numbers, and the underscore character. + </para> + <para> + Existing replication slots and their state can be seen in the + <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link> + view. + </para> + <para> + Slots can be created and dropped either via the streaming replication + protocol (see <xref linkend="protocol-replication"/>) or via SQL + functions (see <xref linkend="functions-replication"/>). + </para> + </sect3> + <sect3 id="streaming-replication-slots-config"> + <title>Configuration Example</title> + <para> + You can create a replication slot like this: +<programlisting> +postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot'); + slot_name | lsn +-------------+----- + node_a_slot | + +postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots; + slot_name | slot_type | active +-------------+-----------+-------- + node_a_slot | physical | f +(1 row) +</programlisting> + To configure the standby to use this slot, <varname>primary_slot_name</varname> + should be configured on the standby. Here is a simple example: +<programlisting> +primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' +primary_slot_name = 'node_a_slot' +</programlisting> + </para> + </sect3> + </sect2> + + <sect2 id="cascading-replication"> + <title>Cascading Replication</title> + + <indexterm zone="high-availability"> + <primary>Cascading Replication</primary> + </indexterm> + + <para> + The cascading replication feature allows a standby server to accept replication + connections and stream WAL records to other standbys, acting as a relay. + This can be used to reduce the number of direct connections to the master + and also to minimize inter-site bandwidth overheads. + </para> + + <para> + A standby acting as both a receiver and a sender is known as a cascading + standby. Standbys that are more directly connected to the master are known + as upstream servers, while those standby servers further away are downstream + servers. Cascading replication does not place limits on the number or + arrangement of downstream servers, though each standby connects to only + one upstream server which eventually links to a single master/primary + server. + </para> + + <para> + A cascading standby sends not only WAL records received from the + master but also those restored from the archive. So even if the replication + connection in some upstream connection is terminated, streaming replication + continues downstream for as long as new WAL records are available. + </para> + + <para> + Cascading replication is currently asynchronous. Synchronous replication + (see <xref linkend="synchronous-replication"/>) settings have no effect on + cascading replication at present. + </para> + + <para> + Hot Standby feedback propagates upstream, whatever the cascaded arrangement. + </para> + + <para> + If an upstream standby server is promoted to become new master, downstream + servers will continue to stream from the new master if + <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default). + </para> + + <para> + To use cascading replication, set up the cascading standby so that it can + accept replication connections (that is, set + <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>, + and configure + <link linkend="auth-pg-hba-conf">host-based authentication</link>). + You will also need to set <varname>primary_conninfo</varname> in the downstream + standby to point to the cascading standby. + </para> + </sect2> + + <sect2 id="synchronous-replication"> + <title>Synchronous Replication</title> + + <indexterm zone="high-availability"> + <primary>Synchronous Replication</primary> + </indexterm> + + <para> + <productname>PostgreSQL</productname> streaming replication is asynchronous by + default. If the primary server + crashes then some transactions that were committed may not have been + replicated to the standby server, causing data loss. The amount + of data loss is proportional to the replication delay at the time of + failover. + </para> + + <para> + Synchronous replication offers the ability to confirm that all changes + made by a transaction have been transferred to one or more synchronous + standby servers. This extends that standard level of durability + offered by a transaction commit. This level of protection is referred + to as 2-safe replication in computer science theory, and group-1-safe + (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to + <literal>remote_write</literal>. + </para> + + <para> + When requesting synchronous replication, each commit of a + write transaction will wait until confirmation is + received that the commit has been written to the write-ahead log on disk + of both the primary and standby server. The only possibility that data + can be lost is if both the primary and the standby suffer crashes at the + same time. This can provide a much higher level of durability, though only + if the sysadmin is cautious about the placement and management of the two + servers. Waiting for confirmation increases the user's confidence that the + changes will not be lost in the event of server crashes but it also + necessarily increases the response time for the requesting transaction. + The minimum wait time is the round-trip time between primary to standby. + </para> + + <para> + Read only transactions and transaction rollbacks need not wait for + replies from standby servers. Subtransaction commits do not wait for + responses from standby servers, only top-level commits. Long + running actions such as data loading or index building do not wait + until the very final commit message. All two-phase commit actions + require commit waits, including both prepare and commit. + </para> + + <para> + A synchronous standby can be a physical replication standby or a logical + replication subscriber. It can also be any other physical or logical WAL + replication stream consumer that knows how to send the appropriate + feedback messages. Besides the built-in physical and logical replication + systems, this includes special programs such + as <command>pg_receivewal</command> and <command>pg_recvlogical</command> + as well as some third-party replication systems and custom programs. + Check the respective documentation for details on synchronous replication + support. + </para> + + <sect3 id="synchronous-replication-config"> + <title>Basic Configuration</title> + + <para> + Once streaming replication has been configured, configuring synchronous + replication requires only one additional configuration step: + <xref linkend="guc-synchronous-standby-names"/> must be set to + a non-empty value. <varname>synchronous_commit</varname> must also be set to + <literal>on</literal>, but since this is the default value, typically no change is + required. (See <xref linkend="runtime-config-wal-settings"/> and + <xref linkend="runtime-config-replication-master"/>.) + This configuration will cause each commit to wait for + confirmation that the standby has written the commit record to durable + storage. + <varname>synchronous_commit</varname> can be set by individual + users, so it can be configured in the configuration file, for particular + users or databases, or dynamically by applications, in order to control + the durability guarantee on a per-transaction basis. + </para> + + <para> + After a commit record has been written to disk on the primary, the + WAL record is then sent to the standby. The standby sends reply + messages each time a new batch of WAL data is written to disk, unless + <varname>wal_receiver_status_interval</varname> is set to zero on the standby. + In the case that <varname>synchronous_commit</varname> is set to + <literal>remote_apply</literal>, the standby sends reply messages when the commit + record is replayed, making the transaction visible. + If the standby is chosen as a synchronous standby, according to the setting + of <varname>synchronous_standby_names</varname> on the primary, the reply + messages from that standby will be considered along with those from other + synchronous standbys to decide when to release transactions waiting for + confirmation that the commit record has been received. These parameters + allow the administrator to specify which standby servers should be + synchronous standbys. Note that the configuration of synchronous + replication is mainly on the master. Named standbys must be directly + connected to the master; the master knows nothing about downstream + standby servers using cascaded replication. + </para> + + <para> + Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will + cause each commit to wait for confirmation that the standby has received + the commit record and written it out to its own operating system, but not + for the data to be flushed to disk on the standby. This + setting provides a weaker guarantee of durability than <literal>on</literal> + does: the standby could lose the data in the event of an operating system + crash, though not a <productname>PostgreSQL</productname> crash. + However, it's a useful setting in practice + because it can decrease the response time for the transaction. + Data loss could only occur if both the primary and the standby crash and + the database of the primary gets corrupted at the same time. + </para> + + <para> + Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will + cause each commit to wait until the current synchronous standbys report + that they have replayed the transaction, making it visible to user + queries. In simple cases, this allows for load balancing with causal + consistency. + </para> + + <para> + Users will stop waiting if a fast shutdown is requested. However, as + when using asynchronous replication, the server will not fully + shutdown until all outstanding WAL records are transferred to the currently + connected standby servers. + </para> + + </sect3> + + <sect3 id="synchronous-replication-multiple-standbys"> + <title>Multiple Synchronous Standbys</title> + + <para> + Synchronous replication supports one or more synchronous standby servers; + transactions will wait until all the standby servers which are considered + as synchronous confirm receipt of their data. The number of synchronous + standbys that transactions must wait for replies from is specified in + <varname>synchronous_standby_names</varname>. This parameter also specifies + a list of standby names and the method (<literal>FIRST</literal> and + <literal>ANY</literal>) to choose synchronous standbys from the listed ones. + </para> + <para> + The method <literal>FIRST</literal> specifies a priority-based synchronous + replication and makes transaction commits wait until their WAL records are + replicated to the requested number of synchronous standbys chosen based on + their priorities. The standbys whose names appear earlier in the list are + given higher priority and will be considered as synchronous. Other standby + servers appearing later in this list represent potential synchronous + standbys. If any of the current synchronous standbys disconnects for + whatever reason, it will be replaced immediately with the + next-highest-priority standby. + </para> + <para> + An example of <varname>synchronous_standby_names</varname> for + a priority-based multiple synchronous standbys is: +<programlisting> +synchronous_standby_names = 'FIRST 2 (s1, s2, s3)' +</programlisting> + In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>, + <literal>s3</literal> and <literal>s4</literal> are running, the two standbys + <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys + because their names appear early in the list of standby names. + <literal>s3</literal> is a potential synchronous standby and will take over + the role of synchronous standby when either of <literal>s1</literal> or + <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since + its name is not in the list. + </para> + <para> + The method <literal>ANY</literal> specifies a quorum-based synchronous + replication and makes transaction commits wait until their WAL records + are replicated to <emphasis>at least</emphasis> the requested number of + synchronous standbys in the list. + </para> + <para> + An example of <varname>synchronous_standby_names</varname> for + a quorum-based multiple synchronous standbys is: +<programlisting> +synchronous_standby_names = 'ANY 2 (s1, s2, s3)' +</programlisting> + In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>, + <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will + wait for replies from at least any two standbys of <literal>s1</literal>, + <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous + standby since its name is not in the list. + </para> + <para> + The synchronous states of standby servers can be viewed using + the <structname>pg_stat_replication</structname> view. + </para> + </sect3> + + <sect3 id="synchronous-replication-performance"> + <title>Planning for Performance</title> + + <para> + Synchronous replication usually requires carefully planned and placed + standby servers to ensure applications perform acceptably. Waiting + doesn't utilize system resources, but transaction locks continue to be + held until the transfer is confirmed. As a result, incautious use of + synchronous replication will reduce performance for database + applications because of increased response times and higher contention. + </para> + + <para> + <productname>PostgreSQL</productname> allows the application developer + to specify the durability level required via replication. This can be + specified for the system overall, though it can also be specified for + specific users or connections, or even individual transactions. + </para> + + <para> + For example, an application workload might consist of: + 10% of changes are important customer details, while + 90% of changes are less important data that the business can more + easily survive if it is lost, such as chat messages between users. + </para> + + <para> + With synchronous replication options specified at the application level + (on the primary) we can offer synchronous replication for the most + important changes, without slowing down the bulk of the total workload. + Application level options are an important and practical tool for allowing + the benefits of synchronous replication for high performance applications. + </para> + + <para> + You should consider that the network bandwidth must be higher than + the rate of generation of WAL data. + </para> + + </sect3> + + <sect3 id="synchronous-replication-ha"> + <title>Planning for High Availability</title> + + <para> + <varname>synchronous_standby_names</varname> specifies the number and + names of synchronous standbys that transaction commits made when + <varname>synchronous_commit</varname> is set to <literal>on</literal>, + <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for + responses from. Such transaction commits may never be completed + if any one of synchronous standbys should crash. + </para> + + <para> + The best solution for high availability is to ensure you keep as many + synchronous standbys as requested. This can be achieved by naming multiple + potential synchronous standbys using <varname>synchronous_standby_names</varname>. + </para> + + <para> + In a priority-based synchronous replication, the standbys whose names + appear earlier in the list will be used as synchronous standbys. + Standbys listed after these will take over the role of synchronous standby + if one of current ones should fail. + </para> + + <para> + In a quorum-based synchronous replication, all the standbys appearing + in the list will be used as candidates for synchronous standbys. + Even if one of them should fail, the other standbys will keep performing + the role of candidates of synchronous standby. + </para> + + <para> + When a standby first attaches to the primary, it will not yet be properly + synchronized. This is described as <literal>catchup</literal> mode. Once + the lag between standby and primary reaches zero for the first time + we move to real-time <literal>streaming</literal> state. + The catch-up duration may be long immediately after the standby has + been created. If the standby is shut down, then the catch-up period + will increase according to the length of time the standby has been down. + The standby is only able to become a synchronous standby + once it has reached <literal>streaming</literal> state. + This state can be viewed using + the <structname>pg_stat_replication</structname> view. + </para> + + <para> + If primary restarts while commits are waiting for acknowledgment, those + waiting transactions will be marked fully committed once the primary + database recovers. + There is no way to be certain that all standbys have received all + outstanding WAL data at time of the crash of the primary. Some + transactions may not show as committed on the standby, even though + they show as committed on the primary. The guarantee we offer is that + the application will not receive explicit acknowledgment of the + successful commit of a transaction until the WAL data is known to be + safely received by all the synchronous standbys. + </para> + + <para> + If you really cannot keep as many synchronous standbys as requested + then you should decrease the number of synchronous standbys that + transaction commits must wait for responses from + in <varname>synchronous_standby_names</varname> (or disable it) and + reload the configuration file on the primary server. + </para> + + <para> + If the primary is isolated from remaining standby servers you should + fail over to the best candidate of those other remaining standby servers. + </para> + + <para> + If you need to re-create a standby server while transactions are + waiting, make sure that the commands pg_start_backup() and + pg_stop_backup() are run in a session with + <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those + requests will wait forever for the standby to appear. + </para> + + </sect3> + </sect2> + + <sect2 id="continuous-archiving-in-standby"> + <title>Continuous Archiving in Standby</title> + + <indexterm> + <primary>continuous archiving</primary> + <secondary>in standby</secondary> + </indexterm> + + <para> + When continuous WAL archiving is used in a standby, there are two + different scenarios: the WAL archive can be shared between the primary + and the standby, or the standby can have its own WAL archive. When + the standby has its own WAL archive, set <varname>archive_mode</varname> + to <literal>always</literal>, and the standby will call the archive + command for every WAL segment it receives, whether it's by restoring + from the archive or by streaming replication. The shared archive can + be handled similarly, but the <varname>archive_command</varname> must + test if the file being archived exists already, and if the existing file + has identical contents. This requires more care in the + <varname>archive_command</varname>, as it must + be careful to not overwrite an existing file with different contents, + but return success if the exactly same file is archived twice. And + all that must be done free of race conditions, if two servers attempt + to archive the same file at the same time. + </para> + + <para> + If <varname>archive_mode</varname> is set to <literal>on</literal>, the + archiver is not enabled during recovery or standby mode. If the standby + server is promoted, it will start archiving after the promotion, but + will not archive any WAL or timeline history files that + it did not generate itself. To get a complete + series of WAL files in the archive, you must ensure that all WAL is + archived, before it reaches the standby. This is inherently true with + file-based log shipping, as the standby can only restore files that + are found in the archive, but not if streaming replication is enabled. + When a server is not in recovery mode, there is no difference between + <literal>on</literal> and <literal>always</literal> modes. + </para> + </sect2> + </sect1> + + <sect1 id="warm-standby-failover"> + <title>Failover</title> + + <para> + If the primary server fails then the standby server should begin + failover procedures. + </para> + + <para> + If the standby server fails then no failover need take place. If the + standby server can be restarted, even some time later, then the recovery + process can also be restarted immediately, taking advantage of + restartable recovery. If the standby server cannot be restarted, then a + full new standby server instance should be created. + </para> + + <para> + If the primary server fails and the standby server becomes the + new primary, and then the old primary restarts, you must have + a mechanism for informing the old primary that it is no longer the primary. This is + sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is + necessary to avoid situations where both systems think they are the + primary, which will lead to confusion and ultimately data loss. + </para> + + <para> + Many failover systems use just two systems, the primary and the standby, + connected by some kind of heartbeat mechanism to continually verify the + connectivity between the two and the viability of the primary. It is + also possible to use a third system (called a witness server) to prevent + some cases of inappropriate failover, but the additional complexity + might not be worthwhile unless it is set up with sufficient care and + rigorous testing. + </para> + + <para> + <productname>PostgreSQL</productname> does not provide the system + software required to identify a failure on the primary and notify + the standby database server. Many such tools exist and are well + integrated with the operating system facilities required for + successful failover, such as IP address migration. + </para> + + <para> + Once failover to the standby occurs, there is only a + single server in operation. This is known as a degenerate state. + The former standby is now the primary, but the former primary is down + and might stay down. To return to normal operation, a standby server + must be recreated, + either on the former primary system when it comes up, or on a third, + possibly new, system. The <xref linkend="app-pgrewind"/> utility can be + used to speed up this process on large clusters. + Once complete, the primary and standby can be + considered to have switched roles. Some people choose to use a third + server to provide backup for the new primary until the new standby + server is recreated, + though clearly this complicates the system configuration and + operational processes. + </para> + + <para> + So, switching from primary to standby server can be fast but requires + some time to re-prepare the failover cluster. Regular switching from + primary to standby is useful, since it allows regular downtime on + each system for maintenance. This also serves as a test of the + failover mechanism to ensure that it will really work when you need it. + Written administration procedures are advised. + </para> + + <para> + To trigger failover of a log-shipping standby server, run + <command>pg_ctl promote</command>, call <function>pg_promote()</function>, + or create a trigger file with the file name and path specified by the + <varname>promote_trigger_file</varname>. If you're planning to use + <command>pg_ctl promote</command> or to call + <function>pg_promote()</function> to fail over, + <varname>promote_trigger_file</varname> is not required. If you're + setting up the reporting servers that are only used to offload read-only + queries from the primary, not for high availability purposes, you don't + need to promote it. + </para> + </sect1> + + <sect1 id="log-shipping-alternative"> + <title>Alternative Method for Log Shipping</title> + + <para> + An alternative to the built-in standby mode described in the previous + sections is to use a <varname>restore_command</varname> that polls the archive location. + This was the only option available in versions 8.4 and below. See the + <xref linkend="pgstandby"/> module for a reference implementation of this. + </para> + + <para> + Note that in this mode, the server will apply WAL one file at a + time, so if you use the standby server for queries (see Hot Standby), + there is a delay between an action in the master and when the + action becomes visible in the standby, corresponding to the time it takes + to fill up the WAL file. <varname>archive_timeout</varname> can be used to make that delay + shorter. Also note that you can't combine streaming replication with + this method. + </para> + + <para> + The operations that occur on both primary and standby servers are + normal continuous archiving and recovery tasks. The only point of + contact between the two database servers is the archive of WAL files + that both share: primary writing to the archive, standby reading from + the archive. Care must be taken to ensure that WAL archives from separate + primary servers do not become mixed together or confused. The archive + need not be large if it is only required for standby operation. + </para> + + <para> + The magic that makes the two loosely coupled servers work together is + simply a <varname>restore_command</varname> used on the standby that, + when asked for the next WAL file, waits for it to become available from + the primary. Normal recovery + processing would request a file from the WAL archive, reporting failure + if the file was unavailable. For standby processing it is normal for + the next WAL file to be unavailable, so the standby must wait for + it to appear. For files ending in + <literal>.history</literal> there is no need to wait, and a non-zero return + code must be returned. A waiting <varname>restore_command</varname> can be + written as a custom script that loops after polling for the existence of + the next WAL file. There must also be some way to trigger failover, which + should interrupt the <varname>restore_command</varname>, break the loop and + return a file-not-found error to the standby server. This ends recovery + and the standby will then come up as a normal server. + </para> + + <para> + Pseudocode for a suitable <varname>restore_command</varname> is: +<programlisting> +triggered = false; +while (!NextWALFileReady() && !triggered) +{ + sleep(100000L); /* wait for ~0.1 sec */ + if (CheckForExternalTrigger()) + triggered = true; +} +if (!triggered) + CopyWALFileForRecovery(); +</programlisting> + </para> + + <para> + A working example of a waiting <varname>restore_command</varname> is provided + in the <xref linkend="pgstandby"/> module. It + should be used as a reference on how to correctly implement the logic + described above. It can also be extended as needed to support specific + configurations and environments. + </para> + + <para> + The method for triggering failover is an important part of planning + and design. One potential option is the <varname>restore_command</varname> + command. It is executed once for each WAL file, but the process + running the <varname>restore_command</varname> is created and dies for + each file, so there is no daemon or server process, and + signals or a signal handler cannot be used. Therefore, the + <varname>restore_command</varname> is not suitable to trigger failover. + It is possible to use a simple timeout facility, especially if + used in conjunction with a known <varname>archive_timeout</varname> + setting on the primary. However, this is somewhat error prone + since a network problem or busy primary server might be sufficient + to initiate failover. A notification mechanism such as the explicit + creation of a trigger file is ideal, if this can be arranged. + </para> + + <sect2 id="warm-standby-config"> + <title>Implementation</title> + + <para> + The short procedure for configuring a standby server using this alternative + method is as follows. For + full details of each step, refer to previous sections as noted. + <orderedlist> + <listitem> + <para> + Set up primary and standby systems as nearly identical as + possible, including two identical copies of + <productname>PostgreSQL</productname> at the same release level. + </para> + </listitem> + <listitem> + <para> + Set up continuous archiving from the primary to a WAL archive + directory on the standby server. Ensure that + <xref linkend="guc-archive-mode"/>, + <xref linkend="guc-archive-command"/> and + <xref linkend="guc-archive-timeout"/> + are set appropriately on the primary + (see <xref linkend="backup-archiving-wal"/>). + </para> + </listitem> + <listitem> + <para> + Make a base backup of the primary server (see <xref + linkend="backup-base-backup"/>), and load this data onto the standby. + </para> + </listitem> + <listitem> + <para> + Begin recovery on the standby server from the local WAL + archive, using <varname>restore_command</varname> that waits + as described previously (see <xref linkend="backup-pitr-recovery"/>). + </para> + </listitem> + </orderedlist> + </para> + + <para> + Recovery treats the WAL archive as read-only, so once a WAL file has + been copied to the standby system it can be copied to tape at the same + time as it is being read by the standby database server. + Thus, running a standby server for high availability can be performed at + the same time as files are stored for longer term disaster recovery + purposes. + </para> + + <para> + For testing purposes, it is possible to run both primary and standby + servers on the same system. This does not provide any worthwhile + improvement in server robustness, nor would it be described as HA. + </para> + </sect2> + + <sect2 id="warm-standby-record"> + <title>Record-Based Log Shipping</title> + + <para> + It is also possible to implement record-based log shipping using this + alternative method, though this requires custom development, and changes + will still only become visible to hot standby queries after a full WAL + file has been shipped. + </para> + + <para> + An external program can call the <function>pg_walfile_name_offset()</function> + function (see <xref linkend="functions-admin"/>) + to find out the file name and the exact byte offset within it of + the current end of WAL. It can then access the WAL file directly + and copy the data from the last known end of WAL through the current end + over to the standby servers. With this approach, the window for data + loss is the polling cycle time of the copying program, which can be very + small, and there is no wasted bandwidth from forcing partially-used + segment files to be archived. Note that the standby servers' + <varname>restore_command</varname> scripts can only deal with whole WAL files, + so the incrementally copied data is not ordinarily made available to + the standby servers. It is of use only when the primary dies — + then the last partial WAL file is fed to the standby before allowing + it to come up. The correct implementation of this process requires + cooperation of the <varname>restore_command</varname> script with the data + copying program. + </para> + + <para> + Starting with <productname>PostgreSQL</productname> version 9.0, you can use + streaming replication (see <xref linkend="streaming-replication"/>) to + achieve the same benefits with less effort. + </para> + </sect2> + </sect1> + + <sect1 id="hot-standby"> + <title>Hot Standby</title> + + <indexterm zone="high-availability"> + <primary>Hot Standby</primary> + </indexterm> + + <para> + Hot Standby is the term used to describe the ability to connect to + the server and run read-only queries while the server is in archive + recovery or standby mode. This + is useful both for replication purposes and for restoring a backup + to a desired state with great precision. + The term Hot Standby also refers to the ability of the server to move + from recovery through to normal operation while users continue running + queries and/or keep their connections open. + </para> + + <para> + Running queries in hot standby mode is similar to normal query operation, + though there are several usage and administrative differences + explained below. + </para> + + <sect2 id="hot-standby-users"> + <title>User's Overview</title> + + <para> + When the <xref linkend="guc-hot-standby"/> parameter is set to true on a + standby server, it will begin accepting connections once the recovery has + brought the system to a consistent state. All such connections are + strictly read-only; not even temporary tables may be written. + </para> + + <para> + The data on the standby takes some time to arrive from the primary server + so there will be a measurable delay between primary and standby. Running the + same query nearly simultaneously on both primary and standby might therefore + return differing results. We say that data on the standby is + <firstterm>eventually consistent</firstterm> with the primary. Once the + commit record for a transaction is replayed on the standby, the changes + made by that transaction will be visible to any new snapshots taken on + the standby. Snapshots may be taken at the start of each query or at the + start of each transaction, depending on the current transaction isolation + level. For more details, see <xref linkend="transaction-iso"/>. + </para> + + <para> + Transactions started during hot standby may issue the following commands: + + <itemizedlist> + <listitem> + <para> + Query access: <command>SELECT</command>, <command>COPY TO</command> + </para> + </listitem> + <listitem> + <para> + Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command> + </para> + </listitem> + <listitem> + <para> + Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command> + </para> + </listitem> + <listitem> + <para> + Transaction management commands: + <itemizedlist> + <listitem> + <para> + <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command> + </para> + </listitem> + <listitem> + <para> + <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command> + </para> + </listitem> + <listitem> + <para> + <command>EXCEPTION</command> blocks and other internal subtransactions + </para> + </listitem> + </itemizedlist> + </para> + </listitem> + <listitem> + <para> + <command>LOCK TABLE</command>, though only when explicitly in one of these modes: + <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>. + </para> + </listitem> + <listitem> + <para> + Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>, + <command>DEALLOCATE</command>, <command>DISCARD</command> + </para> + </listitem> + <listitem> + <para> + Plugins and extensions: <command>LOAD</command> + </para> + </listitem> + <listitem> + <para> + <command>UNLISTEN</command> + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Transactions started during hot standby will never be assigned a + transaction ID and cannot write to the system write-ahead log. + Therefore, the following actions will produce error messages: + + <itemizedlist> + <listitem> + <para> + Data Manipulation Language (DML): <command>INSERT</command>, + <command>UPDATE</command>, <command>DELETE</command>, <command>COPY FROM</command>, + <command>TRUNCATE</command>. + Note that there are no allowed actions that result in a trigger + being executed during recovery. This restriction applies even to + temporary tables, because table rows cannot be read or written without + assigning a transaction ID, which is currently not possible in a + Hot Standby environment. + </para> + </listitem> + <listitem> + <para> + Data Definition Language (DDL): <command>CREATE</command>, + <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>. + This restriction applies even to temporary tables, because carrying + out these operations would require updating the system catalog tables. + </para> + </listitem> + <listitem> + <para> + <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be + taken without updating the underlying data files. + </para> + </listitem> + <listitem> + <para> + Rules on <command>SELECT</command> statements that generate DML commands. + </para> + </listitem> + <listitem> + <para> + <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>. + </para> + </listitem> + <listitem> + <para> + <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>. + </para> + </listitem> + <listitem> + <para> + Transaction management commands that explicitly set non-read-only state: + <itemizedlist> + <listitem> + <para> + <command>BEGIN READ WRITE</command>, + <command>START TRANSACTION READ WRITE</command> + </para> + </listitem> + <listitem> + <para> + <command>SET TRANSACTION READ WRITE</command>, + <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command> + </para> + </listitem> + <listitem> + <para> + <command>SET transaction_read_only = off</command> + </para> + </listitem> + </itemizedlist> + </para> + </listitem> + <listitem> + <para> + Two-phase commit commands: <command>PREPARE TRANSACTION</command>, + <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command> + because even read-only transactions need to write WAL in the + prepare phase (the first phase of two phase commit). + </para> + </listitem> + <listitem> + <para> + Sequence updates: <function>nextval()</function>, <function>setval()</function> + </para> + </listitem> + <listitem> + <para> + <command>LISTEN</command>, <command>NOTIFY</command> + </para> + </listitem> + </itemizedlist> + </para> + + <para> + In normal operation, <quote>read-only</quote> transactions are allowed to + use <command>LISTEN</command> and <command>NOTIFY</command>, + so Hot Standby sessions operate under slightly tighter + restrictions than ordinary read-only sessions. It is possible that some + of these restrictions might be loosened in a future release. + </para> + + <para> + During hot standby, the parameter <varname>transaction_read_only</varname> is always + true and may not be changed. But as long as no attempt is made to modify + the database, connections during hot standby will act much like any other + database connection. If failover or switchover occurs, the database will + switch to normal processing mode. Sessions will remain connected while the + server changes mode. Once hot standby finishes, it will be possible to + initiate read-write transactions (even from a session begun during + hot standby). + </para> + + <para> + Users will be able to tell whether their session is read-only by + issuing <command>SHOW transaction_read_only</command>. In addition, a set of + functions (<xref linkend="functions-recovery-info-table"/>) allow users to + access information about the standby server. These allow you to write + programs that are aware of the current state of the database. These + can be used to monitor the progress of recovery, or to allow you to + write complex programs that restore the database to particular states. + </para> + </sect2> + + <sect2 id="hot-standby-conflict"> + <title>Handling Query Conflicts</title> + + <para> + The primary and standby servers are in many ways loosely connected. Actions + on the primary will have an effect on the standby. As a result, there is + potential for negative interactions or conflicts between them. The easiest + conflict to understand is performance: if a huge data load is taking place + on the primary then this will generate a similar stream of WAL records on the + standby, so standby queries may contend for system resources, such as I/O. + </para> + + <para> + There are also additional types of conflict that can occur with Hot Standby. + These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries + might need to be canceled and, in some cases, sessions disconnected to resolve them. + The user is provided with several ways to handle these + conflicts. Conflict cases include: + + <itemizedlist> + <listitem> + <para> + Access Exclusive locks taken on the primary server, including both + explicit <command>LOCK</command> commands and various <acronym>DDL</acronym> + actions, conflict with table accesses in standby queries. + </para> + </listitem> + <listitem> + <para> + Dropping a tablespace on the primary conflicts with standby queries + using that tablespace for temporary work files. + </para> + </listitem> + <listitem> + <para> + Dropping a database on the primary conflicts with sessions connected + to that database on the standby. + </para> + </listitem> + <listitem> + <para> + Application of a vacuum cleanup record from WAL conflicts with + standby transactions whose snapshots can still <quote>see</quote> any of + the rows to be removed. + </para> + </listitem> + <listitem> + <para> + Application of a vacuum cleanup record from WAL conflicts with + queries accessing the target page on the standby, whether or not + the data to be removed is visible. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + On the primary server, these cases simply result in waiting; and the + user might choose to cancel either of the conflicting actions. However, + on the standby there is no choice: the WAL-logged action already occurred + on the primary so the standby must not fail to apply it. Furthermore, + allowing WAL application to wait indefinitely may be very undesirable, + because the standby's state will become increasingly far behind the + primary's. Therefore, a mechanism is provided to forcibly cancel standby + queries that conflict with to-be-applied WAL records. + </para> + + <para> + An example of the problem situation is an administrator on the primary + server running <command>DROP TABLE</command> on a table that is currently being + queried on the standby server. Clearly the standby query cannot continue + if the <command>DROP TABLE</command> is applied on the standby. If this situation + occurred on the primary, the <command>DROP TABLE</command> would wait until the + other query had finished. But when <command>DROP TABLE</command> is run on the + primary, the primary doesn't have information about what queries are + running on the standby, so it will not wait for any such standby + queries. The WAL change records come through to the standby while the + standby query is still running, causing a conflict. The standby server + must either delay application of the WAL records (and everything after + them, too) or else cancel the conflicting query so that the <command>DROP + TABLE</command> can be applied. + </para> + + <para> + When a conflicting query is short, it's typically desirable to allow it to + complete by delaying WAL application for a little bit; but a long delay in + WAL application is usually not desirable. So the cancel mechanism has + parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref + linkend="guc-max-standby-streaming-delay"/>, that define the maximum + allowed delay in WAL application. Conflicting queries will be canceled + once it has taken longer than the relevant delay setting to apply any + newly-received WAL data. There are two parameters so that different delay + values can be specified for the case of reading WAL data from an archive + (i.e., initial recovery from a base backup or <quote>catching up</quote> a + standby server that has fallen far behind) versus reading WAL data via + streaming replication. + </para> + + <para> + In a standby server that exists primarily for high availability, it's + best to set the delay parameters relatively short, so that the server + cannot fall far behind the primary due to delays caused by standby + queries. However, if the standby server is meant for executing + long-running queries, then a high or even infinite delay value may be + preferable. Keep in mind however that a long-running query could + cause other sessions on the standby server to not see recent changes + on the primary, if it delays application of WAL records. + </para> + + <para> + Once the delay specified by <varname>max_standby_archive_delay</varname> or + <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting + queries will be canceled. This usually results just in a cancellation + error, although in the case of replaying a <command>DROP DATABASE</command> + the entire conflicting session will be terminated. Also, if the conflict + is over a lock held by an idle transaction, the conflicting session is + terminated (this behavior might change in the future). + </para> + + <para> + Canceled queries may be retried immediately (after beginning a new + transaction, of course). Since query cancellation depends on + the nature of the WAL records being replayed, a query that was + canceled may well succeed if it is executed again. + </para> + + <para> + Keep in mind that the delay parameters are compared to the elapsed time + since the WAL data was received by the standby server. Thus, the grace + period allowed to any one query on the standby is never more than the + delay parameter, and could be considerably less if the standby has already + fallen behind as a result of waiting for previous queries to complete, or + as a result of being unable to keep up with a heavy update load. + </para> + + <para> + The most common reason for conflict between standby queries and WAL replay + is <quote>early cleanup</quote>. Normally, <productname>PostgreSQL</productname> allows + cleanup of old row versions when there are no transactions that need to + see them to ensure correct visibility of data according to MVCC rules. + However, this rule can only be applied for transactions executing on the + master. So it is possible that cleanup on the master will remove row + versions that are still visible to a transaction on the standby. + </para> + + <para> + Experienced users should note that both row version cleanup and row version + freezing will potentially conflict with standby queries. Running a manual + <command>VACUUM FREEZE</command> is likely to cause conflicts even on tables with + no updated or deleted rows. + </para> + + <para> + Users should be clear that tables that are regularly and heavily updated + on the primary server will quickly cause cancellation of longer running + queries on the standby. In such cases the setting of a finite value for + <varname>max_standby_archive_delay</varname> or + <varname>max_standby_streaming_delay</varname> can be considered similar to + setting <varname>statement_timeout</varname>. + </para> + + <para> + Remedial possibilities exist if the number of standby-query cancellations + is found to be unacceptable. The first option is to set the parameter + <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from + removing recently-dead rows and so cleanup conflicts do not occur. + If you do this, you + should note that this will delay cleanup of dead rows on the primary, + which may result in undesirable table bloat. However, the cleanup + situation will be no worse than if the standby queries were running + directly on the primary server, and you are still getting the benefit of + off-loading execution onto the standby. + If standby servers connect and disconnect frequently, you + might want to make adjustments to handle the period when + <varname>hot_standby_feedback</varname> feedback is not being provided. + For example, consider increasing <varname>max_standby_archive_delay</varname> + so that queries are not rapidly canceled by conflicts in WAL archive + files during disconnected periods. You should also consider increasing + <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations + by newly-arrived streaming WAL entries after reconnection. + </para> + + <para> + Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age"/> + on the primary server, so that dead rows will not be cleaned up as quickly + as they normally would be. This will allow more time for queries to + execute before they are canceled on the standby, without having to set + a high <varname>max_standby_streaming_delay</varname>. However it is + difficult to guarantee any specific execution-time window with this + approach, since <varname>vacuum_defer_cleanup_age</varname> is measured in + transactions executed on the primary server. + </para> + + <para> + The number of query cancels and the reason for them can be viewed using + the <structname>pg_stat_database_conflicts</structname> system view on the standby + server. The <structname>pg_stat_database</structname> system view also contains + summary information. + </para> + </sect2> + + <sect2 id="hot-standby-admin"> + <title>Administrator's Overview</title> + + <para> + If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename> + (the default value) and there is a + <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm> + file present, the server will run in Hot Standby mode. + However, it may take some time for Hot Standby connections to be allowed, + because the server will not accept connections until it has completed + sufficient recovery to provide a consistent state against which queries + can run. During this period, + clients that attempt to connect will be refused with an error message. + To confirm the server has come up, either loop trying to connect from + the application, or look for these messages in the server logs: + +<programlisting> +LOG: entering standby mode + +... then some time later ... + +LOG: consistent recovery state reached +LOG: database system is ready to accept read only connections +</programlisting> + + Consistency information is recorded once per checkpoint on the primary. + It is not possible to enable hot standby when reading WAL + written during a period when <varname>wal_level</varname> was not set to + <literal>replica</literal> or <literal>logical</literal> on the primary. Reaching + a consistent state can also be delayed in the presence of both of these + conditions: + + <itemizedlist> + <listitem> + <para> + A write transaction has more than 64 subtransactions + </para> + </listitem> + <listitem> + <para> + Very long-lived write transactions + </para> + </listitem> + </itemizedlist> + + If you are running file-based log shipping ("warm standby"), you might need + to wait until the next WAL file arrives, which could be as long as the + <varname>archive_timeout</varname> setting on the primary. + </para> + + <para> + The setting of some parameters on the standby will need reconfiguration + if they have been changed on the primary. For these parameters, + the value on the standby must + be equal to or greater than the value on the primary. + Therefore, if you want to increase these values, you should do so on all + standby servers first, before applying the changes to the primary server. + Conversely, if you want to decrease these values, you should do so on the + primary server first, before applying the changes to all standby servers. + If these parameters + are not set high enough then the standby will refuse to start. + Higher values can then be supplied and the server + restarted to begin recovery again. These parameters are: + + <itemizedlist> + <listitem> + <para> + <varname>max_connections</varname> + </para> + </listitem> + <listitem> + <para> + <varname>max_prepared_transactions</varname> + </para> + </listitem> + <listitem> + <para> + <varname>max_locks_per_transaction</varname> + </para> + </listitem> + <listitem> + <para> + <varname>max_wal_senders</varname> + </para> + </listitem> + <listitem> + <para> + <varname>max_worker_processes</varname> + </para> + </listitem> + </itemizedlist> + </para> + + <para> + It is important that the administrator select appropriate settings for + <xref linkend="guc-max-standby-archive-delay"/> and <xref + linkend="guc-max-standby-streaming-delay"/>. The best choices vary + depending on business priorities. For example if the server is primarily + tasked as a High Availability server, then you will want low delay + settings, perhaps even zero, though that is a very aggressive setting. If + the standby server is tasked as an additional server for decision support + queries then it might be acceptable to set the maximum delay values to + many hours, or even -1 which means wait forever for queries to complete. + </para> + + <para> + Transaction status "hint bits" written on the primary are not WAL-logged, + so data on the standby will likely re-write the hints again on the standby. + Thus, the standby server will still perform disk writes even though + all users are read-only; no changes occur to the data values + themselves. Users will still write large sort temporary files and + re-generate relcache info files, so no part of the database + is truly read-only during hot standby mode. + Note also that writes to remote databases using + <application>dblink</application> module, and other operations outside the + database using PL functions will still be possible, even though the + transaction is read-only locally. + </para> + + <para> + The following types of administration commands are not accepted + during recovery mode: + + <itemizedlist> + <listitem> + <para> + Data Definition Language (DDL): e.g., <command>CREATE INDEX</command> + </para> + </listitem> + <listitem> + <para> + Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>, + <command>REASSIGN</command> + </para> + </listitem> + <listitem> + <para> + Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>, + <command>CLUSTER</command>, <command>REINDEX</command> + </para> + </listitem> + </itemizedlist> + </para> + + <para> + Again, note that some of these commands are actually allowed during + "read only" mode transactions on the primary. + </para> + + <para> + As a result, you cannot create additional indexes that exist solely + on the standby, nor statistics that exist solely on the standby. + If these administration commands are needed, they should be executed + on the primary, and eventually those changes will propagate to the + standby. + </para> + + <para> + <function>pg_cancel_backend()</function> + and <function>pg_terminate_backend()</function> will work on user backends, + but not the Startup process, which performs + recovery. <structname>pg_stat_activity</structname> does not show + recovering transactions as active. As a result, + <structname>pg_prepared_xacts</structname> is always empty during + recovery. If you wish to resolve in-doubt prepared transactions, view + <literal>pg_prepared_xacts</literal> on the primary and issue commands to + resolve transactions there or resolve them after the end of recovery. + </para> + + <para> + <structname>pg_locks</structname> will show locks held by backends, + as normal. <structname>pg_locks</structname> also shows + a virtual transaction managed by the Startup process that owns all + <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery. + Note that the Startup process does not acquire locks to + make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal> + do not show in <structname>pg_locks</structname> for the Startup + process; they are just presumed to exist. + </para> + + <para> + The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will + work, because the simple information it checks for exists. + The <productname>check_postgres</productname> monitoring script will also work, + though some reported values could give different or confusing results. + For example, last vacuum time will not be maintained, since no + vacuum occurs on the standby. Vacuums running on the primary + do still send their changes to the standby. + </para> + + <para> + WAL file control commands will not work during recovery, + e.g., <function>pg_start_backup</function>, <function>pg_switch_wal</function> etc. + </para> + + <para> + Dynamically loadable modules work, including <structname>pg_stat_statements</structname>. + </para> + + <para> + Advisory locks work normally in recovery, including deadlock detection. + Note that advisory locks are never WAL logged, so it is impossible for + an advisory lock on either the primary or the standby to conflict with WAL + replay. Nor is it possible to acquire an advisory lock on the primary + and have it initiate a similar advisory lock on the standby. Advisory + locks relate only to the server on which they are acquired. + </para> + + <para> + Trigger-based replication systems such as <productname>Slony</productname>, + <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the + standby at all, though they will run happily on the primary server as + long as the changes are not sent to standby servers to be applied. + WAL replay is not trigger-based so you cannot relay from the + standby to any system that requires additional database writes or + relies on the use of triggers. + </para> + + <para> + New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still + work as long as they do not rely on writing new status to the database. + </para> + + <para> + Currently, temporary table creation is not allowed during read only + transactions, so in some cases existing scripts will not run correctly. + This restriction might be relaxed in a later release. This is + both a SQL Standard compliance issue and a technical issue. + </para> + + <para> + <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty. + Some standby users may be actively using the tablespace via their + <varname>temp_tablespaces</varname> parameter. If there are temporary files in the + tablespace, all active queries are canceled to ensure that temporary + files are removed, so the tablespace can be removed and WAL replay + can continue. + </para> + + <para> + Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET + TABLESPACE</command> on the primary + will generate a WAL entry that will cause all users connected to that + database on the standby to be forcibly disconnected. This action occurs + immediately, whatever the setting of + <varname>max_standby_streaming_delay</varname>. Note that + <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which + in most cases will go unnoticed, though might in some cases cause a + program confusion if it depends in some way upon database name. + </para> + + <para> + In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command> + for a role with login capability while that user is still connected then + nothing happens to the connected user — they remain connected. The user cannot + reconnect however. This behavior applies in recovery also, so a + <command>DROP USER</command> on the primary does not disconnect that user on the standby. + </para> + + <para> + The statistics collector is active during recovery. All scans, reads, blocks, + index usage, etc., will be recorded normally on the standby. Replayed + actions will not duplicate their effects on primary, so replaying an + insert will not increment the Inserts column of pg_stat_user_tables. + The stats file is deleted at the start of recovery, so stats from primary + and standby will differ; this is considered a feature, not a bug. + </para> + + <para> + Autovacuum is not active during recovery. It will start normally at the + end of recovery. + </para> + + <para> + The checkpointer process and the background writer process are active during + recovery. The checkpointer process will perform restartpoints (similar to + checkpoints on the primary) and the background writer process will perform + normal block cleaning activities. This can include updates of the hint bit + information stored on the standby server. + The <command>CHECKPOINT</command> command is accepted during recovery, + though it performs a restartpoint rather than a new checkpoint. + </para> + </sect2> + + <sect2 id="hot-standby-parameters"> + <title>Hot Standby Parameter Reference</title> + + <para> + Various parameters have been mentioned above in + <xref linkend="hot-standby-conflict"/> and + <xref linkend="hot-standby-admin"/>. + </para> + + <para> + On the primary, parameters <xref linkend="guc-wal-level"/> and + <xref linkend="guc-vacuum-defer-cleanup-age"/> can be used. + <xref linkend="guc-max-standby-archive-delay"/> and + <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on + the primary. + </para> + + <para> + On the standby, parameters <xref linkend="guc-hot-standby"/>, + <xref linkend="guc-max-standby-archive-delay"/> and + <xref linkend="guc-max-standby-streaming-delay"/> can be used. + <xref linkend="guc-vacuum-defer-cleanup-age"/> has no effect + as long as the server remains in standby mode, though it will + become relevant if the standby becomes primary. + </para> + </sect2> + + <sect2 id="hot-standby-caveats"> + <title>Caveats</title> + + <para> + There are several limitations of Hot Standby. + These can and probably will be fixed in future releases: + + <itemizedlist> + <listitem> + <para> + Full knowledge of running transactions is required before snapshots + can be taken. Transactions that use large numbers of subtransactions + (currently greater than 64) will delay the start of read only + connections until the completion of the longest running write transaction. + If this situation occurs, explanatory messages will be sent to the server log. + </para> + </listitem> + <listitem> + <para> + Valid starting points for standby queries are generated at each + checkpoint on the master. If the standby is shut down while the master + is in a shutdown state, it might not be possible to re-enter Hot Standby + until the primary is started up, so that it generates further starting + points in the WAL logs. This situation isn't a problem in the most + common situations where it might happen. Generally, if the primary is + shut down and not available anymore, that's likely due to a serious + failure that requires the standby being converted to operate as + the new primary anyway. And in situations where the primary is + being intentionally taken down, coordinating to make sure the standby + becomes the new primary smoothly is also standard procedure. + </para> + </listitem> + <listitem> + <para> + At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions + will require twice the normal number of lock table entries. If you plan + on running either a large number of concurrent prepared transactions + that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one + large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are + advised to select a larger value of <varname>max_locks_per_transaction</varname>, + perhaps as much as twice the value of the parameter on + the primary server. You need not consider this at all if + your setting of <varname>max_prepared_transactions</varname> is 0. + </para> + </listitem> + <listitem> + <para> + The Serializable transaction isolation level is not yet available in hot + standby. (See <xref linkend="xact-serializable"/> and + <xref linkend="serializable-consistency"/> for details.) + An attempt to set a transaction to the serializable isolation level in + hot standby mode will generate an error. + </para> + </listitem> + </itemizedlist> + + </para> + </sect2> + + </sect1> + +</chapter> |