summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/high-availability.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/high-availability.sgml')
-rw-r--r--doc/src/sgml/high-availability.sgml2493
1 files changed, 2493 insertions, 0 deletions
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
new file mode 100644
index 0000000..37c4439
--- /dev/null
+++ b/doc/src/sgml/high-availability.sgml
@@ -0,0 +1,2493 @@
+<!-- doc/src/sgml/high-availability.sgml -->
+
+<chapter id="high-availability">
+ <title>High Availability, Load Balancing, and Replication</title>
+
+ <indexterm><primary>high availability</primary></indexterm>
+ <indexterm><primary>failover</primary></indexterm>
+ <indexterm><primary>replication</primary></indexterm>
+ <indexterm><primary>load balancing</primary></indexterm>
+ <indexterm><primary>clustering</primary></indexterm>
+ <indexterm><primary>data partitioning</primary></indexterm>
+
+ <para>
+ Database servers can work together to allow a second server to
+ take over quickly if the primary server fails (high
+ availability), or to allow several computers to serve the same
+ data (load balancing). Ideally, database servers could work
+ together seamlessly. Web servers serving static web pages can
+ be combined quite easily by merely load-balancing web requests
+ to multiple machines. In fact, read-only database servers can
+ be combined relatively easily too. Unfortunately, most database
+ servers have a read/write mix of requests, and read/write servers
+ are much harder to combine. This is because though read-only
+ data needs to be placed on each server only once, a write to any
+ server has to be propagated to all servers so that future read
+ requests to those servers return consistent results.
+ </para>
+
+ <para>
+ This synchronization problem is the fundamental difficulty for
+ servers working together. Because there is no single solution
+ that eliminates the impact of the sync problem for all use cases,
+ there are multiple solutions. Each solution addresses this
+ problem in a different way, and minimizes its impact for a specific
+ workload.
+ </para>
+
+ <para>
+ Some solutions deal with synchronization by allowing only one
+ server to modify the data. Servers that can modify data are
+ called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers.
+ Servers that track changes in the master are called <firstterm>standby</firstterm>
+ or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected
+ to until it is promoted to a master server is called a <firstterm>warm
+ standby</firstterm> server, and one that can accept connections and serves read-only
+ queries is called a <firstterm>hot standby</firstterm> server.
+ </para>
+
+ <para>
+ Some solutions are synchronous,
+ meaning that a data-modifying transaction is not considered
+ committed until all servers have committed the transaction. This
+ guarantees that a failover will not lose any data and that all
+ load-balanced servers will return consistent results no matter
+ which server is queried. In contrast, asynchronous solutions allow some
+ delay between the time of a commit and its propagation to the other servers,
+ opening the possibility that some transactions might be lost in
+ the switch to a backup server, and that load balanced servers
+ might return slightly stale results. Asynchronous communication
+ is used when synchronous would be too slow.
+ </para>
+
+ <para>
+ Solutions can also be categorized by their granularity. Some solutions
+ can deal only with an entire database server, while others allow control
+ at the per-table or per-database level.
+ </para>
+
+ <para>
+ Performance must be considered in any choice. There is usually a
+ trade-off between functionality and
+ performance. For example, a fully synchronous solution over a slow
+ network might cut performance by more than half, while an asynchronous
+ one might have a minimal performance impact.
+ </para>
+
+ <para>
+ The remainder of this section outlines various failover, replication,
+ and load balancing solutions.
+ </para>
+
+ <sect1 id="different-replication-solutions">
+ <title>Comparison of Different Solutions</title>
+
+ <variablelist>
+
+ <varlistentry>
+ <term>Shared Disk Failover</term>
+ <listitem>
+
+ <para>
+ Shared disk failover avoids synchronization overhead by having only one
+ copy of the database. It uses a single disk array that is shared by
+ multiple servers. If the main database server fails, the standby server
+ is able to mount and start the database as though it were recovering from
+ a database crash. This allows rapid failover with no data loss.
+ </para>
+
+ <para>
+ Shared hardware functionality is common in network storage devices.
+ Using a network file system is also possible, though care must be
+ taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref
+ linkend="creating-cluster-nfs"/>). One significant limitation of this
+ method is that if the shared disk array fails or becomes corrupt, the
+ primary and standby servers are both nonfunctional. Another issue is
+ that the standby server should never access the shared storage while
+ the primary server is running.
+ </para>
+
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>File System (Block Device) Replication</term>
+ <listitem>
+
+ <para>
+ A modified version of shared hardware functionality is file system
+ replication, where all changes to a file system are mirrored to a file
+ system residing on another computer. The only restriction is that
+ the mirroring must be done in a way that ensures the standby server
+ has a consistent copy of the file system &mdash; specifically, writes
+ to the standby must be done in the same order as those on the master.
+ <productname>DRBD</productname> is a popular file system replication solution
+ for Linux.
+ </para>
+
+<!--
+https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
+
+Oracle RAC is a shared disk approach and just send cache invalidations
+to other nodes but not actual data. As the disk is shared, data is
+only committed once to disk and there is a distributed locking
+protocol to make nodes agree on a serializable transactional order.
+-->
+
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Write-Ahead Log Shipping</term>
+ <listitem>
+
+ <para>
+ Warm and hot standby servers can be kept current by reading a
+ stream of write-ahead log (<acronym>WAL</acronym>)
+ records. If the main server fails, the standby contains
+ almost all of the data of the main server, and can be quickly
+ made the new master database server. This can be synchronous or
+ asynchronous and can only be done for the entire database server.
+ </para>
+ <para>
+ A standby server can be implemented using file-based log shipping
+ (<xref linkend="warm-standby"/>) or streaming replication (see
+ <xref linkend="streaming-replication"/>), or a combination of both. For
+ information on hot standby, see <xref linkend="hot-standby"/>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Logical Replication</term>
+ <listitem>
+ <para>
+ Logical replication allows a database server to send a stream of data
+ modifications to another server. <productname>PostgreSQL</productname>
+ logical replication constructs a stream of logical data modifications
+ from the WAL. Logical replication allows the data changes from
+ individual tables to be replicated. Logical replication doesn't require
+ a particular server to be designated as a master or a replica but allows
+ data to flow in multiple directions. For more information on logical
+ replication, see <xref linkend="logical-replication"/>. Through the
+ logical decoding interface (<xref linkend="logicaldecoding"/>),
+ third-party extensions can also provide similar functionality.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Trigger-Based Master-Standby Replication</term>
+ <listitem>
+
+ <para>
+ A master-standby replication setup sends all data modification
+ queries to the master server. The master server asynchronously
+ sends data changes to the standby server. The standby can answer
+ read-only queries while the master server is running. The
+ standby server is ideal for data warehouse queries.
+ </para>
+
+ <para>
+ <productname>Slony-I</productname> is an example of this type of replication, with per-table
+ granularity, and support for multiple standby servers. Because it
+ updates the standby server asynchronously (in batches), there is
+ possible data loss during fail over.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>SQL-Based Replication Middleware</term>
+ <listitem>
+
+ <para>
+ With SQL-based replication middleware, a program intercepts
+ every SQL query and sends it to one or all servers. Each server
+ operates independently. Read-write queries must be sent to all servers,
+ so that every server receives any changes. But read-only queries can be
+ sent to just one server, allowing the read workload to be distributed
+ among them.
+ </para>
+
+ <para>
+ If queries are simply broadcast unmodified, functions like
+ <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and
+ sequences can have different values on different servers.
+ This is because each server operates independently, and because
+ SQL queries are broadcast (and not actual modified rows). If
+ this is unacceptable, either the middleware or the application
+ must query such values from a single server and then use those
+ values in write queries. Another option is to use this replication
+ option with a traditional master-standby setup, i.e., data modification
+ queries are sent only to the master and are propagated to the
+ standby servers via master-standby replication, not by the replication
+ middleware. Care must also be taken that all
+ transactions either commit or abort on all servers, perhaps
+ using two-phase commit (<xref linkend="sql-prepare-transaction"/>
+ and <xref linkend="sql-commit-prepared"/>).
+ <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname>
+ are examples of this type of replication.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Asynchronous Multimaster Replication</term>
+ <listitem>
+
+ <para>
+ For servers that are not regularly connected or have slow
+ communication links, like laptops or
+ remote servers, keeping data consistent among servers is a
+ challenge. Using asynchronous multimaster replication, each
+ server works independently, and periodically communicates with
+ the other servers to identify conflicting transactions. The
+ conflicts can be resolved by users or conflict resolution rules.
+ Bucardo is an example of this type of replication.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Synchronous Multimaster Replication</term>
+ <listitem>
+
+ <para>
+ In synchronous multimaster replication, each server can accept
+ write requests, and modified data is transmitted from the
+ original server to every other server before each transaction
+ commits. Heavy write activity can cause excessive locking and
+ commit delays, leading to poor performance. Read requests can
+ be sent to any server. Some implementations use shared disk
+ to reduce the communication overhead. Synchronous multimaster
+ replication is best for mostly read workloads, though its big
+ advantage is that any server can accept write requests &mdash;
+ there is no need to partition workloads between master and
+ standby servers, and because the data changes are sent from one
+ server to another, there is no problem with non-deterministic
+ functions like <function>random()</function>.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> does not offer this type of replication,
+ though <productname>PostgreSQL</productname> two-phase commit (<xref
+ linkend="sql-prepare-transaction"/> and <xref
+ linkend="sql-commit-prepared"/>)
+ can be used to implement this in application code or middleware.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <para>
+ <xref linkend="high-availability-matrix"/> summarizes
+ the capabilities of the various solutions listed above.
+ </para>
+
+ <table id="high-availability-matrix">
+ <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
+ <tgroup cols="9">
+ <colspec colname="col1" colwidth="1.1*"/>
+ <colspec colname="col2" colwidth="1*"/>
+ <colspec colname="col3" colwidth="1*"/>
+ <colspec colname="col4" colwidth="1*"/>
+ <colspec colname="col5" colwidth="1*"/>
+ <colspec colname="col6" colwidth="1*"/>
+ <colspec colname="col7" colwidth="1*"/>
+ <colspec colname="col8" colwidth="1*"/>
+ <colspec colname="col9" colwidth="1*"/>
+ <thead>
+ <row>
+ <entry>Feature</entry>
+ <entry>Shared Disk</entry>
+ <entry>File System Repl.</entry>
+ <entry>Write-Ahead Log Shipping</entry>
+ <entry>Logical Repl.</entry>
+ <entry>Trigger-Based Repl.</entry>
+ <entry>SQL Repl. Middle-ware</entry>
+ <entry>Async. MM Repl.</entry>
+ <entry>Sync. MM Repl.</entry>
+ </row>
+ </thead>
+
+ <tbody>
+
+ <row>
+ <entry>Popular examples</entry>
+ <entry align="center">NAS</entry>
+ <entry align="center">DRBD</entry>
+ <entry align="center">built-in streaming repl.</entry>
+ <entry align="center">built-in logical repl., pglogical</entry>
+ <entry align="center">Londiste, Slony</entry>
+ <entry align="center">pgpool-II</entry>
+ <entry align="center">Bucardo</entry>
+ <entry align="center"></entry>
+ </row>
+
+ <row>
+ <entry>Comm. method</entry>
+ <entry align="center">shared disk</entry>
+ <entry align="center">disk blocks</entry>
+ <entry align="center">WAL</entry>
+ <entry align="center">logical decoding</entry>
+ <entry align="center">table rows</entry>
+ <entry align="center">SQL</entry>
+ <entry align="center">table rows</entry>
+ <entry align="center">table rows and row locks</entry>
+ </row>
+
+ <row>
+ <entry>No special hardware required</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ <row>
+ <entry>Allows multiple master servers</entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ <row>
+ <entry>No master server overhead</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ </row>
+
+ <row>
+ <entry>No waiting for multiple servers</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">with sync off</entry>
+ <entry align="center">with sync off</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ </row>
+
+ <row>
+ <entry>Master failure will never lose data</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">with sync on</entry>
+ <entry align="center">with sync on</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ <row>
+ <entry>Replicas accept read-only queries</entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ <entry align="center">with hot standby</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ <row>
+ <entry>Per-table granularity</entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ <row>
+ <entry>No conflict resolution necessary</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center">&bull;</entry>
+ <entry align="center"></entry>
+ <entry align="center">&bull;</entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ There are a few solutions that do not fit into the above categories:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term>Data Partitioning</term>
+ <listitem>
+
+ <para>
+ Data partitioning splits tables into data sets. Each set can
+ be modified by only one server. For example, data can be
+ partitioned by offices, e.g., London and Paris, with a server
+ in each office. If queries combining London and Paris data
+ are necessary, an application can query both servers, or
+ master/standby replication can be used to keep a read-only copy
+ of the other office's data on each server.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Multiple-Server Parallel Query Execution</term>
+ <listitem>
+
+ <para>
+ Many of the above solutions allow multiple servers to handle multiple
+ queries, but none allow a single query to use multiple servers to
+ complete faster. This solution allows multiple servers to work
+ concurrently on a single query. It is usually accomplished by
+ splitting the data among servers and having each server execute its
+ part of the query and return results to a central server where they
+ are combined and returned to the user. This can be implemented using the
+ <productname>PL/Proxy</productname> tool set.
+ </para>
+
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <para>
+ It should also be noted that because <productname>PostgreSQL</productname>
+ is open source and easily extended, a number of companies have
+ taken <productname>PostgreSQL</productname> and created commercial
+ closed-source solutions with unique failover, replication, and load
+ balancing capabilities. These are not discussed here.
+ </para>
+
+ </sect1>
+
+
+ <sect1 id="warm-standby">
+ <title>Log-Shipping Standby Servers</title>
+
+
+ <para>
+ Continuous archiving can be used to create a <firstterm>high
+ availability</firstterm> (HA) cluster configuration with one or more
+ <firstterm>standby servers</firstterm> ready to take over operations if the
+ primary server fails. This capability is widely referred to as
+ <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>.
+ </para>
+
+ <para>
+ The primary and standby server work together to provide this capability,
+ though the servers are only loosely coupled. The primary server operates
+ in continuous archiving mode, while each standby server operates in
+ continuous recovery mode, reading the WAL files from the primary. No
+ changes to the database tables are required to enable this capability,
+ so it offers low administration overhead compared to some other
+ replication solutions. This configuration also has relatively low
+ performance impact on the primary server.
+ </para>
+
+ <para>
+ Directly moving WAL records from one database server to another
+ is typically described as log shipping. <productname>PostgreSQL</productname>
+ implements file-based log shipping by transferring WAL records
+ one file (WAL segment) at a time. WAL files (16MB) can be
+ shipped easily and cheaply over any distance, whether it be to an
+ adjacent system, another system at the same site, or another system on
+ the far side of the globe. The bandwidth required for this technique
+ varies according to the transaction rate of the primary server.
+ Record-based log shipping is more granular and streams WAL changes
+ incrementally over a network connection (see <xref
+ linkend="streaming-replication"/>).
+ </para>
+
+ <para>
+ It should be noted that log shipping is asynchronous, i.e., the WAL
+ records are shipped after transaction commit. As a result, there is a
+ window for data loss should the primary server suffer a catastrophic
+ failure; transactions not yet shipped will be lost. The size of the
+ data loss window in file-based log shipping can be limited by use of the
+ <varname>archive_timeout</varname> parameter, which can be set as low
+ as a few seconds. However such a low setting will
+ substantially increase the bandwidth required for file shipping.
+ Streaming replication (see <xref linkend="streaming-replication"/>)
+ allows a much smaller window of data loss.
+ </para>
+
+ <para>
+ Recovery performance is sufficiently good that the standby will
+ typically be only moments away from full
+ availability once it has been activated. As a result, this is called
+ a warm standby configuration which offers high
+ availability. Restoring a server from an archived base backup and
+ rollforward will take considerably longer, so that technique only
+ offers a solution for disaster recovery, not high availability.
+ A standby server can also be used for read-only queries, in which case
+ it is called a Hot Standby server. See <xref linkend="hot-standby"/> for
+ more information.
+ </para>
+
+ <indexterm zone="high-availability">
+ <primary>warm standby</primary>
+ </indexterm>
+
+ <indexterm zone="high-availability">
+ <primary>PITR standby</primary>
+ </indexterm>
+
+ <indexterm zone="high-availability">
+ <primary>standby server</primary>
+ </indexterm>
+
+ <indexterm zone="high-availability">
+ <primary>log shipping</primary>
+ </indexterm>
+
+ <indexterm zone="high-availability">
+ <primary>witness server</primary>
+ </indexterm>
+
+ <indexterm zone="high-availability">
+ <primary>STONITH</primary>
+ </indexterm>
+
+ <sect2 id="standby-planning">
+ <title>Planning</title>
+
+ <para>
+ It is usually wise to create the primary and standby servers
+ so that they are as similar as possible, at least from the
+ perspective of the database server. In particular, the path names
+ associated with tablespaces will be passed across unmodified, so both
+ primary and standby servers must have the same mount paths for
+ tablespaces if that feature is used. Keep in mind that if
+ <xref linkend="sql-createtablespace"/>
+ is executed on the primary, any new mount point needed for it must
+ be created on the primary and all standby servers before the command
+ is executed. Hardware need not be exactly the same, but experience shows
+ that maintaining two identical systems is easier than maintaining two
+ dissimilar ones over the lifetime of the application and system.
+ In any case the hardware architecture must be the same &mdash; shipping
+ from, say, a 32-bit to a 64-bit system will not work.
+ </para>
+
+ <para>
+ In general, log shipping between servers running different major
+ <productname>PostgreSQL</productname> release
+ levels is not possible. It is the policy of the PostgreSQL Global
+ Development Group not to make changes to disk formats during minor release
+ upgrades, so it is likely that running different minor release levels
+ on primary and standby servers will work successfully. However, no
+ formal support for that is offered and you are advised to keep primary
+ and standby servers at the same release level as much as possible.
+ When updating to a new minor release, the safest policy is to update
+ the standby servers first &mdash; a new minor release is more likely
+ to be able to read WAL files from a previous minor release than vice
+ versa.
+ </para>
+
+ </sect2>
+
+ <sect2 id="standby-server-operation" xreflabel="Standby Server Operation">
+ <title>Standby Server Operation</title>
+
+ <para>
+ A server enters standby mode if a
+ <anchor id="file-standby-signal" xreflabel="standby.signal"/>
+ <filename>standby.signal</filename>
+ <indexterm><primary><filename>standby.signal</filename></primary></indexterm>
+ file exists in the data directory when the server is started.
+ </para>
+
+ <para>
+ In standby mode, the server continuously applies WAL received from the
+ master server. The standby server can read WAL from a WAL archive
+ (see <xref linkend="guc-restore-command"/>) or directly from the master
+ over a TCP connection (streaming replication). The standby server will
+ also attempt to restore any WAL found in the standby cluster's
+ <filename>pg_wal</filename> directory. That typically happens after a server
+ restart, when the standby replays again WAL that was streamed from the
+ master before the restart, but you can also manually copy files to
+ <filename>pg_wal</filename> at any time to have them replayed.
+ </para>
+
+ <para>
+ At startup, the standby begins by restoring all WAL available in the
+ archive location, calling <varname>restore_command</varname>. Once it
+ reaches the end of WAL available there and <varname>restore_command</varname>
+ fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory.
+ If that fails, and streaming replication has been configured, the
+ standby tries to connect to the primary server and start streaming WAL
+ from the last valid record found in archive or <filename>pg_wal</filename>. If that fails
+ or streaming replication is not configured, or if the connection is
+ later disconnected, the standby goes back to step 1 and tries to
+ restore the file from the archive again. This loop of retries from the
+ archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server
+ is stopped or failover is triggered by a trigger file.
+ </para>
+
+ <para>
+ Standby mode is exited and the server switches to normal operation
+ when <command>pg_ctl promote</command> is run,
+ <function>pg_promote()</function> is called, or a trigger file is found
+ (<varname>promote_trigger_file</varname>). Before failover,
+ any WAL immediately available in the archive or in <filename>pg_wal</filename> will be
+ restored, but no attempt is made to connect to the master.
+ </para>
+ </sect2>
+
+ <sect2 id="preparing-master-for-standby">
+ <title>Preparing the Master for Standby Servers</title>
+
+ <para>
+ Set up continuous archiving on the primary to an archive directory
+ accessible from the standby, as described
+ in <xref linkend="continuous-archiving"/>. The archive location should be
+ accessible from the standby even when the master is down, i.e., it should
+ reside on the standby server itself or another trusted server, not on
+ the master server.
+ </para>
+
+ <para>
+ If you want to use streaming replication, set up authentication on the
+ primary server to allow replication connections from the standby
+ server(s); that is, create a role and provide a suitable entry or
+ entries in <filename>pg_hba.conf</filename> with the database field set to
+ <literal>replication</literal>. Also ensure <varname>max_wal_senders</varname> is set
+ to a sufficiently large value in the configuration file of the primary
+ server. If replication slots will be used,
+ ensure that <varname>max_replication_slots</varname> is set sufficiently
+ high as well.
+ </para>
+
+ <para>
+ Take a base backup as described in <xref linkend="backup-base-backup"/>
+ to bootstrap the standby server.
+ </para>
+ </sect2>
+
+ <sect2 id="standby-server-setup">
+ <title>Setting Up a Standby Server</title>
+
+ <para>
+ To set up the standby server, restore the base backup taken from primary
+ server (see <xref linkend="backup-pitr-recovery"/>). Create a file
+ <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm>
+ in the standby's cluster data
+ directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from
+ the WAL archive. If you plan to have multiple standby servers for high
+ availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to
+ <literal>latest</literal> (the default), to make the standby server follow the timeline change
+ that occurs at failover to another standby.
+ </para>
+
+ <note>
+ <para>
+ Do not use pg_standby or similar tools with the built-in standby mode
+ described here. <xref linkend="guc-restore-command"/> should return immediately
+ if the file does not exist; the server will retry the command again if
+ necessary. See <xref linkend="log-shipping-alternative"/>
+ for using tools like pg_standby.
+ </para>
+ </note>
+
+ <para>
+ If you want to use streaming replication, fill in
+ <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including
+ the host name (or IP address) and any additional details needed to
+ connect to the primary server. If the primary needs a password for
+ authentication, the password needs to be specified in
+ <xref linkend="guc-primary-conninfo"/> as well.
+ </para>
+
+ <para>
+ If you're setting up the standby server for high availability purposes,
+ set up WAL archiving, connections and authentication like the primary
+ server, because the standby server will work as a primary server after
+ failover.
+ </para>
+
+ <para>
+ If you're using a WAL archive, its size can be minimized using the <xref
+ linkend="guc-archive-cleanup-command"/> parameter to remove files that are no
+ longer required by the standby server.
+ The <application>pg_archivecleanup</application> utility is designed specifically to
+ be used with <varname>archive_cleanup_command</varname> in typical single-standby
+ configurations, see <xref linkend="pgarchivecleanup"/>.
+ Note however, that if you're using the archive for backup purposes, you
+ need to retain files needed to recover from at least the latest base
+ backup, even if they're no longer needed by the standby.
+ </para>
+
+ <para>
+ A simple example of configuration is:
+<programlisting>
+primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000'''
+restore_command = 'cp /path/to/archive/%f %p'
+archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
+</programlisting>
+ </para>
+
+ <para>
+ You can have any number of standby servers, but if you use streaming
+ replication, make sure you set <varname>max_wal_senders</varname> high enough in
+ the primary to allow them to be connected simultaneously.
+ </para>
+
+ </sect2>
+
+ <sect2 id="streaming-replication">
+ <title>Streaming Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Streaming Replication</primary>
+ </indexterm>
+
+ <para>
+ Streaming replication allows a standby server to stay more up-to-date
+ than is possible with file-based log shipping. The standby connects
+ to the primary, which streams WAL records to the standby as they're
+ generated, without waiting for the WAL file to be filled.
+ </para>
+
+ <para>
+ Streaming replication is asynchronous by default
+ (see <xref linkend="synchronous-replication"/>), in which case there is
+ a small delay between committing a transaction in the primary and the
+ changes becoming visible in the standby. This delay is however much
+ smaller than with file-based log shipping, typically under one second
+ assuming the standby is powerful enough to keep up with the load. With
+ streaming replication, <varname>archive_timeout</varname> is not required to
+ reduce the data loss window.
+ </para>
+
+ <para>
+ If you use streaming replication without file-based continuous
+ archiving, the server might recycle old WAL segments before the standby
+ has received them. If this occurs, the standby will need to be
+ reinitialized from a new base backup. You can avoid this by setting
+ <varname>wal_keep_size</varname> to a value large enough to ensure that
+ WAL segments are not recycled too early, or by configuring a replication
+ slot for the standby. If you set up a WAL archive that's accessible from
+ the standby, these solutions are not required, since the standby can
+ always use the archive to catch up provided it retains enough segments.
+ </para>
+
+ <para>
+ To use streaming replication, set up a file-based log-shipping standby
+ server as described in <xref linkend="warm-standby"/>. The step that
+ turns a file-based log-shipping standby into streaming replication
+ standby is setting the <varname>primary_conninfo</varname> setting
+ to point to the primary server. Set
+ <xref linkend="guc-listen-addresses"/> and authentication options
+ (see <filename>pg_hba.conf</filename>) on the primary so that the standby server
+ can connect to the <literal>replication</literal> pseudo-database on the primary
+ server (see <xref linkend="streaming-replication-authentication"/>).
+ </para>
+
+ <para>
+ On systems that support the keepalive socket option, setting
+ <xref linkend="guc-tcp-keepalives-idle"/>,
+ <xref linkend="guc-tcp-keepalives-interval"/> and
+ <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly
+ notice a broken connection.
+ </para>
+
+ <para>
+ Set the maximum number of concurrent connections from the standby servers
+ (see <xref linkend="guc-max-wal-senders"/> for details).
+ </para>
+
+ <para>
+ When the standby is started and <varname>primary_conninfo</varname> is set
+ correctly, the standby will connect to the primary after replaying all
+ WAL files available in the archive. If the connection is established
+ successfully, you will see a <literal>walreceiver</literal> in the standby, and
+ a corresponding <literal>walsender</literal> process in the primary.
+ </para>
+
+ <sect3 id="streaming-replication-authentication">
+ <title>Authentication</title>
+ <para>
+ It is very important that the access privileges for replication be set up
+ so that only trusted users can read the WAL stream, because it is
+ easy to extract privileged information from it. Standby servers must
+ authenticate to the primary as an account that has the
+ <literal>REPLICATION</literal> privilege or a superuser. It is
+ recommended to create a dedicated user account with
+ <literal>REPLICATION</literal> and <literal>LOGIN</literal>
+ privileges for replication. While <literal>REPLICATION</literal>
+ privilege gives very high permissions, it does not allow the user to
+ modify any data on the primary system, which the
+ <literal>SUPERUSER</literal> privilege does.
+ </para>
+
+ <para>
+ Client authentication for replication is controlled by a
+ <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the
+ <replaceable>database</replaceable> field. For example, if the standby is running on
+ host IP <literal>192.168.1.100</literal> and the account name for replication
+ is <literal>foo</literal>, the administrator can add the following line to the
+ <filename>pg_hba.conf</filename> file on the primary:
+
+<programlisting>
+# Allow the user "foo" from host 192.168.1.100 to connect to the primary
+# as a replication standby if the user's password is correctly supplied.
+#
+# TYPE DATABASE USER ADDRESS METHOD
+host replication foo 192.168.1.100/32 md5
+</programlisting>
+ </para>
+ <para>
+ The host name and port number of the primary, connection user name,
+ and password are specified in the <xref linkend="guc-primary-conninfo"/>.
+ The password can also be set in the <filename>~/.pgpass</filename> file on the
+ standby (specify <literal>replication</literal> in the <replaceable>database</replaceable>
+ field).
+ For example, if the primary is running on host IP <literal>192.168.1.50</literal>,
+ port <literal>5432</literal>, the account name for replication is
+ <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator
+ can add the following line to the <filename>postgresql.conf</filename> file on the
+ standby:
+
+<programlisting>
+# The standby connects to the primary that is running on host 192.168.1.50
+# and port 5432 as the user "foo" whose password is "foopass".
+primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
+</programlisting>
+ </para>
+ </sect3>
+
+ <sect3 id="streaming-replication-monitoring">
+ <title>Monitoring</title>
+ <para>
+ An important health indicator of streaming replication is the amount
+ of WAL records generated in the primary, but not yet applied in the
+ standby. You can calculate this lag by comparing the current WAL write
+ location on the primary with the last WAL location received by the
+ standby. These locations can be retrieved using
+ <function>pg_current_wal_lsn</function> on the primary and
+ <function>pg_last_wal_receive_lsn</function> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"/> and
+ <xref linkend="functions-recovery-info-table"/> for details).
+ The last WAL receive location in the standby is also displayed in the
+ process status of the WAL receiver process, displayed using the
+ <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details).
+ </para>
+ <para>
+ You can retrieve a list of WAL sender processes via the
+ <link linkend="monitoring-pg-stat-replication-view"><structname>
+ pg_stat_replication</structname></link> view. Large differences between
+ <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field
+ might indicate that the master server is under heavy load, while
+ differences between <literal>sent_lsn</literal> and
+ <function>pg_last_wal_receive_lsn</function> on the standby might indicate
+ network delay, or that the standby is under heavy load.
+ </para>
+ <para>
+ On a hot standby, the status of the WAL receiver process can be retrieved
+ via the <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link> view. A large
+ difference between <function>pg_last_wal_replay_lsn</function> and the
+ view's <literal>flushed_lsn</literal> indicates that WAL is being
+ received faster than it can be replayed.
+ </para>
+ </sect3>
+ </sect2>
+
+ <sect2 id="streaming-replication-slots">
+ <title>Replication Slots</title>
+ <indexterm>
+ <primary>replication slot</primary>
+ <secondary>streaming replication</secondary>
+ </indexterm>
+ <para>
+ Replication slots provide an automated way to ensure that the master does
+ not remove WAL segments until they have been received by all standbys,
+ and that the master does not remove rows which could cause a
+ <link linkend="hot-standby-conflict">recovery conflict</link> even when the
+ standby is disconnected.
+ </para>
+ <para>
+ In lieu of using replication slots, it is possible to prevent the removal
+ of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by
+ storing the segments in an archive using
+ <xref linkend="guc-archive-command"/>.
+ However, these methods often result in retaining more WAL segments than
+ required, whereas replication slots retain only the number of segments
+ known to be needed. On the other hand, replication slots can retain so
+ many WAL segments that they fill up the space allocated
+ for <literal>pg_wal</literal>;
+ <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
+ retained by replication slots.
+ </para>
+ <para>
+ Similarly, <xref linkend="guc-hot-standby-feedback"/>
+ and <xref linkend="guc-vacuum-defer-cleanup-age"/> provide protection against
+ relevant rows being removed by vacuum, but the former provides no
+ protection during any time period when the standby is not connected,
+ and the latter often needs to be set to a high value to provide adequate
+ protection. Replication slots overcome these disadvantages.
+ </para>
+ <sect3 id="streaming-replication-slots-manipulation">
+ <title>Querying and Manipulating Replication Slots</title>
+ <para>
+ Each replication slot has a name, which can contain lower-case letters,
+ numbers, and the underscore character.
+ </para>
+ <para>
+ Existing replication slots and their state can be seen in the
+ <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link>
+ view.
+ </para>
+ <para>
+ Slots can be created and dropped either via the streaming replication
+ protocol (see <xref linkend="protocol-replication"/>) or via SQL
+ functions (see <xref linkend="functions-replication"/>).
+ </para>
+ </sect3>
+ <sect3 id="streaming-replication-slots-config">
+ <title>Configuration Example</title>
+ <para>
+ You can create a replication slot like this:
+<programlisting>
+postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
+ slot_name | lsn
+-------------+-----
+ node_a_slot |
+
+postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots;
+ slot_name | slot_type | active
+-------------+-----------+--------
+ node_a_slot | physical | f
+(1 row)
+</programlisting>
+ To configure the standby to use this slot, <varname>primary_slot_name</varname>
+ should be configured on the standby. Here is a simple example:
+<programlisting>
+primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
+primary_slot_name = 'node_a_slot'
+</programlisting>
+ </para>
+ </sect3>
+ </sect2>
+
+ <sect2 id="cascading-replication">
+ <title>Cascading Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Cascading Replication</primary>
+ </indexterm>
+
+ <para>
+ The cascading replication feature allows a standby server to accept replication
+ connections and stream WAL records to other standbys, acting as a relay.
+ This can be used to reduce the number of direct connections to the master
+ and also to minimize inter-site bandwidth overheads.
+ </para>
+
+ <para>
+ A standby acting as both a receiver and a sender is known as a cascading
+ standby. Standbys that are more directly connected to the master are known
+ as upstream servers, while those standby servers further away are downstream
+ servers. Cascading replication does not place limits on the number or
+ arrangement of downstream servers, though each standby connects to only
+ one upstream server which eventually links to a single master/primary
+ server.
+ </para>
+
+ <para>
+ A cascading standby sends not only WAL records received from the
+ master but also those restored from the archive. So even if the replication
+ connection in some upstream connection is terminated, streaming replication
+ continues downstream for as long as new WAL records are available.
+ </para>
+
+ <para>
+ Cascading replication is currently asynchronous. Synchronous replication
+ (see <xref linkend="synchronous-replication"/>) settings have no effect on
+ cascading replication at present.
+ </para>
+
+ <para>
+ Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
+ </para>
+
+ <para>
+ If an upstream standby server is promoted to become new master, downstream
+ servers will continue to stream from the new master if
+ <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default).
+ </para>
+
+ <para>
+ To use cascading replication, set up the cascading standby so that it can
+ accept replication connections (that is, set
+ <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>,
+ and configure
+ <link linkend="auth-pg-hba-conf">host-based authentication</link>).
+ You will also need to set <varname>primary_conninfo</varname> in the downstream
+ standby to point to the cascading standby.
+ </para>
+ </sect2>
+
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</productname> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to one or more synchronous
+ standby servers. This extends that standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory, and group-1-safe
+ (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to
+ <literal>remote_write</literal>.
+ </para>
+
+ <para>
+ When requesting synchronous replication, each commit of a
+ write transaction will wait until confirmation is
+ received that the commit has been written to the write-ahead log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability, though only
+ if the sysadmin is cautious about the placement and management of the two
+ servers. Waiting for confirmation increases the user's confidence that the
+ changes will not be lost in the event of server crashes but it also
+ necessarily increases the response time for the requesting transaction.
+ The minimum wait time is the round-trip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message. All two-phase commit actions
+ require commit waits, including both prepare and commit.
+ </para>
+
+ <para>
+ A synchronous standby can be a physical replication standby or a logical
+ replication subscriber. It can also be any other physical or logical WAL
+ replication stream consumer that knows how to send the appropriate
+ feedback messages. Besides the built-in physical and logical replication
+ systems, this includes special programs such
+ as <command>pg_receivewal</command> and <command>pg_recvlogical</command>
+ as well as some third-party replication systems and custom programs.
+ Check the respective documentation for details on synchronous replication
+ support.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Once streaming replication has been configured, configuring synchronous
+ replication requires only one additional configuration step:
+ <xref linkend="guc-synchronous-standby-names"/> must be set to
+ a non-empty value. <varname>synchronous_commit</varname> must also be set to
+ <literal>on</literal>, but since this is the default value, typically no change is
+ required. (See <xref linkend="runtime-config-wal-settings"/> and
+ <xref linkend="runtime-config-replication-master"/>.)
+ This configuration will cause each commit to wait for
+ confirmation that the standby has written the commit record to durable
+ storage.
+ <varname>synchronous_commit</varname> can be set by individual
+ users, so it can be configured in the configuration file, for particular
+ users or databases, or dynamically by applications, in order to control
+ the durability guarantee on a per-transaction basis.
+ </para>
+
+ <para>
+ After a commit record has been written to disk on the primary, the
+ WAL record is then sent to the standby. The standby sends reply
+ messages each time a new batch of WAL data is written to disk, unless
+ <varname>wal_receiver_status_interval</varname> is set to zero on the standby.
+ In the case that <varname>synchronous_commit</varname> is set to
+ <literal>remote_apply</literal>, the standby sends reply messages when the commit
+ record is replayed, making the transaction visible.
+ If the standby is chosen as a synchronous standby, according to the setting
+ of <varname>synchronous_standby_names</varname> on the primary, the reply
+ messages from that standby will be considered along with those from other
+ synchronous standbys to decide when to release transactions waiting for
+ confirmation that the commit record has been received. These parameters
+ allow the administrator to specify which standby servers should be
+ synchronous standbys. Note that the configuration of synchronous
+ replication is mainly on the master. Named standbys must be directly
+ connected to the master; the master knows nothing about downstream
+ standby servers using cascaded replication.
+ </para>
+
+ <para>
+ Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will
+ cause each commit to wait for confirmation that the standby has received
+ the commit record and written it out to its own operating system, but not
+ for the data to be flushed to disk on the standby. This
+ setting provides a weaker guarantee of durability than <literal>on</literal>
+ does: the standby could lose the data in the event of an operating system
+ crash, though not a <productname>PostgreSQL</productname> crash.
+ However, it's a useful setting in practice
+ because it can decrease the response time for the transaction.
+ Data loss could only occur if both the primary and the standby crash and
+ the database of the primary gets corrupted at the same time.
+ </para>
+
+ <para>
+ Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
+ cause each commit to wait until the current synchronous standbys report
+ that they have replayed the transaction, making it visible to user
+ queries. In simple cases, this allows for load balancing with causal
+ consistency.
+ </para>
+
+ <para>
+ Users will stop waiting if a fast shutdown is requested. However, as
+ when using asynchronous replication, the server will not fully
+ shutdown until all outstanding WAL records are transferred to the currently
+ connected standby servers.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-multiple-standbys">
+ <title>Multiple Synchronous Standbys</title>
+
+ <para>
+ Synchronous replication supports one or more synchronous standby servers;
+ transactions will wait until all the standby servers which are considered
+ as synchronous confirm receipt of their data. The number of synchronous
+ standbys that transactions must wait for replies from is specified in
+ <varname>synchronous_standby_names</varname>. This parameter also specifies
+ a list of standby names and the method (<literal>FIRST</literal> and
+ <literal>ANY</literal>) to choose synchronous standbys from the listed ones.
+ </para>
+ <para>
+ The method <literal>FIRST</literal> specifies a priority-based synchronous
+ replication and makes transaction commits wait until their WAL records are
+ replicated to the requested number of synchronous standbys chosen based on
+ their priorities. The standbys whose names appear earlier in the list are
+ given higher priority and will be considered as synchronous. Other standby
+ servers appearing later in this list represent potential synchronous
+ standbys. If any of the current synchronous standbys disconnects for
+ whatever reason, it will be replaced immediately with the
+ next-highest-priority standby.
+ </para>
+ <para>
+ An example of <varname>synchronous_standby_names</varname> for
+ a priority-based multiple synchronous standbys is:
+<programlisting>
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
+</programlisting>
+ In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
+ <literal>s3</literal> and <literal>s4</literal> are running, the two standbys
+ <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys
+ because their names appear early in the list of standby names.
+ <literal>s3</literal> is a potential synchronous standby and will take over
+ the role of synchronous standby when either of <literal>s1</literal> or
+ <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since
+ its name is not in the list.
+ </para>
+ <para>
+ The method <literal>ANY</literal> specifies a quorum-based synchronous
+ replication and makes transaction commits wait until their WAL records
+ are replicated to <emphasis>at least</emphasis> the requested number of
+ synchronous standbys in the list.
+ </para>
+ <para>
+ An example of <varname>synchronous_standby_names</varname> for
+ a quorum-based multiple synchronous standbys is:
+<programlisting>
+synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+ In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
+ <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will
+ wait for replies from at least any two standbys of <literal>s1</literal>,
+ <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous
+ standby since its name is not in the list.
+ </para>
+ <para>
+ The synchronous states of standby servers can be viewed using
+ the <structname>pg_stat_replication</structname> view.
+ </para>
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilize system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer synchronous replication for the most
+ important changes, without slowing down the bulk of the total workload.
+ Application level options are an important and practical tool for allowing
+ the benefits of synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ <varname>synchronous_standby_names</varname> specifies the number and
+ names of synchronous standbys that transaction commits made when
+ <varname>synchronous_commit</varname> is set to <literal>on</literal>,
+ <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for
+ responses from. Such transaction commits may never be completed
+ if any one of synchronous standbys should crash.
+ </para>
+
+ <para>
+ The best solution for high availability is to ensure you keep as many
+ synchronous standbys as requested. This can be achieved by naming multiple
+ potential synchronous standbys using <varname>synchronous_standby_names</varname>.
+ </para>
+
+ <para>
+ In a priority-based synchronous replication, the standbys whose names
+ appear earlier in the list will be used as synchronous standbys.
+ Standbys listed after these will take over the role of synchronous standby
+ if one of current ones should fail.
+ </para>
+
+ <para>
+ In a quorum-based synchronous replication, all the standbys appearing
+ in the list will be used as candidates for synchronous standbys.
+ Even if one of them should fail, the other standbys will keep performing
+ the role of candidates of synchronous standby.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it will not yet be properly
+ synchronized. This is described as <literal>catchup</literal> mode. Once
+ the lag between standby and primary reaches zero for the first time
+ we move to real-time <literal>streaming</literal> state.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shut down, then the catch-up period
+ will increase according to the length of time the standby has been down.
+ The standby is only able to become a synchronous standby
+ once it has reached <literal>streaming</literal> state.
+ This state can be viewed using
+ the <structname>pg_stat_replication</structname> view.
+ </para>
+
+ <para>
+ If primary restarts while commits are waiting for acknowledgment, those
+ waiting transactions will be marked fully committed once the primary
+ database recovers.
+ There is no way to be certain that all standbys have received all
+ outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgment of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by all the synchronous standbys.
+ </para>
+
+ <para>
+ If you really cannot keep as many synchronous standbys as requested
+ then you should decrease the number of synchronous standbys that
+ transaction commits must wait for responses from
+ in <varname>synchronous_standby_names</varname> (or disable it) and
+ reload the configuration file on the primary server.
+ </para>
+
+ <para>
+ If the primary is isolated from remaining standby servers you should
+ fail over to the best candidate of those other remaining standby servers.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those
+ requests will wait forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
+
+ <sect2 id="continuous-archiving-in-standby">
+ <title>Continuous Archiving in Standby</title>
+
+ <indexterm>
+ <primary>continuous archiving</primary>
+ <secondary>in standby</secondary>
+ </indexterm>
+
+ <para>
+ When continuous WAL archiving is used in a standby, there are two
+ different scenarios: the WAL archive can be shared between the primary
+ and the standby, or the standby can have its own WAL archive. When
+ the standby has its own WAL archive, set <varname>archive_mode</varname>
+ to <literal>always</literal>, and the standby will call the archive
+ command for every WAL segment it receives, whether it's by restoring
+ from the archive or by streaming replication. The shared archive can
+ be handled similarly, but the <varname>archive_command</varname> must
+ test if the file being archived exists already, and if the existing file
+ has identical contents. This requires more care in the
+ <varname>archive_command</varname>, as it must
+ be careful to not overwrite an existing file with different contents,
+ but return success if the exactly same file is archived twice. And
+ all that must be done free of race conditions, if two servers attempt
+ to archive the same file at the same time.
+ </para>
+
+ <para>
+ If <varname>archive_mode</varname> is set to <literal>on</literal>, the
+ archiver is not enabled during recovery or standby mode. If the standby
+ server is promoted, it will start archiving after the promotion, but
+ will not archive any WAL or timeline history files that
+ it did not generate itself. To get a complete
+ series of WAL files in the archive, you must ensure that all WAL is
+ archived, before it reaches the standby. This is inherently true with
+ file-based log shipping, as the standby can only restore files that
+ are found in the archive, but not if streaming replication is enabled.
+ When a server is not in recovery mode, there is no difference between
+ <literal>on</literal> and <literal>always</literal> modes.
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="warm-standby-failover">
+ <title>Failover</title>
+
+ <para>
+ If the primary server fails then the standby server should begin
+ failover procedures.
+ </para>
+
+ <para>
+ If the standby server fails then no failover need take place. If the
+ standby server can be restarted, even some time later, then the recovery
+ process can also be restarted immediately, taking advantage of
+ restartable recovery. If the standby server cannot be restarted, then a
+ full new standby server instance should be created.
+ </para>
+
+ <para>
+ If the primary server fails and the standby server becomes the
+ new primary, and then the old primary restarts, you must have
+ a mechanism for informing the old primary that it is no longer the primary. This is
+ sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is
+ necessary to avoid situations where both systems think they are the
+ primary, which will lead to confusion and ultimately data loss.
+ </para>
+
+ <para>
+ Many failover systems use just two systems, the primary and the standby,
+ connected by some kind of heartbeat mechanism to continually verify the
+ connectivity between the two and the viability of the primary. It is
+ also possible to use a third system (called a witness server) to prevent
+ some cases of inappropriate failover, but the additional complexity
+ might not be worthwhile unless it is set up with sufficient care and
+ rigorous testing.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</productname> does not provide the system
+ software required to identify a failure on the primary and notify
+ the standby database server. Many such tools exist and are well
+ integrated with the operating system facilities required for
+ successful failover, such as IP address migration.
+ </para>
+
+ <para>
+ Once failover to the standby occurs, there is only a
+ single server in operation. This is known as a degenerate state.
+ The former standby is now the primary, but the former primary is down
+ and might stay down. To return to normal operation, a standby server
+ must be recreated,
+ either on the former primary system when it comes up, or on a third,
+ possibly new, system. The <xref linkend="app-pgrewind"/> utility can be
+ used to speed up this process on large clusters.
+ Once complete, the primary and standby can be
+ considered to have switched roles. Some people choose to use a third
+ server to provide backup for the new primary until the new standby
+ server is recreated,
+ though clearly this complicates the system configuration and
+ operational processes.
+ </para>
+
+ <para>
+ So, switching from primary to standby server can be fast but requires
+ some time to re-prepare the failover cluster. Regular switching from
+ primary to standby is useful, since it allows regular downtime on
+ each system for maintenance. This also serves as a test of the
+ failover mechanism to ensure that it will really work when you need it.
+ Written administration procedures are advised.
+ </para>
+
+ <para>
+ To trigger failover of a log-shipping standby server, run
+ <command>pg_ctl promote</command>, call <function>pg_promote()</function>,
+ or create a trigger file with the file name and path specified by the
+ <varname>promote_trigger_file</varname>. If you're planning to use
+ <command>pg_ctl promote</command> or to call
+ <function>pg_promote()</function> to fail over,
+ <varname>promote_trigger_file</varname> is not required. If you're
+ setting up the reporting servers that are only used to offload read-only
+ queries from the primary, not for high availability purposes, you don't
+ need to promote it.
+ </para>
+ </sect1>
+
+ <sect1 id="log-shipping-alternative">
+ <title>Alternative Method for Log Shipping</title>
+
+ <para>
+ An alternative to the built-in standby mode described in the previous
+ sections is to use a <varname>restore_command</varname> that polls the archive location.
+ This was the only option available in versions 8.4 and below. See the
+ <xref linkend="pgstandby"/> module for a reference implementation of this.
+ </para>
+
+ <para>
+ Note that in this mode, the server will apply WAL one file at a
+ time, so if you use the standby server for queries (see Hot Standby),
+ there is a delay between an action in the master and when the
+ action becomes visible in the standby, corresponding to the time it takes
+ to fill up the WAL file. <varname>archive_timeout</varname> can be used to make that delay
+ shorter. Also note that you can't combine streaming replication with
+ this method.
+ </para>
+
+ <para>
+ The operations that occur on both primary and standby servers are
+ normal continuous archiving and recovery tasks. The only point of
+ contact between the two database servers is the archive of WAL files
+ that both share: primary writing to the archive, standby reading from
+ the archive. Care must be taken to ensure that WAL archives from separate
+ primary servers do not become mixed together or confused. The archive
+ need not be large if it is only required for standby operation.
+ </para>
+
+ <para>
+ The magic that makes the two loosely coupled servers work together is
+ simply a <varname>restore_command</varname> used on the standby that,
+ when asked for the next WAL file, waits for it to become available from
+ the primary. Normal recovery
+ processing would request a file from the WAL archive, reporting failure
+ if the file was unavailable. For standby processing it is normal for
+ the next WAL file to be unavailable, so the standby must wait for
+ it to appear. For files ending in
+ <literal>.history</literal> there is no need to wait, and a non-zero return
+ code must be returned. A waiting <varname>restore_command</varname> can be
+ written as a custom script that loops after polling for the existence of
+ the next WAL file. There must also be some way to trigger failover, which
+ should interrupt the <varname>restore_command</varname>, break the loop and
+ return a file-not-found error to the standby server. This ends recovery
+ and the standby will then come up as a normal server.
+ </para>
+
+ <para>
+ Pseudocode for a suitable <varname>restore_command</varname> is:
+<programlisting>
+triggered = false;
+while (!NextWALFileReady() &amp;&amp; !triggered)
+{
+ sleep(100000L); /* wait for ~0.1 sec */
+ if (CheckForExternalTrigger())
+ triggered = true;
+}
+if (!triggered)
+ CopyWALFileForRecovery();
+</programlisting>
+ </para>
+
+ <para>
+ A working example of a waiting <varname>restore_command</varname> is provided
+ in the <xref linkend="pgstandby"/> module. It
+ should be used as a reference on how to correctly implement the logic
+ described above. It can also be extended as needed to support specific
+ configurations and environments.
+ </para>
+
+ <para>
+ The method for triggering failover is an important part of planning
+ and design. One potential option is the <varname>restore_command</varname>
+ command. It is executed once for each WAL file, but the process
+ running the <varname>restore_command</varname> is created and dies for
+ each file, so there is no daemon or server process, and
+ signals or a signal handler cannot be used. Therefore, the
+ <varname>restore_command</varname> is not suitable to trigger failover.
+ It is possible to use a simple timeout facility, especially if
+ used in conjunction with a known <varname>archive_timeout</varname>
+ setting on the primary. However, this is somewhat error prone
+ since a network problem or busy primary server might be sufficient
+ to initiate failover. A notification mechanism such as the explicit
+ creation of a trigger file is ideal, if this can be arranged.
+ </para>
+
+ <sect2 id="warm-standby-config">
+ <title>Implementation</title>
+
+ <para>
+ The short procedure for configuring a standby server using this alternative
+ method is as follows. For
+ full details of each step, refer to previous sections as noted.
+ <orderedlist>
+ <listitem>
+ <para>
+ Set up primary and standby systems as nearly identical as
+ possible, including two identical copies of
+ <productname>PostgreSQL</productname> at the same release level.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Set up continuous archiving from the primary to a WAL archive
+ directory on the standby server. Ensure that
+ <xref linkend="guc-archive-mode"/>,
+ <xref linkend="guc-archive-command"/> and
+ <xref linkend="guc-archive-timeout"/>
+ are set appropriately on the primary
+ (see <xref linkend="backup-archiving-wal"/>).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Make a base backup of the primary server (see <xref
+ linkend="backup-base-backup"/>), and load this data onto the standby.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Begin recovery on the standby server from the local WAL
+ archive, using <varname>restore_command</varname> that waits
+ as described previously (see <xref linkend="backup-pitr-recovery"/>).
+ </para>
+ </listitem>
+ </orderedlist>
+ </para>
+
+ <para>
+ Recovery treats the WAL archive as read-only, so once a WAL file has
+ been copied to the standby system it can be copied to tape at the same
+ time as it is being read by the standby database server.
+ Thus, running a standby server for high availability can be performed at
+ the same time as files are stored for longer term disaster recovery
+ purposes.
+ </para>
+
+ <para>
+ For testing purposes, it is possible to run both primary and standby
+ servers on the same system. This does not provide any worthwhile
+ improvement in server robustness, nor would it be described as HA.
+ </para>
+ </sect2>
+
+ <sect2 id="warm-standby-record">
+ <title>Record-Based Log Shipping</title>
+
+ <para>
+ It is also possible to implement record-based log shipping using this
+ alternative method, though this requires custom development, and changes
+ will still only become visible to hot standby queries after a full WAL
+ file has been shipped.
+ </para>
+
+ <para>
+ An external program can call the <function>pg_walfile_name_offset()</function>
+ function (see <xref linkend="functions-admin"/>)
+ to find out the file name and the exact byte offset within it of
+ the current end of WAL. It can then access the WAL file directly
+ and copy the data from the last known end of WAL through the current end
+ over to the standby servers. With this approach, the window for data
+ loss is the polling cycle time of the copying program, which can be very
+ small, and there is no wasted bandwidth from forcing partially-used
+ segment files to be archived. Note that the standby servers'
+ <varname>restore_command</varname> scripts can only deal with whole WAL files,
+ so the incrementally copied data is not ordinarily made available to
+ the standby servers. It is of use only when the primary dies &mdash;
+ then the last partial WAL file is fed to the standby before allowing
+ it to come up. The correct implementation of this process requires
+ cooperation of the <varname>restore_command</varname> script with the data
+ copying program.
+ </para>
+
+ <para>
+ Starting with <productname>PostgreSQL</productname> version 9.0, you can use
+ streaming replication (see <xref linkend="streaming-replication"/>) to
+ achieve the same benefits with less effort.
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="hot-standby">
+ <title>Hot Standby</title>
+
+ <indexterm zone="high-availability">
+ <primary>Hot Standby</primary>
+ </indexterm>
+
+ <para>
+ Hot Standby is the term used to describe the ability to connect to
+ the server and run read-only queries while the server is in archive
+ recovery or standby mode. This
+ is useful both for replication purposes and for restoring a backup
+ to a desired state with great precision.
+ The term Hot Standby also refers to the ability of the server to move
+ from recovery through to normal operation while users continue running
+ queries and/or keep their connections open.
+ </para>
+
+ <para>
+ Running queries in hot standby mode is similar to normal query operation,
+ though there are several usage and administrative differences
+ explained below.
+ </para>
+
+ <sect2 id="hot-standby-users">
+ <title>User's Overview</title>
+
+ <para>
+ When the <xref linkend="guc-hot-standby"/> parameter is set to true on a
+ standby server, it will begin accepting connections once the recovery has
+ brought the system to a consistent state. All such connections are
+ strictly read-only; not even temporary tables may be written.
+ </para>
+
+ <para>
+ The data on the standby takes some time to arrive from the primary server
+ so there will be a measurable delay between primary and standby. Running the
+ same query nearly simultaneously on both primary and standby might therefore
+ return differing results. We say that data on the standby is
+ <firstterm>eventually consistent</firstterm> with the primary. Once the
+ commit record for a transaction is replayed on the standby, the changes
+ made by that transaction will be visible to any new snapshots taken on
+ the standby. Snapshots may be taken at the start of each query or at the
+ start of each transaction, depending on the current transaction isolation
+ level. For more details, see <xref linkend="transaction-iso"/>.
+ </para>
+
+ <para>
+ Transactions started during hot standby may issue the following commands:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Query access: <command>SELECT</command>, <command>COPY TO</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Transaction management commands:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>EXCEPTION</command> blocks and other internal subtransactions
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>LOCK TABLE</command>, though only when explicitly in one of these modes:
+ <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>,
+ <command>DEALLOCATE</command>, <command>DISCARD</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Plugins and extensions: <command>LOAD</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>UNLISTEN</command>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Transactions started during hot standby will never be assigned a
+ transaction ID and cannot write to the system write-ahead log.
+ Therefore, the following actions will produce error messages:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Data Manipulation Language (DML): <command>INSERT</command>,
+ <command>UPDATE</command>, <command>DELETE</command>, <command>COPY FROM</command>,
+ <command>TRUNCATE</command>.
+ Note that there are no allowed actions that result in a trigger
+ being executed during recovery. This restriction applies even to
+ temporary tables, because table rows cannot be read or written without
+ assigning a transaction ID, which is currently not possible in a
+ Hot Standby environment.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Data Definition Language (DDL): <command>CREATE</command>,
+ <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>.
+ This restriction applies even to temporary tables, because carrying
+ out these operations would require updating the system catalog tables.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be
+ taken without updating the underlying data files.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Rules on <command>SELECT</command> statements that generate DML commands.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Transaction management commands that explicitly set non-read-only state:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <command>BEGIN READ WRITE</command>,
+ <command>START TRANSACTION READ WRITE</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>SET TRANSACTION READ WRITE</command>,
+ <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>SET transaction_read_only = off</command>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Two-phase commit commands: <command>PREPARE TRANSACTION</command>,
+ <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command>
+ because even read-only transactions need to write WAL in the
+ prepare phase (the first phase of two phase commit).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Sequence updates: <function>nextval()</function>, <function>setval()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <command>LISTEN</command>, <command>NOTIFY</command>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ In normal operation, <quote>read-only</quote> transactions are allowed to
+ use <command>LISTEN</command> and <command>NOTIFY</command>,
+ so Hot Standby sessions operate under slightly tighter
+ restrictions than ordinary read-only sessions. It is possible that some
+ of these restrictions might be loosened in a future release.
+ </para>
+
+ <para>
+ During hot standby, the parameter <varname>transaction_read_only</varname> is always
+ true and may not be changed. But as long as no attempt is made to modify
+ the database, connections during hot standby will act much like any other
+ database connection. If failover or switchover occurs, the database will
+ switch to normal processing mode. Sessions will remain connected while the
+ server changes mode. Once hot standby finishes, it will be possible to
+ initiate read-write transactions (even from a session begun during
+ hot standby).
+ </para>
+
+ <para>
+ Users will be able to tell whether their session is read-only by
+ issuing <command>SHOW transaction_read_only</command>. In addition, a set of
+ functions (<xref linkend="functions-recovery-info-table"/>) allow users to
+ access information about the standby server. These allow you to write
+ programs that are aware of the current state of the database. These
+ can be used to monitor the progress of recovery, or to allow you to
+ write complex programs that restore the database to particular states.
+ </para>
+ </sect2>
+
+ <sect2 id="hot-standby-conflict">
+ <title>Handling Query Conflicts</title>
+
+ <para>
+ The primary and standby servers are in many ways loosely connected. Actions
+ on the primary will have an effect on the standby. As a result, there is
+ potential for negative interactions or conflicts between them. The easiest
+ conflict to understand is performance: if a huge data load is taking place
+ on the primary then this will generate a similar stream of WAL records on the
+ standby, so standby queries may contend for system resources, such as I/O.
+ </para>
+
+ <para>
+ There are also additional types of conflict that can occur with Hot Standby.
+ These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries
+ might need to be canceled and, in some cases, sessions disconnected to resolve them.
+ The user is provided with several ways to handle these
+ conflicts. Conflict cases include:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Access Exclusive locks taken on the primary server, including both
+ explicit <command>LOCK</command> commands and various <acronym>DDL</acronym>
+ actions, conflict with table accesses in standby queries.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Dropping a tablespace on the primary conflicts with standby queries
+ using that tablespace for temporary work files.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Dropping a database on the primary conflicts with sessions connected
+ to that database on the standby.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</quote> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ queries accessing the target page on the standby, whether or not
+ the data to be removed is visible.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ On the primary server, these cases simply result in waiting; and the
+ user might choose to cancel either of the conflicting actions. However,
+ on the standby there is no choice: the WAL-logged action already occurred
+ on the primary so the standby must not fail to apply it. Furthermore,
+ allowing WAL application to wait indefinitely may be very undesirable,
+ because the standby's state will become increasingly far behind the
+ primary's. Therefore, a mechanism is provided to forcibly cancel standby
+ queries that conflict with to-be-applied WAL records.
+ </para>
+
+ <para>
+ An example of the problem situation is an administrator on the primary
+ server running <command>DROP TABLE</command> on a table that is currently being
+ queried on the standby server. Clearly the standby query cannot continue
+ if the <command>DROP TABLE</command> is applied on the standby. If this situation
+ occurred on the primary, the <command>DROP TABLE</command> would wait until the
+ other query had finished. But when <command>DROP TABLE</command> is run on the
+ primary, the primary doesn't have information about what queries are
+ running on the standby, so it will not wait for any such standby
+ queries. The WAL change records come through to the standby while the
+ standby query is still running, causing a conflict. The standby server
+ must either delay application of the WAL records (and everything after
+ them, too) or else cancel the conflicting query so that the <command>DROP
+ TABLE</command> can be applied.
+ </para>
+
+ <para>
+ When a conflicting query is short, it's typically desirable to allow it to
+ complete by delaying WAL application for a little bit; but a long delay in
+ WAL application is usually not desirable. So the cancel mechanism has
+ parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref
+ linkend="guc-max-standby-streaming-delay"/>, that define the maximum
+ allowed delay in WAL application. Conflicting queries will be canceled
+ once it has taken longer than the relevant delay setting to apply any
+ newly-received WAL data. There are two parameters so that different delay
+ values can be specified for the case of reading WAL data from an archive
+ (i.e., initial recovery from a base backup or <quote>catching up</quote> a
+ standby server that has fallen far behind) versus reading WAL data via
+ streaming replication.
+ </para>
+
+ <para>
+ In a standby server that exists primarily for high availability, it's
+ best to set the delay parameters relatively short, so that the server
+ cannot fall far behind the primary due to delays caused by standby
+ queries. However, if the standby server is meant for executing
+ long-running queries, then a high or even infinite delay value may be
+ preferable. Keep in mind however that a long-running query could
+ cause other sessions on the standby server to not see recent changes
+ on the primary, if it delays application of WAL records.
+ </para>
+
+ <para>
+ Once the delay specified by <varname>max_standby_archive_delay</varname> or
+ <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting
+ queries will be canceled. This usually results just in a cancellation
+ error, although in the case of replaying a <command>DROP DATABASE</command>
+ the entire conflicting session will be terminated. Also, if the conflict
+ is over a lock held by an idle transaction, the conflicting session is
+ terminated (this behavior might change in the future).
+ </para>
+
+ <para>
+ Canceled queries may be retried immediately (after beginning a new
+ transaction, of course). Since query cancellation depends on
+ the nature of the WAL records being replayed, a query that was
+ canceled may well succeed if it is executed again.
+ </para>
+
+ <para>
+ Keep in mind that the delay parameters are compared to the elapsed time
+ since the WAL data was received by the standby server. Thus, the grace
+ period allowed to any one query on the standby is never more than the
+ delay parameter, and could be considerably less if the standby has already
+ fallen behind as a result of waiting for previous queries to complete, or
+ as a result of being unable to keep up with a heavy update load.
+ </para>
+
+ <para>
+ The most common reason for conflict between standby queries and WAL replay
+ is <quote>early cleanup</quote>. Normally, <productname>PostgreSQL</productname> allows
+ cleanup of old row versions when there are no transactions that need to
+ see them to ensure correct visibility of data according to MVCC rules.
+ However, this rule can only be applied for transactions executing on the
+ master. So it is possible that cleanup on the master will remove row
+ versions that are still visible to a transaction on the standby.
+ </para>
+
+ <para>
+ Experienced users should note that both row version cleanup and row version
+ freezing will potentially conflict with standby queries. Running a manual
+ <command>VACUUM FREEZE</command> is likely to cause conflicts even on tables with
+ no updated or deleted rows.
+ </para>
+
+ <para>
+ Users should be clear that tables that are regularly and heavily updated
+ on the primary server will quickly cause cancellation of longer running
+ queries on the standby. In such cases the setting of a finite value for
+ <varname>max_standby_archive_delay</varname> or
+ <varname>max_standby_streaming_delay</varname> can be considered similar to
+ setting <varname>statement_timeout</varname>.
+ </para>
+
+ <para>
+ Remedial possibilities exist if the number of standby-query cancellations
+ is found to be unacceptable. The first option is to set the parameter
+ <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
+ should note that this will delay cleanup of dead rows on the primary,
+ which may result in undesirable table bloat. However, the cleanup
+ situation will be no worse than if the standby queries were running
+ directly on the primary server, and you are still getting the benefit of
+ off-loading execution onto the standby.
+ If standby servers connect and disconnect frequently, you
+ might want to make adjustments to handle the period when
+ <varname>hot_standby_feedback</varname> feedback is not being provided.
+ For example, consider increasing <varname>max_standby_archive_delay</varname>
+ so that queries are not rapidly canceled by conflicts in WAL archive
+ files during disconnected periods. You should also consider increasing
+ <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations
+ by newly-arrived streaming WAL entries after reconnection.
+ </para>
+
+ <para>
+ Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age"/>
+ on the primary server, so that dead rows will not be cleaned up as quickly
+ as they normally would be. This will allow more time for queries to
+ execute before they are canceled on the standby, without having to set
+ a high <varname>max_standby_streaming_delay</varname>. However it is
+ difficult to guarantee any specific execution-time window with this
+ approach, since <varname>vacuum_defer_cleanup_age</varname> is measured in
+ transactions executed on the primary server.
+ </para>
+
+ <para>
+ The number of query cancels and the reason for them can be viewed using
+ the <structname>pg_stat_database_conflicts</structname> system view on the standby
+ server. The <structname>pg_stat_database</structname> system view also contains
+ summary information.
+ </para>
+ </sect2>
+
+ <sect2 id="hot-standby-admin">
+ <title>Administrator's Overview</title>
+
+ <para>
+ If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename>
+ (the default value) and there is a
+ <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm>
+ file present, the server will run in Hot Standby mode.
+ However, it may take some time for Hot Standby connections to be allowed,
+ because the server will not accept connections until it has completed
+ sufficient recovery to provide a consistent state against which queries
+ can run. During this period,
+ clients that attempt to connect will be refused with an error message.
+ To confirm the server has come up, either loop trying to connect from
+ the application, or look for these messages in the server logs:
+
+<programlisting>
+LOG: entering standby mode
+
+... then some time later ...
+
+LOG: consistent recovery state reached
+LOG: database system is ready to accept read only connections
+</programlisting>
+
+ Consistency information is recorded once per checkpoint on the primary.
+ It is not possible to enable hot standby when reading WAL
+ written during a period when <varname>wal_level</varname> was not set to
+ <literal>replica</literal> or <literal>logical</literal> on the primary. Reaching
+ a consistent state can also be delayed in the presence of both of these
+ conditions:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ A write transaction has more than 64 subtransactions
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Very long-lived write transactions
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ If you are running file-based log shipping ("warm standby"), you might need
+ to wait until the next WAL file arrives, which could be as long as the
+ <varname>archive_timeout</varname> setting on the primary.
+ </para>
+
+ <para>
+ The setting of some parameters on the standby will need reconfiguration
+ if they have been changed on the primary. For these parameters,
+ the value on the standby must
+ be equal to or greater than the value on the primary.
+ Therefore, if you want to increase these values, you should do so on all
+ standby servers first, before applying the changes to the primary server.
+ Conversely, if you want to decrease these values, you should do so on the
+ primary server first, before applying the changes to all standby servers.
+ If these parameters
+ are not set high enough then the standby will refuse to start.
+ Higher values can then be supplied and the server
+ restarted to begin recovery again. These parameters are:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <varname>max_connections</varname>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <varname>max_prepared_transactions</varname>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <varname>max_locks_per_transaction</varname>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <varname>max_wal_senders</varname>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <varname>max_worker_processes</varname>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ It is important that the administrator select appropriate settings for
+ <xref linkend="guc-max-standby-archive-delay"/> and <xref
+ linkend="guc-max-standby-streaming-delay"/>. The best choices vary
+ depending on business priorities. For example if the server is primarily
+ tasked as a High Availability server, then you will want low delay
+ settings, perhaps even zero, though that is a very aggressive setting. If
+ the standby server is tasked as an additional server for decision support
+ queries then it might be acceptable to set the maximum delay values to
+ many hours, or even -1 which means wait forever for queries to complete.
+ </para>
+
+ <para>
+ Transaction status "hint bits" written on the primary are not WAL-logged,
+ so data on the standby will likely re-write the hints again on the standby.
+ Thus, the standby server will still perform disk writes even though
+ all users are read-only; no changes occur to the data values
+ themselves. Users will still write large sort temporary files and
+ re-generate relcache info files, so no part of the database
+ is truly read-only during hot standby mode.
+ Note also that writes to remote databases using
+ <application>dblink</application> module, and other operations outside the
+ database using PL functions will still be possible, even though the
+ transaction is read-only locally.
+ </para>
+
+ <para>
+ The following types of administration commands are not accepted
+ during recovery mode:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Data Definition Language (DDL): e.g., <command>CREATE INDEX</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>,
+ <command>REASSIGN</command>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>,
+ <command>CLUSTER</command>, <command>REINDEX</command>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Again, note that some of these commands are actually allowed during
+ "read only" mode transactions on the primary.
+ </para>
+
+ <para>
+ As a result, you cannot create additional indexes that exist solely
+ on the standby, nor statistics that exist solely on the standby.
+ If these administration commands are needed, they should be executed
+ on the primary, and eventually those changes will propagate to the
+ standby.
+ </para>
+
+ <para>
+ <function>pg_cancel_backend()</function>
+ and <function>pg_terminate_backend()</function> will work on user backends,
+ but not the Startup process, which performs
+ recovery. <structname>pg_stat_activity</structname> does not show
+ recovering transactions as active. As a result,
+ <structname>pg_prepared_xacts</structname> is always empty during
+ recovery. If you wish to resolve in-doubt prepared transactions, view
+ <literal>pg_prepared_xacts</literal> on the primary and issue commands to
+ resolve transactions there or resolve them after the end of recovery.
+ </para>
+
+ <para>
+ <structname>pg_locks</structname> will show locks held by backends,
+ as normal. <structname>pg_locks</structname> also shows
+ a virtual transaction managed by the Startup process that owns all
+ <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery.
+ Note that the Startup process does not acquire locks to
+ make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal>
+ do not show in <structname>pg_locks</structname> for the Startup
+ process; they are just presumed to exist.
+ </para>
+
+ <para>
+ The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will
+ work, because the simple information it checks for exists.
+ The <productname>check_postgres</productname> monitoring script will also work,
+ though some reported values could give different or confusing results.
+ For example, last vacuum time will not be maintained, since no
+ vacuum occurs on the standby. Vacuums running on the primary
+ do still send their changes to the standby.
+ </para>
+
+ <para>
+ WAL file control commands will not work during recovery,
+ e.g., <function>pg_start_backup</function>, <function>pg_switch_wal</function> etc.
+ </para>
+
+ <para>
+ Dynamically loadable modules work, including <structname>pg_stat_statements</structname>.
+ </para>
+
+ <para>
+ Advisory locks work normally in recovery, including deadlock detection.
+ Note that advisory locks are never WAL logged, so it is impossible for
+ an advisory lock on either the primary or the standby to conflict with WAL
+ replay. Nor is it possible to acquire an advisory lock on the primary
+ and have it initiate a similar advisory lock on the standby. Advisory
+ locks relate only to the server on which they are acquired.
+ </para>
+
+ <para>
+ Trigger-based replication systems such as <productname>Slony</productname>,
+ <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the
+ standby at all, though they will run happily on the primary server as
+ long as the changes are not sent to standby servers to be applied.
+ WAL replay is not trigger-based so you cannot relay from the
+ standby to any system that requires additional database writes or
+ relies on the use of triggers.
+ </para>
+
+ <para>
+ New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still
+ work as long as they do not rely on writing new status to the database.
+ </para>
+
+ <para>
+ Currently, temporary table creation is not allowed during read only
+ transactions, so in some cases existing scripts will not run correctly.
+ This restriction might be relaxed in a later release. This is
+ both a SQL Standard compliance issue and a technical issue.
+ </para>
+
+ <para>
+ <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty.
+ Some standby users may be actively using the tablespace via their
+ <varname>temp_tablespaces</varname> parameter. If there are temporary files in the
+ tablespace, all active queries are canceled to ensure that temporary
+ files are removed, so the tablespace can be removed and WAL replay
+ can continue.
+ </para>
+
+ <para>
+ Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET
+ TABLESPACE</command> on the primary
+ will generate a WAL entry that will cause all users connected to that
+ database on the standby to be forcibly disconnected. This action occurs
+ immediately, whatever the setting of
+ <varname>max_standby_streaming_delay</varname>. Note that
+ <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which
+ in most cases will go unnoticed, though might in some cases cause a
+ program confusion if it depends in some way upon database name.
+ </para>
+
+ <para>
+ In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command>
+ for a role with login capability while that user is still connected then
+ nothing happens to the connected user &mdash; they remain connected. The user cannot
+ reconnect however. This behavior applies in recovery also, so a
+ <command>DROP USER</command> on the primary does not disconnect that user on the standby.
+ </para>
+
+ <para>
+ The statistics collector is active during recovery. All scans, reads, blocks,
+ index usage, etc., will be recorded normally on the standby. Replayed
+ actions will not duplicate their effects on primary, so replaying an
+ insert will not increment the Inserts column of pg_stat_user_tables.
+ The stats file is deleted at the start of recovery, so stats from primary
+ and standby will differ; this is considered a feature, not a bug.
+ </para>
+
+ <para>
+ Autovacuum is not active during recovery. It will start normally at the
+ end of recovery.
+ </para>
+
+ <para>
+ The checkpointer process and the background writer process are active during
+ recovery. The checkpointer process will perform restartpoints (similar to
+ checkpoints on the primary) and the background writer process will perform
+ normal block cleaning activities. This can include updates of the hint bit
+ information stored on the standby server.
+ The <command>CHECKPOINT</command> command is accepted during recovery,
+ though it performs a restartpoint rather than a new checkpoint.
+ </para>
+ </sect2>
+
+ <sect2 id="hot-standby-parameters">
+ <title>Hot Standby Parameter Reference</title>
+
+ <para>
+ Various parameters have been mentioned above in
+ <xref linkend="hot-standby-conflict"/> and
+ <xref linkend="hot-standby-admin"/>.
+ </para>
+
+ <para>
+ On the primary, parameters <xref linkend="guc-wal-level"/> and
+ <xref linkend="guc-vacuum-defer-cleanup-age"/> can be used.
+ <xref linkend="guc-max-standby-archive-delay"/> and
+ <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on
+ the primary.
+ </para>
+
+ <para>
+ On the standby, parameters <xref linkend="guc-hot-standby"/>,
+ <xref linkend="guc-max-standby-archive-delay"/> and
+ <xref linkend="guc-max-standby-streaming-delay"/> can be used.
+ <xref linkend="guc-vacuum-defer-cleanup-age"/> has no effect
+ as long as the server remains in standby mode, though it will
+ become relevant if the standby becomes primary.
+ </para>
+ </sect2>
+
+ <sect2 id="hot-standby-caveats">
+ <title>Caveats</title>
+
+ <para>
+ There are several limitations of Hot Standby.
+ These can and probably will be fixed in future releases:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Full knowledge of running transactions is required before snapshots
+ can be taken. Transactions that use large numbers of subtransactions
+ (currently greater than 64) will delay the start of read only
+ connections until the completion of the longest running write transaction.
+ If this situation occurs, explanatory messages will be sent to the server log.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Valid starting points for standby queries are generated at each
+ checkpoint on the master. If the standby is shut down while the master
+ is in a shutdown state, it might not be possible to re-enter Hot Standby
+ until the primary is started up, so that it generates further starting
+ points in the WAL logs. This situation isn't a problem in the most
+ common situations where it might happen. Generally, if the primary is
+ shut down and not available anymore, that's likely due to a serious
+ failure that requires the standby being converted to operate as
+ the new primary anyway. And in situations where the primary is
+ being intentionally taken down, coordinating to make sure the standby
+ becomes the new primary smoothly is also standard procedure.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions
+ will require twice the normal number of lock table entries. If you plan
+ on running either a large number of concurrent prepared transactions
+ that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one
+ large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are
+ advised to select a larger value of <varname>max_locks_per_transaction</varname>,
+ perhaps as much as twice the value of the parameter on
+ the primary server. You need not consider this at all if
+ your setting of <varname>max_prepared_transactions</varname> is 0.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The Serializable transaction isolation level is not yet available in hot
+ standby. (See <xref linkend="xact-serializable"/> and
+ <xref linkend="serializable-consistency"/> for details.)
+ An attempt to set a transaction to the serializable isolation level in
+ hot standby mode will generate an error.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ </para>
+ </sect2>
+
+ </sect1>
+
+</chapter>