1 files changed, 558 insertions, 0 deletions
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000..5d61a33
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,558 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</filename> module provides functions that allow you to
+  verify the logical consistency of the structure of relations.
+ </para>
+
+ <para>
+  The B-Tree checking functions verify various <emphasis>invariants</emphasis> in the
+  structure of the representation of particular relations.  The
+  correctness of the access method functions behind index scans and
+  other important operations relies on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</quote> order (e.g.,
+  for B-Tree indexes on <type>text</type>, index tuples should be in
+  collated lexical order).  If that particular invariant somehow fails
+  to hold, we can expect binary searches on the affected page to
+  incorrectly guide index scans, resulting in wrong answers to SQL
+  queries.  If the structure appears to be valid, no error is raised.
+ </para>
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines.  See <xref
+  linkend="xindex-support"/> for details of operator class support
+  functions.
+ </para>
+ <para>
+  Unlike the B-Tree checking functions which report corruption by raising
+  errors, the heap checking function <function>verify_heapam</function> checks
+  a table and attempts to return a set of rows, one row per corruption
+  detected.  Despite this, if facilities that
+  <function>verify_heapam</function> relies upon are themselves corrupted, the
+  function may be unable to continue and may instead raise an error.
+ </para>
+ <para>
+  Permission to execute <filename>amcheck</filename> functions may be granted
+  to non-superusers, but before granting such permissions careful consideration
+  should be given to data security and privacy concerns.  Although the
+  corruption reports generated by these functions do not focus on the contents
+  of the corrupted data so much as on the structure of that data and the nature
+  of the corruptions found, an attacker who gains permission to execute these
+  functions, particularly if the attacker can also induce corruption, might be
+  able to infer something of the data itself from such messages.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass, heapallindexed boolean) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(index =&gt; c.oid, heapallindexed =&gt; i.indisunique),
+               c.relname,
+               c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check temp tables, which may be from another session:
+AND c.relpersistence != 't'
+-- Function may throw an error when this is omitted:
+AND c.relkind = 'i' AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of the
+      10 largest catalog indexes in the database <quote>test</quote>.
+      Verification of the presence of heap tuples as index tuples is
+      requested for the subset that are unique indexes.  Since no
+      error is raised, all indexes tested appear to be logically
+      consistent.  Naturally, this query could easily be changed to
+      call <function>bt_index_check</function> for every index in the
+      database where verification is supported.
+     </para>
+     <para>
+      <function>bt_index_check</function> acquires an <literal>AccessShareLock</literal>
+      on the target index and the heap relation it belongs to. This lock mode
+      is the same lock mode acquired on relations by simple
+      <literal>SELECT</literal> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, but will verify the
+      presence of all heap tuples as index tuples within the index
+      when <parameter>heapallindexed</parameter> is
+      <literal>true</literal>.  When a routine, lightweight test for
+      corruption is required in a live production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, rootdescend boolean) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.
+      Optionally, when the <parameter>heapallindexed</parameter>
+      argument is <literal>true</literal>, the function verifies the
+      presence of all heap tuples that should be found within the
+      index.  When the optional <parameter>rootdescend</parameter>
+      argument is <literal>true</literal>, verification re-finds
+      tuples on the leaf level by performing a new search from the
+      root page for each tuple.  The checks that can be performed by
+      <function>bt_index_parent_check</function> are a superset of the
+      checks that can be performed by <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships, including checking
+      that there are no missing downlinks in the index structure.
+      <function>bt_index_parent_check</function> follows the general
+      convention of raising an error if it finds a logical
+      inconsistency or other problem.
+     </para>
+     <para>
+      A <literal>ShareLock</literal> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</literal> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</command>, <command>UPDATE</command>, and <command>DELETE</command>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</command>, as well as
+      all other utility commands.  Note that the function holds locks
+      only while running, not for the entire transaction.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> cannot be used when
+      hot standby mode is enabled (i.e., on read-only physical
+      replicas), unlike <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+  <tip>
+   <para>
+    <function>bt_index_check</function> and
+    <function>bt_index_parent_check</function> both output log
+    messages about the verification process at
+    <literal>DEBUG1</literal> and <literal>DEBUG2</literal> severity
+    levels.  These messages provide detailed information about the
+    verification process that may be of interest to
+    <productname>PostgreSQL</productname> developers.  Advanced users
+    may also find this information helpful, since it provides
+    additional context should verification actually detect an
+    inconsistency.  Running:
+<programlisting>
+SET client_min_messages = DEBUG1;
+</programlisting>
+    in an interactive <application>psql</application> session before
+    running a verification query will display messages about the
+    progress of verification with a manageable level of detail.
+   </para>
+  </tip>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>
+      verify_heapam(relation regclass,
+                    on_error_stop boolean,
+                    check_toast boolean,
+                    skip text,
+                    startblock bigint,
+                    endblock bigint,
+                    blkno OUT bigint,
+                    offnum OUT integer,
+                    attnum OUT integer,
+                    msg OUT text)
+      returns setof record
+     </function>
+    </term>
+    <listitem>
+     <para>
+      Checks a table, sequence, or materialized view for structural corruption,
+      where pages in the relation contain data that is invalidly formatted, and
+      for logical corruption, where pages are structurally valid but
+      inconsistent with the rest of the database cluster.
+     </para>
+     <para>
+      The following optional arguments are recognized:
+     </para>
+     <variablelist>
+      <varlistentry>
+       <term><literal>on_error_stop</literal></term>
+       <listitem>
+        <para>
+         If true, corruption checking stops at the end of the first block in
+         which any corruptions are found.
+        </para>
+        <para>
+         Defaults to false.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>check_toast</literal></term>
+       <listitem>
+        <para>
+         If true, toasted values are checked against the target relation's
+         TOAST table.
+        </para>
+        <para>
+         This option is known to be slow.  Also, if the toast table or its
+         index is corrupt, checking it against toast values could conceivably
+         crash the server, although in many cases this would just produce an
+         error.
+        </para>
+        <para>
+         Defaults to false.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>skip</literal></term>
+       <listitem>
+        <para>
+         If not <literal>none</literal>, corruption checking skips blocks that
+         are marked as all-visible or all-frozen, as specified.
+         Valid options are <literal>all-visible</literal>,
+         <literal>all-frozen</literal> and <literal>none</literal>.
+        </para>
+        <para>
+         Defaults to <literal>none</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>startblock</literal></term>
+       <listitem>
+        <para>
+         If specified, corruption checking begins at the specified block,
+         skipping all previous blocks.  It is an error to specify a
+         <parameter>startblock</parameter> outside the range of blocks in the
+         target table.
+        </para>
+        <para>
+         By default, checking begins at the first block.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>endblock</literal></term>
+       <listitem>
+        <para>
+         If specified, corruption checking ends at the specified block,
+         skipping all remaining blocks.  It is an error to specify an
+         <parameter>endblock</parameter> outside the range of blocks in the target
+         table.
+        </para>
+        <para>
+         By default, all blocks are checked.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+     <para>
+      For each corruption detected, <function>verify_heapam</function> returns
+      a row with the following columns:
+     </para>
+     <variablelist>
+      <varlistentry>
+       <term><literal>blkno</literal></term>
+       <listitem>
+        <para>
+         The number of the block containing the corrupt page.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>offnum</literal></term>
+       <listitem>
+        <para>
+         The OffsetNumber of the corrupt tuple.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>attnum</literal></term>
+       <listitem>
+        <para>
+         The attribute number of the corrupt column in the tuple, if the
+         corruption is specific to a column and not the tuple as a whole.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry>
+       <term><literal>msg</literal></term>
+       <listitem>
+        <para>
+         A message describing the problem detected.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Optional <parameter>heapallindexed</parameter> Verification</title>
+ <para>
+  When the <parameter>heapallindexed</parameter> argument to B-Tree
+  verification functions is <literal>true</literal>, an additional
+  phase of verification is performed against the table associated with
+  the target index relation.  This consists of a <quote>dummy</quote>
+  <command>CREATE INDEX</command> operation, which checks for the
+  presence of all hypothetical new index tuples against a temporary,
+  in-memory summarizing structure (this is built when needed during
+  the basic first phase of verification).  The summarizing structure
+  <quote>fingerprints</quote> every tuple found within the target
+  index.  The high level principle behind
+  <parameter>heapallindexed</parameter> verification is that a new
+  index that is equivalent to the existing, target index must only
+  have entries that can be found in the existing structure.
+ </para>
+ <para>
+  The additional <parameter>heapallindexed</parameter> phase adds
+  significant overhead: verification will typically take several times
+  longer.  However, there is no change to the relation-level locks
+  acquired when <parameter>heapallindexed</parameter> verification is
+  performed.
+ </para>
+ <para>
+  The summarizing structure is bound in size by
+  <varname>maintenance_work_mem</varname>.  In order to ensure that
+  there is no more than a 2% probability of failure to detect an
+  inconsistency for each heap tuple that should be represented in the
+  index, approximately 2 bytes of memory are needed per tuple.  As
+  less memory is made available per tuple, the probability of missing
+  an inconsistency slowly increases.  This approach limits the
+  overhead of verification significantly, while only slightly reducing
+  the probability of detecting a problem, especially for installations
+  where verification is treated as a routine maintenance task.  Any
+  single absent or malformed tuple has a new opportunity to be
+  detected with each new verification attempt.
+ </para>
+
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</filename> Effectively</title>
+
+ <para>
+  <filename>amcheck</filename> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data
+  checksums</application></link> will fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations.
+    </para>
+    <para>
+     This includes issues caused by the comparison rules of operating
+     system collations changing. Comparisons of datums of a collatable
+     type like <type>text</type> must be immutable (just as all
+     comparisons used for B-Tree index scans must be immutable), which
+     implies that operating system collation rules must never change.
+     Though rare, updates to operating system collation rules can
+     cause these issues. More commonly, an inconsistency in the
+     collation order between a primary server and a standby server is
+     implicated, possibly because the <emphasis>major</emphasis> operating
+     system version in use is inconsistent.  Such inconsistencies will
+     generally only arise on standby servers, and so can generally
+     only be detected on standby servers.
+    </para>
+    <para>
+     If a problem like this arises, it may not affect each individual
+     index that is ordered using an affected collation, simply because
+     <emphasis>indexed</emphasis> values might happen to have the same
+     absolute ordering regardless of the behavioral inconsistency. See
+     <xref linkend="locale"/> and <xref linkend="collation"/> for
+     further details about how <productname>PostgreSQL</productname> uses
+     operating system locales and collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Structural inconsistencies between indexes and the heap relations
+     that are indexed (when <parameter>heapallindexed</parameter>
+     verification is performed).
+    </para>
+    <para>
+     There is no cross-checking of indexes against their heap relation
+     during normal operation.  Symptoms of heap corruption can be subtle.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</productname> access method
+     code, sort code, or transaction management code.
+    </para>
+    <para>
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</productname> features that could plausibly allow a
+     logical inconsistency to be introduced.  Verification of table
+     structure and associated visibility and transaction status
+     information plays a similar role.  One obvious testing strategy
+     is to call <filename>amcheck</filename> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"/> for details on running the tests.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     File system or storage subsystem faults where checksums happen to
+     simply not be enabled.
+    </para>
+    <para>
+     Note that <filename>amcheck</filename> examines a page as represented in some
+     shared memory buffer at the time of verification if there is only a
+     shared buffer hit when accessing the block. Consequently,
+     <filename>amcheck</filename> does not necessarily examine data read from the
+     file system at the time of verification. Note that when checksums are
+     enabled, <filename>amcheck</filename> may raise an error due to a checksum
+     failure when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, or the broader memory subsystem.
+    </para>
+    <para>
+     <productname>PostgreSQL</productname> does not protect against correctable
+     memory errors and it is assumed you will operate using RAM that
+     uses industry standard Error Correcting Codes (ECC) or better
+     protection.  However, ECC memory is typically only immune to
+     single-bit errors, and should not be assumed to provide
+     <emphasis>absolute</emphasis> protection against failures that
+     result in memory corruption.
+    </para>
+    <para>
+     When <parameter>heapallindexed</parameter> verification is
+     performed, there is generally a greatly increased chance of
+     detecting single-bit errors, since strict binary equality is
+     tested, and the indexed attributes within the heap are tested.
+    </para>
+   </listitem>
+  </itemizedlist>
+ </para>
+
+ <para>
+  Structural corruption can happen due to faulty storage hardware, or
+  relation files being overwritten or modified by unrelated software.
+  This kind of corruption can also be detected with
+  <link linkend="checksums"><application>data page
+  checksums</application></link>.
+ </para>
+
+ <para>
+  Relation pages which are correctly formatted, internally consistent, and
+  correct relative to their own internal checksums may still contain
+  logical corruption.  As such, this kind of corruption cannot be detected
+  with <application>checksums</application>.  Examples include toasted
+  values in the main table which lack a corresponding entry in the toast
+  table, and tuples in the main table with a Transaction ID that is older
+  than the oldest valid Transaction ID in the database or cluster.
+ </para>
+
+ <para>
+  Multiple causes of logical corruption have been observed in production
+  systems, including bugs in the <productname>PostgreSQL</productname>
+  server software, faulty and ill-conceived backup and restore tools, and
+  user error.
+ </para>
+
+ <para>
+  Corrupt relations are most concerning in live production environments,
+  precisely the same environments where high risk activities are least
+  welcome.  For this reason, <function>verify_heapam</function> has been
+  designed to diagnose corruption without undue risk.  It cannot guard
+  against all causes of backend crashes, as even executing the calling
+  query could be unsafe on a badly corrupted system.   Access to <link
+  linkend="catalogs-overview">catalog tables</link> is performed and could
+  be problematic if the catalogs themselves are corrupted.
+ </para>
+
+ <para>
+  In general, <filename>amcheck</filename> can only prove the presence of
+  corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+ <sect2>
+  <title>Repairing Corruption</title>
+ <para>
+  No error concerning corruption raised by <filename>amcheck</filename> should
+  ever be a false positive.  <filename>amcheck</filename> raises
+  errors in the event of conditions that, by definition, should never
+  happen, and so careful analysis of <filename>amcheck</filename>
+  errors is often required.
+ </para>
+ <para>
+  There is no general method of repairing problems that
+  <filename>amcheck</filename> detects.  An explanation for the root cause of
+  an invariant violation should be sought.  <xref
+  linkend="pageinspect"/> may play a useful role in diagnosing
+  corruption that <filename>amcheck</filename> detects.  A <command>REINDEX</command>
+  may not be effective in repairing corruption.
+ </para>
+
+ </sect2>
+
+</sect1>