summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/btree.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/btree.sgml')
-rw-r--r--doc/src/sgml/btree.sgml817
1 files changed, 817 insertions, 0 deletions
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
new file mode 100644
index 0000000..8d6eb75
--- /dev/null
+++ b/doc/src/sgml/btree.sgml
@@ -0,0 +1,817 @@
+<!-- doc/src/sgml/btree.sgml -->
+
+<chapter id="btree">
+<title>B-Tree Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>B-Tree</secondary>
+ </indexterm>
+
+<sect1 id="btree-intro">
+ <title>Introduction</title>
+
+ <para>
+ <productname>PostgreSQL</productname> includes an implementation of the
+ standard <acronym>btree</acronym> (multi-way balanced tree) index data
+ structure. Any data type that can be sorted into a well-defined linear
+ order can be indexed by a btree index. The only limitation is that an
+ index entry cannot exceed approximately one-third of a page (after TOAST
+ compression, if applicable).
+ </para>
+
+ <para>
+ Because each btree operator class imposes a sort order on its data type,
+ btree operator classes (or, really, operator families) have come to be
+ used as <productname>PostgreSQL</productname>'s general representation
+ and understanding of sorting semantics. Therefore, they've acquired
+ some features that go beyond what would be needed just to support btree
+ indexes, and parts of the system that are quite distant from the
+ btree AM make use of them.
+ </para>
+
+</sect1>
+
+<sect1 id="btree-behavior">
+ <title>Behavior of B-Tree Operator Classes</title>
+
+ <para>
+ As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator
+ class must provide five comparison operators,
+ <literal>&lt;</literal>,
+ <literal>&lt;=</literal>,
+ <literal>=</literal>,
+ <literal>&gt;=</literal> and
+ <literal>&gt;</literal>.
+ One might expect that <literal>&lt;&gt;</literal> should also be part of
+ the operator class, but it is not, because it would almost never be
+ useful to use a <literal>&lt;&gt;</literal> WHERE clause in an index
+ search. (For some purposes, the planner treats <literal>&lt;&gt;</literal>
+ as associated with a btree operator class; but it finds that operator via
+ the <literal>=</literal> operator's negator link, rather than
+ from <structname>pg_amop</structname>.)
+ </para>
+
+ <para>
+ When several data types share near-identical sorting semantics, their
+ operator classes can be grouped into an operator family. Doing so is
+ advantageous because it allows the planner to make deductions about
+ cross-type comparisons. Each operator class within the family should
+ contain the single-type operators (and associated support functions)
+ for its input data type, while cross-type comparison operators and
+ support functions are <quote>loose</quote> in the family. It is
+ recommendable that a complete set of cross-type operators be included
+ in the family, thus ensuring that the planner can represent any
+ comparison conditions that it deduces from transitivity.
+ </para>
+
+ <para>
+ There are some basic assumptions that a btree operator family must
+ satisfy:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ An <literal>=</literal> operator must be an equivalence relation; that
+ is, for all non-null values <replaceable>A</replaceable>,
+ <replaceable>B</replaceable>, <replaceable>C</replaceable> of the
+ data type:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <replaceable>A</replaceable> <literal>=</literal>
+ <replaceable>A</replaceable> is true
+ (<firstterm>reflexive law</firstterm>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <replaceable>A</replaceable> <literal>=</literal>
+ <replaceable>B</replaceable>,
+ then <replaceable>B</replaceable> <literal>=</literal>
+ <replaceable>A</replaceable>
+ (<firstterm>symmetric law</firstterm>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <replaceable>A</replaceable> <literal>=</literal>
+ <replaceable>B</replaceable> and <replaceable>B</replaceable>
+ <literal>=</literal> <replaceable>C</replaceable>,
+ then <replaceable>A</replaceable> <literal>=</literal>
+ <replaceable>C</replaceable>
+ (<firstterm>transitive law</firstterm>)
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A <literal>&lt;</literal> operator must be a strong ordering relation;
+ that is, for all non-null values <replaceable>A</replaceable>,
+ <replaceable>B</replaceable>, <replaceable>C</replaceable>:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <replaceable>A</replaceable> <literal>&lt;</literal>
+ <replaceable>A</replaceable> is false
+ (<firstterm>irreflexive law</firstterm>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <replaceable>A</replaceable> <literal>&lt;</literal>
+ <replaceable>B</replaceable>
+ and <replaceable>B</replaceable> <literal>&lt;</literal>
+ <replaceable>C</replaceable>,
+ then <replaceable>A</replaceable> <literal>&lt;</literal>
+ <replaceable>C</replaceable>
+ (<firstterm>transitive law</firstterm>)
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Furthermore, the ordering is total; that is, for all non-null
+ values <replaceable>A</replaceable>, <replaceable>B</replaceable>:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ exactly one of <replaceable>A</replaceable> <literal>&lt;</literal>
+ <replaceable>B</replaceable>, <replaceable>A</replaceable>
+ <literal>=</literal> <replaceable>B</replaceable>, and
+ <replaceable>B</replaceable> <literal>&lt;</literal>
+ <replaceable>A</replaceable> is true
+ (<firstterm>trichotomy law</firstterm>)
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ (The trichotomy law justifies the definition of the comparison support
+ function, of course.)
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The other three operators are defined in terms of <literal>=</literal>
+ and <literal>&lt;</literal> in the obvious way, and must act consistently
+ with them.
+ </para>
+
+ <para>
+ For an operator family supporting multiple data types, the above laws must
+ hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>,
+ <replaceable>C</replaceable> are taken from any data types in the family.
+ The transitive laws are the trickiest to ensure, as in cross-type
+ situations they represent statements that the behaviors of two or three
+ different operators are consistent.
+ As an example, it would not work to put <type>float8</type>
+ and <type>numeric</type> into the same operator family, at least not with
+ the current semantics that <type>numeric</type> values are converted
+ to <type>float8</type> for comparison to a <type>float8</type>. Because
+ of the limited accuracy of <type>float8</type>, this means there are
+ distinct <type>numeric</type> values that will compare equal to the
+ same <type>float8</type> value, and thus the transitive law would fail.
+ </para>
+
+ <para>
+ Another requirement for a multiple-data-type family is that any implicit
+ or binary-coercion casts that are defined between data types included in
+ the operator family must not change the associated sort ordering.
+ </para>
+
+ <para>
+ It should be fairly clear why a btree index requires these laws to hold
+ within a single data type: without them there is no ordering to arrange
+ the keys with. Also, index searches using a comparison key of a
+ different data type require comparisons to behave sanely across two
+ data types. The extensions to three or more data types within a family
+ are not strictly required by the btree index mechanism itself, but the
+ planner relies on them for optimization purposes.
+ </para>
+
+</sect1>
+
+<sect1 id="btree-support-funcs">
+ <title>B-Tree Support Functions</title>
+
+ <para>
+ As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
+ one required and four optional support functions. The five
+ user-defined methods are:
+ </para>
+ <variablelist>
+ <varlistentry>
+ <term><function>order</function></term>
+ <listitem>
+ <para>
+ For each combination of data types that a btree operator family
+ provides comparison operators for, it must provide a comparison
+ support function, registered in
+ <structname>pg_amproc</structname> with support function number 1
+ and
+ <structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield>
+ equal to the left and right data types for the comparison (i.e.,
+ the same data types that the matching operators are registered
+ with in <structname>pg_amop</structname>). The comparison
+ function must take two non-null values
+ <replaceable>A</replaceable> and <replaceable>B</replaceable> and
+ return an <type>int32</type> value that is
+ <literal>&lt;</literal> <literal>0</literal>,
+ <literal>0</literal>, or <literal>&gt;</literal>
+ <literal>0</literal> when <replaceable>A</replaceable>
+ <literal>&lt;</literal> <replaceable>B</replaceable>,
+ <replaceable>A</replaceable> <literal>=</literal>
+ <replaceable>B</replaceable>, or <replaceable>A</replaceable>
+ <literal>&gt;</literal> <replaceable>B</replaceable>,
+ respectively. A null result is disallowed: all values of the
+ data type must be comparable. See
+ <filename>src/backend/access/nbtree/nbtcompare.c</filename> for
+ examples.
+ </para>
+
+ <para>
+ If the compared values are of a collatable data type, the
+ appropriate collation OID will be passed to the comparison
+ support function, using the standard
+ <function>PG_GET_COLLATION()</function> mechanism.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><function>sortsupport</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide <firstterm>sort
+ support</firstterm> function(s), registered under support
+ function number 2. These functions allow implementing
+ comparisons for sorting purposes in a more efficient way than
+ naively calling the comparison support function. The APIs
+ involved in this are defined in
+ <filename>src/include/utils/sortsupport.h</filename>.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><function>in_range</function></term>
+ <listitem>
+ <indexterm>
+ <primary>in_range support functions</primary>
+ </indexterm>
+
+ <indexterm>
+ <primary>support functions</primary>
+ <secondary>in_range</secondary>
+ </indexterm>
+ <para>
+ Optionally, a btree operator family may provide
+ <firstterm>in_range</firstterm> support function(s), registered
+ under support function number 3. These are not used during btree
+ index operations; rather, they extend the semantics of the
+ operator family so that it can support window clauses containing
+ the <literal>RANGE</literal> <replaceable>offset</replaceable>
+ <literal>PRECEDING</literal> and <literal>RANGE</literal>
+ <replaceable>offset</replaceable> <literal>FOLLOWING</literal>
+ frame bound types (see <xref
+ linkend="syntax-window-functions"/>). Fundamentally, the extra
+ information provided is how to add or subtract an
+ <replaceable>offset</replaceable> value in a way that is
+ compatible with the family's data ordering.
+ </para>
+
+ <para>
+ An <function>in_range</function> function must have the signature
+<synopsis>
+in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool)
+returns bool
+</synopsis>
+ <replaceable>val</replaceable> and
+ <replaceable>base</replaceable> must be of the same type, which
+ is one of the types supported by the operator family (i.e., a
+ type for which it provides an ordering). However,
+ <replaceable>offset</replaceable> could be of a different type,
+ which might be one otherwise unsupported by the family. An
+ example is that the built-in <literal>time_ops</literal> family
+ provides an <function>in_range</function> function that has
+ <replaceable>offset</replaceable> of type <type>interval</type>.
+ A family can provide <function>in_range</function> functions for
+ any of its supported types and one or more
+ <replaceable>offset</replaceable> types. Each
+ <function>in_range</function> function should be entered in
+ <structname>pg_amproc</structname> with
+ <structfield>amproclefttype</structfield> equal to
+ <type>type1</type> and <structfield>amprocrighttype</structfield>
+ equal to <type>type2</type>.
+ </para>
+
+ <para>
+ The essential semantics of an <function>in_range</function>
+ function depend on the two Boolean flag parameters. It should
+ add or subtract <replaceable>base</replaceable> and
+ <replaceable>offset</replaceable>, then compare
+ <replaceable>val</replaceable> to the result, as follows:
+ <itemizedlist>
+ <listitem>
+ <para>
+ if <literal>!</literal><replaceable>sub</replaceable> and
+ <literal>!</literal><replaceable>less</replaceable>, return
+ <replaceable>val</replaceable> <literal>&gt;=</literal>
+ (<replaceable>base</replaceable> <literal>+</literal>
+ <replaceable>offset</replaceable>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <literal>!</literal><replaceable>sub</replaceable> and
+ <replaceable>less</replaceable>, return
+ <replaceable>val</replaceable> <literal>&lt;=</literal>
+ (<replaceable>base</replaceable> <literal>+</literal>
+ <replaceable>offset</replaceable>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <replaceable>sub</replaceable> and
+ <literal>!</literal><replaceable>less</replaceable>, return
+ <replaceable>val</replaceable> <literal>&gt;=</literal>
+ (<replaceable>base</replaceable> <literal>-</literal>
+ <replaceable>offset</replaceable>)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ if <replaceable>sub</replaceable> and
+ <replaceable>less</replaceable>, return
+ <replaceable>val</replaceable> <literal>&lt;=</literal>
+ (<replaceable>base</replaceable> <literal>-</literal>
+ <replaceable>offset</replaceable>)
+ </para>
+ </listitem>
+ </itemizedlist>
+ Before doing so, the function should check the sign of
+ <replaceable>offset</replaceable>: if it is less than zero, raise
+ error
+ <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal>
+ (22013) with error text like <quote>invalid preceding or
+ following size in window function</quote>. (This is required by
+ the SQL standard, although nonstandard operator families might
+ perhaps choose to ignore this restriction, since there seems to
+ be little semantic necessity for it.) This requirement is
+ delegated to the <function>in_range</function> function so that
+ the core code needn't understand what <quote>less than
+ zero</quote> means for a particular data type.
+ </para>
+
+ <para>
+ An additional expectation is that <function>in_range</function>
+ functions should, if practical, avoid throwing an error if
+ <replaceable>base</replaceable> <literal>+</literal>
+ <replaceable>offset</replaceable> or
+ <replaceable>base</replaceable> <literal>-</literal>
+ <replaceable>offset</replaceable> would overflow. The correct
+ comparison result can be determined even if that value would be
+ out of the data type's range. Note that if the data type
+ includes concepts such as <quote>infinity</quote> or
+ <quote>NaN</quote>, extra care may be needed to ensure that
+ <function>in_range</function>'s results agree with the normal
+ sort order of the operator family.
+ </para>
+
+ <para>
+ The results of the <function>in_range</function> function must be
+ consistent with the sort ordering imposed by the operator family.
+ To be precise, given any fixed values of
+ <replaceable>offset</replaceable> and
+ <replaceable>sub</replaceable>, then:
+ <itemizedlist>
+ <listitem>
+ <para>
+ If <function>in_range</function> with
+ <replaceable>less</replaceable> = true is true for some
+ <replaceable>val1</replaceable> and
+ <replaceable>base</replaceable>, it must be true for every
+ <replaceable>val2</replaceable> <literal>&lt;=</literal>
+ <replaceable>val1</replaceable> with the same
+ <replaceable>base</replaceable>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If <function>in_range</function> with
+ <replaceable>less</replaceable> = true is false for some
+ <replaceable>val1</replaceable> and
+ <replaceable>base</replaceable>, it must be false for every
+ <replaceable>val2</replaceable> <literal>&gt;=</literal>
+ <replaceable>val1</replaceable> with the same
+ <replaceable>base</replaceable>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If <function>in_range</function> with
+ <replaceable>less</replaceable> = true is true for some
+ <replaceable>val</replaceable> and
+ <replaceable>base1</replaceable>, it must be true for every
+ <replaceable>base2</replaceable> <literal>&gt;=</literal>
+ <replaceable>base1</replaceable> with the same
+ <replaceable>val</replaceable>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If <function>in_range</function> with
+ <replaceable>less</replaceable> = true is false for some
+ <replaceable>val</replaceable> and
+ <replaceable>base1</replaceable>, it must be false for every
+ <replaceable>base2</replaceable> <literal>&lt;=</literal>
+ <replaceable>base1</replaceable> with the same
+ <replaceable>val</replaceable>.
+ </para>
+ </listitem>
+ </itemizedlist>
+ Analogous statements with inverted conditions hold when
+ <replaceable>less</replaceable> = false.
+ </para>
+
+ <para>
+ If the type being ordered (<type>type1</type>) is collatable, the
+ appropriate collation OID will be passed to the
+ <function>in_range</function> function, using the standard
+ PG_GET_COLLATION() mechanism.
+ </para>
+
+ <para>
+ <function>in_range</function> functions need not handle NULL
+ inputs, and typically will be marked strict.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><function>equalimage</function></term>
+ <listitem>
+ <para>
+ Optionally, a btree operator family may provide
+ <function>equalimage</function> (<quote>equality implies image
+ equality</quote>) support functions, registered under support
+ function number 4. These functions allow the core code to
+ determine when it is safe to apply the btree deduplication
+ optimization. Currently, <function>equalimage</function>
+ functions are only called when building or rebuilding an index.
+ </para>
+ <para>
+ An <function>equalimage</function> function must have the
+ signature
+<synopsis>
+equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
+</synopsis>
+ The return value is static information about an operator class
+ and collation. Returning <literal>true</literal> indicates that
+ the <function>order</function> function for the operator class is
+ guaranteed to only return <literal>0</literal> (<quote>arguments
+ are equal</quote>) when its <replaceable>A</replaceable> and
+ <replaceable>B</replaceable> arguments are also interchangeable
+ without any loss of semantic information. Not registering an
+ <function>equalimage</function> function or returning
+ <literal>false</literal> indicates that this condition cannot be
+ assumed to hold.
+ </para>
+ <para>
+ The <replaceable>opcintype</replaceable> argument is the
+ <literal><structname>pg_type</structname>.oid</literal> of the
+ data type that the operator class indexes. This is a convenience
+ that allows reuse of the same underlying
+ <function>equalimage</function> function across operator classes.
+ If <replaceable>opcintype</replaceable> is a collatable data
+ type, the appropriate collation OID will be passed to the
+ <function>equalimage</function> function, using the standard
+ <function>PG_GET_COLLATION()</function> mechanism.
+ </para>
+ <para>
+ As far as the operator class is concerned, returning
+ <literal>true</literal> indicates that deduplication is safe (or
+ safe for the collation whose OID was passed to its
+ <function>equalimage</function> function). However, the core
+ code will only deem deduplication safe for an index when
+ <emphasis>every</emphasis> indexed column uses an operator class
+ that registers an <function>equalimage</function> function, and
+ each function actually returns <literal>true</literal> when
+ called.
+ </para>
+ <para>
+ Image equality is <emphasis>almost</emphasis> the same condition
+ as simple bitwise equality. There is one subtle difference: When
+ indexing a varlena data type, the on-disk representation of two
+ image equal datums may not be bitwise equal due to inconsistent
+ application of <acronym>TOAST</acronym> compression on input.
+ Formally, when an operator class's
+ <function>equalimage</function> function returns
+ <literal>true</literal>, it is safe to assume that the
+ <literal>datum_image_eq()</literal> C function will always agree
+ with the operator class's <function>order</function> function
+ (provided that the same collation OID is passed to both the
+ <function>equalimage</function> and <function>order</function>
+ functions).
+ </para>
+ <para>
+ The core code is fundamentally unable to deduce anything about
+ the <quote>equality implies image equality</quote> status of an
+ operator class within a multiple-data-type family based on
+ details from other operator classes in the same family. Also, it
+ is not sensible for an operator family to register a cross-type
+ <function>equalimage</function> function, and attempting to do so
+ will result in an error. This is because <quote>equality implies
+ image equality</quote> status does not just depend on
+ sorting/equality semantics, which are more or less defined at the
+ operator family level. In general, the semantics that one
+ particular data type implements must be considered separately.
+ </para>
+ <para>
+ The convention followed by the operator classes included with the
+ core <productname>PostgreSQL</productname> distribution is to
+ register a stock, generic <function>equalimage</function>
+ function. Most operator classes register
+ <function>btequalimage()</function>, which indicates that
+ deduplication is safe unconditionally. Operator classes for
+ collatable data types such as <type>text</type> register
+ <function>btvarstrequalimage()</function>, which indicates that
+ deduplication is safe with deterministic collations. Best
+ practice for third-party extensions is to register their own
+ custom function to retain control.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term><function>options</function></term>
+ <listitem>
+ <para>
+ Optionally, a B-tree operator family may provide
+ <function>options</function> (<quote>operator class specific
+ options</quote>) support functions, registered under support
+ function number 5. These functions define a set of user-visible
+ parameters that control operator class behavior.
+ </para>
+ <para>
+ An <function>options</function> support function must have the
+ signature
+<synopsis>
+options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns void
+</synopsis>
+ The function is passed a pointer to a <replaceable>local_relopts</replaceable>
+ struct, which needs to be filled with a set of operator class
+ specific options. The options can be accessed from other support
+ functions using the <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and
+ <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros.
+ </para>
+ <para>
+ Currently, no B-Tree operator class has an <function>options</function>
+ support function. B-tree doesn't allow flexible representation of keys
+ like GiST, SP-GiST, GIN and BRIN do. So, <function>options</function>
+ probably doesn't have much application in the current B-tree index
+ access method. Nevertheless, this support function was added to B-tree
+ for uniformity, and will probably find uses during further
+ evolution of B-tree in <productname>PostgreSQL</productname>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+</sect1>
+
+<sect1 id="btree-implementation">
+ <title>Implementation</title>
+
+ <para>
+ This section covers B-Tree index implementation details that may be
+ of use to advanced users. See
+ <filename>src/backend/access/nbtree/README</filename> in the source
+ distribution for a much more detailed, internals-focused description
+ of the B-Tree implementation.
+ </para>
+ <sect2 id="btree-structure">
+ <title>B-Tree Structure</title>
+ <para>
+ <productname>PostgreSQL</productname> B-Tree indexes are
+ multi-level tree structures, where each level of the tree can be
+ used as a doubly-linked list of pages. A single metapage is stored
+ in a fixed position at the start of the first segment file of the
+ index. All other pages are either leaf pages or internal pages.
+ Leaf pages are the pages on the lowest level of the tree. All
+ other levels consist of internal pages. Each leaf page contains
+ tuples that point to table rows. Each internal page contains
+ tuples that point to the next level down in the tree. Typically,
+ over 99% of all pages are leaf pages. Both internal pages and leaf
+ pages use the standard page format described in <xref
+ linkend="storage-page-layout"/>.
+ </para>
+ <para>
+ New leaf pages are added to a B-Tree index when an existing leaf
+ page cannot fit an incoming tuple. A <firstterm>page
+ split</firstterm> operation makes room for items that originally
+ belonged on the overflowing page by moving a portion of the items
+ to a new page. Page splits must also insert a new
+ <firstterm>downlink</firstterm> to the new page in the parent page,
+ which may cause the parent to split in turn. Page splits
+ <quote>cascade upwards</quote> in a recursive fashion. When the
+ root page finally cannot fit a new downlink, a <firstterm>root page
+ split</firstterm> operation takes place. This adds a new level to
+ the tree structure by creating a new root page that is one level
+ above the original root page.
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+ <title>Deduplication</title>
+ <para>
+ A duplicate is a leaf page tuple (a tuple that points to a table
+ row) where <emphasis>all</emphasis> indexed key columns have values
+ that match corresponding column values from at least one other leaf
+ page tuple in the same index. Duplicate tuples are quite common in
+ practice. B-Tree indexes can use a special, space-efficient
+ representation for duplicates when an optional technique is
+ enabled: <firstterm>deduplication</firstterm>.
+ </para>
+ <para>
+ Deduplication works by periodically merging groups of duplicate
+ tuples together, forming a single <firstterm>posting list</firstterm> tuple for each
+ group. The column key value(s) only appear once in this
+ representation. This is followed by a sorted array of
+ <acronym>TID</acronym>s that point to rows in the table. This
+ significantly reduces the storage size of indexes where each value
+ (or each distinct combination of column values) appears several
+ times on average. The latency of queries can be reduced
+ significantly. Overall query throughput may increase
+ significantly. The overhead of routine index vacuuming may also be
+ reduced significantly.
+ </para>
+ <note>
+ <para>
+ B-Tree deduplication is just as effective with
+ <quote>duplicates</quote> that contain a NULL value, even though
+ NULL values are never equal to each other according to the
+ <literal>=</literal> member of any B-Tree operator class. As far
+ as any part of the implementation that understands the on-disk
+ B-Tree structure is concerned, NULL is just another value from the
+ domain of indexed values.
+ </para>
+ </note>
+ <para>
+ The deduplication process occurs lazily, when a new item is
+ inserted that cannot fit on an existing leaf page. This prevents
+ (or at least delays) leaf page splits. Unlike GIN posting list
+ tuples, B-Tree posting list tuples do not need to expand every time
+ a new duplicate is inserted; they are merely an alternative
+ physical representation of the original logical contents of the
+ leaf page. This design prioritizes consistent performance with
+ mixed read-write workloads. Most client applications will at least
+ see a moderate performance benefit from using deduplication.
+ Deduplication is enabled by default.
+ </para>
+ <para>
+ <command>CREATE INDEX</command> and <command>REINDEX</command>
+ apply deduplication to create posting list tuples, though the
+ strategy they use is slightly different. Each group of duplicate
+ ordinary tuples encountered in the sorted input taken from the
+ table is merged into a posting list tuple
+ <emphasis>before</emphasis> being added to the current pending leaf
+ page. Individual posting list tuples are packed with as many
+ <acronym>TID</acronym>s as possible. Leaf pages are written out in
+ the usual way, without any separate deduplication pass. This
+ strategy is well-suited to <command>CREATE INDEX</command> and
+ <command>REINDEX</command> because they are once-off batch
+ operations.
+ </para>
+ <para>
+ Write-heavy workloads that don't benefit from deduplication due to
+ having few or no duplicate values in indexes will incur a small,
+ fixed performance penalty (unless deduplication is explicitly
+ disabled). The <literal>deduplicate_items</literal> storage
+ parameter can be used to disable deduplication within individual
+ indexes. There is never any performance penalty with read-only
+ workloads, since reading posting list tuples is at least as
+ efficient as reading the standard tuple representation. Disabling
+ deduplication isn't usually helpful.
+ </para>
+ <para>
+ B-Tree indexes are not directly aware that under MVCC, there might
+ be multiple extant versions of the same logical table row; to an
+ index, each tuple is an independent object that needs its own index
+ entry. <quote>Version duplicates</quote> may sometimes accumulate
+ and adversely affect query latency and throughput. This typically
+ occurs with <command>UPDATE</command>-heavy workloads where most
+ individual updates cannot apply the <acronym>HOT</acronym>
+ optimization (often because at least one indexed column gets
+ modified, necessitating a new set of index tuple versions &mdash;
+ one new tuple for <emphasis>each and every</emphasis> index). In
+ effect, B-Tree deduplication ameliorates index bloat caused by
+ version churn. Note that even the tuples from a unique index are
+ not necessarily <emphasis>physically</emphasis> unique when stored
+ on disk due to version churn. The deduplication optimization is
+ selectively applied within unique indexes. It targets those pages
+ that appear to have version duplicates. The high level goal is to
+ give <command>VACUUM</command> more time to run before an
+ <quote>unnecessary</quote> page split caused by version churn can
+ take place.
+ </para>
+ <tip>
+ <para>
+ A special heuristic is applied to determine whether a
+ deduplication pass in a unique index should take place. It can
+ often skip straight to splitting a leaf page, avoiding a
+ performance penalty from wasting cycles on unhelpful deduplication
+ passes. If you're concerned about the overhead of deduplication,
+ consider setting <literal>deduplicate_items = off</literal>
+ selectively. Leaving deduplication enabled in unique indexes has
+ little downside.
+ </para>
+ </tip>
+ <para>
+ Deduplication cannot be used in all cases due to
+ implementation-level restrictions. Deduplication safety is
+ determined when <command>CREATE INDEX</command> or
+ <command>REINDEX</command> is run.
+ </para>
+ <para>
+ Note that deduplication is deemed unsafe and cannot be used in the
+ following cases involving semantically significant differences
+ among equal datums:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <type>text</type>, <type>varchar</type>, and <type>char</type>
+ cannot use deduplication when a
+ <emphasis>nondeterministic</emphasis> collation is used. Case
+ and accent differences must be preserved among equal datums.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>numeric</type> cannot use deduplication. Numeric display
+ scale must be preserved among equal datums.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>jsonb</type> cannot use deduplication, since the
+ <type>jsonb</type> B-Tree operator class uses
+ <type>numeric</type> internally.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>float4</type> and <type>float8</type> cannot use
+ deduplication. These types have distinct representations for
+ <literal>-0</literal> and <literal>0</literal>, which are
+ nevertheless considered equal. This difference must be
+ preserved.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ There is one further implementation-level restriction that may be
+ lifted in a future version of
+ <productname>PostgreSQL</productname>:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Container types (such as composite types, arrays, or range
+ types) cannot use deduplication.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ There is one further implementation-level restriction that applies
+ regardless of the operator class or collation used:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INCLUDE</literal> indexes can never use deduplication.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ </sect2>
+</sect1>
+
+</chapter>