summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/storage.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/storage.sgml')
-rw-r--r--doc/src/sgml/storage.sgml1146
1 files changed, 1146 insertions, 0 deletions
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
new file mode 100644
index 0000000..3ea4e55
--- /dev/null
+++ b/doc/src/sgml/storage.sgml
@@ -0,0 +1,1146 @@
+<!-- doc/src/sgml/storage.sgml -->
+
+<chapter id="storage">
+
+<title>Database Physical Storage</title>
+
+<para>
+This chapter provides an overview of the physical storage format used by
+<productname>PostgreSQL</productname> databases.
+</para>
+
+<sect1 id="storage-file-layout">
+
+<title>Database File Layout</title>
+
+<para>
+This section describes the storage format at the level of files and
+directories.
+</para>
+
+<para>
+Traditionally, the configuration and data files used by a database
+cluster are stored together within the cluster's data
+directory, commonly referred to as <varname>PGDATA</varname> (after the name of the
+environment variable that can be used to define it). A common location for
+<varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>. Multiple clusters,
+managed by different server instances, can exist on the same machine.
+</para>
+
+<para>
+The <varname>PGDATA</varname> directory contains several subdirectories and control
+files, as shown in <xref linkend="pgdata-contents-table"/>. In addition to
+these required items, the cluster configuration files
+<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
+<filename>pg_ident.conf</filename> are traditionally stored in
+<varname>PGDATA</varname>, although it is possible to place them elsewhere.
+</para>
+
+<table tocentry="1" id="pgdata-contents-table">
+<title>Contents of <varname>PGDATA</varname></title>
+<tgroup cols="2">
+<thead>
+<row>
+<entry>
+Item
+</entry>
+<entry>Description</entry>
+</row>
+</thead>
+
+<tbody>
+
+<row>
+ <entry><filename>PG_VERSION</filename></entry>
+ <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
+</row>
+
+<row>
+ <entry><filename>base</filename></entry>
+ <entry>Subdirectory containing per-database subdirectories</entry>
+</row>
+
+<row>
+ <entry><filename>current_logfiles</filename></entry>
+ <entry>File recording the log file(s) currently written to by the logging
+ collector</entry>
+</row>
+
+<row>
+ <entry><filename>global</filename></entry>
+ <entry>Subdirectory containing cluster-wide tables, such as
+ <structname>pg_database</structname></entry>
+</row>
+
+<row>
+ <entry><filename>pg_commit_ts</filename></entry>
+ <entry>Subdirectory containing transaction commit timestamp data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_dynshmem</filename></entry>
+ <entry>Subdirectory containing files used by the dynamic shared memory
+ subsystem</entry>
+</row>
+
+<row>
+ <entry><filename>pg_logical</filename></entry>
+ <entry>Subdirectory containing status data for logical decoding</entry>
+</row>
+
+<row>
+ <entry><filename>pg_multixact</filename></entry>
+ <entry>Subdirectory containing multitransaction status data
+ (used for shared row locks)</entry>
+</row>
+
+<row>
+ <entry><filename>pg_notify</filename></entry>
+ <entry>Subdirectory containing LISTEN/NOTIFY status data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_replslot</filename></entry>
+ <entry>Subdirectory containing replication slot data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_serial</filename></entry>
+ <entry>Subdirectory containing information about committed serializable transactions</entry>
+</row>
+
+<row>
+ <entry><filename>pg_snapshots</filename></entry>
+ <entry>Subdirectory containing exported snapshots</entry>
+</row>
+
+<row>
+ <entry><filename>pg_stat</filename></entry>
+ <entry>Subdirectory containing permanent files for the statistics
+ subsystem</entry>
+</row>
+
+<row>
+ <entry><filename>pg_stat_tmp</filename></entry>
+ <entry>Subdirectory containing temporary files for the statistics
+ subsystem</entry>
+</row>
+
+<row>
+ <entry><filename>pg_subtrans</filename></entry>
+ <entry>Subdirectory containing subtransaction status data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_tblspc</filename></entry>
+ <entry>Subdirectory containing symbolic links to tablespaces</entry>
+</row>
+
+<row>
+ <entry><filename>pg_twophase</filename></entry>
+ <entry>Subdirectory containing state files for prepared transactions</entry>
+</row>
+
+<row>
+ <entry><filename>pg_wal</filename></entry>
+ <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
+</row>
+
+<row>
+ <entry><filename>pg_xact</filename></entry>
+ <entry>Subdirectory containing transaction commit status data</entry>
+</row>
+
+<row>
+ <entry><filename>postgresql.auto.conf</filename></entry>
+ <entry>A file used for storing configuration parameters that are set by
+<command>ALTER SYSTEM</command></entry>
+</row>
+
+<row>
+ <entry><filename>postmaster.opts</filename></entry>
+ <entry>A file recording the command-line options the server was
+last started with</entry>
+</row>
+
+<row>
+ <entry><filename>postmaster.pid</filename></entry>
+ <entry>A lock file recording the current postmaster process ID (PID),
+ cluster data directory path,
+ postmaster start timestamp,
+ port number,
+ Unix-domain socket directory path (could be empty),
+ first valid listen_address (IP address or <literal>*</literal>, or empty if
+ not listening on TCP),
+ and shared memory segment ID
+ (this file is not present after server shutdown)</entry>
+</row>
+
+</tbody>
+</tgroup>
+</table>
+
+<para>
+For each database in the cluster there is a subdirectory within
+<varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in
+<structname>pg_database</structname>. This subdirectory is the default location
+for the database's files; in particular, its system catalogs are stored
+there.
+</para>
+
+<para>
+ Note that the following sections describe the behavior of the builtin
+ <literal>heap</literal> <link linkend="tableam">table access method</link>,
+ and the builtin <link linkend="indexam">index access methods</link>. Due
+ to the extensible nature of <productname>PostgreSQL</productname>, other
+ access methods might work differently.
+</para>
+
+<para>
+Each table and index is stored in a separate file. For ordinary relations,
+these files are named after the table or index's <firstterm>filenode</firstterm> number,
+which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But
+for temporary relations, the file name is of the form
+<literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable>
+is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable>
+is the filenode number. In either case, in addition to the main file (a/k/a
+main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref
+linkend="storage-fsm"/>), which stores information about free space available in
+the relation. The free space map is stored in a file named with the filenode
+number plus the suffix <literal>_fsm</literal>. Tables also have a
+<firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>,
+to track which pages are known to have no dead tuples. The visibility map is
+described further in <xref linkend="storage-vm"/>. Unlogged tables and indexes
+have a third fork, known as the initialization fork, which is stored in a fork
+with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>).
+</para>
+
+<caution>
+<para>
+Note that while a table's filenode often matches its OID, this is
+<emphasis>not</emphasis> necessarily the case; some operations, like
+<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
+of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
+Avoid assuming that filenode and table OID are the same.
+Also, for certain system catalogs including <structname>pg_class</structname> itself,
+<structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero. The
+actual filenode number of these catalogs is stored in a lower-level data
+structure, and can be obtained using the <function>pg_relation_filenode()</function>
+function.
+</para>
+</caution>
+
+<para>
+When a table or index exceeds 1 GB, it is divided into gigabyte-sized
+<firstterm>segments</firstterm>. The first segment's file name is the same as the
+filenode; subsequent segments are named filenode.1, filenode.2, etc.
+This arrangement avoids problems on platforms that have file size limitations.
+(Actually, 1 GB is just the default segment size. The segment size can be
+adjusted using the configuration option <option>--with-segsize</option>
+when building <productname>PostgreSQL</productname>.)
+In principle, free space map and visibility map forks could require multiple
+segments as well, though this is unlikely to happen in practice.
+</para>
+
+<para>
+A table that has columns with potentially large entries will have an
+associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of
+field values that are too large to keep in the table rows proper.
+<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to
+its <acronym>TOAST</acronym> table, if any.
+See <xref linkend="storage-toast"/> for more information.
+</para>
+
+<para>
+The contents of tables and indexes are discussed further in
+<xref linkend="storage-page-layout"/>.
+</para>
+
+<para>
+Tablespaces make the scenario more complicated. Each user-defined tablespace
+has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename>
+directory, which points to the physical tablespace directory (i.e., the
+location specified in the tablespace's <command>CREATE TABLESPACE</command> command).
+This symbolic link is named after
+the tablespace's OID. Inside the physical tablespace directory there is
+a subdirectory with a name that depends on the <productname>PostgreSQL</productname>
+server version, such as <literal>PG_9.0_201008051</literal>. (The reason for using
+this subdirectory is so that successive versions of the database can use
+the same <command>CREATE TABLESPACE</command> location value without conflicts.)
+Within the version-specific subdirectory, there is
+a subdirectory for each database that has elements in the tablespace, named
+after the database's OID. Tables and indexes are stored within that
+directory, using the filenode naming scheme.
+The <literal>pg_default</literal> tablespace is not accessed through
+<filename>pg_tblspc</filename>, but corresponds to
+<varname>PGDATA</varname><filename>/base</filename>. Similarly, the <literal>pg_global</literal>
+tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to
+<varname>PGDATA</varname><filename>/global</filename>.
+</para>
+
+<para>
+The <function>pg_relation_filepath()</function> function shows the entire path
+(relative to <varname>PGDATA</varname>) of any relation. It is often useful
+as a substitute for remembering many of the above rules. But keep in
+mind that this function just gives the name of the first segment of the
+main fork of the relation &mdash; you may need to append a segment number
+and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all
+the files associated with the relation.
+</para>
+
+<para>
+Temporary files (for operations such as sorting more data than can fit in
+memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>,
+or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory
+if a tablespace other than <literal>pg_default</literal> is specified for them.
+The name of a temporary file has the form
+<filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>,
+where <replaceable>PPP</replaceable> is the PID of the owning backend and
+<replaceable>NNN</replaceable> distinguishes different temporary files of that backend.
+</para>
+
+</sect1>
+
+<sect1 id="storage-toast">
+
+<title>TOAST</title>
+
+ <indexterm>
+ <primary>TOAST</primary>
+ </indexterm>
+ <indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm>
+
+<para>
+This section provides an overview of <acronym>TOAST</acronym> (The
+Oversized-Attribute Storage Technique).
+</para>
+
+<para>
+<productname>PostgreSQL</productname> uses a fixed page size (commonly
+8 kB), and does not allow tuples to span multiple pages. Therefore, it is
+not possible to store very large field values directly. To overcome
+this limitation, large field values are compressed and/or broken up into
+multiple physical rows. This happens transparently to the user, with only
+small impact on most of the backend code. The technique is affectionately
+known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>).
+The <acronym>TOAST</acronym> infrastructure is also used to improve handling of
+large data values in-memory.
+</para>
+
+<para>
+Only certain data types support <acronym>TOAST</acronym> &mdash; there is no need to
+impose the overhead on data types that cannot produce large field values.
+To support <acronym>TOAST</acronym>, a data type must have a variable-length
+(<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first
+four-byte word of any stored value contains the total length of the value in
+bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest
+of the data type's representation. The special representations collectively
+called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or
+reinterpreting this initial length word. Therefore, the C-level functions
+supporting a <acronym>TOAST</acronym>-able data type must be careful about how they
+handle potentially <acronym>TOAST</acronym>ed input values: an input might not
+actually consist of a four-byte length word and contents until after it's
+been <firstterm>detoasted</firstterm>. (This is normally done by invoking
+<function>PG_DETOAST_DATUM</function> before doing anything with an input value,
+but in some cases more efficient approaches are possible.
+See <xref linkend="xtypes-toast"/> for more detail.)
+</para>
+
+<para>
+<acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order
+bits on big-endian machines, the low-order bits on little-endian machines),
+thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able
+data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero,
+the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and
+the remaining bits of the length word give the total datum size (including
+length word) in bytes. When the highest-order or lowest-order bit is set,
+the value has only a single-byte header instead of the normal four-byte
+header, and the remaining bits of that byte give the total datum size
+(including length byte) in bytes. This alternative supports space-efficient
+storage of values shorter than 127 bytes, while still allowing the data type
+to grow to 1 GB at need. Values with single-byte headers aren't aligned on
+any particular boundary, whereas values with four-byte headers are aligned on
+at least a four-byte boundary; this omission of alignment padding provides
+additional space savings that is significant compared to short values.
+As a special case, if the remaining bits of a single-byte header are all
+zero (which would be impossible for a self-inclusive length), the value is
+a pointer to out-of-line data, with several possible alternatives as
+described below. The type and size of such a <firstterm>TOAST pointer</firstterm>
+are determined by a code stored in the second byte of the datum.
+Lastly, when the highest-order or lowest-order bit is clear but the adjacent
+bit is set, the content of the datum has been compressed and must be
+decompressed before use. In this case the remaining bits of the four-byte
+length word give the total size of the compressed datum, not the
+original data. Note that compression is also possible for out-of-line data
+but the varlena header does not tell whether it has occurred &mdash;
+the content of the <acronym>TOAST</acronym> pointer tells that, instead.
+</para>
+
+<para>
+The compression technique used for either in-line or out-of-line compressed
+data can be selected for each column by setting
+the <literal>COMPRESSION</literal> column option in <command>CREATE
+TABLE</command> or <command>ALTER TABLE</command>. The default for columns
+with no explicit setting is to consult the
+<xref linkend="guc-default-toast-compression"/> parameter at the time data is
+inserted.
+</para>
+
+<para>
+As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums.
+The oldest and most common type is a pointer to out-of-line data stored in
+a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but
+associated with, the table containing the <acronym>TOAST</acronym> pointer datum
+itself. These <firstterm>on-disk</firstterm> pointer datums are created by the
+<acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>)
+when a tuple to be stored on disk is too large to be stored as-is.
+Further details appear in <xref linkend="storage-toast-ondisk"/>.
+Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to
+out-of-line data that appears elsewhere in memory. Such datums are
+necessarily short-lived, and will never appear on-disk, but they are very
+useful for avoiding copying and redundant processing of large data values.
+Further details appear in <xref linkend="storage-toast-inmemory"/>.
+</para>
+
+<sect2 id="storage-toast-ondisk">
+ <title>Out-of-Line, On-Disk TOAST Storage</title>
+
+<para>
+If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will
+have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's
+<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry. On-disk
+<acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as
+described in more detail below.
+</para>
+
+<para>
+Out-of-line values are divided (after compression if used) into chunks of at
+most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen
+so that four chunk rows will fit on a page, making it about 2000 bytes).
+Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table
+belonging to the owning table. Every
+<acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID
+identifying the particular <acronym>TOAST</acronym>ed value),
+<structfield>chunk_seq</structfield> (a sequence number for the chunk within its value),
+and <structfield>chunk_data</structfield> (the actual data of the chunk). A unique index
+on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast
+retrieval of the values. A pointer datum representing an out-of-line on-disk
+<acronym>TOAST</acronym>ed value therefore needs to store the OID of the
+<acronym>TOAST</acronym> table in which to look and the OID of the specific value
+(its <structfield>chunk_id</structfield>). For convenience, pointer datums also store the
+logical datum size (original uncompressed data length), physical stored size
+(different if compression was applied), and the compression method used, if
+any. Allowing for the varlena header bytes,
+the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18
+bytes regardless of the actual size of the represented value.
+</para>
+
+<para>
+The <acronym>TOAST</acronym> management code is triggered only
+when a row value to be stored in a table is wider than
+<symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB).
+The <acronym>TOAST</acronym> code will compress and/or move
+field values out-of-line until the row value is shorter than
+<symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable)
+or no more gains can be had. During an UPDATE
+operation, values of unchanged fields are normally preserved as-is; so an
+UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if
+none of the out-of-line values change.
+</para>
+
+<para>
+The <acronym>TOAST</acronym> management code recognizes four different strategies
+for storing <acronym>TOAST</acronym>-able columns on disk:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PLAIN</literal> prevents either compression or
+ out-of-line storage. This is the only possible strategy for
+ columns of non-<acronym>TOAST</acronym>-able data types.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>EXTENDED</literal> allows both compression and out-of-line
+ storage. This is the default for most <acronym>TOAST</acronym>-able data types.
+ Compression will be attempted first, then out-of-line storage if
+ the row is still too big.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>EXTERNAL</literal> allows out-of-line storage but not
+ compression. Use of <literal>EXTERNAL</literal> will
+ make substring operations on wide <type>text</type> and
+ <type>bytea</type> columns faster (at the penalty of increased storage
+ space) because these operations are optimized to fetch only the
+ required parts of the out-of-line value when it is not compressed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MAIN</literal> allows compression but not out-of-line
+ storage. (Actually, out-of-line storage will still be performed
+ for such columns, but only as a last resort when there is no other
+ way to make the row small enough to fit on a page.)
+ </para>
+ </listitem>
+ </itemizedlist>
+
+Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns
+of that data type, but the strategy for a given table column can be altered
+with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>.
+</para>
+
+<para>
+<symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using
+<link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link>
+</para>
+
+<para>
+This scheme has a number of advantages compared to a more straightforward
+approach such as allowing row values to span pages. Assuming that queries are
+usually qualified by comparisons against relatively small key values, most of
+the work of the executor will be done using the main row entry. The big values
+of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all)
+at the time the result set is sent to the client. Thus, the main table is much
+smaller and more of its rows fit in the shared buffer cache than would be the
+case without any out-of-line storage. Sort sets shrink also, and sorts will
+more often be done entirely in memory. A little test showed that a table
+containing typical HTML pages and their URLs was stored in about half of the
+raw data size including the <acronym>TOAST</acronym> table, and that the main table
+contained only about 10% of the entire data (the URLs and some small HTML
+pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed
+comparison table, in which all the HTML pages were cut down to 7 kB to fit.
+</para>
+
+</sect2>
+
+<sect2 id="storage-toast-inmemory">
+ <title>Out-of-Line, In-Memory TOAST Storage</title>
+
+<para>
+<acronym>TOAST</acronym> pointers can point to data that is not on disk, but is
+elsewhere in the memory of the current server process. Such pointers
+obviously cannot be long-lived, but they are nonetheless useful. There
+are currently two sub-cases:
+pointers to <firstterm>indirect</firstterm> data and
+pointers to <firstterm>expanded</firstterm> data.
+</para>
+
+<para>
+Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena
+value stored somewhere in memory. This case was originally created merely
+as a proof of concept, but it is currently used during logical decoding to
+avoid possibly having to create physical tuples exceeding 1 GB (as pulling
+all out-of-line field values into the tuple might do). The case is of
+limited use since the creator of the pointer datum is entirely responsible
+that the referenced data survives for as long as the pointer could exist,
+and there is no infrastructure to help with this.
+</para>
+
+<para>
+Expanded <acronym>TOAST</acronym> pointers are useful for complex data types
+whose on-disk representation is not especially suited for computational
+purposes. As an example, the standard varlena representation of a
+<productname>PostgreSQL</productname> array includes dimensionality information, a
+nulls bitmap if there are any null elements, then the values of all the
+elements in order. When the element type itself is variable-length, the
+only way to find the <replaceable>N</replaceable>'th element is to scan through all the
+preceding elements. This representation is appropriate for on-disk storage
+because of its compactness, but for computations with the array it's much
+nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote>
+representation in which all the element starting locations have been
+identified. The <acronym>TOAST</acronym> pointer mechanism supports this need by
+allowing a pass-by-reference Datum to point to either a standard varlena
+value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that
+points to an expanded representation somewhere in memory. The details of
+this expanded representation are up to the data type, though it must have
+a standard header and meet the other API requirements given
+in <filename>src/include/utils/expandeddatum.h</filename>. C-level functions
+working with the data type can choose to handle either representation.
+Functions that do not know about the expanded representation, but simply
+apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically
+receive the traditional varlena representation; so support for an expanded
+representation can be introduced incrementally, one function at a time.
+</para>
+
+<para>
+<acronym>TOAST</acronym> pointers to expanded values are further broken down
+into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers.
+The pointed-to representation is the same either way, but a function that
+receives a read-write pointer is allowed to modify the referenced value
+in-place, whereas one that receives a read-only pointer must not; it must
+first create a copy if it wants to make a modified version of the value.
+This distinction and some associated conventions make it possible to avoid
+unnecessary copying of expanded values during query execution.
+</para>
+
+<para>
+For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym>
+management code ensures that no such pointer datum can accidentally get
+stored on disk. In-memory <acronym>TOAST</acronym> pointers are automatically
+expanded to normal in-line varlena values before storage &mdash; and then
+possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing
+tuple would otherwise be too big.
+</para>
+
+</sect2>
+
+</sect1>
+
+<sect1 id="storage-fsm">
+
+<title>Free Space Map</title>
+
+<indexterm>
+ <primary>Free Space Map</primary>
+</indexterm>
+<indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm>
+
+<para>
+Each heap and index relation, except for hash indexes, has a Free Space Map
+(<acronym>FSM</acronym>) to keep track of available space in the relation.
+It's stored alongside the main relation data in a separate relation fork,
+named after the filenode number of the relation, plus a <literal>_fsm</literal>
+suffix. For example, if the filenode of a relation is 12345, the
+<acronym>FSM</acronym> is stored in a file called
+<filename>12345_fsm</filename>, in the same directory as the main relation file.
+</para>
+
+<para>
+The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The
+bottom level <acronym>FSM</acronym> pages store the free space available on each
+heap (or index) page, using one byte to represent each such page. The upper
+levels aggregate information from the lower levels.
+</para>
+
+<para>
+Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with
+one byte per node. Each leaf node represents a heap page, or a lower level
+<acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's
+values is stored. The maximum value in the leaf nodes is therefore stored
+at the root.
+</para>
+
+<para>
+See <filename>src/backend/storage/freespace/README</filename> for more details on
+how the <acronym>FSM</acronym> is structured, and how it's updated and searched.
+The <xref linkend="pgfreespacemap"/> module
+can be used to examine the information stored in free space maps.
+</para>
+
+</sect1>
+
+<sect1 id="storage-vm">
+
+<title>Visibility Map</title>
+
+<indexterm>
+ <primary>Visibility Map</primary>
+</indexterm>
+<indexterm><primary>VM</primary><see>Visibility Map</see></indexterm>
+
+<para>
+Each heap relation has a Visibility Map
+(VM) to keep track of which pages contain only tuples that are known to be
+visible to all active transactions; it also keeps track of which pages contain
+only frozen tuples. It's stored
+alongside the main relation data in a separate relation fork, named after the
+filenode number of the relation, plus a <literal>_vm</literal> suffix. For example,
+if the filenode of a relation is 12345, the VM is stored in a file called
+<filename>12345_vm</filename>, in the same directory as the main relation file.
+Note that indexes do not have VMs.
+</para>
+
+<para>
+The visibility map stores two bits per heap page. The first bit, if set,
+indicates that the page is all-visible, or in other words that the page does
+not contain any tuples that need to be vacuumed.
+This information can also be used
+by <link linkend="indexes-index-only-scans"><firstterm>index-only
+scans</firstterm></link> to answer queries using only the index tuple.
+The second bit, if set, means that all tuples on the page have been frozen.
+That means that even an anti-wraparound vacuum need not revisit the page.
+</para>
+
+<para>
+The map is conservative in the sense that we make sure that whenever a bit is
+set, we know the condition is true, but if a bit is not set, it might or
+might not be true. Visibility map bits are only set by vacuum, but are
+cleared by any data-modifying operations on a page.
+</para>
+
+<para>
+The <xref linkend="pgvisibility"/> module can be used to examine the
+information stored in the visibility map.
+</para>
+
+</sect1>
+
+<sect1 id="storage-init">
+
+<title>The Initialization Fork</title>
+
+<indexterm>
+ <primary>Initialization Fork</primary>
+</indexterm>
+
+<para>
+Each unlogged table, and each index on an unlogged table, has an initialization
+fork. The initialization fork is an empty table or index of the appropriate
+type. When an unlogged table must be reset to empty due to a crash, the
+initialization fork is copied over the main fork, and any other forks are
+erased (they will be recreated automatically as needed).
+</para>
+
+</sect1>
+
+<sect1 id="storage-page-layout">
+
+<title>Database Page Layout</title>
+
+<para>
+This section provides an overview of the page format used within
+<productname>PostgreSQL</productname> tables and indexes.<footnote>
+ <para>
+ Actually, use of this page format is not required for either table or
+ index access methods. The <literal>heap</literal> table access method
+ always uses this format. All the existing index methods also use the
+ basic format, but the data kept on index metapages usually doesn't follow
+ the item layout rules.
+ </para>
+</footnote>
+Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table.
+</para>
+
+<para>
+In the following explanation, a
+<firstterm>byte</firstterm>
+is assumed to contain 8 bits. In addition, the term
+<firstterm>item</firstterm>
+refers to an individual data value that is stored on a page. In a table,
+an item is a row; in an index, an item is an index entry.
+</para>
+
+<para>
+Every table and index is stored as an array of <firstterm>pages</firstterm> of a
+fixed size (usually 8 kB, although a different page size can be selected
+when compiling the server). In a table, all the pages are logically
+equivalent, so a particular item (row) can be stored in any page. In
+indexes, the first page is generally reserved as a <firstterm>metapage</firstterm>
+holding control information, and there can be different types of pages
+within the index, depending on the index access method.
+</para>
+
+<para>
+<xref linkend="page-table"/> shows the overall layout of a page.
+There are five parts to each page.
+</para>
+
+<table tocentry="1" id="page-table">
+<title>Overall Page Layout</title>
+<titleabbrev>Page Layout</titleabbrev>
+<tgroup cols="2">
+<thead>
+<row>
+<entry>
+Item
+</entry>
+<entry>Description</entry>
+</row>
+</thead>
+
+<tbody>
+
+<row>
+ <entry>PageHeaderData</entry>
+ <entry>24 bytes long. Contains general information about the page, including
+free space pointers.</entry>
+</row>
+
+<row>
+<entry>ItemIdData</entry>
+<entry>Array of item identifiers pointing to the actual items. Each
+entry is an (offset,length) pair. 4 bytes per item.</entry>
+</row>
+
+<row>
+<entry>Free space</entry>
+<entry>The unallocated space. New item identifiers are allocated from
+the start of this area, new items from the end.</entry>
+</row>
+
+<row>
+<entry>Items</entry>
+<entry>The actual items themselves.</entry>
+</row>
+
+<row>
+<entry>Special space</entry>
+<entry>Index access method specific data. Different methods store different
+data. Empty in ordinary tables.</entry>
+</row>
+
+</tbody>
+</tgroup>
+</table>
+
+ <para>
+
+ The first 24 bytes of each page consists of a page header
+ (<structname>PageHeaderData</structname>). Its format is detailed in <xref
+ linkend="pageheaderdata-table"/>. The first field tracks the most
+ recent WAL entry related to this page. The second field contains
+ the page checksum if <xref linkend="app-initdb-data-checksums"/> are
+ enabled. Next is a 2-byte field containing flag bits. This is followed
+ by three 2-byte integer fields (<structfield>pd_lower</structfield>,
+ <structfield>pd_upper</structfield>, and
+ <structfield>pd_special</structfield>). These contain byte offsets
+ from the page start to the start of unallocated space, to the end of
+ unallocated space, and to the start of the special space. The next 2
+ bytes of the page header, <structfield>pd_pagesize_version</structfield>,
+ store both the page size and a version indicator. Beginning with
+ <productname>PostgreSQL</productname> 8.3 the version number is 4;
+ <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
+ <productname>PostgreSQL</productname> 8.0 used version number 2;
+ <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
+ prior releases used version number 0.
+ (The basic page layout and header format has not changed in most of these
+ versions, but the layout of heap row headers has.) The page size
+ is basically only present as a cross-check; there is no support for having
+ more than one page size in an installation.
+ The last field is a hint that shows whether pruning the page is likely
+ to be profitable: it tracks the oldest un-pruned XMAX on the page.
+
+ </para>
+
+ <table tocentry="1" id="pageheaderdata-table">
+ <title>PageHeaderData Layout</title>
+ <titleabbrev>PageHeaderData Layout</titleabbrev>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Field</entry>
+ <entry>Type</entry>
+ <entry>Length</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>pd_lsn</entry>
+ <entry>PageXLogRecPtr</entry>
+ <entry>8 bytes</entry>
+ <entry>LSN: next byte after last byte of WAL record for last change
+ to this page</entry>
+ </row>
+ <row>
+ <entry>pd_checksum</entry>
+ <entry>uint16</entry>
+ <entry>2 bytes</entry>
+ <entry>Page checksum</entry>
+ </row>
+ <row>
+ <entry>pd_flags</entry>
+ <entry>uint16</entry>
+ <entry>2 bytes</entry>
+ <entry>Flag bits</entry>
+ </row>
+ <row>
+ <entry>pd_lower</entry>
+ <entry>LocationIndex</entry>
+ <entry>2 bytes</entry>
+ <entry>Offset to start of free space</entry>
+ </row>
+ <row>
+ <entry>pd_upper</entry>
+ <entry>LocationIndex</entry>
+ <entry>2 bytes</entry>
+ <entry>Offset to end of free space</entry>
+ </row>
+ <row>
+ <entry>pd_special</entry>
+ <entry>LocationIndex</entry>
+ <entry>2 bytes</entry>
+ <entry>Offset to start of special space</entry>
+ </row>
+ <row>
+ <entry>pd_pagesize_version</entry>
+ <entry>uint16</entry>
+ <entry>2 bytes</entry>
+ <entry>Page size and layout version number information</entry>
+ </row>
+ <row>
+ <entry>pd_prune_xid</entry>
+ <entry>TransactionId</entry>
+ <entry>4 bytes</entry>
+ <entry>Oldest unpruned XMAX on page, or zero if none</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ All the details can be found in
+ <filename>src/include/storage/bufpage.h</filename>.
+ </para>
+
+ <para>
+ Following the page header are item identifiers
+ (<type>ItemIdData</type>), each requiring four bytes.
+ An item identifier contains a byte-offset to
+ the start of an item, its length in bytes, and a few attribute bits
+ which affect its interpretation.
+ New item identifiers are allocated
+ as needed from the beginning of the unallocated space.
+ The number of item identifiers present can be determined by looking at
+ <structfield>pd_lower</structfield>, which is increased to allocate a new identifier.
+ Because an item
+ identifier is never moved until it is freed, its index can be used on a
+ long-term basis to reference an item, even when the item itself is moved
+ around on the page to compact free space. In fact, every pointer to an
+ item (<type>ItemPointer</type>, also known as
+ <type>CTID</type>) created by
+ <productname>PostgreSQL</productname> consists of a page number and the
+ index of an item identifier.
+
+ </para>
+
+ <para>
+
+ The items themselves are stored in space allocated backwards from the end
+ of unallocated space. The exact structure varies depending on what the
+ table is to contain. Tables and sequences both use a structure named
+ <type>HeapTupleHeaderData</type>, described below.
+
+ </para>
+
+ <para>
+
+ The final section is the <quote>special section</quote> which can
+ contain anything the access method wishes to store. For example,
+ b-tree indexes store links to the page's left and right siblings,
+ as well as some other data relevant to the index structure.
+ Ordinary tables do not use a special section at all (indicated by setting
+ <structfield>pd_special</structfield> to equal the page size).
+
+ </para>
+
+ <para>
+ <xref linkend="storage-page-layout-figure"/> illustrates how these parts are
+ laid out in a page.
+ </para>
+
+ <figure id="storage-page-layout-figure">
+ <title>Page Layout</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/>
+ </imageobject>
+ </mediaobject>
+ </figure>
+
+ <sect2 id="storage-tuple-layout">
+
+ <title>Table Row Layout</title>
+
+ <para>
+
+ All table rows are structured in the same way. There is a fixed-size
+ header (occupying 23 bytes on most machines), followed by an optional null
+ bitmap, an optional object ID field, and the user data. The header is
+ detailed
+ in <xref linkend="heaptupleheaderdata-table"/>. The actual user data
+ (columns of the row) begins at the offset indicated by
+ <structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN
+ distance for the platform.
+ The null bitmap is
+ only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
+ <structfield>t_infomask</structfield>. If it is present it begins just after
+ the fixed header and occupies enough bytes to have one bit per data column
+ (that is, the number of bits that equals the attribute count in
+ <structfield>t_infomask2</structfield>). In this list of bits, a
+ 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
+ present, all columns are assumed not-null.
+ The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit
+ is set in <structfield>t_infomask</structfield>. If present, it appears just
+ before the <structfield>t_hoff</structfield> boundary. Any padding needed to make
+ <structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null
+ bitmap and the object ID. (This in turn ensures that the object ID is
+ suitably aligned.)
+
+ </para>
+
+ <table tocentry="1" id="heaptupleheaderdata-table">
+ <title>HeapTupleHeaderData Layout</title>
+ <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Field</entry>
+ <entry>Type</entry>
+ <entry>Length</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>t_xmin</entry>
+ <entry>TransactionId</entry>
+ <entry>4 bytes</entry>
+ <entry>insert XID stamp</entry>
+ </row>
+ <row>
+ <entry>t_xmax</entry>
+ <entry>TransactionId</entry>
+ <entry>4 bytes</entry>
+ <entry>delete XID stamp</entry>
+ </row>
+ <row>
+ <entry>t_cid</entry>
+ <entry>CommandId</entry>
+ <entry>4 bytes</entry>
+ <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
+ </row>
+ <row>
+ <entry>t_xvac</entry>
+ <entry>TransactionId</entry>
+ <entry>4 bytes</entry>
+ <entry>XID for VACUUM operation moving a row version</entry>
+ </row>
+ <row>
+ <entry>t_ctid</entry>
+ <entry>ItemPointerData</entry>
+ <entry>6 bytes</entry>
+ <entry>current TID of this or newer row version</entry>
+ </row>
+ <row>
+ <entry>t_infomask2</entry>
+ <entry>uint16</entry>
+ <entry>2 bytes</entry>
+ <entry>number of attributes, plus various flag bits</entry>
+ </row>
+ <row>
+ <entry>t_infomask</entry>
+ <entry>uint16</entry>
+ <entry>2 bytes</entry>
+ <entry>various flag bits</entry>
+ </row>
+ <row>
+ <entry>t_hoff</entry>
+ <entry>uint8</entry>
+ <entry>1 byte</entry>
+ <entry>offset to user data</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ All the details can be found in
+ <filename>src/include/access/htup_details.h</filename>.
+ </para>
+
+ <para>
+
+ Interpreting the actual data can only be done with information obtained
+ from other tables, mostly <structname>pg_attribute</structname>. The
+ key values needed to identify field locations are
+ <structfield>attlen</structfield> and <structfield>attalign</structfield>.
+ There is no way to directly get a
+ particular attribute, except when there are only fixed width fields and no
+ null values. All this trickery is wrapped up in the functions
+ <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
+ and <firstterm>heap_getsysattr</firstterm>.
+
+ </para>
+ <para>
+
+ To read the data you need to examine each attribute in turn. First check
+ whether the field is NULL according to the null bitmap. If it is, go to
+ the next. Then make sure you have the right alignment. If the field is a
+ fixed width field, then all the bytes are simply placed. If it's a
+ variable length field (attlen = -1) then it's a bit more complicated.
+ All variable-length data types share the common header structure
+ <type>struct varlena</type>, which includes the total length of the stored
+ value and some flag bits. Depending on the flags, the data can be either
+ inline or in a <acronym>TOAST</acronym> table;
+ it might be compressed, too (see <xref linkend="storage-toast"/>).
+
+ </para>
+ </sect2>
+</sect1>
+
+<sect1 id="storage-hot">
+
+ <title>Heap-Only Tuples (<acronym>HOT</acronym>)</title>
+
+ <para>
+ To allow for high concurrency, <productname>PostgreSQL</productname>
+ uses <link linkend="mvcc-intro">multiversion concurrency
+ control</link> (<acronym>MVCC</acronym>) to store rows. However,
+ <acronym>MVCC</acronym> has some downsides for update queries.
+ Specifically, updates require new versions of rows to be added to
+ tables. This can also require new index entries for each updated row,
+ and removal of old versions of rows and their index entries can be
+ expensive.
+ </para>
+
+ <para>
+ To help reduce the overhead of updates,
+ <productname>PostgreSQL</productname> has an optimization called
+ heap-only tuples (<acronym>HOT</acronym>). This optimization is
+ possible when:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The update does not modify any columns referenced by the table's
+ indexes, including expression and partial indexes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ There is sufficient free space on the page containing the old row
+ for the updated row.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ In such cases, heap-only tuples provide two optimizations:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ New index entries are not needed to represent updated rows.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Old versions of updated rows can be completely removed during normal
+ operation, including <command>SELECT</command>s, instead of requiring
+ periodic vacuum operations. (This is possible because indexes
+ do not reference their <link linkend="storage-page-layout">page
+ item identifiers</link>.)
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ In summary, heap-only tuple updates can only be created
+ if columns used by indexes are not updated. You can
+ increase the likelihood of sufficient page space for
+ <acronym>HOT</acronym> updates by decreasing a table's <link
+ linkend="reloption-fillfactor"><literal>fillfactor</literal></link>.
+ If you don't, <acronym>HOT</acronym> updates will still happen because
+ new rows will naturally migrate to new pages and existing pages with
+ sufficient free space for new row versions. The system view <link
+ linkend="monitoring-pg-stat-all-tables-view">pg_stat_all_tables</link>
+ allows monitoring of the occurrence of HOT and non-HOT updates.
+ </para>
+</sect1>
+
+</chapter>