diff options
Diffstat (limited to 'doc/src/sgml/storage.sgml')
-rw-r--r-- | doc/src/sgml/storage.sgml | 1148 |
1 files changed, 1148 insertions, 0 deletions
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml new file mode 100644 index 0000000..148fb1b --- /dev/null +++ b/doc/src/sgml/storage.sgml @@ -0,0 +1,1148 @@ +<!-- doc/src/sgml/storage.sgml --> + +<chapter id="storage"> + +<title>Database Physical Storage</title> + +<para> +This chapter provides an overview of the physical storage format used by +<productname>PostgreSQL</productname> databases. +</para> + +<sect1 id="storage-file-layout"> + +<title>Database File Layout</title> + +<para> +This section describes the storage format at the level of files and +directories. +</para> + +<para> +Traditionally, the configuration and data files used by a database +cluster are stored together within the cluster's data +directory, commonly referred to as <varname>PGDATA</varname> (after the name of the +environment variable that can be used to define it). A common location for +<varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>. Multiple clusters, +managed by different server instances, can exist on the same machine. +</para> + +<para> +The <varname>PGDATA</varname> directory contains several subdirectories and control +files, as shown in <xref linkend="pgdata-contents-table"/>. In addition to +these required items, the cluster configuration files +<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and +<filename>pg_ident.conf</filename> are traditionally stored in +<varname>PGDATA</varname>, although it is possible to place them elsewhere. +</para> + +<table tocentry="1" id="pgdata-contents-table"> +<title>Contents of <varname>PGDATA</varname></title> +<tgroup cols="2"> +<thead> +<row> +<entry> +Item +</entry> +<entry>Description</entry> +</row> +</thead> + +<tbody> + +<row> + <entry><filename>PG_VERSION</filename></entry> + <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry> +</row> + +<row> + <entry><filename>base</filename></entry> + <entry>Subdirectory containing per-database subdirectories</entry> +</row> + +<row> + <entry><filename>current_logfiles</filename></entry> + <entry>File recording the log file(s) currently written to by the logging + collector</entry> +</row> + +<row> + <entry><filename>global</filename></entry> + <entry>Subdirectory containing cluster-wide tables, such as + <structname>pg_database</structname></entry> +</row> + +<row> + <entry><filename>pg_commit_ts</filename></entry> + <entry>Subdirectory containing transaction commit timestamp data</entry> +</row> + +<row> + <entry><filename>pg_dynshmem</filename></entry> + <entry>Subdirectory containing files used by the dynamic shared memory + subsystem</entry> +</row> + +<row> + <entry><filename>pg_logical</filename></entry> + <entry>Subdirectory containing status data for logical decoding</entry> +</row> + +<row> + <entry><filename>pg_multixact</filename></entry> + <entry>Subdirectory containing multitransaction status data + (used for shared row locks)</entry> +</row> + +<row> + <entry><filename>pg_notify</filename></entry> + <entry>Subdirectory containing LISTEN/NOTIFY status data</entry> +</row> + +<row> + <entry><filename>pg_replslot</filename></entry> + <entry>Subdirectory containing replication slot data</entry> +</row> + +<row> + <entry><filename>pg_serial</filename></entry> + <entry>Subdirectory containing information about committed serializable transactions</entry> +</row> + +<row> + <entry><filename>pg_snapshots</filename></entry> + <entry>Subdirectory containing exported snapshots</entry> +</row> + +<row> + <entry><filename>pg_stat</filename></entry> + <entry>Subdirectory containing permanent files for the statistics + subsystem</entry> +</row> + +<row> + <entry><filename>pg_stat_tmp</filename></entry> + <entry>Subdirectory containing temporary files for the statistics + subsystem</entry> +</row> + +<row> + <entry><filename>pg_subtrans</filename></entry> + <entry>Subdirectory containing subtransaction status data</entry> +</row> + +<row> + <entry><filename>pg_tblspc</filename></entry> + <entry>Subdirectory containing symbolic links to tablespaces</entry> +</row> + +<row> + <entry><filename>pg_twophase</filename></entry> + <entry>Subdirectory containing state files for prepared transactions</entry> +</row> + +<row> + <entry><filename>pg_wal</filename></entry> + <entry>Subdirectory containing WAL (Write Ahead Log) files</entry> +</row> + +<row> + <entry><filename>pg_xact</filename></entry> + <entry>Subdirectory containing transaction commit status data</entry> +</row> + +<row> + <entry><filename>postgresql.auto.conf</filename></entry> + <entry>A file used for storing configuration parameters that are set by +<command>ALTER SYSTEM</command></entry> +</row> + +<row> + <entry><filename>postmaster.opts</filename></entry> + <entry>A file recording the command-line options the server was +last started with</entry> +</row> + +<row> + <entry><filename>postmaster.pid</filename></entry> + <entry>A lock file recording the current postmaster process ID (PID), + cluster data directory path, + postmaster start timestamp, + port number, + Unix-domain socket directory path (could be empty), + first valid listen_address (IP address or <literal>*</literal>, or empty if + not listening on TCP), + and shared memory segment ID + (this file is not present after server shutdown)</entry> +</row> + +</tbody> +</tgroup> +</table> + +<para> +For each database in the cluster there is a subdirectory within +<varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in +<structname>pg_database</structname>. This subdirectory is the default location +for the database's files; in particular, its system catalogs are stored +there. +</para> + +<para> + Note that the following sections describe the behavior of the builtin + <literal>heap</literal> <link linkend="tableam">table access method</link>, + and the builtin <link linkend="indexam">index access methods</link>. Due + to the extensible nature of <productname>PostgreSQL</productname>, other + access methods might work differently. +</para> + +<para> +Each table and index is stored in a separate file. For ordinary relations, +these files are named after the table or index's <firstterm>filenode</firstterm> number, +which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But +for temporary relations, the file name is of the form +<literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable> +is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable> +is the filenode number. In either case, in addition to the main file (a/k/a +main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref +linkend="storage-fsm"/>), which stores information about free space available in +the relation. The free space map is stored in a file named with the filenode +number plus the suffix <literal>_fsm</literal>. Tables also have a +<firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>, +to track which pages are known to have no dead tuples. The visibility map is +described further in <xref linkend="storage-vm"/>. Unlogged tables and indexes +have a third fork, known as the initialization fork, which is stored in a fork +with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>). +</para> + +<caution> +<para> +Note that while a table's filenode often matches its OID, this is +<emphasis>not</emphasis> necessarily the case; some operations, like +<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms +of <command>ALTER TABLE</command>, can change the filenode while preserving the OID. +Avoid assuming that filenode and table OID are the same. +Also, for certain system catalogs including <structname>pg_class</structname> itself, +<structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero. The +actual filenode number of these catalogs is stored in a lower-level data +structure, and can be obtained using the <function>pg_relation_filenode()</function> +function. +</para> +</caution> + +<para> +When a table or index exceeds 1 GB, it is divided into gigabyte-sized +<firstterm>segments</firstterm>. The first segment's file name is the same as the +filenode; subsequent segments are named filenode.1, filenode.2, etc. +This arrangement avoids problems on platforms that have file size limitations. +(Actually, 1 GB is just the default segment size. The segment size can be +adjusted using the configuration option <option>--with-segsize</option> +when building <productname>PostgreSQL</productname>.) +In principle, free space map and visibility map forks could require multiple +segments as well, though this is unlikely to happen in practice. +</para> + +<para> +A table that has columns with potentially large entries will have an +associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of +field values that are too large to keep in the table rows proper. +<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to +its <acronym>TOAST</acronym> table, if any. +See <xref linkend="storage-toast"/> for more information. +</para> + +<para> +The contents of tables and indexes are discussed further in +<xref linkend="storage-page-layout"/>. +</para> + +<para> +Tablespaces make the scenario more complicated. Each user-defined tablespace +has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename> +directory, which points to the physical tablespace directory (i.e., the +location specified in the tablespace's <command>CREATE TABLESPACE</command> command). +This symbolic link is named after +the tablespace's OID. Inside the physical tablespace directory there is +a subdirectory with a name that depends on the <productname>PostgreSQL</productname> +server version, such as <literal>PG_9.0_201008051</literal>. (The reason for using +this subdirectory is so that successive versions of the database can use +the same <command>CREATE TABLESPACE</command> location value without conflicts.) +Within the version-specific subdirectory, there is +a subdirectory for each database that has elements in the tablespace, named +after the database's OID. Tables and indexes are stored within that +directory, using the filenode naming scheme. +The <literal>pg_default</literal> tablespace is not accessed through +<filename>pg_tblspc</filename>, but corresponds to +<varname>PGDATA</varname><filename>/base</filename>. Similarly, the <literal>pg_global</literal> +tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to +<varname>PGDATA</varname><filename>/global</filename>. +</para> + +<para> +The <function>pg_relation_filepath()</function> function shows the entire path +(relative to <varname>PGDATA</varname>) of any relation. It is often useful +as a substitute for remembering many of the above rules. But keep in +mind that this function just gives the name of the first segment of the +main fork of the relation — you may need to append a segment number +and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all +the files associated with the relation. +</para> + +<para> +Temporary files (for operations such as sorting more data than can fit in +memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>, +or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory +if a tablespace other than <literal>pg_default</literal> is specified for them. +The name of a temporary file has the form +<filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>, +where <replaceable>PPP</replaceable> is the PID of the owning backend and +<replaceable>NNN</replaceable> distinguishes different temporary files of that backend. +</para> + +</sect1> + +<sect1 id="storage-toast"> + +<title>TOAST</title> + + <indexterm> + <primary>TOAST</primary> + </indexterm> + <indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm> + +<para> +This section provides an overview of <acronym>TOAST</acronym> (The +Oversized-Attribute Storage Technique). +</para> + +<para> +<productname>PostgreSQL</productname> uses a fixed page size (commonly +8 kB), and does not allow tuples to span multiple pages. Therefore, it is +not possible to store very large field values directly. To overcome +this limitation, large field values are compressed and/or broken up into +multiple physical rows. This happens transparently to the user, with only +small impact on most of the backend code. The technique is affectionately +known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>). +The <acronym>TOAST</acronym> infrastructure is also used to improve handling of +large data values in-memory. +</para> + +<para> +Only certain data types support <acronym>TOAST</acronym> — there is no need to +impose the overhead on data types that cannot produce large field values. +To support <acronym>TOAST</acronym>, a data type must have a variable-length +(<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first +four-byte word of any stored value contains the total length of the value in +bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest +of the data type's representation. The special representations collectively +called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or +reinterpreting this initial length word. Therefore, the C-level functions +supporting a <acronym>TOAST</acronym>-able data type must be careful about how they +handle potentially <acronym>TOAST</acronym>ed input values: an input might not +actually consist of a four-byte length word and contents until after it's +been <firstterm>detoasted</firstterm>. (This is normally done by invoking +<function>PG_DETOAST_DATUM</function> before doing anything with an input value, +but in some cases more efficient approaches are possible. +See <xref linkend="xtypes-toast"/> for more detail.) +</para> + +<para> +<acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order +bits on big-endian machines, the low-order bits on little-endian machines), +thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able +data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero, +the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and +the remaining bits of the length word give the total datum size (including +length word) in bytes. When the highest-order or lowest-order bit is set, +the value has only a single-byte header instead of the normal four-byte +header, and the remaining bits of that byte give the total datum size +(including length byte) in bytes. This alternative supports space-efficient +storage of values shorter than 127 bytes, while still allowing the data type +to grow to 1 GB at need. Values with single-byte headers aren't aligned on +any particular boundary, whereas values with four-byte headers are aligned on +at least a four-byte boundary; this omission of alignment padding provides +additional space savings that is significant compared to short values. +As a special case, if the remaining bits of a single-byte header are all +zero (which would be impossible for a self-inclusive length), the value is +a pointer to out-of-line data, with several possible alternatives as +described below. The type and size of such a <firstterm>TOAST pointer</firstterm> +are determined by a code stored in the second byte of the datum. +Lastly, when the highest-order or lowest-order bit is clear but the adjacent +bit is set, the content of the datum has been compressed and must be +decompressed before use. In this case the remaining bits of the four-byte +length word give the total size of the compressed datum, not the +original data. Note that compression is also possible for out-of-line data +but the varlena header does not tell whether it has occurred — +the content of the <acronym>TOAST</acronym> pointer tells that, instead. +</para> + +<para> +The compression technique used for either in-line or out-of-line compressed +data can be selected for each column by setting +the <literal>COMPRESSION</literal> column option in <command>CREATE +TABLE</command> or <command>ALTER TABLE</command>. The default for columns +with no explicit setting is to consult the +<xref linkend="guc-default-toast-compression"/> parameter at the time data is +inserted. +</para> + +<para> +As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums. +The oldest and most common type is a pointer to out-of-line data stored in +a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but +associated with, the table containing the <acronym>TOAST</acronym> pointer datum +itself. These <firstterm>on-disk</firstterm> pointer datums are created by the +<acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>) +when a tuple to be stored on disk is too large to be stored as-is. +Further details appear in <xref linkend="storage-toast-ondisk"/>. +Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to +out-of-line data that appears elsewhere in memory. Such datums are +necessarily short-lived, and will never appear on-disk, but they are very +useful for avoiding copying and redundant processing of large data values. +Further details appear in <xref linkend="storage-toast-inmemory"/>. +</para> + +<sect2 id="storage-toast-ondisk"> + <title>Out-of-Line, On-Disk TOAST Storage</title> + +<para> +If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will +have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's +<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry. On-disk +<acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as +described in more detail below. +</para> + +<para> +Out-of-line values are divided (after compression if used) into chunks of at +most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen +so that four chunk rows will fit on a page, making it about 2000 bytes). +Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table +belonging to the owning table. Every +<acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID +identifying the particular <acronym>TOAST</acronym>ed value), +<structfield>chunk_seq</structfield> (a sequence number for the chunk within its value), +and <structfield>chunk_data</structfield> (the actual data of the chunk). A unique index +on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast +retrieval of the values. A pointer datum representing an out-of-line on-disk +<acronym>TOAST</acronym>ed value therefore needs to store the OID of the +<acronym>TOAST</acronym> table in which to look and the OID of the specific value +(its <structfield>chunk_id</structfield>). For convenience, pointer datums also store the +logical datum size (original uncompressed data length), physical stored size +(different if compression was applied), and the compression method used, if +any. Allowing for the varlena header bytes, +the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18 +bytes regardless of the actual size of the represented value. +</para> + +<para> +The <acronym>TOAST</acronym> management code is triggered only +when a row value to be stored in a table is wider than +<symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB). +The <acronym>TOAST</acronym> code will compress and/or move +field values out-of-line until the row value is shorter than +<symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable) +or no more gains can be had. During an UPDATE +operation, values of unchanged fields are normally preserved as-is; so an +UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if +none of the out-of-line values change. +</para> + +<para> +The <acronym>TOAST</acronym> management code recognizes four different strategies +for storing <acronym>TOAST</acronym>-able columns on disk: + + <itemizedlist> + <listitem> + <para> + <literal>PLAIN</literal> prevents either compression or + out-of-line storage; furthermore it disables use of single-byte headers + for varlena types. + This is the only possible strategy for + columns of non-<acronym>TOAST</acronym>-able data types. + </para> + </listitem> + <listitem> + <para> + <literal>EXTENDED</literal> allows both compression and out-of-line + storage. This is the default for most <acronym>TOAST</acronym>-able data types. + Compression will be attempted first, then out-of-line storage if + the row is still too big. + </para> + </listitem> + <listitem> + <para> + <literal>EXTERNAL</literal> allows out-of-line storage but not + compression. Use of <literal>EXTERNAL</literal> will + make substring operations on wide <type>text</type> and + <type>bytea</type> columns faster (at the penalty of increased storage + space) because these operations are optimized to fetch only the + required parts of the out-of-line value when it is not compressed. + </para> + </listitem> + <listitem> + <para> + <literal>MAIN</literal> allows compression but not out-of-line + storage. (Actually, out-of-line storage will still be performed + for such columns, but only as a last resort when there is no other + way to make the row small enough to fit on a page.) + </para> + </listitem> + </itemizedlist> + +Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns +of that data type, but the strategy for a given table column can be altered +with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>. +</para> + +<para> +<symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using +<link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link> +</para> + +<para> +This scheme has a number of advantages compared to a more straightforward +approach such as allowing row values to span pages. Assuming that queries are +usually qualified by comparisons against relatively small key values, most of +the work of the executor will be done using the main row entry. The big values +of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all) +at the time the result set is sent to the client. Thus, the main table is much +smaller and more of its rows fit in the shared buffer cache than would be the +case without any out-of-line storage. Sort sets shrink also, and sorts will +more often be done entirely in memory. A little test showed that a table +containing typical HTML pages and their URLs was stored in about half of the +raw data size including the <acronym>TOAST</acronym> table, and that the main table +contained only about 10% of the entire data (the URLs and some small HTML +pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed +comparison table, in which all the HTML pages were cut down to 7 kB to fit. +</para> + +</sect2> + +<sect2 id="storage-toast-inmemory"> + <title>Out-of-Line, In-Memory TOAST Storage</title> + +<para> +<acronym>TOAST</acronym> pointers can point to data that is not on disk, but is +elsewhere in the memory of the current server process. Such pointers +obviously cannot be long-lived, but they are nonetheless useful. There +are currently two sub-cases: +pointers to <firstterm>indirect</firstterm> data and +pointers to <firstterm>expanded</firstterm> data. +</para> + +<para> +Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena +value stored somewhere in memory. This case was originally created merely +as a proof of concept, but it is currently used during logical decoding to +avoid possibly having to create physical tuples exceeding 1 GB (as pulling +all out-of-line field values into the tuple might do). The case is of +limited use since the creator of the pointer datum is entirely responsible +that the referenced data survives for as long as the pointer could exist, +and there is no infrastructure to help with this. +</para> + +<para> +Expanded <acronym>TOAST</acronym> pointers are useful for complex data types +whose on-disk representation is not especially suited for computational +purposes. As an example, the standard varlena representation of a +<productname>PostgreSQL</productname> array includes dimensionality information, a +nulls bitmap if there are any null elements, then the values of all the +elements in order. When the element type itself is variable-length, the +only way to find the <replaceable>N</replaceable>'th element is to scan through all the +preceding elements. This representation is appropriate for on-disk storage +because of its compactness, but for computations with the array it's much +nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote> +representation in which all the element starting locations have been +identified. The <acronym>TOAST</acronym> pointer mechanism supports this need by +allowing a pass-by-reference Datum to point to either a standard varlena +value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that +points to an expanded representation somewhere in memory. The details of +this expanded representation are up to the data type, though it must have +a standard header and meet the other API requirements given +in <filename>src/include/utils/expandeddatum.h</filename>. C-level functions +working with the data type can choose to handle either representation. +Functions that do not know about the expanded representation, but simply +apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically +receive the traditional varlena representation; so support for an expanded +representation can be introduced incrementally, one function at a time. +</para> + +<para> +<acronym>TOAST</acronym> pointers to expanded values are further broken down +into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers. +The pointed-to representation is the same either way, but a function that +receives a read-write pointer is allowed to modify the referenced value +in-place, whereas one that receives a read-only pointer must not; it must +first create a copy if it wants to make a modified version of the value. +This distinction and some associated conventions make it possible to avoid +unnecessary copying of expanded values during query execution. +</para> + +<para> +For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym> +management code ensures that no such pointer datum can accidentally get +stored on disk. In-memory <acronym>TOAST</acronym> pointers are automatically +expanded to normal in-line varlena values before storage — and then +possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing +tuple would otherwise be too big. +</para> + +</sect2> + +</sect1> + +<sect1 id="storage-fsm"> + +<title>Free Space Map</title> + +<indexterm> + <primary>Free Space Map</primary> +</indexterm> +<indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm> + +<para> +Each heap and index relation, except for hash indexes, has a Free Space Map +(<acronym>FSM</acronym>) to keep track of available space in the relation. +It's stored alongside the main relation data in a separate relation fork, +named after the filenode number of the relation, plus a <literal>_fsm</literal> +suffix. For example, if the filenode of a relation is 12345, the +<acronym>FSM</acronym> is stored in a file called +<filename>12345_fsm</filename>, in the same directory as the main relation file. +</para> + +<para> +The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The +bottom level <acronym>FSM</acronym> pages store the free space available on each +heap (or index) page, using one byte to represent each such page. The upper +levels aggregate information from the lower levels. +</para> + +<para> +Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with +one byte per node. Each leaf node represents a heap page, or a lower level +<acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's +values is stored. The maximum value in the leaf nodes is therefore stored +at the root. +</para> + +<para> +See <filename>src/backend/storage/freespace/README</filename> for more details on +how the <acronym>FSM</acronym> is structured, and how it's updated and searched. +The <xref linkend="pgfreespacemap"/> module +can be used to examine the information stored in free space maps. +</para> + +</sect1> + +<sect1 id="storage-vm"> + +<title>Visibility Map</title> + +<indexterm> + <primary>Visibility Map</primary> +</indexterm> +<indexterm><primary>VM</primary><see>Visibility Map</see></indexterm> + +<para> +Each heap relation has a Visibility Map +(VM) to keep track of which pages contain only tuples that are known to be +visible to all active transactions; it also keeps track of which pages contain +only frozen tuples. It's stored +alongside the main relation data in a separate relation fork, named after the +filenode number of the relation, plus a <literal>_vm</literal> suffix. For example, +if the filenode of a relation is 12345, the VM is stored in a file called +<filename>12345_vm</filename>, in the same directory as the main relation file. +Note that indexes do not have VMs. +</para> + +<para> +The visibility map stores two bits per heap page. The first bit, if set, +indicates that the page is all-visible, or in other words that the page does +not contain any tuples that need to be vacuumed. +This information can also be used +by <link linkend="indexes-index-only-scans"><firstterm>index-only +scans</firstterm></link> to answer queries using only the index tuple. +The second bit, if set, means that all tuples on the page have been frozen. +That means that even an anti-wraparound vacuum need not revisit the page. +</para> + +<para> +The map is conservative in the sense that we make sure that whenever a bit is +set, we know the condition is true, but if a bit is not set, it might or +might not be true. Visibility map bits are only set by vacuum, but are +cleared by any data-modifying operations on a page. +</para> + +<para> +The <xref linkend="pgvisibility"/> module can be used to examine the +information stored in the visibility map. +</para> + +</sect1> + +<sect1 id="storage-init"> + +<title>The Initialization Fork</title> + +<indexterm> + <primary>Initialization Fork</primary> +</indexterm> + +<para> +Each unlogged table, and each index on an unlogged table, has an initialization +fork. The initialization fork is an empty table or index of the appropriate +type. When an unlogged table must be reset to empty due to a crash, the +initialization fork is copied over the main fork, and any other forks are +erased (they will be recreated automatically as needed). +</para> + +</sect1> + +<sect1 id="storage-page-layout"> + +<title>Database Page Layout</title> + +<para> +This section provides an overview of the page format used within +<productname>PostgreSQL</productname> tables and indexes.<footnote> + <para> + Actually, use of this page format is not required for either table or + index access methods. The <literal>heap</literal> table access method + always uses this format. All the existing index methods also use the + basic format, but the data kept on index metapages usually doesn't follow + the item layout rules. + </para> +</footnote> +Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table. +</para> + +<para> +In the following explanation, a +<firstterm>byte</firstterm> +is assumed to contain 8 bits. In addition, the term +<firstterm>item</firstterm> +refers to an individual data value that is stored on a page. In a table, +an item is a row; in an index, an item is an index entry. +</para> + +<para> +Every table and index is stored as an array of <firstterm>pages</firstterm> of a +fixed size (usually 8 kB, although a different page size can be selected +when compiling the server). In a table, all the pages are logically +equivalent, so a particular item (row) can be stored in any page. In +indexes, the first page is generally reserved as a <firstterm>metapage</firstterm> +holding control information, and there can be different types of pages +within the index, depending on the index access method. +</para> + +<para> +<xref linkend="page-table"/> shows the overall layout of a page. +There are five parts to each page. +</para> + +<table tocentry="1" id="page-table"> +<title>Overall Page Layout</title> +<titleabbrev>Page Layout</titleabbrev> +<tgroup cols="2"> +<thead> +<row> +<entry> +Item +</entry> +<entry>Description</entry> +</row> +</thead> + +<tbody> + +<row> + <entry>PageHeaderData</entry> + <entry>24 bytes long. Contains general information about the page, including +free space pointers.</entry> +</row> + +<row> +<entry>ItemIdData</entry> +<entry>Array of item identifiers pointing to the actual items. Each +entry is an (offset,length) pair. 4 bytes per item.</entry> +</row> + +<row> +<entry>Free space</entry> +<entry>The unallocated space. New item identifiers are allocated from +the start of this area, new items from the end.</entry> +</row> + +<row> +<entry>Items</entry> +<entry>The actual items themselves.</entry> +</row> + +<row> +<entry>Special space</entry> +<entry>Index access method specific data. Different methods store different +data. Empty in ordinary tables.</entry> +</row> + +</tbody> +</tgroup> +</table> + + <para> + + The first 24 bytes of each page consists of a page header + (<structname>PageHeaderData</structname>). Its format is detailed in <xref + linkend="pageheaderdata-table"/>. The first field tracks the most + recent WAL entry related to this page. The second field contains + the page checksum if <xref linkend="app-initdb-data-checksums"/> are + enabled. Next is a 2-byte field containing flag bits. This is followed + by three 2-byte integer fields (<structfield>pd_lower</structfield>, + <structfield>pd_upper</structfield>, and + <structfield>pd_special</structfield>). These contain byte offsets + from the page start to the start of unallocated space, to the end of + unallocated space, and to the start of the special space. The next 2 + bytes of the page header, <structfield>pd_pagesize_version</structfield>, + store both the page size and a version indicator. Beginning with + <productname>PostgreSQL</productname> 8.3 the version number is 4; + <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3; + <productname>PostgreSQL</productname> 8.0 used version number 2; + <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1; + prior releases used version number 0. + (The basic page layout and header format has not changed in most of these + versions, but the layout of heap row headers has.) The page size + is basically only present as a cross-check; there is no support for having + more than one page size in an installation. + The last field is a hint that shows whether pruning the page is likely + to be profitable: it tracks the oldest un-pruned XMAX on the page. + + </para> + + <table tocentry="1" id="pageheaderdata-table"> + <title>PageHeaderData Layout</title> + <titleabbrev>PageHeaderData Layout</titleabbrev> + <tgroup cols="4"> + <thead> + <row> + <entry>Field</entry> + <entry>Type</entry> + <entry>Length</entry> + <entry>Description</entry> + </row> + </thead> + <tbody> + <row> + <entry>pd_lsn</entry> + <entry>PageXLogRecPtr</entry> + <entry>8 bytes</entry> + <entry>LSN: next byte after last byte of WAL record for last change + to this page</entry> + </row> + <row> + <entry>pd_checksum</entry> + <entry>uint16</entry> + <entry>2 bytes</entry> + <entry>Page checksum</entry> + </row> + <row> + <entry>pd_flags</entry> + <entry>uint16</entry> + <entry>2 bytes</entry> + <entry>Flag bits</entry> + </row> + <row> + <entry>pd_lower</entry> + <entry>LocationIndex</entry> + <entry>2 bytes</entry> + <entry>Offset to start of free space</entry> + </row> + <row> + <entry>pd_upper</entry> + <entry>LocationIndex</entry> + <entry>2 bytes</entry> + <entry>Offset to end of free space</entry> + </row> + <row> + <entry>pd_special</entry> + <entry>LocationIndex</entry> + <entry>2 bytes</entry> + <entry>Offset to start of special space</entry> + </row> + <row> + <entry>pd_pagesize_version</entry> + <entry>uint16</entry> + <entry>2 bytes</entry> + <entry>Page size and layout version number information</entry> + </row> + <row> + <entry>pd_prune_xid</entry> + <entry>TransactionId</entry> + <entry>4 bytes</entry> + <entry>Oldest unpruned XMAX on page, or zero if none</entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + All the details can be found in + <filename>src/include/storage/bufpage.h</filename>. + </para> + + <para> + Following the page header are item identifiers + (<type>ItemIdData</type>), each requiring four bytes. + An item identifier contains a byte-offset to + the start of an item, its length in bytes, and a few attribute bits + which affect its interpretation. + New item identifiers are allocated + as needed from the beginning of the unallocated space. + The number of item identifiers present can be determined by looking at + <structfield>pd_lower</structfield>, which is increased to allocate a new identifier. + Because an item + identifier is never moved until it is freed, its index can be used on a + long-term basis to reference an item, even when the item itself is moved + around on the page to compact free space. In fact, every pointer to an + item (<type>ItemPointer</type>, also known as + <type>CTID</type>) created by + <productname>PostgreSQL</productname> consists of a page number and the + index of an item identifier. + + </para> + + <para> + + The items themselves are stored in space allocated backwards from the end + of unallocated space. The exact structure varies depending on what the + table is to contain. Tables and sequences both use a structure named + <type>HeapTupleHeaderData</type>, described below. + + </para> + + <para> + + The final section is the <quote>special section</quote> which can + contain anything the access method wishes to store. For example, + b-tree indexes store links to the page's left and right siblings, + as well as some other data relevant to the index structure. + Ordinary tables do not use a special section at all (indicated by setting + <structfield>pd_special</structfield> to equal the page size). + + </para> + + <para> + <xref linkend="storage-page-layout-figure"/> illustrates how these parts are + laid out in a page. + </para> + + <figure id="storage-page-layout-figure"> + <title>Page Layout</title> + <mediaobject> + <imageobject> + <imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/> + </imageobject> + </mediaobject> + </figure> + + <sect2 id="storage-tuple-layout"> + + <title>Table Row Layout</title> + + <para> + + All table rows are structured in the same way. There is a fixed-size + header (occupying 23 bytes on most machines), followed by an optional null + bitmap, an optional object ID field, and the user data. The header is + detailed + in <xref linkend="heaptupleheaderdata-table"/>. The actual user data + (columns of the row) begins at the offset indicated by + <structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN + distance for the platform. + The null bitmap is + only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in + <structfield>t_infomask</structfield>. If it is present it begins just after + the fixed header and occupies enough bytes to have one bit per data column + (that is, the number of bits that equals the attribute count in + <structfield>t_infomask2</structfield>). In this list of bits, a + 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not + present, all columns are assumed not-null. + The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit + is set in <structfield>t_infomask</structfield>. If present, it appears just + before the <structfield>t_hoff</structfield> boundary. Any padding needed to make + <structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null + bitmap and the object ID. (This in turn ensures that the object ID is + suitably aligned.) + + </para> + + <table tocentry="1" id="heaptupleheaderdata-table"> + <title>HeapTupleHeaderData Layout</title> + <titleabbrev>HeapTupleHeaderData Layout</titleabbrev> + <tgroup cols="4"> + <thead> + <row> + <entry>Field</entry> + <entry>Type</entry> + <entry>Length</entry> + <entry>Description</entry> + </row> + </thead> + <tbody> + <row> + <entry>t_xmin</entry> + <entry>TransactionId</entry> + <entry>4 bytes</entry> + <entry>insert XID stamp</entry> + </row> + <row> + <entry>t_xmax</entry> + <entry>TransactionId</entry> + <entry>4 bytes</entry> + <entry>delete XID stamp</entry> + </row> + <row> + <entry>t_cid</entry> + <entry>CommandId</entry> + <entry>4 bytes</entry> + <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry> + </row> + <row> + <entry>t_xvac</entry> + <entry>TransactionId</entry> + <entry>4 bytes</entry> + <entry>XID for VACUUM operation moving a row version</entry> + </row> + <row> + <entry>t_ctid</entry> + <entry>ItemPointerData</entry> + <entry>6 bytes</entry> + <entry>current TID of this or newer row version</entry> + </row> + <row> + <entry>t_infomask2</entry> + <entry>uint16</entry> + <entry>2 bytes</entry> + <entry>number of attributes, plus various flag bits</entry> + </row> + <row> + <entry>t_infomask</entry> + <entry>uint16</entry> + <entry>2 bytes</entry> + <entry>various flag bits</entry> + </row> + <row> + <entry>t_hoff</entry> + <entry>uint8</entry> + <entry>1 byte</entry> + <entry>offset to user data</entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + All the details can be found in + <filename>src/include/access/htup_details.h</filename>. + </para> + + <para> + + Interpreting the actual data can only be done with information obtained + from other tables, mostly <structname>pg_attribute</structname>. The + key values needed to identify field locations are + <structfield>attlen</structfield> and <structfield>attalign</structfield>. + There is no way to directly get a + particular attribute, except when there are only fixed width fields and no + null values. All this trickery is wrapped up in the functions + <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm> + and <firstterm>heap_getsysattr</firstterm>. + + </para> + <para> + + To read the data you need to examine each attribute in turn. First check + whether the field is NULL according to the null bitmap. If it is, go to + the next. Then make sure you have the right alignment. If the field is a + fixed width field, then all the bytes are simply placed. If it's a + variable length field (attlen = -1) then it's a bit more complicated. + All variable-length data types share the common header structure + <type>struct varlena</type>, which includes the total length of the stored + value and some flag bits. Depending on the flags, the data can be either + inline or in a <acronym>TOAST</acronym> table; + it might be compressed, too (see <xref linkend="storage-toast"/>). + + </para> + </sect2> +</sect1> + +<sect1 id="storage-hot"> + + <title>Heap-Only Tuples (<acronym>HOT</acronym>)</title> + + <para> + To allow for high concurrency, <productname>PostgreSQL</productname> + uses <link linkend="mvcc-intro">multiversion concurrency + control</link> (<acronym>MVCC</acronym>) to store rows. However, + <acronym>MVCC</acronym> has some downsides for update queries. + Specifically, updates require new versions of rows to be added to + tables. This can also require new index entries for each updated row, + and removal of old versions of rows and their index entries can be + expensive. + </para> + + <para> + To help reduce the overhead of updates, + <productname>PostgreSQL</productname> has an optimization called + heap-only tuples (<acronym>HOT</acronym>). This optimization is + possible when: + + <itemizedlist> + <listitem> + <para> + The update does not modify any columns referenced by the table's + indexes, including expression and partial indexes. + </para> + </listitem> + <listitem> + <para> + There is sufficient free space on the page containing the old row + for the updated row. + </para> + </listitem> + </itemizedlist> + + In such cases, heap-only tuples provide two optimizations: + + <itemizedlist> + <listitem> + <para> + New index entries are not needed to represent updated rows. + </para> + </listitem> + <listitem> + <para> + Old versions of updated rows can be completely removed during normal + operation, including <command>SELECT</command>s, instead of requiring + periodic vacuum operations. (This is possible because indexes + do not reference their <link linkend="storage-page-layout">page + item identifiers</link>.) + </para> + </listitem> + </itemizedlist> + </para> + + <para> + In summary, heap-only tuple updates can only be created + if columns used by indexes are not updated. You can + increase the likelihood of sufficient page space for + <acronym>HOT</acronym> updates by decreasing a table's <link + linkend="reloption-fillfactor"><literal>fillfactor</literal></link>. + If you don't, <acronym>HOT</acronym> updates will still happen because + new rows will naturally migrate to new pages and existing pages with + sufficient free space for new row versions. The system view <link + linkend="monitoring-pg-stat-all-tables-view">pg_stat_all_tables</link> + allows monitoring of the occurrence of HOT and non-HOT updates. + </para> +</sect1> + +</chapter> |