diff options
Diffstat (limited to 'doc/src/sgml/html/storage-toast.html')
-rw-r--r-- | doc/src/sgml/html/storage-toast.html | 221 |
1 files changed, 221 insertions, 0 deletions
diff --git a/doc/src/sgml/html/storage-toast.html b/doc/src/sgml/html/storage-toast.html new file mode 100644 index 0000000..f0724d1 --- /dev/null +++ b/doc/src/sgml/html/storage-toast.html @@ -0,0 +1,221 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.2. TOAST</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="storage-file-layout.html" title="69.1. Database File Layout" /><link rel="next" href="storage-fsm.html" title="69.3. Free Space Map" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.2. TOAST</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="storage-file-layout.html" title="69.1. Database File Layout">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="storage.html" title="Chapter 69. Database Physical Storage">Up</a></td><th width="60%" align="center">Chapter 69. Database Physical Storage</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="storage-fsm.html" title="69.3. Free Space Map">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="STORAGE-TOAST"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.2. TOAST</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="storage-toast.html#STORAGE-TOAST-ONDISK">69.2.1. Out-of-Line, On-Disk TOAST Storage</a></span></dt><dt><span class="sect2"><a href="storage-toast.html#STORAGE-TOAST-INMEMORY">69.2.2. Out-of-Line, In-Memory TOAST Storage</a></span></dt></dl></div><a id="id-1.10.22.4.2" class="indexterm"></a><a id="id-1.10.22.4.3" class="indexterm"></a><p> +This section provides an overview of <acronym class="acronym">TOAST</acronym> (The +Oversized-Attribute Storage Technique). +</p><p> +<span class="productname">PostgreSQL</span> uses a fixed page size (commonly +8 kB), and does not allow tuples to span multiple pages. Therefore, it is +not possible to store very large field values directly. To overcome +this limitation, large field values are compressed and/or broken up into +multiple physical rows. This happens transparently to the user, with only +small impact on most of the backend code. The technique is affectionately +known as <acronym class="acronym">TOAST</acronym> (or <span class="quote">“<span class="quote">the best thing since sliced bread</span>”</span>). +The <acronym class="acronym">TOAST</acronym> infrastructure is also used to improve handling of +large data values in-memory. +</p><p> +Only certain data types support <acronym class="acronym">TOAST</acronym> — there is no need to +impose the overhead on data types that cannot produce large field values. +To support <acronym class="acronym">TOAST</acronym>, a data type must have a variable-length +(<em class="firstterm">varlena</em>) representation, in which, ordinarily, the first +four-byte word of any stored value contains the total length of the value in +bytes (including itself). <acronym class="acronym">TOAST</acronym> does not constrain the rest +of the data type's representation. The special representations collectively +called <em class="firstterm"><acronym class="acronym">TOAST</acronym>ed values</em> work by modifying or +reinterpreting this initial length word. Therefore, the C-level functions +supporting a <acronym class="acronym">TOAST</acronym>-able data type must be careful about how they +handle potentially <acronym class="acronym">TOAST</acronym>ed input values: an input might not +actually consist of a four-byte length word and contents until after it's +been <em class="firstterm">detoasted</em>. (This is normally done by invoking +<code class="function">PG_DETOAST_DATUM</code> before doing anything with an input value, +but in some cases more efficient approaches are possible. +See <a class="xref" href="xtypes.html#XTYPES-TOAST" title="37.13.1. TOAST Considerations">Section 37.13.1</a> for more detail.) +</p><p> +<acronym class="acronym">TOAST</acronym> usurps two bits of the varlena length word (the high-order +bits on big-endian machines, the low-order bits on little-endian machines), +thereby limiting the logical size of any value of a <acronym class="acronym">TOAST</acronym>-able +data type to 1 GB (2<sup>30</sup> - 1 bytes). When both bits are zero, +the value is an ordinary un-<acronym class="acronym">TOAST</acronym>ed value of the data type, and +the remaining bits of the length word give the total datum size (including +length word) in bytes. When the highest-order or lowest-order bit is set, +the value has only a single-byte header instead of the normal four-byte +header, and the remaining bits of that byte give the total datum size +(including length byte) in bytes. This alternative supports space-efficient +storage of values shorter than 127 bytes, while still allowing the data type +to grow to 1 GB at need. Values with single-byte headers aren't aligned on +any particular boundary, whereas values with four-byte headers are aligned on +at least a four-byte boundary; this omission of alignment padding provides +additional space savings that is significant compared to short values. +As a special case, if the remaining bits of a single-byte header are all +zero (which would be impossible for a self-inclusive length), the value is +a pointer to out-of-line data, with several possible alternatives as +described below. The type and size of such a <em class="firstterm">TOAST pointer</em> +are determined by a code stored in the second byte of the datum. +Lastly, when the highest-order or lowest-order bit is clear but the adjacent +bit is set, the content of the datum has been compressed and must be +decompressed before use. In this case the remaining bits of the four-byte +length word give the total size of the compressed datum, not the +original data. Note that compression is also possible for out-of-line data +but the varlena header does not tell whether it has occurred — +the content of the <acronym class="acronym">TOAST</acronym> pointer tells that, instead. +</p><p> +As mentioned, there are multiple types of <acronym class="acronym">TOAST</acronym> pointer datums. +The oldest and most common type is a pointer to out-of-line data stored in +a <em class="firstterm"><acronym class="acronym">TOAST</acronym> table</em> that is separate from, but +associated with, the table containing the <acronym class="acronym">TOAST</acronym> pointer datum +itself. These <em class="firstterm">on-disk</em> pointer datums are created by the +<acronym class="acronym">TOAST</acronym> management code (in <code class="filename">access/common/toast_internals.c</code>) +when a tuple to be stored on disk is too large to be stored as-is. +Further details appear in <a class="xref" href="storage-toast.html#STORAGE-TOAST-ONDISK" title="69.2.1. Out-of-Line, On-Disk TOAST Storage">Section 69.2.1</a>. +Alternatively, a <acronym class="acronym">TOAST</acronym> pointer datum can contain a pointer to +out-of-line data that appears elsewhere in memory. Such datums are +necessarily short-lived, and will never appear on-disk, but they are very +useful for avoiding copying and redundant processing of large data values. +Further details appear in <a class="xref" href="storage-toast.html#STORAGE-TOAST-INMEMORY" title="69.2.2. Out-of-Line, In-Memory TOAST Storage">Section 69.2.2</a>. +</p><p> +The compression technique used for either in-line or out-of-line compressed +data is a fairly simple and very fast member +of the LZ family of compression techniques. See +<code class="filename">src/common/pg_lzcompress.c</code> for the details. +</p><div class="sect2" id="STORAGE-TOAST-ONDISK"><div class="titlepage"><div><div><h3 class="title">69.2.1. Out-of-Line, On-Disk TOAST Storage</h3></div></div></div><p> +If any of the columns of a table are <acronym class="acronym">TOAST</acronym>-able, the table will +have an associated <acronym class="acronym">TOAST</acronym> table, whose OID is stored in the table's +<code class="structname">pg_class</code>.<code class="structfield">reltoastrelid</code> entry. On-disk +<acronym class="acronym">TOAST</acronym>ed values are kept in the <acronym class="acronym">TOAST</acronym> table, as +described in more detail below. +</p><p> +Out-of-line values are divided (after compression if used) into chunks of at +most <code class="symbol">TOAST_MAX_CHUNK_SIZE</code> bytes (by default this value is chosen +so that four chunk rows will fit on a page, making it about 2000 bytes). +Each chunk is stored as a separate row in the <acronym class="acronym">TOAST</acronym> table +belonging to the owning table. Every +<acronym class="acronym">TOAST</acronym> table has the columns <code class="structfield">chunk_id</code> (an OID +identifying the particular <acronym class="acronym">TOAST</acronym>ed value), +<code class="structfield">chunk_seq</code> (a sequence number for the chunk within its value), +and <code class="structfield">chunk_data</code> (the actual data of the chunk). A unique index +on <code class="structfield">chunk_id</code> and <code class="structfield">chunk_seq</code> provides fast +retrieval of the values. A pointer datum representing an out-of-line on-disk +<acronym class="acronym">TOAST</acronym>ed value therefore needs to store the OID of the +<acronym class="acronym">TOAST</acronym> table in which to look and the OID of the specific value +(its <code class="structfield">chunk_id</code>). For convenience, pointer datums also store the +logical datum size (original uncompressed data length) and physical stored size +(different if compression was applied). Allowing for the varlena header bytes, +the total size of an on-disk <acronym class="acronym">TOAST</acronym> pointer datum is therefore 18 +bytes regardless of the actual size of the represented value. +</p><p> +The <acronym class="acronym">TOAST</acronym> management code is triggered only +when a row value to be stored in a table is wider than +<code class="symbol">TOAST_TUPLE_THRESHOLD</code> bytes (normally 2 kB). +The <acronym class="acronym">TOAST</acronym> code will compress and/or move +field values out-of-line until the row value is shorter than +<code class="symbol">TOAST_TUPLE_TARGET</code> bytes (also normally 2 kB, adjustable) +or no more gains can be had. During an UPDATE +operation, values of unchanged fields are normally preserved as-is; so an +UPDATE of a row with out-of-line values incurs no <acronym class="acronym">TOAST</acronym> costs if +none of the out-of-line values change. +</p><p> +The <acronym class="acronym">TOAST</acronym> management code recognizes four different strategies +for storing <acronym class="acronym">TOAST</acronym>-able columns on disk: + + </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p> + <code class="literal">PLAIN</code> prevents either compression or + out-of-line storage; furthermore it disables use of single-byte headers + for varlena types. + This is the only possible strategy for + columns of non-<acronym class="acronym">TOAST</acronym>-able data types. + </p></li><li class="listitem"><p> + <code class="literal">EXTENDED</code> allows both compression and out-of-line + storage. This is the default for most <acronym class="acronym">TOAST</acronym>-able data types. + Compression will be attempted first, then out-of-line storage if + the row is still too big. + </p></li><li class="listitem"><p> + <code class="literal">EXTERNAL</code> allows out-of-line storage but not + compression. Use of <code class="literal">EXTERNAL</code> will + make substring operations on wide <code class="type">text</code> and + <code class="type">bytea</code> columns faster (at the penalty of increased storage + space) because these operations are optimized to fetch only the + required parts of the out-of-line value when it is not compressed. + </p></li><li class="listitem"><p> + <code class="literal">MAIN</code> allows compression but not out-of-line + storage. (Actually, out-of-line storage will still be performed + for such columns, but only as a last resort when there is no other + way to make the row small enough to fit on a page.) + </p></li></ul></div><p> + +Each <acronym class="acronym">TOAST</acronym>-able data type specifies a default strategy for columns +of that data type, but the strategy for a given table column can be altered +with <a class="link" href="sql-altertable.html" title="ALTER TABLE"><code class="command">ALTER TABLE ... SET STORAGE</code></a>. +</p><p> +<code class="symbol">TOAST_TUPLE_TARGET</code> can be adjusted for each table using +<a class="link" href="sql-altertable.html" title="ALTER TABLE"><code class="command">ALTER TABLE ... SET (toast_tuple_target = N)</code></a> +</p><p> +This scheme has a number of advantages compared to a more straightforward +approach such as allowing row values to span pages. Assuming that queries are +usually qualified by comparisons against relatively small key values, most of +the work of the executor will be done using the main row entry. The big values +of <acronym class="acronym">TOAST</acronym>ed attributes will only be pulled out (if selected at all) +at the time the result set is sent to the client. Thus, the main table is much +smaller and more of its rows fit in the shared buffer cache than would be the +case without any out-of-line storage. Sort sets shrink also, and sorts will +more often be done entirely in memory. A little test showed that a table +containing typical HTML pages and their URLs was stored in about half of the +raw data size including the <acronym class="acronym">TOAST</acronym> table, and that the main table +contained only about 10% of the entire data (the URLs and some small HTML +pages). There was no run time difference compared to an un-<acronym class="acronym">TOAST</acronym>ed +comparison table, in which all the HTML pages were cut down to 7 kB to fit. +</p></div><div class="sect2" id="STORAGE-TOAST-INMEMORY"><div class="titlepage"><div><div><h3 class="title">69.2.2. Out-of-Line, In-Memory TOAST Storage</h3></div></div></div><p> +<acronym class="acronym">TOAST</acronym> pointers can point to data that is not on disk, but is +elsewhere in the memory of the current server process. Such pointers +obviously cannot be long-lived, but they are nonetheless useful. There +are currently two sub-cases: +pointers to <em class="firstterm">indirect</em> data and +pointers to <em class="firstterm">expanded</em> data. +</p><p> +Indirect <acronym class="acronym">TOAST</acronym> pointers simply point at a non-indirect varlena +value stored somewhere in memory. This case was originally created merely +as a proof of concept, but it is currently used during logical decoding to +avoid possibly having to create physical tuples exceeding 1 GB (as pulling +all out-of-line field values into the tuple might do). The case is of +limited use since the creator of the pointer datum is entirely responsible +that the referenced data survives for as long as the pointer could exist, +and there is no infrastructure to help with this. +</p><p> +Expanded <acronym class="acronym">TOAST</acronym> pointers are useful for complex data types +whose on-disk representation is not especially suited for computational +purposes. As an example, the standard varlena representation of a +<span class="productname">PostgreSQL</span> array includes dimensionality information, a +nulls bitmap if there are any null elements, then the values of all the +elements in order. When the element type itself is variable-length, the +only way to find the <em class="replaceable"><code>N</code></em>'th element is to scan through all the +preceding elements. This representation is appropriate for on-disk storage +because of its compactness, but for computations with the array it's much +nicer to have an <span class="quote">“<span class="quote">expanded</span>”</span> or <span class="quote">“<span class="quote">deconstructed</span>”</span> +representation in which all the element starting locations have been +identified. The <acronym class="acronym">TOAST</acronym> pointer mechanism supports this need by +allowing a pass-by-reference Datum to point to either a standard varlena +value (the on-disk representation) or a <acronym class="acronym">TOAST</acronym> pointer that +points to an expanded representation somewhere in memory. The details of +this expanded representation are up to the data type, though it must have +a standard header and meet the other API requirements given +in <code class="filename">src/include/utils/expandeddatum.h</code>. C-level functions +working with the data type can choose to handle either representation. +Functions that do not know about the expanded representation, but simply +apply <code class="function">PG_DETOAST_DATUM</code> to their inputs, will automatically +receive the traditional varlena representation; so support for an expanded +representation can be introduced incrementally, one function at a time. +</p><p> +<acronym class="acronym">TOAST</acronym> pointers to expanded values are further broken down +into <em class="firstterm">read-write</em> and <em class="firstterm">read-only</em> pointers. +The pointed-to representation is the same either way, but a function that +receives a read-write pointer is allowed to modify the referenced value +in-place, whereas one that receives a read-only pointer must not; it must +first create a copy if it wants to make a modified version of the value. +This distinction and some associated conventions make it possible to avoid +unnecessary copying of expanded values during query execution. +</p><p> +For all types of in-memory <acronym class="acronym">TOAST</acronym> pointer, the <acronym class="acronym">TOAST</acronym> +management code ensures that no such pointer datum can accidentally get +stored on disk. In-memory <acronym class="acronym">TOAST</acronym> pointers are automatically +expanded to normal in-line varlena values before storage — and then +possibly converted to on-disk <acronym class="acronym">TOAST</acronym> pointers, if the containing +tuple would otherwise be too big. +</p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="storage-file-layout.html" title="69.1. Database File Layout">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="storage.html" title="Chapter 69. Database Physical Storage">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="storage-fsm.html" title="69.3. Free Space Map">Next</a></td></tr><tr><td width="40%" align="left" valign="top">69.1. Database File Layout </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.3. Free Space Map</td></tr></table></div></body></html>
\ No newline at end of file |