summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/storage.sgml
blob: 3ea4e5526df5a054360345dc08ba4ee0072aaaab (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
<!-- doc/src/sgml/storage.sgml -->

<chapter id="storage">

<title>Database Physical Storage</title>

<para>
This chapter provides an overview of the physical storage format used by
<productname>PostgreSQL</productname> databases.
</para>

<sect1 id="storage-file-layout">

<title>Database File Layout</title>

<para>
This section describes the storage format at the level of files and
directories.
</para>

<para>
Traditionally, the configuration and data files used by a database
cluster are stored together within the cluster's data
directory, commonly referred to as <varname>PGDATA</varname> (after the name of the
environment variable that can be used to define it).  A common location for
<varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>.  Multiple clusters,
managed by different server instances, can exist on the same machine.
</para>

<para>
The <varname>PGDATA</varname> directory contains several subdirectories and control
files, as shown in <xref linkend="pgdata-contents-table"/>.  In addition to
these required items, the cluster configuration files
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
<filename>pg_ident.conf</filename> are traditionally stored in
<varname>PGDATA</varname>, although it is possible to place them elsewhere.
</para>

<table tocentry="1" id="pgdata-contents-table">
<title>Contents of <varname>PGDATA</varname></title>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>

<tbody>

<row>
 <entry><filename>PG_VERSION</filename></entry>
 <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
</row>

<row>
 <entry><filename>base</filename></entry>
 <entry>Subdirectory containing per-database subdirectories</entry>
</row>

<row>
 <entry><filename>current_logfiles</filename></entry>
 <entry>File recording the log file(s) currently written to by the logging
  collector</entry>
</row>

<row>
 <entry><filename>global</filename></entry>
 <entry>Subdirectory containing cluster-wide tables, such as
 <structname>pg_database</structname></entry>
</row>

<row>
 <entry><filename>pg_commit_ts</filename></entry>
 <entry>Subdirectory containing transaction commit timestamp data</entry>
</row>

<row>
 <entry><filename>pg_dynshmem</filename></entry>
 <entry>Subdirectory containing files used by the dynamic shared memory
  subsystem</entry>
</row>

<row>
 <entry><filename>pg_logical</filename></entry>
 <entry>Subdirectory containing status data for logical decoding</entry>
</row>

<row>
 <entry><filename>pg_multixact</filename></entry>
 <entry>Subdirectory containing multitransaction status data
  (used for shared row locks)</entry>
</row>

<row>
 <entry><filename>pg_notify</filename></entry>
 <entry>Subdirectory containing LISTEN/NOTIFY status data</entry>
</row>

<row>
 <entry><filename>pg_replslot</filename></entry>
 <entry>Subdirectory containing replication slot data</entry>
</row>

<row>
 <entry><filename>pg_serial</filename></entry>
 <entry>Subdirectory containing information about committed serializable transactions</entry>
</row>

<row>
 <entry><filename>pg_snapshots</filename></entry>
 <entry>Subdirectory containing exported snapshots</entry>
</row>

<row>
 <entry><filename>pg_stat</filename></entry>
 <entry>Subdirectory containing permanent files for the statistics
  subsystem</entry>
</row>

<row>
 <entry><filename>pg_stat_tmp</filename></entry>
 <entry>Subdirectory containing temporary files for the statistics
  subsystem</entry>
</row>

<row>
 <entry><filename>pg_subtrans</filename></entry>
 <entry>Subdirectory containing subtransaction status data</entry>
</row>

<row>
 <entry><filename>pg_tblspc</filename></entry>
 <entry>Subdirectory containing symbolic links to tablespaces</entry>
</row>

<row>
 <entry><filename>pg_twophase</filename></entry>
 <entry>Subdirectory containing state files for prepared transactions</entry>
</row>

<row>
 <entry><filename>pg_wal</filename></entry>
 <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
</row>

<row>
 <entry><filename>pg_xact</filename></entry>
 <entry>Subdirectory containing transaction commit status data</entry>
</row>

<row>
 <entry><filename>postgresql.auto.conf</filename></entry>
 <entry>A file used for storing configuration parameters that are set by
<command>ALTER SYSTEM</command></entry>
</row>

<row>
 <entry><filename>postmaster.opts</filename></entry>
 <entry>A file recording the command-line options the server was
last started with</entry>
</row>

<row>
 <entry><filename>postmaster.pid</filename></entry>
 <entry>A lock file recording the current postmaster process ID (PID),
  cluster data directory path,
  postmaster start timestamp,
  port number,
  Unix-domain socket directory path (could be empty),
  first valid listen_address (IP address or <literal>*</literal>, or empty if
  not listening on TCP),
  and shared memory segment ID
  (this file is not present after server shutdown)</entry>
</row>

</tbody>
</tgroup>
</table>

<para>
For each database in the cluster there is a subdirectory within
<varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in
<structname>pg_database</structname>.  This subdirectory is the default location
for the database's files; in particular, its system catalogs are stored
there.
</para>

<para>
 Note that the following sections describe the behavior of the builtin
 <literal>heap</literal> <link linkend="tableam">table access method</link>,
 and the builtin <link linkend="indexam">index access methods</link>. Due
 to the extensible nature of <productname>PostgreSQL</productname>, other
 access methods might work differently.
</para>

<para>
Each table and index is stored in a separate file.  For ordinary relations,
these files are named after the table or index's <firstterm>filenode</firstterm> number,
which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But
for temporary relations, the file name is of the form
<literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable>
is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable>
is the filenode number.  In either case, in addition to the main file (a/k/a
main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref
linkend="storage-fsm"/>), which stores information about free space available in
the relation.  The free space map is stored in a file named with the filenode
number plus the suffix <literal>_fsm</literal>.  Tables also have a
<firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>,
to track which pages are known to have no dead tuples.  The visibility map is
described further in <xref linkend="storage-vm"/>.  Unlogged tables and indexes
have a third fork, known as the initialization fork, which is stored in a fork
with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>).
</para>

<caution>
<para>
Note that while a table's filenode often matches its OID, this is
<emphasis>not</emphasis> necessarily the case; some operations, like
<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
Avoid assuming that filenode and table OID are the same.
Also, for certain system catalogs including <structname>pg_class</structname> itself,
<structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero.  The
actual filenode number of these catalogs is stored in a lower-level data
structure, and can be obtained using the <function>pg_relation_filenode()</function>
function.
</para>
</caution>

<para>
When a table or index exceeds 1 GB, it is divided into gigabyte-sized
<firstterm>segments</firstterm>.  The first segment's file name is the same as the
filenode; subsequent segments are named filenode.1, filenode.2, etc.
This arrangement avoids problems on platforms that have file size limitations.
(Actually, 1 GB is just the default segment size.  The segment size can be
adjusted using the configuration option <option>--with-segsize</option>
when building <productname>PostgreSQL</productname>.)
In principle, free space map and visibility map forks could require multiple
segments as well, though this is unlikely to happen in practice.
</para>

<para>
A table that has columns with potentially large entries will have an
associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of
field values that are too large to keep in the table rows proper.
<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to
its <acronym>TOAST</acronym> table, if any.
See <xref linkend="storage-toast"/> for more information.
</para>

<para>
The contents of tables and indexes are discussed further in
<xref linkend="storage-page-layout"/>.
</para>

<para>
Tablespaces make the scenario more complicated.  Each user-defined tablespace
has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename>
directory, which points to the physical tablespace directory (i.e., the
location specified in the tablespace's <command>CREATE TABLESPACE</command> command).
This symbolic link is named after
the tablespace's OID.  Inside the physical tablespace directory there is
a subdirectory with a name that depends on the <productname>PostgreSQL</productname>
server version, such as <literal>PG_9.0_201008051</literal>.  (The reason for using
this subdirectory is so that successive versions of the database can use
the same <command>CREATE TABLESPACE</command> location value without conflicts.)
Within the version-specific subdirectory, there is
a subdirectory for each database that has elements in the tablespace, named
after the database's OID.  Tables and indexes are stored within that
directory, using the filenode naming scheme.
The <literal>pg_default</literal> tablespace is not accessed through
<filename>pg_tblspc</filename>, but corresponds to
<varname>PGDATA</varname><filename>/base</filename>.  Similarly, the <literal>pg_global</literal>
tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to
<varname>PGDATA</varname><filename>/global</filename>.
</para>

<para>
The <function>pg_relation_filepath()</function> function shows the entire path
(relative to <varname>PGDATA</varname>) of any relation.  It is often useful
as a substitute for remembering many of the above rules.  But keep in
mind that this function just gives the name of the first segment of the
main fork of the relation &mdash; you may need to append a segment number
and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all
the files associated with the relation.
</para>

<para>
Temporary files (for operations such as sorting more data than can fit in
memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>,
or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory
if a tablespace other than <literal>pg_default</literal> is specified for them.
The name of a temporary file has the form
<filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>,
where <replaceable>PPP</replaceable> is the PID of the owning backend and
<replaceable>NNN</replaceable> distinguishes different temporary files of that backend.
</para>

</sect1>

<sect1 id="storage-toast">

<title>TOAST</title>

    <indexterm>
     <primary>TOAST</primary>
    </indexterm>
    <indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm>

<para>
This section provides an overview of <acronym>TOAST</acronym> (The
Oversized-Attribute Storage Technique).
</para>

<para>
<productname>PostgreSQL</productname> uses a fixed page size (commonly
8 kB), and does not allow tuples to span multiple pages.  Therefore, it is
not possible to store very large field values directly.  To overcome
this limitation, large field values are compressed and/or broken up into
multiple physical rows.  This happens transparently to the user, with only
small impact on most of the backend code.  The technique is affectionately
known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>).
The <acronym>TOAST</acronym> infrastructure is also used to improve handling of
large data values in-memory.
</para>

<para>
Only certain data types support <acronym>TOAST</acronym> &mdash; there is no need to
impose the overhead on data types that cannot produce large field values.
To support <acronym>TOAST</acronym>, a data type must have a variable-length
(<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first
four-byte word of any stored value contains the total length of the value in
bytes (including itself).  <acronym>TOAST</acronym> does not constrain the rest
of the data type's representation.  The special representations collectively
called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or
reinterpreting this initial length word.  Therefore, the C-level functions
supporting a <acronym>TOAST</acronym>-able data type must be careful about how they
handle potentially <acronym>TOAST</acronym>ed input values: an input might not
actually consist of a four-byte length word and contents until after it's
been <firstterm>detoasted</firstterm>.  (This is normally done by invoking
<function>PG_DETOAST_DATUM</function> before doing anything with an input value,
but in some cases more efficient approaches are possible.
See <xref linkend="xtypes-toast"/> for more detail.)
</para>

<para>
<acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order
bits on big-endian machines, the low-order bits on little-endian machines),
thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able
data type to 1 GB (2<superscript>30</superscript> - 1 bytes).  When both bits are zero,
the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and
the remaining bits of the length word give the total datum size (including
length word) in bytes.  When the highest-order or lowest-order bit is set,
the value has only a single-byte header instead of the normal four-byte
header, and the remaining bits of that byte give the total datum size
(including length byte) in bytes.  This alternative supports space-efficient
storage of values shorter than 127 bytes, while still allowing the data type
to grow to 1 GB at need.  Values with single-byte headers aren't aligned on
any particular boundary, whereas values with four-byte headers are aligned on
at least a four-byte boundary; this omission of alignment padding provides
additional space savings that is significant compared to short values.
As a special case, if the remaining bits of a single-byte header are all
zero (which would be impossible for a self-inclusive length), the value is
a pointer to out-of-line data, with several possible alternatives as
described below.  The type and size of such a <firstterm>TOAST pointer</firstterm>
are determined by a code stored in the second byte of the datum.
Lastly, when the highest-order or lowest-order bit is clear but the adjacent
bit is set, the content of the datum has been compressed and must be
decompressed before use.  In this case the remaining bits of the four-byte
length word give the total size of the compressed datum, not the
original data.  Note that compression is also possible for out-of-line data
but the varlena header does not tell whether it has occurred &mdash;
the content of the <acronym>TOAST</acronym> pointer tells that, instead.
</para>

<para>
The compression technique used for either in-line or out-of-line compressed
data can be selected for each column by setting
the <literal>COMPRESSION</literal> column option in <command>CREATE
TABLE</command> or <command>ALTER TABLE</command>.  The default for columns
with no explicit setting is to consult the
<xref linkend="guc-default-toast-compression"/> parameter at the time data is
inserted.
</para>

<para>
As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums.
The oldest and most common type is a pointer to out-of-line data stored in
a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but
associated with, the table containing the <acronym>TOAST</acronym> pointer datum
itself.  These <firstterm>on-disk</firstterm> pointer datums are created by the
<acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>)
when a tuple to be stored on disk is too large to be stored as-is.
Further details appear in <xref linkend="storage-toast-ondisk"/>.
Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to
out-of-line data that appears elsewhere in memory.  Such datums are
necessarily short-lived, and will never appear on-disk, but they are very
useful for avoiding copying and redundant processing of large data values.
Further details appear in <xref linkend="storage-toast-inmemory"/>.
</para>

<sect2 id="storage-toast-ondisk">
 <title>Out-of-Line, On-Disk TOAST Storage</title>

<para>
If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will
have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's
<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry.  On-disk
<acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as
described in more detail below.
</para>

<para>
Out-of-line values are divided (after compression if used) into chunks of at
most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen
so that four chunk rows will fit on a page, making it about 2000 bytes).
Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table
belonging to the owning table.  Every
<acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID
identifying the particular <acronym>TOAST</acronym>ed value),
<structfield>chunk_seq</structfield> (a sequence number for the chunk within its value),
and <structfield>chunk_data</structfield> (the actual data of the chunk).  A unique index
on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast
retrieval of the values.  A pointer datum representing an out-of-line on-disk
<acronym>TOAST</acronym>ed value therefore needs to store the OID of the
<acronym>TOAST</acronym> table in which to look and the OID of the specific value
(its <structfield>chunk_id</structfield>).  For convenience, pointer datums also store the
logical datum size (original uncompressed data length), physical stored size
(different if compression was applied), and the compression method used, if
any.  Allowing for the varlena header bytes,
the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18
bytes regardless of the actual size of the represented value.
</para>

<para>
The <acronym>TOAST</acronym> management code is triggered only
when a row value to be stored in a table is wider than
<symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB).
The <acronym>TOAST</acronym> code will compress and/or move
field values out-of-line until the row value is shorter than
<symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable)
or no more gains can be had.  During an UPDATE
operation, values of unchanged fields are normally preserved as-is; so an
UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if
none of the out-of-line values change.
</para>

<para>
The <acronym>TOAST</acronym> management code recognizes four different strategies
for storing <acronym>TOAST</acronym>-able columns on disk:

   <itemizedlist>
    <listitem>
     <para>
      <literal>PLAIN</literal> prevents either compression or
      out-of-line storage.  This is the only possible strategy for
      columns of non-<acronym>TOAST</acronym>-able data types.
     </para>
    </listitem>
    <listitem>
     <para>
      <literal>EXTENDED</literal> allows both compression and out-of-line
      storage.  This is the default for most <acronym>TOAST</acronym>-able data types.
      Compression will be attempted first, then out-of-line storage if
      the row is still too big.
     </para>
    </listitem>
    <listitem>
     <para>
      <literal>EXTERNAL</literal> allows out-of-line storage but not
      compression.  Use of <literal>EXTERNAL</literal> will
      make substring operations on wide <type>text</type> and
      <type>bytea</type> columns faster (at the penalty of increased storage
      space) because these operations are optimized to fetch only the
      required parts of the out-of-line value when it is not compressed.
     </para>
    </listitem>
    <listitem>
     <para>
      <literal>MAIN</literal> allows compression but not out-of-line
      storage.  (Actually, out-of-line storage will still be performed
      for such columns, but only as a last resort when there is no other
      way to make the row small enough to fit on a page.)
     </para>
    </listitem>
   </itemizedlist>

Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns
of that data type, but the strategy for a given table column can be altered
with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>.
</para>

<para>
<symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using
<link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link>
</para>

<para>
This scheme has a number of advantages compared to a more straightforward
approach such as allowing row values to span pages.  Assuming that queries are
usually qualified by comparisons against relatively small key values, most of
the work of the executor will be done using the main row entry. The big values
of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all)
at the time the result set is sent to the client. Thus, the main table is much
smaller and more of its rows fit in the shared buffer cache than would be the
case without any out-of-line storage. Sort sets shrink also, and sorts will
more often be done entirely in memory. A little test showed that a table
containing typical HTML pages and their URLs was stored in about half of the
raw data size including the <acronym>TOAST</acronym> table, and that the main table
contained only about 10% of the entire data (the URLs and some small HTML
pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed
comparison table, in which all the HTML pages were cut down to 7 kB to fit.
</para>

</sect2>

<sect2 id="storage-toast-inmemory">
 <title>Out-of-Line, In-Memory TOAST Storage</title>

<para>
<acronym>TOAST</acronym> pointers can point to data that is not on disk, but is
elsewhere in the memory of the current server process.  Such pointers
obviously cannot be long-lived, but they are nonetheless useful.  There
are currently two sub-cases:
pointers to <firstterm>indirect</firstterm> data and
pointers to <firstterm>expanded</firstterm> data.
</para>

<para>
Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena
value stored somewhere in memory.  This case was originally created merely
as a proof of concept, but it is currently used during logical decoding to
avoid possibly having to create physical tuples exceeding 1 GB (as pulling
all out-of-line field values into the tuple might do).  The case is of
limited use since the creator of the pointer datum is entirely responsible
that the referenced data survives for as long as the pointer could exist,
and there is no infrastructure to help with this.
</para>

<para>
Expanded <acronym>TOAST</acronym> pointers are useful for complex data types
whose on-disk representation is not especially suited for computational
purposes.  As an example, the standard varlena representation of a
<productname>PostgreSQL</productname> array includes dimensionality information, a
nulls bitmap if there are any null elements, then the values of all the
elements in order.  When the element type itself is variable-length, the
only way to find the <replaceable>N</replaceable>'th element is to scan through all the
preceding elements.  This representation is appropriate for on-disk storage
because of its compactness, but for computations with the array it's much
nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote>
representation in which all the element starting locations have been
identified.  The <acronym>TOAST</acronym> pointer mechanism supports this need by
allowing a pass-by-reference Datum to point to either a standard varlena
value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that
points to an expanded representation somewhere in memory.  The details of
this expanded representation are up to the data type, though it must have
a standard header and meet the other API requirements given
in <filename>src/include/utils/expandeddatum.h</filename>.  C-level functions
working with the data type can choose to handle either representation.
Functions that do not know about the expanded representation, but simply
apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically
receive the traditional varlena representation; so support for an expanded
representation can be introduced incrementally, one function at a time.
</para>

<para>
<acronym>TOAST</acronym> pointers to expanded values are further broken down
into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers.
The pointed-to representation is the same either way, but a function that
receives a read-write pointer is allowed to modify the referenced value
in-place, whereas one that receives a read-only pointer must not; it must
first create a copy if it wants to make a modified version of the value.
This distinction and some associated conventions make it possible to avoid
unnecessary copying of expanded values during query execution.
</para>

<para>
For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym>
management code ensures that no such pointer datum can accidentally get
stored on disk.  In-memory <acronym>TOAST</acronym> pointers are automatically
expanded to normal in-line varlena values before storage &mdash; and then
possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing
tuple would otherwise be too big.
</para>

</sect2>

</sect1>

<sect1 id="storage-fsm">

<title>Free Space Map</title>

<indexterm>
 <primary>Free Space Map</primary>
</indexterm>
<indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm>

<para>
Each heap and index relation, except for hash indexes, has a Free Space Map
(<acronym>FSM</acronym>) to keep track of available space in the relation.
It's stored alongside the main relation data in a separate relation fork,
named after the filenode number of the relation, plus a <literal>_fsm</literal>
suffix. For example, if the filenode of a relation is 12345, the
<acronym>FSM</acronym> is stored in a file called
<filename>12345_fsm</filename>, in the same directory as the main relation file.
</para>

<para>
The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The
bottom level <acronym>FSM</acronym> pages store the free space available on each
heap (or index) page, using one byte to represent each such page. The upper
levels aggregate information from the lower levels.
</para>

<para>
Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with
one byte per node. Each leaf node represents a heap page, or a lower level
<acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's
values is stored. The maximum value in the leaf nodes is therefore stored
at the root.
</para>

<para>
See <filename>src/backend/storage/freespace/README</filename> for more details on
how the <acronym>FSM</acronym> is structured, and how it's updated and searched.
The <xref linkend="pgfreespacemap"/> module
can be used to examine the information stored in free space maps.
</para>

</sect1>

<sect1 id="storage-vm">

<title>Visibility Map</title>

<indexterm>
 <primary>Visibility Map</primary>
</indexterm>
<indexterm><primary>VM</primary><see>Visibility Map</see></indexterm>

<para>
Each heap relation has a Visibility Map
(VM) to keep track of which pages contain only tuples that are known to be
visible to all active transactions; it also keeps track of which pages contain
only frozen tuples.  It's stored
alongside the main relation data in a separate relation fork, named after the
filenode number of the relation, plus a <literal>_vm</literal> suffix. For example,
if the filenode of a relation is 12345, the VM is stored in a file called
<filename>12345_vm</filename>, in the same directory as the main relation file.
Note that indexes do not have VMs.
</para>

<para>
The visibility map stores two bits per heap page.  The first bit, if set,
indicates that the page is all-visible, or in other words that the page does
not contain any tuples that need to be vacuumed.
This information can also be used
by <link linkend="indexes-index-only-scans"><firstterm>index-only
scans</firstterm></link> to answer queries using only the index tuple.
The second bit, if set, means that all tuples on the page have been frozen.
That means that even an anti-wraparound vacuum need not revisit the page.
</para>

<para>
The map is conservative in the sense that we make sure that whenever a bit is
set, we know the condition is true, but if a bit is not set, it might or
might not be true. Visibility map bits are only set by vacuum, but are
cleared by any data-modifying operations on a page.
</para>

<para>
The <xref linkend="pgvisibility"/> module can be used to examine the
information stored in the visibility map.
</para>

</sect1>

<sect1 id="storage-init">

<title>The Initialization Fork</title>

<indexterm>
 <primary>Initialization Fork</primary>
</indexterm>

<para>
Each unlogged table, and each index on an unlogged table, has an initialization
fork.  The initialization fork is an empty table or index of the appropriate
type.  When an unlogged table must be reset to empty due to a crash, the
initialization fork is copied over the main fork, and any other forks are
erased (they will be recreated automatically as needed).
</para>

</sect1>

<sect1 id="storage-page-layout">

<title>Database Page Layout</title>

<para>
This section provides an overview of the page format used within
<productname>PostgreSQL</productname> tables and indexes.<footnote>
  <para>
    Actually, use of this page format is not required for either table or
    index access methods. The <literal>heap</literal> table access method
    always uses this format.  All the existing index methods also use the
    basic format, but the data kept on index metapages usually doesn't follow
    the item layout rules.
  </para>
</footnote>
Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table.
</para>

<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits.  In addition, the term
<firstterm>item</firstterm>
refers to an individual data value that is stored on a page.  In a table,
an item is a row; in an index, an item is an index entry.
</para>

<para>
Every table and index is stored as an array of <firstterm>pages</firstterm> of a
fixed size (usually 8 kB, although a different page size can be selected
when compiling the server).  In a table, all the pages are logically
equivalent, so a particular item (row) can be stored in any page.  In
indexes, the first page is generally reserved as a <firstterm>metapage</firstterm>
holding control information, and there can be different types of pages
within the index, depending on the index access method.
</para>

<para>
<xref linkend="page-table"/> shows the overall layout of a page.
There are five parts to each page.
</para>

<table tocentry="1" id="page-table">
<title>Overall Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>

<tbody>

<row>
 <entry>PageHeaderData</entry>
 <entry>24 bytes long. Contains general information about the page, including
free space pointers.</entry>
</row>

<row>
<entry>ItemIdData</entry>
<entry>Array of item identifiers pointing to the actual items. Each
entry is an (offset,length) pair. 4 bytes per item.</entry>
</row>

<row>
<entry>Free space</entry>
<entry>The unallocated space. New item identifiers are allocated from
the start of this area, new items from the end.</entry>
</row>

<row>
<entry>Items</entry>
<entry>The actual items themselves.</entry>
</row>

<row>
<entry>Special space</entry>
<entry>Index access method specific data. Different methods store different
data. Empty in ordinary tables.</entry>
</row>

</tbody>
</tgroup>
</table>

 <para>

  The first 24 bytes of each page consists of a page header
  (<structname>PageHeaderData</structname>). Its format is detailed in <xref
  linkend="pageheaderdata-table"/>. The first field tracks the most
  recent WAL entry related to this page. The second field contains
  the page checksum if <xref linkend="app-initdb-data-checksums"/> are
  enabled.  Next is a 2-byte field containing flag bits. This is followed
  by three 2-byte integer fields (<structfield>pd_lower</structfield>,
  <structfield>pd_upper</structfield>, and
  <structfield>pd_special</structfield>).  These contain byte offsets
  from the page start to the start of unallocated space, to the end of
  unallocated space, and to the start of the special space.  The next 2
  bytes of the page header, <structfield>pd_pagesize_version</structfield>,
  store both the page size and a version indicator.  Beginning with
  <productname>PostgreSQL</productname> 8.3 the version number is 4;
  <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
  <productname>PostgreSQL</productname> 8.0 used version number 2;
  <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
  prior releases used version number 0.
  (The basic page layout and header format has not changed in most of these
  versions, but the layout of heap row headers has.)  The page size
  is basically only present as a cross-check; there is no support for having
  more than one page size in an installation.
  The last field is a hint that shows whether pruning the page is likely
  to be profitable: it tracks the oldest un-pruned XMAX on the page.

 </para>

 <table tocentry="1" id="pageheaderdata-table">
 <title>PageHeaderData Layout</title>
 <titleabbrev>PageHeaderData Layout</titleabbrev>
 <tgroup cols="4">
 <thead>
  <row>
   <entry>Field</entry>
   <entry>Type</entry>
   <entry>Length</entry>
   <entry>Description</entry>
  </row>
 </thead>
 <tbody>
  <row>
   <entry>pd_lsn</entry>
   <entry>PageXLogRecPtr</entry>
   <entry>8 bytes</entry>
   <entry>LSN: next byte after last byte of WAL record for last change
   to this page</entry>
  </row>
  <row>
   <entry>pd_checksum</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
   <entry>Page checksum</entry>
  </row>
  <row>
   <entry>pd_flags</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
   <entry>Flag bits</entry>
  </row>
  <row>
   <entry>pd_lower</entry>
   <entry>LocationIndex</entry>
   <entry>2 bytes</entry>
   <entry>Offset to start of free space</entry>
  </row>
  <row>
   <entry>pd_upper</entry>
   <entry>LocationIndex</entry>
   <entry>2 bytes</entry>
   <entry>Offset to end of free space</entry>
  </row>
  <row>
   <entry>pd_special</entry>
   <entry>LocationIndex</entry>
   <entry>2 bytes</entry>
   <entry>Offset to start of special space</entry>
  </row>
  <row>
   <entry>pd_pagesize_version</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
   <entry>Page size and layout version number information</entry>
  </row>
  <row>
   <entry>pd_prune_xid</entry>
   <entry>TransactionId</entry>
   <entry>4 bytes</entry>
   <entry>Oldest unpruned XMAX on page, or zero if none</entry>
  </row>
 </tbody>
 </tgroup>
 </table>

 <para>
  All the details can be found in
  <filename>src/include/storage/bufpage.h</filename>.
 </para>

 <para>
  Following the page header are item identifiers
  (<type>ItemIdData</type>), each requiring four bytes.
  An item identifier contains a byte-offset to
  the start of an item, its length in bytes, and a few attribute bits
  which affect its interpretation.
  New item identifiers are allocated
  as needed from the beginning of the unallocated space.
  The number of item identifiers present can be determined by looking at
  <structfield>pd_lower</structfield>, which is increased to allocate a new identifier.
  Because an item
  identifier is never moved until it is freed, its index can be used on a
  long-term basis to reference an item, even when the item itself is moved
  around on the page to compact free space.  In fact, every pointer to an
  item (<type>ItemPointer</type>, also known as
  <type>CTID</type>) created by
  <productname>PostgreSQL</productname> consists of a page number and the
  index of an item identifier.

 </para>

 <para>

  The items themselves are stored in space allocated backwards from the end
  of unallocated space.  The exact structure varies depending on what the
  table is to contain. Tables and sequences both use a structure named
  <type>HeapTupleHeaderData</type>, described below.

 </para>

 <para>

  The final section is the <quote>special section</quote> which can
  contain anything the access method wishes to store.  For example,
  b-tree indexes store links to the page's left and right siblings,
  as well as some other data relevant to the index structure.
  Ordinary tables do not use a special section at all (indicated by setting
  <structfield>pd_special</structfield> to equal the page size).

 </para>

 <para>
  <xref linkend="storage-page-layout-figure"/> illustrates how these parts are
  laid out in a page.
 </para>

 <figure id="storage-page-layout-figure">
  <title>Page Layout</title>
  <mediaobject>
   <imageobject>
    <imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/>
   </imageobject>
  </mediaobject>
 </figure>

 <sect2 id="storage-tuple-layout">

 <title>Table Row Layout</title>

 <para>

  All table rows are structured in the same way. There is a fixed-size
  header (occupying 23 bytes on most machines), followed by an optional null
  bitmap, an optional object ID field, and the user data. The header is
  detailed
  in <xref linkend="heaptupleheaderdata-table"/>.  The actual user data
  (columns of the row) begins at the offset indicated by
  <structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN
  distance for the platform.
  The null bitmap is
  only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
  <structfield>t_infomask</structfield>. If it is present it begins just after
  the fixed header and occupies enough bytes to have one bit per data column
  (that is, the number of bits that equals the attribute count in
  <structfield>t_infomask2</structfield>). In this list of bits, a
  1 bit indicates not-null, a 0 bit is a null.  When the bitmap is not
  present, all columns are assumed not-null.
  The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit
  is set in <structfield>t_infomask</structfield>.  If present, it appears just
  before the <structfield>t_hoff</structfield> boundary.  Any padding needed to make
  <structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null
  bitmap and the object ID.  (This in turn ensures that the object ID is
  suitably aligned.)

 </para>

 <table tocentry="1" id="heaptupleheaderdata-table">
 <title>HeapTupleHeaderData Layout</title>
 <titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
 <tgroup cols="4">
 <thead>
  <row>
   <entry>Field</entry>
   <entry>Type</entry>
   <entry>Length</entry>
   <entry>Description</entry>
  </row>
 </thead>
 <tbody>
  <row>
   <entry>t_xmin</entry>
   <entry>TransactionId</entry>
   <entry>4 bytes</entry>
   <entry>insert XID stamp</entry>
  </row>
  <row>
   <entry>t_xmax</entry>
   <entry>TransactionId</entry>
   <entry>4 bytes</entry>
   <entry>delete XID stamp</entry>
  </row>
  <row>
   <entry>t_cid</entry>
   <entry>CommandId</entry>
   <entry>4 bytes</entry>
   <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
  </row>
  <row>
   <entry>t_xvac</entry>
   <entry>TransactionId</entry>
   <entry>4 bytes</entry>
   <entry>XID for VACUUM operation moving a row version</entry>
  </row>
  <row>
   <entry>t_ctid</entry>
   <entry>ItemPointerData</entry>
   <entry>6 bytes</entry>
   <entry>current TID of this or newer row version</entry>
  </row>
  <row>
   <entry>t_infomask2</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
   <entry>number of attributes, plus various flag bits</entry>
  </row>
  <row>
   <entry>t_infomask</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
   <entry>various flag bits</entry>
  </row>
  <row>
   <entry>t_hoff</entry>
   <entry>uint8</entry>
   <entry>1 byte</entry>
   <entry>offset to user data</entry>
  </row>
 </tbody>
 </tgroup>
 </table>

 <para>
   All the details can be found in
   <filename>src/include/access/htup_details.h</filename>.
 </para>

 <para>

  Interpreting the actual data can only be done with information obtained
  from other tables, mostly <structname>pg_attribute</structname>. The
  key values needed to identify field locations are
  <structfield>attlen</structfield> and <structfield>attalign</structfield>.
  There is no way to directly get a
  particular attribute, except when there are only fixed width fields and no
  null values. All this trickery is wrapped up in the functions
  <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
  and <firstterm>heap_getsysattr</firstterm>.

 </para>
 <para>

  To read the data you need to examine each attribute in turn. First check
  whether the field is NULL according to the null bitmap. If it is, go to
  the next. Then make sure you have the right alignment.  If the field is a
  fixed width field, then all the bytes are simply placed. If it's a
  variable length field (attlen = -1) then it's a bit more complicated.
  All variable-length data types share the common header structure
  <type>struct varlena</type>, which includes the total length of the stored
  value and some flag bits.  Depending on the flags, the data can be either
  inline or in a <acronym>TOAST</acronym> table;
  it might be compressed, too (see <xref linkend="storage-toast"/>).

 </para>
 </sect2>
</sect1>

<sect1 id="storage-hot">

 <title>Heap-Only Tuples (<acronym>HOT</acronym>)</title>

 <para>
  To allow for high concurrency, <productname>PostgreSQL</productname>
  uses <link linkend="mvcc-intro">multiversion concurrency
  control</link> (<acronym>MVCC</acronym>) to store rows.  However,
  <acronym>MVCC</acronym> has some downsides for update queries.
  Specifically, updates require new versions of rows to be added to
  tables.  This can also require new index entries for each updated row,
  and removal of old versions of rows and their index entries can be
  expensive.
 </para>

 <para>
  To help reduce the overhead of updates,
  <productname>PostgreSQL</productname> has an optimization called
  heap-only tuples (<acronym>HOT</acronym>).  This optimization is
  possible when:

  <itemizedlist>
   <listitem>
    <para>
     The update does not modify any columns referenced by the table's
     indexes, including expression and partial indexes.
     </para>
   </listitem>
   <listitem>
    <para>
     There is sufficient free space on the page containing the old row
     for the updated row.
    </para>
   </listitem>
  </itemizedlist>

  In such cases, heap-only tuples provide two optimizations:

  <itemizedlist>
   <listitem>
    <para>
     New index entries are not needed to represent updated rows.
    </para>
   </listitem>
   <listitem>
    <para>
     Old versions of updated rows can be completely removed during normal
     operation, including <command>SELECT</command>s, instead of requiring
     periodic vacuum operations.  (This is possible because indexes
     do not reference their <link linkend="storage-page-layout">page
     item identifiers</link>.)
    </para>
   </listitem>
  </itemizedlist>
 </para>

 <para>
  In summary, heap-only tuple updates can only be created
  if columns used by indexes are not updated.  You can
  increase the likelihood of sufficient page space for
  <acronym>HOT</acronym> updates by decreasing a table's <link
  linkend="reloption-fillfactor"><literal>fillfactor</literal></link>.
  If you don't, <acronym>HOT</acronym> updates will still happen because
  new rows will naturally migrate to new pages and existing pages with
  sufficient free space for new row versions.  The system view <link
  linkend="monitoring-pg-stat-all-tables-view">pg_stat_all_tables</link>
  allows monitoring of the occurrence of HOT and non-HOT updates.
 </para>
</sect1>

</chapter>