diff options
Diffstat (limited to '')
-rw-r--r-- | src/backend/access/heap/README.HOT | 499 |
1 files changed, 499 insertions, 0 deletions
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT new file mode 100644 index 0000000..68c6709 --- /dev/null +++ b/src/backend/access/heap/README.HOT @@ -0,0 +1,499 @@ +src/backend/access/heap/README.HOT + +Heap Only Tuples (HOT) +====================== + +The Heap Only Tuple (HOT) feature eliminates redundant index entries and +allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples +without performing a table-wide vacuum. It does this by allowing +single-page vacuuming, also called "defragmentation". + +Note: there is a Glossary at the end of this document that may be helpful +for first-time readers. + + +Technical Challenges +-------------------- + +Page-at-a-time vacuuming is normally impractical because of the costs of +finding and removing the index entries that link to the tuples to be +reclaimed. Standard vacuuming scans the indexes to ensure all such index +entries are removed, amortizing the index scan cost across as many dead +tuples as possible; this approach does not scale down well to the case of +reclaiming just a few tuples. In principle one could recompute the index +keys and do standard index searches to find the index entries, but this is +risky in the presence of possibly-buggy user-defined functions in +functional indexes. An allegedly immutable function that in fact is not +immutable might prevent us from re-finding an index entry (and we cannot +throw an error for not finding it, in view of the fact that dead index +entries are sometimes reclaimed early). That would lead to a seriously +corrupt index, in the form of entries pointing to tuple slots that by now +contain some unrelated content. In any case we would prefer to be able +to do vacuuming without invoking any user-written code. + +HOT solves this problem for a restricted but useful special case: +where a tuple is repeatedly updated in ways that do not change its +indexed columns. (Here, "indexed column" means any column referenced +at all in an index definition, including for example columns that are +tested in a partial-index predicate but are not stored in the index.) + +An additional property of HOT is that it reduces index size by avoiding +the creation of identically-keyed index entries. This improves search +speeds. + + +Update Chains With a Single Index Entry +--------------------------------------- + +Without HOT, every version of a row in an update chain has its own index +entries, even if all indexed columns are the same. With HOT, a new tuple +placed on the same page and with all indexed columns the same as its +parent row version does not get new index entries. This means there is +only one index entry for the entire update chain on the heap page. +An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag. +The prior row version is marked HEAP_HOT_UPDATED, and (as always in an +update chain) its t_ctid field links forward to the newer version. + +For example: + + Index points to 1 + lp [1] [2] + + [111111111]->[2222222222] + +In the above diagram, the index points to line pointer 1, and tuple 1 is +marked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it has +no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE. +Although tuple 2 is not directly referenced by the index, it can still be +found by an index search: after traversing from the index to tuple 1, +the index search proceeds forward to child tuples as long as it sees the +HEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie within +a single page, this requires no additional page fetches and doesn't +introduce much performance penalty. + +Eventually, tuple 1 will no longer be visible to any transaction. +At that point its space could be reclaimed, but its line pointer cannot, +since the index still links to that line pointer and we still need to +be able to find tuple 2 in an index search. HOT handles this by turning +line pointer 1 into a "redirecting line pointer", which links to tuple 2 +but has no actual tuple attached. This state of affairs looks like + + Index points to 1 + lp [1]->[2] + + [2222222222] + +If now the row is updated again, to version 3, the page looks like this: + + Index points to 1 + lp [1]->[2] [3] + + [2222222222]->[3333333333] + +At some later time when no transaction can see tuple 2 in its snapshot, +tuple 2 and its line pointer can be pruned entirely: + + Index points to 1 + lp [1]------>[3] + + [3333333333] + +This is safe because no index entry points to line pointer 2. Subsequent +insertions into the page can now recycle both line pointer 2 and the +space formerly used by tuple 2. + +If an update changes any indexed column, or there is not room on the +same page for the new tuple, then the HOT chain ends: the last member +has a regular t_ctid link to the next version and is not marked +HEAP_HOT_UPDATED. (In principle we could continue a HOT chain across +pages, but this would destroy the desired property of being able to +reclaim space with just page-local manipulations. Anyway, we don't +want to have to chase through multiple heap pages to get from an index +entry to the desired tuple, so it seems better to create a new index +entry for the new tuple.) If further updates occur, the next version +could become the root of a new HOT chain. + +Line pointer 1 has to remain as long as there is any non-dead member of +the chain on the page. When there is not, it is marked "dead". +This lets us reclaim the last child line pointer and associated tuple +immediately. The next regular VACUUM pass can reclaim the index entries +pointing at the line pointer and then the line pointer itself. Since a +line pointer is small compared to a tuple, this does not represent an +undue space cost. + +Note: we can use a "dead" line pointer for any DELETEd tuple, +whether it was part of a HOT chain or not. This allows space reclamation +in advance of running VACUUM for plain DELETEs as well as HOT updates. + +The requirement for doing a HOT update is that none of the indexed +columns are changed. This is checked at execution time by comparing the +binary representation of the old and new values. We insist on bitwise +equality rather than using datatype-specific equality routines. The +main reason to avoid the latter is that there might be multiple notions +of equality for a datatype, and we don't know exactly which one is +relevant for the indexes at hand. We assume that bitwise equality +guarantees equality for all purposes. + + +Abort Cases +----------- + +If a heap-only tuple's xmin is aborted, then it can be removed immediately: +it was never visible to any other transaction, and all descendant row +versions must be aborted as well. Therefore we need not consider it part +of a HOT chain. By the same token, if a HOT-updated tuple's xmax is +aborted, there is no need to follow the chain link. However, there is a +race condition here: the transaction that did the HOT update might abort +between the time we inspect the HOT-updated tuple and the time we reach +the descendant heap-only tuple. It is conceivable that someone prunes +the heap-only tuple before that, and even conceivable that the line pointer +is re-used for another purpose. Therefore, when following a HOT chain, +it is always necessary to be prepared for the possibility that the +linked-to line pointer is unused, dead, or redirected; and if it is a +normal line pointer, we still have to check that XMIN of the tuple matches +the XMAX of the tuple we left. Otherwise we should assume that we have +come to the end of the HOT chain. Note that this sort of XMIN/XMAX +matching is required when following ordinary update chains anyway. + +(Early versions of the HOT code assumed that holding pin on the page +buffer while following a HOT link would prevent this type of problem, +but checking XMIN/XMAX matching is a much more robust solution.) + + +Index/Sequential Scans +---------------------- + +When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose +xmax is not aborted, we need to follow its t_ctid link and check that +entry as well; possibly repeatedly until we reach the end of the HOT +chain. (When using an MVCC snapshot it is possible to optimize this a +bit: there can be at most one visible tuple in the chain, so we can stop +when we find it. This rule does not work for non-MVCC snapshots, though.) + +Sequential scans do not need to pay attention to the HOT links because +they scan every line pointer on the page anyway. The same goes for a +bitmap heap scan with a lossy bitmap. + + +Pruning +------- + +HOT pruning means updating line pointers so that HOT chains are +reduced in length, by collapsing out line pointers for intermediate dead +tuples. Although this makes those line pointers available for re-use, +it does not immediately make the space occupied by their tuples available. + + +Defragmentation +--------------- + +Defragmentation centralizes unused space. After we have converted root +line pointers to redirected line pointers and pruned away any dead +intermediate line pointers, the tuples they linked to are free space. +But unless that space is adjacent to the central "hole" on the page +(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion. +Defragmentation moves the surviving tuples to coalesce all the free +space into one "hole". This is done with the same PageRepairFragmentation +function that regular VACUUM uses. + + +When can/should we prune or defragment? +--------------------------------------- + +This is the most interesting question in HOT implementation, since there +is no simple right answer: we must use heuristics to determine when it's +most efficient to perform pruning and/or defragmenting. + +We cannot prune or defragment unless we can get a "buffer cleanup lock" +on the target page; otherwise, pruning might destroy line pointers that +other backends have live references to, and defragmenting might move +tuples that other backends have live pointers to. Thus the general +approach must be to heuristically decide if we should try to prune +or defragment, and if so try to acquire the buffer cleanup lock without +blocking. If we succeed we can proceed with our housekeeping work. +If we cannot get the lock (which should not happen often, except under +very heavy contention) then the housekeeping has to be postponed till +some other time. The worst-case consequence of this is only that an +UPDATE cannot be made HOT but has to link to a new tuple version placed on +some other page, for lack of centralized space on the original page. + +Ideally we would do defragmenting only when we are about to attempt +heap_update on a HOT-safe tuple. The difficulty with this approach +is that the update query has certainly got a pin on the old tuple, and +therefore our attempt to acquire a buffer cleanup lock will always fail. +(This corresponds to the idea that we don't want to move the old tuple +out from under where the query's HeapTuple pointer points. It might +be possible to finesse that, but it seems fragile.) + +Pruning, however, is potentially useful even when we are not about to +insert a new tuple, since shortening a HOT chain reduces the cost of +subsequent index searches. However it is unclear that this gain is +large enough to accept any extra maintenance burden for. + +The currently planned heuristic is to prune and defrag when first accessing +a page that potentially has prunable tuples (as flagged by the pd_prune_xid +page hint field) and that either has free space less than MAX(fillfactor +target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to +find enough free space to store an updated tuple version. (These rules +are subject to change.) + +We have effectively implemented the "truncate dead tuples to just line +pointer" idea that has been proposed and rejected before because of fear +of line pointer bloat: we might end up with huge numbers of line pointers +and just a few actual tuples on a page. To limit the damage in the worst +case, and to keep various work arrays as well as the bitmaps in bitmap +scans reasonably sized, the maximum number of line pointers per page +is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that +could fit without HOT pruning). + +Effectively, space reclamation happens during tuple retrieval when the +page is nearly full (<10% free) and a buffer cleanup lock can be +acquired. This means that UPDATE, DELETE, and SELECT can trigger space +reclamation, but often not during INSERT ... VALUES because it does +not retrieve a row. + + +VACUUM +------ + +There is little change to regular vacuum. It performs pruning to remove +dead heap-only tuples, and cleans up any dead line pointers as if they were +regular dead tuples. + + +Statistics +---------- + +Currently, we count HOT updates the same as cold updates for statistics +purposes, though there is an additional per-table counter that counts +only HOT updates. When a page pruning operation is able to remove a +physical tuple by eliminating an intermediate heap-only tuple or +replacing a physical root tuple by a redirect pointer, a decrement in +the table's number of dead tuples is reported to pgstats, which may +postpone autovacuuming. Note that we do not count replacing a root tuple +by a DEAD line pointer as decrementing n_dead_tuples; we still want +autovacuum to run to clean up the index entries and DEAD item. + +This area probably needs further work ... + + +CREATE INDEX +------------ + +CREATE INDEX presents a problem for HOT updates. While the existing HOT +chains all have the same index values for existing indexes, the columns +in the new index might change within a pre-existing HOT chain, creating +a "broken" chain that can't be indexed properly. + +To address this issue, regular (non-concurrent) CREATE INDEX makes the +new index usable only by new transactions and transactions that don't +have snapshots older than the CREATE INDEX command. This prevents +queries that can see the inconsistent HOT chains from trying to use the +new index and getting incorrect results. Queries that can see the index +can only see the rows that were visible after the index was created, +hence the HOT chains are consistent for them. + +Entries in the new index point to root tuples (tuples with current index +pointers) so that our index uses the same index pointers as all other +indexes on the table. However the row we want to index is actually at +the *end* of the chain, ie, the most recent live tuple on the HOT chain. +That is the one we compute the index entry values for, but the TID +we put into the index is that of the root tuple. Since queries that +will be allowed to use the new index cannot see any of the older tuple +versions in the chain, the fact that they might not match the index entry +isn't a problem. (Such queries will check the tuple visibility +information of the older versions and ignore them, without ever looking at +their contents, so the content inconsistency is OK.) Subsequent updates +to the live tuple will be allowed to extend the HOT chain only if they are +HOT-safe for all the indexes. + +Because we have ShareLock on the table, any DELETE_IN_PROGRESS or +INSERT_IN_PROGRESS tuples should have come from our own transaction. +Therefore we can consider them committed since if the CREATE INDEX +commits, they will be committed, and if it aborts the index is discarded. +An exception to this is that early lock release is customary for system +catalog updates, and so we might find such tuples when reindexing a system +catalog. In that case we deal with it by waiting for the source +transaction to commit or roll back. (We could do that for user tables +too, but since the case is unexpected we prefer to throw an error.) + +Practically, we prevent certain transactions from using the new index by +setting pg_index.indcheckxmin to TRUE. Transactions are allowed to use +such an index only after pg_index.xmin is below their TransactionXmin +horizon, thereby ensuring that any incompatible rows in HOT chains are +dead to them. (pg_index.xmin will be the XID of the CREATE INDEX +transaction. The reason for using xmin rather than a normal column is +that the regular vacuum freezing mechanism will take care of converting +xmin to FrozenTransactionId before it can wrap around.) + +This means in particular that the transaction creating the index will be +unable to use the index if the transaction has old snapshots. We +alleviate that problem somewhat by not setting indcheckxmin unless the +table actually contains HOT chains with RECENTLY_DEAD members. + +Another unpleasant consequence is that it is now risky to use SnapshotAny +in an index scan: if the index was created more recently than the last +vacuum, it's possible that some of the visited tuples do not match the +index entry they are linked to. This does not seem to be a fatal +objection, since there are few users of SnapshotAny and most use seqscans. +The only exception at this writing is CLUSTER, which is okay because it +does not require perfect ordering of the indexscan readout (and especially +so because CLUSTER tends to write recently-dead tuples out of order anyway). + + +CREATE INDEX CONCURRENTLY +------------------------- + +In the concurrent case we must take a different approach. We create the +pg_index entry immediately, before we scan the table. The pg_index entry +is marked as "not ready for inserts". Then we commit and wait for any +transactions which have the table open to finish. This ensures that no +new HOT updates will change the key value for our new index, because all +transactions will see the existence of the index and will respect its +constraint on which updates can be HOT. Other transactions must include +such an index when determining HOT-safety of updates, even though they +must ignore it for both insertion and searching purposes. + +We must do this to avoid making incorrect index entries. For example, +suppose we are building an index on column X and we make an index entry for +a non-HOT tuple with X=1. Then some other backend, unaware that X is an +indexed column, HOT-updates the row to have X=2, and commits. We now have +an index entry for X=1 pointing at a HOT chain whose live row has X=2. +We could make an index entry with X=2 during the validation pass, but +there is no nice way to get rid of the wrong entry with X=1. So we must +have the HOT-safety property enforced before we start to build the new +index. + +After waiting for transactions which had the table open, we build the index +for all rows that are valid in a fresh snapshot. Any tuples visible in the +snapshot will have only valid forward-growing HOT chains. (They might have +older HOT updates behind them which are broken, but this is OK for the same +reason it's OK in a regular index build.) As above, we point the index +entry at the root of the HOT-update chain but we use the key value from the +live tuple. + +We mark the index open for inserts (but still not ready for reads) then +we again wait for transactions which have the table open. Then we take +a second reference snapshot and validate the index. This searches for +tuples missing from the index, and inserts any missing ones. Again, +the index entries have to have TIDs equal to HOT-chain root TIDs, but +the value to be inserted is the one from the live tuple. + +Then we wait until every transaction that could have a snapshot older than +the second reference snapshot is finished. This ensures that nobody is +alive any longer who could need to see any tuples that might be missing +from the index, as well as ensuring that no one can see any inconsistent +rows in a broken HOT chain (the first condition is stronger than the +second). Finally, we can mark the index valid for searches. + +Note that we do not need to set pg_index.indcheckxmin in this code path, +because we have outwaited any transactions that would need to avoid using +the index. (indcheckxmin is only needed because non-concurrent CREATE +INDEX doesn't want to wait; its stronger lock would create too much risk of +deadlock if it did.) + + +DROP INDEX CONCURRENTLY +----------------------- + +DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX +CONCURRENTLY. We first mark the index as not indisvalid, and then wait for +any transactions that could be using it in queries to end. (During this +time, index updates must still be performed as normal, since such +transactions might expect freshly inserted tuples to be findable.) +Then, we clear indisready and indislive, and again wait for transactions +that could be updating the index to end. Finally we can drop the index +normally (though taking only ShareUpdateExclusiveLock on its parent table). + +The reason we need the pg_index.indislive flag is that after the second +wait step begins, we don't want transactions to be touching the index at +all; otherwise they might suffer errors if the DROP finally commits while +they are reading catalog entries for the index. If we had only indisvalid +and indisready, this state would be indistinguishable from the first stage +of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want +transactions to examine the index, since they must consider it in +HOT-safety checks. + + +Limitations and Restrictions +---------------------------- + +It is worth noting that HOT forever forecloses alternative approaches +to vacuuming, specifically the recompute-the-index-keys approach alluded +to in Technical Challenges above. It'll be tough to recompute the index +keys for a root line pointer you don't have data for anymore ... + + +Glossary +-------- + +Broken HOT Chain + + A HOT chain in which the key value for an index has changed. + + This is not allowed to occur normally but if a new index is created + it can happen. In that case various strategies are used to ensure + that no transaction for which the older tuples are visible can + use the index. + +Cold update + + A normal, non-HOT update, in which index entries are made for + the new version of the tuple. + +Dead line pointer + + A stub line pointer, that does not point to anything, but cannot + be removed or reused yet because there are index pointers to it. + Semantically same as a dead tuple. It has state LP_DEAD. + +Heap-only tuple + + A heap tuple with no index pointers, which can only be reached + from indexes indirectly through its ancestral root tuple. + Marked with HEAP_ONLY_TUPLE flag. + +HOT-safe + + A proposed tuple update is said to be HOT-safe if it changes + none of the tuple's indexed columns. It will only become an + actual HOT update if we can find room on the same page for + the new tuple version. + +HOT update + + An UPDATE where the new tuple becomes a heap-only tuple, and no + new index entries are made. + +HOT-updated tuple + + An updated tuple, for which the next tuple in the chain is a + heap-only tuple. Marked with HEAP_HOT_UPDATED flag. + +Indexed column + + A column used in an index definition. The column might not + actually be stored in the index --- it could be used in a + functional index's expression, or used in a partial index + predicate. HOT treats all these cases alike. + +Redirecting line pointer + + A line pointer that points to another line pointer and has no + associated tuple. It has the special lp_flags state LP_REDIRECT, + and lp_off is the OffsetNumber of the line pointer it links to. + This is used when a root tuple becomes dead but we cannot prune + the line pointer because there are non-dead heap-only tuples + further down the chain. + +Root tuple + + The first tuple in a HOT update chain; the one that indexes point to. + +Update chain + + A chain of updated tuples, in which each tuple's ctid points to + the next tuple in the chain. A HOT update chain is an update chain + (or portion of an update chain) that consists of a root tuple and + one or more heap-only tuples. A complete update chain can contain + both HOT and non-HOT (cold) updated tuples. |