summaryrefslogtreecommitdiffstats
path: root/src/backend/access/nbtree/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/access/nbtree/README')
-rw-r--r--src/backend/access/nbtree/README1083
1 files changed, 1083 insertions, 0 deletions
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
new file mode 100644
index 0000000..52e646c
--- /dev/null
+++ b/src/backend/access/nbtree/README
@@ -0,0 +1,1083 @@
+src/backend/access/nbtree/README
+
+Btree Indexing
+==============
+
+This directory contains a correct implementation of Lehman and Yao's
+high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
+Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
+on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also
+use a simplified version of the deletion logic described in Lanin and
+Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
+Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
+
+The basic Lehman & Yao Algorithm
+--------------------------------
+
+Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
+to the page's right sibling. It also adds a "high key" to each page, which
+is an upper bound on the keys that are allowed on that page. These two
+additions make it possible to detect a concurrent page split, which allows
+the tree to be searched without holding any read locks (except to keep a
+single page from being modified while reading it).
+
+When a search follows a downlink to a child page, it compares the page's
+high key with the search key. If the search key is greater than the high
+key, the page must've been split concurrently, and you must follow the
+right-link to find the new page containing the key range you're looking
+for. This might need to be repeated, if the page has been split more than
+once.
+
+Lehman and Yao talk about alternating "separator" keys and downlinks in
+internal pages rather than tuples or records. We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation. All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples. Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago. A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute. Logical duplicates are sorted in heap TID
+order. This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which is assured by having reliably unique keys). Keys are always unique
+on their level, with the exception of a leaf page's high key, which can be
+fully equal to the last item on the page.
+
+The Postgres implementation of suffix truncation must make sure that the
+Lehman and Yao invariants hold, and represents that absent/truncated
+attributes in pivot tuples have the sentinel value "minus infinity". The
+later section on suffix truncation will be helpful if it's unclear how the
+Lehman & Yao invariants work with a real world example.
+
+Differences to the Lehman & Yao algorithm
+-----------------------------------------
+
+We have made the following changes in order to incorporate the L&Y algorithm
+into Postgres:
+
+Lehman and Yao don't require read locks, but assume that in-memory
+copies of tree pages are unshared. Postgres shares in-memory buffers
+among backends. As a result, we do page-level read locking on btree
+pages in order to guarantee that no record is modified while we are
+examining it. This reduces concurrency but guarantees correct
+behavior.
+
+We support the notion of an ordered "scan" of an index as well as
+insertions, deletions, and simple lookups. A scan in the forward
+direction is no problem, we just use the right-sibling pointers that
+L&Y require anyway. (Thus, once we have descended the tree to the
+correct start point for the scan, the scan looks only at leaf pages
+and never at higher tree levels.) To support scans in the backward
+direction, we also store a "left sibling" link much like the "right
+sibling". (This adds an extra step to the L&Y split algorithm: while
+holding the write lock on the page being split, we also lock its former
+right sibling to update that page's left-link. This is safe since no
+writer of that page can be interested in acquiring a write lock on our
+page.) A backwards scan has one additional bit of complexity: after
+following the left-link we must account for the possibility that the
+left sibling page got split before we could read it. So, we have to
+move right until we find a page whose right-link matches the page we
+came from. (Actually, it's even harder than that; see page deletion
+discussion below.)
+
+Page read locks are held only for as long as a scan is examining a page.
+To minimize lock/unlock traffic, an index scan always searches a leaf page
+to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while
+not holding any page lock within the index. We do continue to hold a pin
+on the leaf page in some circumstances, to protect against concurrent
+deletions (see below). In this state the scan is effectively stopped
+"between" pages, either before or after the page it has pinned. This is
+safe in the presence of concurrent insertions and even page splits, because
+items are never moved across pre-existing page boundaries --- so the scan
+cannot miss any items it should have seen, nor accidentally return the same
+item twice. The scan must remember the page's right-link at the time it
+was scanned, since that is the page to move right to; if we move right to
+the current right-link then we'd re-scan any items moved by a page split.
+We don't similarly remember the left-link, since it's best to use the most
+up-to-date left-link when trying to move left (see detailed move-left
+algorithm below).
+
+In most cases we release our lock and pin on a page before attempting
+to acquire pin and lock on the page we are moving to. In a few places
+it is necessary to lock the next page before releasing the current one.
+This is safe when moving right or up, but not when moving left or down
+(else we'd create the possibility of deadlocks).
+
+Lehman and Yao fail to discuss what must happen when the root page
+becomes full and must be split. Our implementation is to split the
+root in the same way that any other page would be split, then construct
+a new root page holding pointers to both of the resulting pages (which
+now become siblings on the next level of the tree). The new root page
+is then installed by altering the root pointer in the meta-data page (see
+below). This works because the root is not treated specially in any
+other way --- in particular, searches will move right using its link
+pointer if the link is set. Therefore, searches will find the data
+that's been moved into the right sibling even if they read the meta-data
+page before it got updated. This is the same reasoning that makes a
+split of a non-root page safe. The locking considerations are similar too.
+
+When an inserter recurses up the tree, splitting internal pages to insert
+links to pages inserted on the level below, it is possible that it will
+need to access a page above the level that was the root when it began its
+descent (or more accurately, the level that was the root when it read the
+meta-data page). In this case the stack it made while descending does not
+help for finding the correct page. When this happens, we find the correct
+place by re-descending the tree until we reach the level one above the
+level we need to insert a link to, and then moving right as necessary.
+(Typically this will take only two fetches, the meta-data page and the new
+root, but in principle there could have been more than one root split
+since we saw the root. We can identify the correct tree level by means of
+the level numbers stored in each page. The situation is rare enough that
+we do not need a more efficient solution.)
+
+Lehman and Yao must couple/chain locks as part of moving right when
+relocating a child page's downlink during an ascent of the tree. This is
+the only point where Lehman and Yao have to simultaneously hold three
+locks (a lock on the child, the original parent, and the original parent's
+right sibling). We don't need to couple internal page locks for pages on
+the same level, though. We match a child's block number to a downlink
+from a pivot tuple one level up, whereas Lehman and Yao match on the
+separator key associated with the downlink that was followed during the
+initial descent. We can release the lock on the original parent page
+before acquiring a lock on its right sibling, since there is never any
+need to deal with the case where the separator key that we must relocate
+becomes the original parent's high key. Lanin and Shasha don't couple
+locks here either, though they also don't couple locks between levels
+during ascents. They are willing to "wait and try again" to avoid races.
+Their algorithm is optimistic, which means that "an insertion holds no
+more than one write lock at a time during its ascent". We more or less
+stick with Lehman and Yao's approach of conservatively coupling parent and
+child locks when ascending the tree, since it's far simpler.
+
+Lehman and Yao assume fixed-size keys, but we must deal with
+variable-size keys. Therefore there is not a fixed maximum number of
+keys per page; we just stuff in as many as will fit. When we split a
+page, we try to equalize the number of bytes, not items, assigned to
+pages (though suffix truncation is also considered). Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
+
+Deleting index tuples during VACUUM
+-----------------------------------
+
+Before deleting a leaf item, we get a full cleanup lock on the target
+page, so that no other backend has a pin on the page when the deletion
+starts. This is not necessary for correctness in terms of the btree index
+operations themselves; as explained above, index scans logically stop
+"between" pages and so can't lose their place. The reason we do it is to
+provide an interlock between VACUUM and index scans that are not prepared
+to deal with concurrent TID recycling when visiting the heap. Since only
+VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since
+this only ever happens _after_ btbulkdelete returns, having index scans
+hold on to the pin (used when reading from the leaf page) until _after_
+they're done visiting the heap (for TIDs from pinned leaf page) prevents
+concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock
+until the index scan is totally finished processing its leaf page.
+
+This approach is fairly coarse, so we avoid it whenever possible. In
+practice most index scans won't hold onto their pin, and so won't block
+VACUUM. These index scans must deal with TID recycling directly, which is
+more complicated and not always possible. See later section on making
+concurrent TID recycling safe.
+
+Opportunistic index tuple deletion performs almost the same page-level
+modifications while only holding an exclusive lock. This is safe because
+there is no question of TID recycling taking place later on -- only VACUUM
+can make TIDs recyclable. See also simple deletion and bottom-up
+deletion, below.
+
+Because a pin is not always held, and a page can be split even while
+someone does hold a pin on it, it is possible that an indexscan will
+return items that are no longer stored on the page it has a pin on, but
+rather somewhere to the right of that page. To ensure that VACUUM can't
+prematurely make TIDs recyclable in this scenario, we require btbulkdelete
+to obtain a cleanup lock on every leaf page in the index, even pages that
+don't contain any deletable tuples. Note that this requirement does not
+say that btbulkdelete must visit the pages in any particular order.
+
+VACUUM's linear scan, concurrent page splits
+--------------------------------------------
+
+VACUUM accesses the index by doing a linear scan to search for deletable
+TIDs, while considering the possibility of deleting empty pages in
+passing. This is in physical/block order, not logical/keyspace order.
+The tricky part of this is avoiding missing any deletable tuples in the
+presence of concurrent page splits: a page split could easily move some
+tuples from a page not yet passed over by the sequential scan to a
+lower-numbered page already passed over.
+
+To implement this, we provide a "vacuum cycle ID" mechanism that makes it
+possible to determine whether a page has been split since the current
+btbulkdelete cycle started. If btbulkdelete finds a page that has been
+split since it started, and has a right-link pointing to a lower page
+number, then it temporarily suspends its sequential scan and visits that
+page instead. It must continue to follow right-links and vacuum dead
+tuples until reaching a page that either hasn't been split since
+btbulkdelete started, or is above the location of the outer sequential
+scan. Then it can resume the sequential scan. This ensures that all
+tuples are visited. It may be that some tuples are visited twice, but
+that has no worse effect than an inaccurate index tuple count (and we
+can't guarantee an accurate count anyway in the face of concurrent
+activity). Note that this still works if the has-been-recently-split test
+has a small probability of false positives, so long as it never gives a
+false negative. This makes it possible to implement the test with a small
+counter value stored on each index page.
+
+Deleting entire pages during VACUUM
+-----------------------------------
+
+We consider deleting an entire page from the btree only when it's become
+completely empty of items. (Merging partly-full pages would allow better
+space reuse, but it seems impractical to move existing data items left or
+right to make this happen --- a scan moving in the opposite direction
+might miss the items if so.) Also, we *never* delete the rightmost page
+on a tree level (this restriction simplifies the traversal algorithms, as
+explained below). Page deletion always begins from an empty leaf page. An
+internal page can only be deleted as part of deleting an entire subtree.
+This is always a "skinny" subtree consisting of a "chain" of internal pages
+plus a single leaf page. There is one page on each level of the subtree,
+and each level/page covers the same key space.
+
+Deleting a leaf page is a two-stage process. In the first stage, the page
+is unlinked from its parent, and marked as half-dead. The parent page must
+be found using the same type of search as used to find the parent during an
+insertion split. We lock the target and the parent pages, change the
+target's downlink to point to the right sibling, and remove its old
+downlink. This causes the target page's key space to effectively belong to
+its right sibling. (Neither the left nor right sibling pages need to
+change their "high key" if any; so there is no problem with possibly not
+having enough space to replace a high key.) At the same time, we mark the
+target page as half-dead, which causes any subsequent searches to ignore it
+and move right (or left, in a backwards scan). This leaves the tree in a
+similar state as during a page split: the page has no downlink pointing to
+it, but it's still linked to its siblings.
+
+(Note: Lanin and Shasha prefer to make the key space move left, but their
+argument for doing so hinges on not having left-links, which we have
+anyway. So we simplify the algorithm by moving the key space right. This
+is only possible because we don't match on a separator key when ascending
+the tree during a page split, unlike Lehman and Yao/Lanin and Shasha -- it
+doesn't matter if the downlink is re-found in a pivot tuple whose separator
+key does not match the one encountered when inserter initially descended
+the tree.)
+
+To preserve consistency on the parent level, we cannot merge the key space
+of a page into its right sibling unless the right sibling is a child of
+the same parent --- otherwise, the parent's key space assignment changes
+too, meaning we'd have to make bounding-key updates in its parent, and
+perhaps all the way up the tree. Since we can't possibly do that
+atomically, we forbid this case. That means that the rightmost child of a
+parent node can't be deleted unless it's the only remaining child, in which
+case we will delete the parent too (see below).
+
+In the second-stage, the half-dead leaf page is unlinked from its siblings.
+We first lock the left sibling (if any) of the target, the target page
+itself, and its right sibling (there must be one) in that order. Then we
+update the side-links in the siblings, and mark the target page deleted.
+
+When we're about to delete the last remaining child of a parent page, things
+are slightly more complicated. In the first stage, we leave the immediate
+parent of the leaf page alone, and remove the downlink to the parent page
+instead, from the grandparent. If it's the last child of the grandparent
+too, we recurse up until we find a parent with more than one child, and
+remove the downlink of that page. The leaf page is marked as half-dead, and
+the block number of the page whose downlink was removed is stashed in the
+half-dead leaf page. This leaves us with a chain of internal pages, with
+one downlink each, leading to the half-dead leaf page, and no downlink
+pointing to the topmost page in the chain.
+
+While we recurse up to find the topmost parent in the chain, we keep the
+leaf page locked, but don't need to hold locks on the intermediate pages
+between the leaf and the topmost parent -- insertions into upper tree levels
+happen only as a result of splits of child pages, and that can't happen as
+long as we're keeping the leaf locked. The internal pages in the chain
+cannot acquire new children afterwards either, because the leaf page is
+marked as half-dead and won't be split.
+
+Removing the downlink to the top of the to-be-deleted subtree/chain
+effectively transfers the key space to the right sibling for all the
+intermediate levels too, in one atomic operation. A concurrent search might
+still visit the intermediate pages, but it will move right when it reaches
+the half-dead page at the leaf level. In particular, the search will move to
+the subtree to the right of the half-dead leaf page/to-be-deleted subtree,
+since the half-dead leaf page's right sibling must be a "cousin" page, not a
+"true" sibling page (or a second cousin page when the to-be-deleted chain
+starts at leaf page's grandparent page, and so on).
+
+In the second stage, the topmost page in the chain is unlinked from its
+siblings, and the half-dead leaf page is updated to point to the next page
+down in the chain. This is repeated until there are no internal pages left
+in the chain. Finally, the half-dead leaf page itself is unlinked from its
+siblings.
+
+A deleted page cannot be recycled immediately, since there may be other
+processes waiting to reference it (ie, search processes that just left the
+parent, or scans moving right or left from one of the siblings). These
+processes must be able to observe a deleted page for some time after the
+deletion operation, in order to be able to at least recover from it (they
+recover by moving right, as with concurrent page splits). Searchers never
+have to worry about concurrent page recycling.
+
+See "Placing deleted pages in the FSM" section below for a description of
+when and how deleted pages become safe for VACUUM to make recyclable.
+
+Page deletion and backwards scans
+---------------------------------
+
+Moving left in a backward scan is complicated because we must consider
+the possibility that the left sibling was just split (meaning we must find
+the rightmost page derived from the left sibling), plus the possibility
+that the page we were just on has now been deleted and hence isn't in the
+sibling chain at all anymore. So the move-left algorithm becomes:
+
+0. Remember the page we are on as the "original page".
+1. Follow the original page's left-link (we're done if this is zero).
+2. If the current page is live and its right-link matches the "original
+ page", we are done.
+3. Otherwise, move right one or more times looking for a live page whose
+ right-link matches the "original page". If found, we are done. (In
+ principle we could scan all the way to the right end of the index, but
+ in practice it seems better to give up after a small number of tries.
+ It's unlikely the original page's sibling split more than a few times
+ while we were in flight to it; if we do not find a matching link in a
+ few tries, then most likely the original page is deleted.)
+4. Return to the "original page". If it is still live, return to step 1
+ (we guessed wrong about it being deleted, and should restart with its
+ current left-link). If it is dead, move right until a non-dead page
+ is found (there must be one, since rightmost pages are never deleted),
+ mark that as the new "original page", and return to step 1.
+
+This algorithm is correct because the live page found by step 4 will have
+the same left keyspace boundary as the page we started from. Therefore,
+when we ultimately exit, it must be on a page whose right keyspace
+boundary matches the left boundary of where we started --- which is what
+we need to be sure we don't miss or re-scan any items.
+
+Page deletion and tree height
+-----------------------------
+
+Because we never delete the rightmost page of any level (and in particular
+never delete the root), it's impossible for the height of the tree to
+decrease. After massive deletions we might have a scenario in which the
+tree is "skinny", with several single-page levels below the root.
+Operations will still be correct in this case, but we'd waste cycles
+descending through the single-page levels. To handle this we use an idea
+from Lanin and Shasha: we keep track of the "fast root" level, which is
+the lowest single-page level. The meta-data page keeps a pointer to this
+level as well as the true root. All ordinary operations initiate their
+searches at the fast root not the true root. When we split a page that is
+alone on its level or delete the next-to-last page on a level (both cases
+are easily detected), we have to make sure that the fast root pointer is
+adjusted appropriately. In the split case, we do this work as part of the
+atomic update for the insertion into the parent level; in the delete case
+as part of the atomic update for the delete (either way, the metapage has
+to be the last page locked in the update to avoid deadlock risks). This
+avoids race conditions if two such operations are executing concurrently.
+
+Placing deleted pages in the FSM
+--------------------------------
+
+Recycling a page is decoupled from page deletion. A deleted page can only
+be put in the FSM to be recycled once there is no possible scan or search
+that has a reference to it; until then, it must stay in place with its
+sibling links undisturbed, as a tombstone that allows concurrent searches
+to detect and then recover from concurrent deletions (which are rather
+like concurrent page splits to searchers). This design is an
+implementation of what Lanin and Shasha call "the drain technique".
+
+We implement the technique by waiting until all active snapshots and
+registered snapshots as of the page deletion are gone; which is overly
+strong, but is simple to implement within Postgres. When marked fully
+dead, a deleted page is labeled with the next-transaction counter value.
+VACUUM can reclaim the page for re-use when the stored XID is guaranteed
+to be "visible to everyone". As collateral damage, we wait for snapshots
+taken until the next transaction to allocate an XID commits. We also wait
+for running XIDs with no snapshots.
+
+Prior to PostgreSQL 14, VACUUM would only place _old_ deleted pages that
+it encounters during its linear scan (pages deleted by a previous VACUUM
+operation) in the FSM. Newly deleted pages were never placed in the FSM,
+because that was assumed to _always_ be unsafe. That assumption was
+unnecessarily pessimistic in practice, though -- it often doesn't take
+very long for newly deleted pages to become safe to place in the FSM.
+There is no truly principled way to predict when deleted pages will become
+safe to place in the FSM for recycling -- it might become safe almost
+immediately (long before the current VACUUM completes), or it might not
+even be safe by the time the next VACUUM takes place. Recycle safety is
+purely a question of maintaining the consistency (or at least the apparent
+consistency) of a physical data structure. The state within the backend
+running VACUUM is simply not relevant.
+
+PostgreSQL 14 added the ability for VACUUM to consider if it's possible to
+recycle newly deleted pages at the end of the full index scan where the
+page deletion took place. It is convenient to check if it's safe at that
+point. This does require that VACUUM keep around a little bookkeeping
+information about newly deleted pages, but that's very cheap. Using
+in-memory state for this avoids the need to revisit newly deleted pages a
+second time later on -- we can just use safexid values from the local
+bookkeeping state to determine recycle safety in a deferred fashion.
+
+The need for additional FSM indirection after a page deletion operation
+takes place is a natural consequence of the highly permissive rules for
+index scans with Lehman and Yao's design. In general an index scan
+doesn't have to hold a lock or even a pin on any page when it descends the
+tree (nothing that you'd usually think of as an interlock is held "between
+levels"). At the same time, index scans cannot be allowed to land on a
+truly unrelated page due to concurrent recycling (not to be confused with
+concurrent deletion), because that results in wrong answers to queries.
+Simpler approaches to page deletion that don't need to defer recycling are
+possible, but none seem compatible with Lehman and Yao's design.
+
+Placing an already-deleted page in the FSM to be recycled when needed
+doesn't actually change the state of the page. The page will be changed
+whenever it is subsequently taken from the FSM for reuse. The deleted
+page's contents will be overwritten by the split operation (it will become
+the new right sibling page).
+
+Making concurrent TID recycling safe
+------------------------------------
+
+As explained in the earlier section about deleting index tuples during
+VACUUM, we implement a locking protocol that allows individual index scans
+to avoid concurrent TID recycling. Index scans opt-out (and so drop their
+leaf page pin when visiting the heap) whenever it's safe to do so, though.
+Dropping the pin early is useful because it avoids blocking progress by
+VACUUM. This is particularly important with index scans used by cursors,
+since idle cursors sometimes stop for relatively long periods of time. In
+extreme cases, a client application may hold on to an idle cursors for
+hours or even days. Blocking VACUUM for that long could be disastrous.
+
+Index scans that don't hold on to a buffer pin are protected by holding an
+MVCC snapshot instead. This more limited interlock prevents wrong answers
+to queries, but it does not prevent concurrent TID recycling itself (only
+holding onto the leaf page pin while accessing the heap ensures that).
+
+Index-only scans can never drop their buffer pin, since they are unable to
+tolerate having a referenced TID become recyclable. Index-only scans
+typically just visit the visibility map (not the heap proper), and so will
+not reliably notice that any stale TID reference (for a TID that pointed
+to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
+the heap by VACUUM. This could easily allow VACUUM to set the whole heap
+page to all-visible in the visibility map immediately afterwards. An MVCC
+snapshot is only sufficient to avoid problems during plain index scans
+because they must access granular visibility information from the heap
+proper. A plain index scan will even recognize LP_UNUSED items in the
+heap (items that could be recycled but haven't been just yet) as "not
+visible" -- even when the heap page is generally considered all-visible.
+
+LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+(described in full in simple deletion, below) is also more complicated for
+index scans that drop their leaf page pins. We must be careful to avoid
+LP_DEAD-marking any new index tuple that looks like a known-dead index
+tuple because it happens to share the same TID, following concurrent TID
+recycling. It's just about possible that some other session inserted a
+new, unrelated index tuple, on the same leaf page, which has the same
+original TID. It would be totally wrong to LP_DEAD-set this new,
+unrelated index tuple.
+
+We handle this kill_prior_tuple race condition by having affected index
+scans conservatively assume that any change to the leaf page at all
+implies that it was reached by btbulkdelete in the interim period when no
+buffer pin was held. This is implemented by not setting any LP_DEAD bits
+on the leaf page at all when the page's LSN has changed. (That won't work
+with an unlogged index, so for now we don't ever apply the "don't hold
+onto pin" optimization there.)
+
+Fastpath For Index Insertion
+----------------------------
+
+We optimize for a common case of insertion of increasing index key
+values by caching the last page to which this backend inserted the last
+value, if this page was the rightmost leaf page. For the next insert, we
+can then quickly check if the cached page is still the rightmost leaf
+page and also the correct place to hold the current value. We can avoid
+the cost of walking down the tree in such common cases.
+
+The optimization works on the assumption that there can only be one
+non-ignorable leaf rightmost page, and so not even a visible-to-everyone
+style interlock is required. We cannot fail to detect that our hint was
+invalidated, because there can only be one such page in the B-Tree at
+any time. It's possible that the page will be deleted and recycled
+without a backend's cached page also being detected as invalidated, but
+only when we happen to recycle a block that once again gets recycled as the
+rightmost leaf page.
+
+Simple deletion
+---------------
+
+If a process visits a heap tuple and finds that it's dead and removable
+(ie, dead to all open transactions, not only that process), then we can
+return to the index and mark the corresponding index entry "known dead",
+allowing subsequent index scans to skip visiting the heap tuple. The
+"known dead" marking works by setting the index item's lp_flags state
+to LP_DEAD. This is currently only done in plain indexscans, not bitmap
+scans, because only plain scans visit the heap and index "in sync" and so
+there's not a convenient way to do it for bitmap scans. Note also that
+LP_DEAD bits are often set when checking a unique index for conflicts on
+insert (this is simpler because it takes place when we hold an exclusive
+lock on the leaf page).
+
+Once an index tuple has been marked LP_DEAD it can actually be deleted
+from the index immediately; since index scans only stop "between" pages,
+no scan can lose its place from such a deletion. We separate the steps
+because we allow LP_DEAD to be set with only a share lock (it's like a
+hint bit for a heap tuple), but physically deleting tuples requires an
+exclusive lock. We also need to generate a snapshotConflictHorizon for
+each deletion operation's WAL record, which requires additional
+coordinating with the tableam when the deletion actually takes place.
+(snapshotConflictHorizon value may be used to generate a conflict during
+subsequent REDO of the record by a standby.)
+
+Delaying and batching index tuple deletion like this enables a further
+optimization: opportunistic checking of "extra" nearby index tuples
+(tuples that are not LP_DEAD-set) when they happen to be very cheap to
+check in passing (because we already know that the tableam will be
+visiting their table block to generate a snapshotConflictHorizon). Any
+index tuples that turn out to be safe to delete will also be deleted.
+Simple deletion will behave as if the extra tuples that actually turn
+out to be delete-safe had their LP_DEAD bits set right from the start.
+
+Deduplication can also prevent a page split, but index tuple deletion is
+our preferred approach. Note that posting list tuples can only have
+their LP_DEAD bit set when every table TID within the posting list is
+known dead. This isn't much of a problem in practice because LP_DEAD
+bits are just a starting point for deletion. What really matters is
+that _some_ deletion operation that targets related nearby-in-table TIDs
+takes place at some point before the page finally splits. That's all
+that's required for the deletion process to perform granular removal of
+groups of dead TIDs from posting list tuples (without the situation ever
+being allowed to get out of hand).
+
+Bottom-Up deletion
+------------------
+
+We attempt to delete whatever duplicates happen to be present on the page
+when the duplicates are suspected to be caused by version churn from
+successive UPDATEs. This only happens when we receive an executor hint
+indicating that optimizations like heapam's HOT have not worked out for
+the index -- the incoming tuple must be a logically unchanged duplicate
+which is needed for MVCC purposes, suggesting that that might well be the
+dominant source of new index tuples on the leaf page in question. (Also,
+bottom-up deletion is triggered within unique indexes in cases with
+continual INSERT and DELETE related churn, since that is easy to detect
+without any external hint.)
+
+Simple deletion will already have failed to prevent a page split when a
+bottom-up deletion pass takes place (often because no LP_DEAD bits were
+ever set on the page). The two mechanisms have closely related
+implementations. The same WAL records are used for each operation, and
+the same tableam infrastructure is used to determine what TIDs/tuples are
+actually safe to delete. The implementations only differ in how they pick
+TIDs to consider for deletion, and whether or not the tableam will give up
+before accessing all table blocks (bottom-up deletion lives with the
+uncertainty of its success by keeping the cost of failure low). Even
+still, the two mechanisms are clearly distinct at the conceptual level.
+
+Bottom-up index deletion is driven entirely by heuristics (whereas simple
+deletion is guaranteed to delete at least those index tuples that are
+already LP_DEAD marked -- there must be at least one). We have no
+certainty that we'll find even one index tuple to delete. That's why we
+closely cooperate with the tableam to keep the costs it pays in balance
+with the benefits we receive. The interface that we use for this is
+described in detail in access/tableam.h.
+
+Bottom-up index deletion can be thought of as a backstop mechanism against
+unnecessary version-driven page splits. It is based in part on an idea
+from generational garbage collection: the "generational hypothesis". This
+is the empirical observation that "most objects die young". Within
+nbtree, new index tuples often quickly appear in the same place, and then
+quickly become garbage. There can be intense concentrations of garbage in
+relatively few leaf pages with certain workloads (or there could be in
+earlier versions of PostgreSQL without bottom-up index deletion, at
+least). See doc/src/sgml/btree.sgml for a high-level description of the
+design principles behind bottom-up index deletion in nbtree, including
+details of how it complements VACUUM.
+
+We expect to find a reasonably large number of tuples that are safe to
+delete within each bottom-up pass. If we don't then we won't need to
+consider the question of bottom-up deletion for the same leaf page for
+quite a while (usually because the page splits, which resolves the
+situation for the time being). We expect to perform regular bottom-up
+deletion operations against pages that are at constant risk of unnecessary
+page splits caused only by version churn. When the mechanism works well
+we'll constantly be "on the verge" of having version-churn-driven page
+splits, but never actually have even one.
+
+Our duplicate heuristics work well despite being fairly simple.
+Unnecessary page splits only occur when there are truly pathological
+levels of version churn (in theory a small amount of version churn could
+make a page split occur earlier than strictly necessary, but that's pretty
+harmless). We don't have to understand the underlying workload; we only
+have to understand the general nature of the pathology that we target.
+Version churn is easy to spot when it is truly pathological. Affected
+leaf pages are fairly homogeneous.
+
+WAL Considerations
+------------------
+
+The insertion and deletion algorithms in themselves don't guarantee btree
+consistency after a crash. To provide robustness, we depend on WAL
+replay. A single WAL entry is effectively an atomic action --- we can
+redo it from the log if it fails to complete.
+
+Ordinary item insertions (that don't force a page split) are of course
+single WAL entries, since they only affect one page. The same for
+leaf-item deletions (if the deletion brings the leaf page to zero items,
+it is now a candidate to be deleted, but that is a separate action).
+
+An insertion that causes a page split is logged as a single WAL entry for
+the changes occurring on the insertion's level --- including update of the
+right sibling's left-link --- followed by a second WAL entry for the
+insertion on the parent level (which might itself be a page split, requiring
+an additional insertion above that, etc).
+
+For a root split, the follow-on WAL entry is a "new root" entry rather than
+an "insertion" entry, but details are otherwise much the same.
+
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-the-fly, when searching
+the tree for a new insertion. It could be done during searches, too, but
+it seems best not to put any extra updates in what would otherwise be a
+read-only operation (updating is not possible in hot standby mode anyway).
+It would seem natural to add the missing downlinks in VACUUM, but since
+inserting a downlink might require splitting a page, it might fail if you
+run out of disk space. That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space. In contrast,
+an insertion can require enlarging the physical file anyway. There is one
+minor exception: VACUUM finishes interrupted splits of internal pages when
+deleting their children. This allows the code for re-finding parent items
+to be used by both page splits and page deletion.
+
+To identify missing downlinks, when a page is split, the left page is
+flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in
+the parent is finished and the flag in the child cleared, but can be
+released immediately after that, before recursing up the tree if the parent
+also needs to be split. This ensures that incompletely split pages should
+not be seen under normal circumstances; only if insertion to the parent
+has failed for some reason. (It's also possible for a reader to observe
+a page with the incomplete split flag set during recovery; see later
+section on "Scans during Recovery" for details.)
+
+We flag the left page, even though it's the right page that's missing the
+downlink, because it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+When splitting a non-root page that is alone on its level, the required
+metapage update (of the "fast root" link) is performed and logged as part
+of the insertion into the parent level. When splitting the root page, the
+metapage update is handled as part of the "new root" action.
+
+Each step in page deletion is logged as a separate WAL entry: marking the
+leaf as half-dead and removing the downlink is one record, and unlinking a
+page is a second record. If vacuum is interrupted for some reason, or the
+system crashes, the tree is consistent for searches and insertions. The
+next VACUUM will find the half-dead leaf page and continue the deletion.
+
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
+Scans during Recovery
+---------------------
+
+nbtree indexes support read queries in Hot Standby mode. Every atomic
+action/WAL record makes isolated changes that leave the tree in a
+consistent state for readers. Readers lock pages according to the same
+rules that readers follow on the primary. (Readers may have to move
+right to recover from a "concurrent" page split or page deletion, just
+like on the primary.)
+
+However, there are a couple of differences in how pages are locked by
+replay/the startup process as compared to the original write operation
+on the primary. The exceptions involve page splits and page deletions.
+The first phase and second phase of a page split are processed
+independently during replay, since they are independent atomic actions.
+We do not attempt to recreate the coupling of parent and child page
+write locks that took place on the primary. This is safe because readers
+never care about the incomplete split flag anyway. Holding on to an
+extra write lock on the primary is only necessary so that a second
+writer cannot observe the incomplete split flag before the first writer
+finishes the split. If we let concurrent writers on the primary observe
+an incomplete split flag on the same page, each writer would attempt to
+complete the unfinished split, corrupting the parent page. (Similarly,
+replay of page deletion records does not hold a write lock on the target
+leaf page throughout; only the primary needs to block out concurrent
+writers that insert on to the page being deleted.)
+
+WAL replay holds same-level locks in a way that matches the approach
+taken during original execution, though. This prevent readers from
+observing same-level inconsistencies. It's probably possible to be more
+lax about how same-level locks are acquired during recovery (most kinds
+of readers could still move right to recover if we didn't couple
+same-level locks), but we prefer to be conservative here.
+
+During recovery all index scans start with ignore_killed_tuples = false
+and we never set kill_prior_tuple. We do this because the oldest xmin
+on the standby server can be older than the oldest xmin on the primary
+server, which means tuples can be marked LP_DEAD even when they are
+still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
+they can still appear in the standby because of full page writes. So
+we must always ignore them in standby, and that means it's not worth
+setting them either. (When LP_DEAD-marked tuples are eventually deleted
+on the primary, the deletion is WAL-logged. Queries that run on a
+standby therefore get much of the benefit of any LP_DEAD setting that
+takes place on the primary.)
+
+Note that we talk about scans that are started during recovery. We go to
+a little trouble to allow a scan to start during recovery and end during
+normal running after recovery has completed. This is a key capability
+because it allows running applications to continue while the standby
+changes state into a normally running server.
+
+The interlocking required to avoid returning incorrect results from
+non-MVCC scans is not required on standby nodes. We still get a full
+cleanup lock when replaying VACUUM records during recovery, but recovery
+does not need to lock every leaf page (only those leaf pages that have
+items to delete) -- that's sufficient to avoid breaking index-only scans
+during recovery (see section above about making TID recycling safe). That
+leaves concern only for plain index scans. (XXX: Not actually clear why
+this is totally unnecessary during recovery.)
+
+MVCC snapshot plain index scans are always safe, for the same reasons that
+they're safe during original execution. HeapTupleSatisfiesToast() doesn't
+use MVCC semantics, though that's because it doesn't need to - if the main
+heap row is visible then the toast rows will also be visible. So as long
+as we follow a toast pointer from a visible (live) tuple the corresponding
+toast rows will also be visible, so we do not need to recheck MVCC on
+them.
+
+Other Things That Are Handy to Know
+-----------------------------------
+
+Page zero of every btree is a meta-data page. This page stores the
+location of the root page --- both the true root and the current effective
+root ("fast" root). To avoid fetching the metapage for every single index
+search, we cache a copy of the meta-data information in the index's
+relcache entry (rd_amcache). This is a bit ticklish since using the cache
+implies following a root page pointer that could be stale. However, a
+backend following a cached pointer can sufficiently verify whether it
+reached the intended page; either by checking the is-root flag when it
+is going to the true root, or by checking that the page has no siblings
+when going to the fast root. At worst, this could result in descending
+some extra tree levels if we have a cached pointer to a fast root that is
+now above the real fast root. Such cases shouldn't arise often enough to
+be worth optimizing; and in any case we can expect a relcache flush will
+discard the cached metapage before long, since a VACUUM that's moved the
+fast root pointer can be expected to issue a statistics update for the
+index.
+
+The algorithm assumes we can fit at least three items per page
+(a "high key" and two real data items). Therefore it's unsafe
+to accept items larger than 1/3rd page size. Larger items would
+work sometimes, but could cause failures later on depending on
+what else gets put on their page.
+
+"ScanKey" data structures are used in two fundamentally different ways
+in this code, which we describe as "search" scankeys and "insertion"
+scankeys. A search scankey is the kind passed to btbeginscan() or
+btrescan() from outside the btree code. The sk_func pointers in a search
+scankey point to comparison functions that return boolean, such as int4lt.
+There might be more than one scankey entry for a given index column, or
+none at all. (We require the keys to appear in index column order, but
+the order of multiple keys for a given column is unspecified.) An
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0). In an insertion scankey there is at most one
+entry per index column. There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan. Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple. (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
+
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split. The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page. Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead. Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short. INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes. They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out. The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26). The Postgres implementation is loosely
+based on their paper. Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees. Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces). Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text. An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
+
+There is sophisticated criteria for choosing a leaf page split point. The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split. The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page). Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible. Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective. Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page. The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen. This idea also comes
+from the Prefix B-Tree paper. This process has much in common with what
+happens at the leaf level to make suffix truncation effective. The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration. The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must. When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order. The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization. Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient. Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits. Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details. Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries. Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format). Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split. It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page
+(bottom-up index deletion may be attempted first, as our second last line
+of defense). We can set the LP_DEAD bit with posting list tuples, though
+only when all TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists. Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists. Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree. Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers. Compression would also make the deletion
+of a subset of TIDs from a posting list slow and complicated, which would
+be a big problem for workloads that depend heavily on bottom-up index
+deletion.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates. Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order. Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the number of distinct values that can ever be placed on
+almost any given leaf page in a unique index is fixed and permanent. For
+example, a primary key on an identity column will usually only have leaf
+page splits caused by the insertion of new logical rows within the
+rightmost leaf page. If there is a split of a non-rightmost leaf page,
+then the split must have been triggered by inserts associated with UPDATEs
+of existing logical rows. Splitting a leaf page purely to store multiple
+versions is a false economy. In effect, we're permanently degrading the
+index structure just to absorb a temporary burst of duplicates.
+
+Deduplication in unique indexes helps to prevent these pathological page
+splits. Storing duplicates in a space efficient manner is not the goal,
+since in the long run there won't be any duplicates anyway. Rather, we're
+buying time for standard garbage collection mechanisms to run before a
+page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing. This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs. This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether). There is little point in trying to
+delay a split that is probably inevitable anyway. This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Note: Avoiding "unnecessary" page splits driven by version churn is also
+the goal of bottom-up index deletion, which was added to PostgreSQL 14.
+Bottom-up index deletion is now the preferred way to deal with this
+problem (with all kinds of indexes, though especially with unique
+indexes). Still, deduplication can sometimes augment bottom-up index
+deletion. When deletion cannot free tuples (due to an old snapshot
+holding up cleanup), falling back on deduplication provides additional
+capacity. Delaying the page split by deduplicating can allow a future
+bottom-up deletion pass of the same page to succeed.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed. Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action). It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split. Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing). Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list -- the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa. A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself). Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead. The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all. This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple has its TID replaced with the
+rightmost/max heap TID from the existing/originally overlapping posting
+list. Similarly, the original incoming item's TID is relocated to the
+appropriate offset in the posting list (we usually shift TIDs out of the
+way to make a hole for it). Finally, the posting-split-with-page-split
+case must generate a new high key based on an imaginary version of the
+original page that has both the final new item and the after-list-split
+posting tuple (page splits usually just operate against an imaginary
+version that contains the new item/item that won't fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item. This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting. In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins. We'd end up having to handle posting list splits
+that need a page split anyway. Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
+
+Notes About Data Representation
+-------------------------------
+
+The right-sibling link required by L&Y is kept in the page "opaque
+data" area, as is the left-sibling link, the page level, and some flags.
+The page level counts upwards from zero at the leaf level, to the tree
+depth minus 1 at the root. (Counting up from the leaves ensures that we
+don't need to renumber any existing pages when splitting the root.)
+
+The Postgres disk block data format (an array of items) doesn't fit
+Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
+so we have to play some games. (The alternating-keys-and-pointers
+notion is important for internal page splits, which conceptually split
+at the middle of an existing pivot tuple -- the tuple's "separator" key
+goes on the left side of the split as the left side's new high key,
+while the tuple's pointer/downlink goes on the right side as the
+first/minus infinity downlink.)
+
+On a page that is not rightmost in its tree level, the "high key" is
+kept in the page's first item, and real data items start at item 2.
+The link portion of the "high key" item goes unused. A page that is
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item. Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
+
+On a leaf page, the data items are simply links to (TIDs of) tuples
+in the relation being indexed, with the associated key values.
+
+On a non-leaf page, the data items are down-links to child pages with
+bounding keys. The key in each data item is a strict lower bound for
+keys on that child page, so logically the key is to the left of that
+downlink. The high key (if present) is the upper bound for the last
+downlink. The first data item on each such page has no lower bound
+--- or lower bound of minus infinity, if you prefer. The comparison
+routines must treat it accordingly. The actual key stored in the
+item is irrelevant, and need not be stored at all. This arrangement
+corresponds to the fact that an L&Y non-leaf page has one more pointer
+than key. Suffix truncation's negative infinity attributes behave in
+the same way.