summaryrefslogtreecommitdiffstats
path: root/src/backend/access/nbtree/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/access/nbtree/README')
-rw-r--r--src/backend/access/nbtree/README909
1 files changed, 909 insertions, 0 deletions
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
new file mode 100644
index 0000000..216d419
--- /dev/null
+++ b/src/backend/access/nbtree/README
@@ -0,0 +1,909 @@
+src/backend/access/nbtree/README
+
+Btree Indexing
+==============
+
+This directory contains a correct implementation of Lehman and Yao's
+high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
+Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
+on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also
+use a simplified version of the deletion logic described in Lanin and
+Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
+Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
+
+The basic Lehman & Yao Algorithm
+--------------------------------
+
+Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
+to the page's right sibling. It also adds a "high key" to each page, which
+is an upper bound on the keys that are allowed on that page. These two
+additions make it possible detect a concurrent page split, which allows the
+tree to be searched without holding any read locks (except to keep a single
+page from being modified while reading it).
+
+When a search follows a downlink to a child page, it compares the page's
+high key with the search key. If the search key is greater than the high
+key, the page must've been split concurrently, and you must follow the
+right-link to find the new page containing the key range you're looking
+for. This might need to be repeated, if the page has been split more than
+once.
+
+Lehman and Yao talk about alternating "separator" keys and downlinks in
+internal pages rather than tuples or records. We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation. All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples. Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago. A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute. Logical duplicates are sorted in heap TID
+order. This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which is assured by having reliably unique keys). Keys are always unique
+on their level, with the exception of a leaf page's high key, which can be
+fully equal to the last item on the page.
+
+The Postgres implementation of suffix truncation must make sure that the
+Lehman and Yao invariants hold, and represents that absent/truncated
+attributes in pivot tuples have the sentinel value "minus infinity". The
+later section on suffix truncation will be helpful if it's unclear how the
+Lehman & Yao invariants work with a real world example.
+
+Differences to the Lehman & Yao algorithm
+-----------------------------------------
+
+We have made the following changes in order to incorporate the L&Y algorithm
+into Postgres:
+
+Lehman and Yao don't require read locks, but assume that in-memory
+copies of tree pages are unshared. Postgres shares in-memory buffers
+among backends. As a result, we do page-level read locking on btree
+pages in order to guarantee that no record is modified while we are
+examining it. This reduces concurrency but guarantees correct
+behavior.
+
+We support the notion of an ordered "scan" of an index as well as
+insertions, deletions, and simple lookups. A scan in the forward
+direction is no problem, we just use the right-sibling pointers that
+L&Y require anyway. (Thus, once we have descended the tree to the
+correct start point for the scan, the scan looks only at leaf pages
+and never at higher tree levels.) To support scans in the backward
+direction, we also store a "left sibling" link much like the "right
+sibling". (This adds an extra step to the L&Y split algorithm: while
+holding the write lock on the page being split, we also lock its former
+right sibling to update that page's left-link. This is safe since no
+writer of that page can be interested in acquiring a write lock on our
+page.) A backwards scan has one additional bit of complexity: after
+following the left-link we must account for the possibility that the
+left sibling page got split before we could read it. So, we have to
+move right until we find a page whose right-link matches the page we
+came from. (Actually, it's even harder than that; see deletion discussion
+below.)
+
+Page read locks are held only for as long as a scan is examining a page.
+To minimize lock/unlock traffic, an index scan always searches a leaf page
+to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while
+not holding any page lock within the index. We do continue to hold a pin
+on the leaf page in some circumstances, to protect against concurrent
+deletions (see below). In this state the scan is effectively stopped
+"between" pages, either before or after the page it has pinned. This is
+safe in the presence of concurrent insertions and even page splits, because
+items are never moved across pre-existing page boundaries --- so the scan
+cannot miss any items it should have seen, nor accidentally return the same
+item twice. The scan must remember the page's right-link at the time it
+was scanned, since that is the page to move right to; if we move right to
+the current right-link then we'd re-scan any items moved by a page split.
+We don't similarly remember the left-link, since it's best to use the most
+up-to-date left-link when trying to move left (see detailed move-left
+algorithm below).
+
+In most cases we release our lock and pin on a page before attempting
+to acquire pin and lock on the page we are moving to. In a few places
+it is necessary to lock the next page before releasing the current one.
+This is safe when moving right or up, but not when moving left or down
+(else we'd create the possibility of deadlocks).
+
+Lehman and Yao fail to discuss what must happen when the root page
+becomes full and must be split. Our implementation is to split the
+root in the same way that any other page would be split, then construct
+a new root page holding pointers to both of the resulting pages (which
+now become siblings on the next level of the tree). The new root page
+is then installed by altering the root pointer in the meta-data page (see
+below). This works because the root is not treated specially in any
+other way --- in particular, searches will move right using its link
+pointer if the link is set. Therefore, searches will find the data
+that's been moved into the right sibling even if they read the meta-data
+page before it got updated. This is the same reasoning that makes a
+split of a non-root page safe. The locking considerations are similar too.
+
+When an inserter recurses up the tree, splitting internal pages to insert
+links to pages inserted on the level below, it is possible that it will
+need to access a page above the level that was the root when it began its
+descent (or more accurately, the level that was the root when it read the
+meta-data page). In this case the stack it made while descending does not
+help for finding the correct page. When this happens, we find the correct
+place by re-descending the tree until we reach the level one above the
+level we need to insert a link to, and then moving right as necessary.
+(Typically this will take only two fetches, the meta-data page and the new
+root, but in principle there could have been more than one root split
+since we saw the root. We can identify the correct tree level by means of
+the level numbers stored in each page. The situation is rare enough that
+we do not need a more efficient solution.)
+
+Lehman and Yao must couple/chain locks as part of moving right when
+relocating a child page's downlink during an ascent of the tree. This is
+the only point where Lehman and Yao have to simultaneously hold three
+locks (a lock on the child, the original parent, and the original parent's
+right sibling). We don't need to couple internal page locks for pages on
+the same level, though. We match a child's block number to a downlink
+from a pivot tuple one level up, whereas Lehman and Yao match on the
+separator key associated with the downlink that was followed during the
+initial descent. We can release the lock on the original parent page
+before acquiring a lock on its right sibling, since there is never any
+need to deal with the case where the separator key that we must relocate
+becomes the original parent's high key. Lanin and Shasha don't couple
+locks here either, though they also don't couple locks between levels
+during ascents. They are willing to "wait and try again" to avoid races.
+Their algorithm is optimistic, which means that "an insertion holds no
+more than one write lock at a time during its ascent". We more or less
+stick with Lehman and Yao's approach of conservatively coupling parent and
+child locks when ascending the tree, since it's far simpler.
+
+Lehman and Yao assume fixed-size keys, but we must deal with
+variable-size keys. Therefore there is not a fixed maximum number of
+keys per page; we just stuff in as many as will fit. When we split a
+page, we try to equalize the number of bytes, not items, assigned to
+pages (though suffix truncation is also considered). Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
+
+The Deletion Algorithm
+----------------------
+
+Before deleting a leaf item, we get a super-exclusive lock on the target
+page, so that no other backend has a pin on the page when the deletion
+starts. This is not necessary for correctness in terms of the btree index
+operations themselves; as explained above, index scans logically stop
+"between" pages and so can't lose their place. The reason we do it is to
+provide an interlock between non-full VACUUM and indexscans. Since VACUUM
+deletes index entries before reclaiming heap tuple line pointers, the
+super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
+line pointer that an indexscanning process might be about to visit. This
+guarantee works only for simple indexscans that visit the heap in sync
+with the index scan, not for bitmap scans. We only need the guarantee
+when using non-MVCC snapshot rules; when using an MVCC snapshot, it
+doesn't matter if the heap tuple is replaced with an unrelated tuple at
+the same TID, because the new tuple won't be visible to our scan anyway.
+Therefore, a scan using an MVCC snapshot which has no other confounding
+factors will not hold the pin after the page contents are read. The
+current reasons for exceptions, where a pin is still needed, are if the
+index is not WAL-logged or if the scan is an index-only scan. If later
+work allows the pin to be dropped for all cases we will be able to
+simplify the vacuum code, since the concept of a super-exclusive lock
+for btree indexes will no longer be needed.
+
+Because a pin is not always held, and a page can be split even while
+someone does hold a pin on it, it is possible that an indexscan will
+return items that are no longer stored on the page it has a pin on, but
+rather somewhere to the right of that page. To ensure that VACUUM can't
+prematurely remove such heap tuples, we require btbulkdelete to obtain a
+super-exclusive lock on every leaf page in the index, even pages that
+don't contain any deletable tuples. Any scan which could yield incorrect
+results if the tuple at a TID matching the scan's range and filter
+conditions were replaced by a different tuple while the scan is in
+progress must hold the pin on each index page until all index entries read
+from the page have been processed. This guarantees that the btbulkdelete
+call cannot return while any indexscan is still holding a copy of a
+deleted index tuple if the scan could be confused by that. Note that this
+requirement does not say that btbulkdelete must visit the pages in any
+particular order. (See also on-the-fly deletion, below.)
+
+There is no such interlocking for deletion of items in internal pages,
+since backends keep no lock nor pin on a page they have descended past.
+Hence, when a backend is ascending the tree using its stack, it must
+be prepared for the possibility that the item it wants is to the left of
+the recorded position (but it can't have moved left out of the recorded
+page). Since we hold a lock on the lower page (per L&Y) until we have
+re-found the parent item that links to it, we can be assured that the
+parent item does still exist and can't have been deleted.
+
+Page Deletion
+-------------
+
+We consider deleting an entire page from the btree only when it's become
+completely empty of items. (Merging partly-full pages would allow better
+space reuse, but it seems impractical to move existing data items left or
+right to make this happen --- a scan moving in the opposite direction
+might miss the items if so.) Also, we *never* delete the rightmost page
+on a tree level (this restriction simplifies the traversal algorithms, as
+explained below). Page deletion always begins from an empty leaf page. An
+internal page can only be deleted as part of deleting an entire subtree.
+This is always a "skinny" subtree consisting of a "chain" of internal pages
+plus a single leaf page. There is one page on each level of the subtree,
+and each level/page covers the same key space.
+
+Deleting a leaf page is a two-stage process. In the first stage, the page
+is unlinked from its parent, and marked as half-dead. The parent page must
+be found using the same type of search as used to find the parent during an
+insertion split. We lock the target and the parent pages, change the
+target's downlink to point to the right sibling, and remove its old
+downlink. This causes the target page's key space to effectively belong to
+its right sibling. (Neither the left nor right sibling pages need to
+change their "high key" if any; so there is no problem with possibly not
+having enough space to replace a high key.) At the same time, we mark the
+target page as half-dead, which causes any subsequent searches to ignore it
+and move right (or left, in a backwards scan). This leaves the tree in a
+similar state as during a page split: the page has no downlink pointing to
+it, but it's still linked to its siblings.
+
+(Note: Lanin and Shasha prefer to make the key space move left, but their
+argument for doing so hinges on not having left-links, which we have
+anyway. So we simplify the algorithm by moving the key space right. This
+is only possible because we don't match on a separator key when ascending
+the tree during a page split, unlike Lehman and Yao/Lanin and Shasha -- it
+doesn't matter if the downlink is re-found in a pivot tuple whose separator
+key does not match the one encountered when inserter initially descended
+the tree.)
+
+To preserve consistency on the parent level, we cannot merge the key space
+of a page into its right sibling unless the right sibling is a child of
+the same parent --- otherwise, the parent's key space assignment changes
+too, meaning we'd have to make bounding-key updates in its parent, and
+perhaps all the way up the tree. Since we can't possibly do that
+atomically, we forbid this case. That means that the rightmost child of a
+parent node can't be deleted unless it's the only remaining child, in which
+case we will delete the parent too (see below).
+
+In the second-stage, the half-dead leaf page is unlinked from its siblings.
+We first lock the left sibling (if any) of the target, the target page
+itself, and its right sibling (there must be one) in that order. Then we
+update the side-links in the siblings, and mark the target page deleted.
+
+When we're about to delete the last remaining child of a parent page, things
+are slightly more complicated. In the first stage, we leave the immediate
+parent of the leaf page alone, and remove the downlink to the parent page
+instead, from the grandparent. If it's the last child of the grandparent
+too, we recurse up until we find a parent with more than one child, and
+remove the downlink of that page. The leaf page is marked as half-dead, and
+the block number of the page whose downlink was removed is stashed in the
+half-dead leaf page. This leaves us with a chain of internal pages, with
+one downlink each, leading to the half-dead leaf page, and no downlink
+pointing to the topmost page in the chain.
+
+While we recurse up to find the topmost parent in the chain, we keep the
+leaf page locked, but don't need to hold locks on the intermediate pages
+between the leaf and the topmost parent -- insertions into upper tree levels
+happen only as a result of splits of child pages, and that can't happen as
+long as we're keeping the leaf locked. The internal pages in the chain
+cannot acquire new children afterwards either, because the leaf page is
+marked as half-dead and won't be split.
+
+Removing the downlink to the top of the to-be-deleted subtree/chain
+effectively transfers the key space to the right sibling for all the
+intermediate levels too, in one atomic operation. A concurrent search might
+still visit the intermediate pages, but it will move right when it reaches
+the half-dead page at the leaf level. In particular, the search will move to
+the subtree to the right of the half-dead leaf page/to-be-deleted subtree,
+since the half-dead leaf page's right sibling must be a "cousin" page, not a
+"true" sibling page (or a second cousin page when the to-be-deleted chain
+starts at leaf page's grandparent page, and so on).
+
+In the second stage, the topmost page in the chain is unlinked from its
+siblings, and the half-dead leaf page is updated to point to the next page
+down in the chain. This is repeated until there are no internal pages left
+in the chain. Finally, the half-dead leaf page itself is unlinked from its
+siblings.
+
+A deleted page cannot be reclaimed immediately, since there may be other
+processes waiting to reference it (ie, search processes that just left the
+parent, or scans moving right or left from one of the siblings). These
+processes must observe that the page is marked dead and recover
+accordingly. Searches and forward scans simply follow the right-link
+until they find a non-dead page --- this will be where the deleted page's
+key-space moved to.
+
+Moving left in a backward scan is complicated because we must consider
+the possibility that the left sibling was just split (meaning we must find
+the rightmost page derived from the left sibling), plus the possibility
+that the page we were just on has now been deleted and hence isn't in the
+sibling chain at all anymore. So the move-left algorithm becomes:
+0. Remember the page we are on as the "original page".
+1. Follow the original page's left-link (we're done if this is zero).
+2. If the current page is live and its right-link matches the "original
+ page", we are done.
+3. Otherwise, move right one or more times looking for a live page whose
+ right-link matches the "original page". If found, we are done. (In
+ principle we could scan all the way to the right end of the index, but
+ in practice it seems better to give up after a small number of tries.
+ It's unlikely the original page's sibling split more than a few times
+ while we were in flight to it; if we do not find a matching link in a
+ few tries, then most likely the original page is deleted.)
+4. Return to the "original page". If it is still live, return to step 1
+ (we guessed wrong about it being deleted, and should restart with its
+ current left-link). If it is dead, move right until a non-dead page
+ is found (there must be one, since rightmost pages are never deleted),
+ mark that as the new "original page", and return to step 1.
+This algorithm is correct because the live page found by step 4 will have
+the same left keyspace boundary as the page we started from. Therefore,
+when we ultimately exit, it must be on a page whose right keyspace
+boundary matches the left boundary of where we started --- which is what
+we need to be sure we don't miss or re-scan any items.
+
+A deleted page can only be reclaimed once there is no scan or search that
+has a reference to it; until then, it must stay in place with its
+right-link undisturbed. We implement this by waiting until all active
+snapshots and registered snapshots as of the deletion are gone; which is
+overly strong, but is simple to implement within Postgres. When marked
+dead, a deleted page is labeled with the next-transaction counter value.
+VACUUM can reclaim the page for re-use when this transaction number is
+older than RecentGlobalXmin. As collateral damage, this implementation
+also waits for running XIDs with no snapshots and for snapshots taken
+until the next transaction to allocate an XID commits.
+
+Reclaiming a page doesn't actually change its state on disk --- we simply
+record it in the shared-memory free space map, from which it will be
+handed out the next time a new page is needed for a page split. The
+deleted page's contents will be overwritten by the split operation.
+(Note: if we find a deleted page with an extremely old transaction
+number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
+a later xid wraparound can't cause us to think the page is unreclaimable.
+But in more normal situations this would be a waste of a disk write.)
+
+Because we never delete the rightmost page of any level (and in particular
+never delete the root), it's impossible for the height of the tree to
+decrease. After massive deletions we might have a scenario in which the
+tree is "skinny", with several single-page levels below the root.
+Operations will still be correct in this case, but we'd waste cycles
+descending through the single-page levels. To handle this we use an idea
+from Lanin and Shasha: we keep track of the "fast root" level, which is
+the lowest single-page level. The meta-data page keeps a pointer to this
+level as well as the true root. All ordinary operations initiate their
+searches at the fast root not the true root. When we split a page that is
+alone on its level or delete the next-to-last page on a level (both cases
+are easily detected), we have to make sure that the fast root pointer is
+adjusted appropriately. In the split case, we do this work as part of the
+atomic update for the insertion into the parent level; in the delete case
+as part of the atomic update for the delete (either way, the metapage has
+to be the last page locked in the update to avoid deadlock risks). This
+avoids race conditions if two such operations are executing concurrently.
+
+VACUUM needs to do a linear scan of an index to search for deleted pages
+that can be reclaimed because they are older than all open transactions.
+For efficiency's sake, we'd like to use the same linear scan to search for
+deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
+in index order, but it is possible to visit them in physical order instead.
+The tricky part of this is to avoid missing any deletable tuples in the
+presence of concurrent page splits: a page split could easily move some
+tuples from a page not yet passed over by the sequential scan to a
+lower-numbered page already passed over. (This wasn't a concern for the
+index-order scan, because splits always split right.) To implement this,
+we provide a "vacuum cycle ID" mechanism that makes it possible to
+determine whether a page has been split since the current btbulkdelete
+cycle started. If btbulkdelete finds a page that has been split since
+it started, and has a right-link pointing to a lower page number, then
+it temporarily suspends its sequential scan and visits that page instead.
+It must continue to follow right-links and vacuum dead tuples until
+reaching a page that either hasn't been split since btbulkdelete started,
+or is above the location of the outer sequential scan. Then it can resume
+the sequential scan. This ensures that all tuples are visited. It may be
+that some tuples are visited twice, but that has no worse effect than an
+inaccurate index tuple count (and we can't guarantee an accurate count
+anyway in the face of concurrent activity). Note that this still works
+if the has-been-recently-split test has a small probability of false
+positives, so long as it never gives a false negative. This makes it
+possible to implement the test with a small counter value stored on each
+index page.
+
+Fastpath For Index Insertion
+----------------------------
+
+We optimize for a common case of insertion of increasing index key
+values by caching the last page to which this backend inserted the last
+value, if this page was the rightmost leaf page. For the next insert, we
+can then quickly check if the cached page is still the rightmost leaf
+page and also the correct place to hold the current value. We can avoid
+the cost of walking down the tree in such common cases.
+
+The optimization works on the assumption that there can only be one
+non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style
+interlock isn't required. We cannot fail to detect that our hint was
+invalidated, because there can only be one such page in the B-Tree at
+any time. It's possible that the page will be deleted and recycled
+without a backend's cached page also being detected as invalidated, but
+only when we happen to recycle a block that once again gets recycled as the
+rightmost leaf page.
+
+On-the-Fly Deletion Of Index Tuples
+-----------------------------------
+
+If a process visits a heap tuple and finds that it's dead and removable
+(ie, dead to all open transactions, not only that process), then we can
+return to the index and mark the corresponding index entry "known dead",
+allowing subsequent index scans to skip visiting the heap tuple. The
+"known dead" marking works by setting the index item's lp_flags state
+to LP_DEAD. This is currently only done in plain indexscans, not bitmap
+scans, because only plain scans visit the heap and index "in sync" and so
+there's not a convenient way to do it for bitmap scans.
+
+Once an index tuple has been marked LP_DEAD it can actually be removed
+from the index immediately; since index scans only stop "between" pages,
+no scan can lose its place from such a deletion. We separate the steps
+because we allow LP_DEAD to be set with only a share lock (it's exactly
+like a hint bit for a heap tuple), but physically removing tuples requires
+exclusive lock. In the current code we try to remove LP_DEAD tuples when
+we are otherwise faced with having to split a page to do an insertion (and
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
+
+This leaves the index in a state where it has no entry for a dead tuple
+that still exists in the heap. This is not a problem for the current
+implementation of VACUUM, but it could be a problem for anything that
+explicitly tries to find index entries for dead tuples. (However, the
+same situation is created by REINDEX, since it doesn't enter dead
+tuples into the index.)
+
+It's sufficient to have an exclusive lock on the index page, not a
+super-exclusive lock, to do deletion of LP_DEAD items. It might seem
+that this breaks the interlock between VACUUM and indexscans, but that is
+not so: as long as an indexscanning process has a pin on the page where
+the index item used to be, VACUUM cannot complete its btbulkdelete scan
+and so cannot remove the heap tuple. This is another reason why
+btbulkdelete has to get a super-exclusive lock on every leaf page, not
+only the ones where it actually sees items to delete. So that we can
+handle the cases where we attempt LP_DEAD flagging for a page after we
+have released its pin, we remember the LSN of the index page when we read
+the index tuples from it; we do not attempt to flag index tuples as dead
+if the we didn't hold the pin the entire time and the LSN has changed.
+
+WAL Considerations
+------------------
+
+The insertion and deletion algorithms in themselves don't guarantee btree
+consistency after a crash. To provide robustness, we depend on WAL
+replay. A single WAL entry is effectively an atomic action --- we can
+redo it from the log if it fails to complete.
+
+Ordinary item insertions (that don't force a page split) are of course
+single WAL entries, since they only affect one page. The same for
+leaf-item deletions (if the deletion brings the leaf page to zero items,
+it is now a candidate to be deleted, but that is a separate action).
+
+An insertion that causes a page split is logged as a single WAL entry for
+the changes occurring on the insertion's level --- including update of the
+right sibling's left-link --- followed by a second WAL entry for the
+insertion on the parent level (which might itself be a page split, requiring
+an additional insertion above that, etc).
+
+For a root split, the follow-on WAL entry is a "new root" entry rather than
+an "insertion" entry, but details are otherwise much the same.
+
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-the-fly, when searching
+the tree for a new insertion. It could be done during searches, too, but
+it seems best not to put any extra updates in what would otherwise be a
+read-only operation (updating is not possible in hot standby mode anyway).
+It would seem natural to add the missing downlinks in VACUUM, but since
+inserting a downlink might require splitting a page, it might fail if you
+run out of disk space. That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space. In contrast,
+an insertion can require enlarging the physical file anyway. There is one
+minor exception: VACUUM finishes interrupted splits of internal pages when
+deleting their children. This allows the code for re-finding parent items
+to be used by both page splits and page deletion.
+
+To identify missing downlinks, when a page is split, the left page is
+flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in
+the parent is finished and the flag in the child cleared, but can be
+released immediately after that, before recursing up the tree if the parent
+also needs to be split. This ensures that incompletely split pages should
+not be seen under normal circumstances; only if insertion to the parent
+has failed for some reason. (It's also possible for a reader to observe
+a page with the incomplete split flag set during recovery; see later
+section on "Scans during Recovery" for details.)
+
+We flag the left page, even though it's the right page that's missing the
+downlink, because it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+When splitting a non-root page that is alone on its level, the required
+metapage update (of the "fast root" link) is performed and logged as part
+of the insertion into the parent level. When splitting the root page, the
+metapage update is handled as part of the "new root" action.
+
+Each step in page deletion is logged as a separate WAL entry: marking the
+leaf as half-dead and removing the downlink is one record, and unlinking a
+page is a second record. If vacuum is interrupted for some reason, or the
+system crashes, the tree is consistent for searches and insertions. The
+next VACUUM will find the half-dead leaf page and continue the deletion.
+
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
+Scans during Recovery
+---------------------
+
+nbtree indexes support read queries in Hot Standby mode. Every atomic
+action/WAL record makes isolated changes that leave the tree in a
+consistent state for readers. Readers lock pages according to the same
+rules that readers follow on the primary. (Readers may have to move
+right to recover from a "concurrent" page split or page deletion, just
+like on the primary.)
+
+However, there are a couple of differences in how pages are locked by
+replay/the startup process as compared to the original write operation
+on the primary. The exceptions involve page splits and page deletions.
+The first phase and second phase of a page split are processed
+independently during replay, since they are independent atomic actions.
+We do not attempt to recreate the coupling of parent and child page
+write locks that took place on the primary. This is safe because readers
+never care about the incomplete split flag anyway. Holding on to an
+extra write lock on the primary is only necessary so that a second
+writer cannot observe the incomplete split flag before the first writer
+finishes the split. If we let concurrent writers on the primary observe
+an incomplete split flag on the same page, each writer would attempt to
+complete the unfinished split, corrupting the parent page. (Similarly,
+replay of page deletion records does not hold a write lock on the target
+leaf page throughout; only the primary needs to block out concurrent
+writers that insert on to the page being deleted.)
+
+During recovery all index scans start with ignore_killed_tuples = false
+and we never set kill_prior_tuple. We do this because the oldest xmin
+on the standby server can be older than the oldest xmin on the master
+server, which means tuples can be marked LP_DEAD even when they are
+still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
+they can still appear in the standby because of full page writes. So
+we must always ignore them in standby, and that means it's not worth
+setting them either. (When LP_DEAD-marked tuples are eventually deleted
+on the primary, the deletion is WAL-logged. Queries that run on a
+standby therefore get much of the benefit of any LP_DEAD setting that
+takes place on the primary.)
+
+Note that we talk about scans that are started during recovery. We go to
+a little trouble to allow a scan to start during recovery and end during
+normal running after recovery has completed. This is a key capability
+because it allows running applications to continue while the standby
+changes state into a normally running server.
+
+The interlocking required to avoid returning incorrect results from
+non-MVCC scans is not required on standby nodes. We still get a
+super-exclusive lock ("cleanup lock") when replaying VACUUM records
+during recovery, but recovery does not need to lock every leaf page
+(only those leaf pages that have items to delete). That is safe because
+HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
+HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
+used during write transactions, which cannot exist on the standby. MVCC
+scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem. The optimizer looks at the boundaries of value ranges
+using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
+is also safe. That leaves concern only for HeapTupleSatisfiesToast().
+
+HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
+because it doesn't need to - if the main heap row is visible then the
+toast rows will also be visible. So as long as we follow a toast
+pointer from a visible (live) tuple the corresponding toast rows
+will also be visible, so we do not need to recheck MVCC on them.
+
+Other Things That Are Handy to Know
+-----------------------------------
+
+Page zero of every btree is a meta-data page. This page stores the
+location of the root page --- both the true root and the current effective
+root ("fast" root). To avoid fetching the metapage for every single index
+search, we cache a copy of the meta-data information in the index's
+relcache entry (rd_amcache). This is a bit ticklish since using the cache
+implies following a root page pointer that could be stale. However, a
+backend following a cached pointer can sufficiently verify whether it
+reached the intended page; either by checking the is-root flag when it
+is going to the true root, or by checking that the page has no siblings
+when going to the fast root. At worst, this could result in descending
+some extra tree levels if we have a cached pointer to a fast root that is
+now above the real fast root. Such cases shouldn't arise often enough to
+be worth optimizing; and in any case we can expect a relcache flush will
+discard the cached metapage before long, since a VACUUM that's moved the
+fast root pointer can be expected to issue a statistics update for the
+index.
+
+The algorithm assumes we can fit at least three items per page
+(a "high key" and two real data items). Therefore it's unsafe
+to accept items larger than 1/3rd page size. Larger items would
+work sometimes, but could cause failures later on depending on
+what else gets put on their page.
+
+"ScanKey" data structures are used in two fundamentally different ways
+in this code, which we describe as "search" scankeys and "insertion"
+scankeys. A search scankey is the kind passed to btbeginscan() or
+btrescan() from outside the btree code. The sk_func pointers in a search
+scankey point to comparison functions that return boolean, such as int4lt.
+There might be more than one scankey entry for a given index column, or
+none at all. (We require the keys to appear in index column order, but
+the order of multiple keys for a given column is unspecified.) An
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0). In an insertion scankey there is at most one
+entry per index column. There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan. Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple. (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
+
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split. The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page. Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead. Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short. INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes. They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out. The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26). The Postgres implementation is loosely
+based on their paper. Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees. Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces). Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text. An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
+
+There is sophisticated criteria for choosing a leaf page split point. The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split. The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page). Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible. Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective. Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page. The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen. This idea also comes
+from the Prefix B-Tree paper. This process has much in common with what
+happens at the leaf level to make suffix truncation effective. The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration. The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must. When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order. The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization. Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient. Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits. Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details. Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries. Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format). Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split. It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page. We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists. Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists. Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree. Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates. Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order. Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the number of distinct values that can ever be placed on
+almost any given leaf page in a unique index is fixed and permanent. For
+example, a primary key on an identity column will usually only have leaf
+page splits caused by the insertion of new logical rows within the
+rightmost leaf page. If there is a split of a non-rightmost leaf page,
+then the split must have been triggered by inserts associated with UPDATEs
+of existing logical rows. Splitting a leaf page purely to store multiple
+versions is a false economy. In effect, we're permanently degrading the
+index structure just to absorb a temporary burst of duplicates.
+
+Deduplication in unique indexes helps to prevent these pathological page
+splits. Storing duplicates in a space efficient manner is not the goal,
+since in the long run there won't be any duplicates anyway. Rather, we're
+buying time for standard garbage collection mechanisms to run before a
+page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing. This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs. This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether). There is little point in trying to
+delay a split that is probably inevitable anyway. This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed. Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action). It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split. Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing). Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list -- the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa. A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself). Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead. The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all. This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple has its TID replaced with the
+rightmost/max heap TID from the existing/originally overlapping posting
+list. Similarly, the original incoming item's TID is relocated to the
+appropriate offset in the posting list (we usually shift TIDs out of the
+way to make a hole for it). Finally, the posting-split-with-page-split
+case must generate a new high key based on an imaginary version of the
+original page that has both the final new item and the after-list-split
+posting tuple (page splits usually just operate against an imaginary
+version that contains the new item/item that won't fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item. This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting. In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins. We'd end up having to handle posting list splits
+that need a page split anyway. Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
+
+Notes About Data Representation
+-------------------------------
+
+The right-sibling link required by L&Y is kept in the page "opaque
+data" area, as is the left-sibling link, the page level, and some flags.
+The page level counts upwards from zero at the leaf level, to the tree
+depth minus 1 at the root. (Counting up from the leaves ensures that we
+don't need to renumber any existing pages when splitting the root.)
+
+The Postgres disk block data format (an array of items) doesn't fit
+Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
+so we have to play some games. (The alternating-keys-and-pointers
+notion is important for internal page splits, which conceptually split
+at the middle of an existing pivot tuple -- the tuple's "separator" key
+goes on the left side of the split as the left side's new high key,
+while the tuple's pointer/downlink goes on the right side as the
+first/minus infinity downlink.)
+
+On a page that is not rightmost in its tree level, the "high key" is
+kept in the page's first item, and real data items start at item 2.
+The link portion of the "high key" item goes unused. A page that is
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item. Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
+
+On a leaf page, the data items are simply links to (TIDs of) tuples
+in the relation being indexed, with the associated key values.
+
+On a non-leaf page, the data items are down-links to child pages with
+bounding keys. The key in each data item is a strict lower bound for
+keys on that child page, so logically the key is to the left of that
+downlink. The high key (if present) is the upper bound for the last
+downlink. The first data item on each such page has no lower bound
+--- or lower bound of minus infinity, if you prefer. The comparison
+routines must treat it accordingly. The actual key stored in the
+item is irrelevant, and need not be stored at all. This arrangement
+corresponds to the fact that an L&Y non-leaf page has one more pointer
+than key. Suffix truncation's negative infinity attributes behave in
+the same way.