summaryrefslogtreecommitdiffstats
path: root/src/backend/access/spgist/README
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
commit46651ce6fe013220ed397add242004d764fc0153 (patch)
tree6e5299f990f88e60174a1d3ae6e48eedd2688b2b /src/backend/access/spgist/README
parentInitial commit. (diff)
downloadpostgresql-14-upstream.tar.xz
postgresql-14-upstream.zip
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--src/backend/access/spgist/README389
1 files changed, 389 insertions, 0 deletions
diff --git a/src/backend/access/spgist/README b/src/backend/access/spgist/README
new file mode 100644
index 0000000..7117e02
--- /dev/null
+++ b/src/backend/access/spgist/README
@@ -0,0 +1,389 @@
+src/backend/access/spgist/README
+
+SP-GiST is an abbreviation of space-partitioned GiST. It provides a
+generalized infrastructure for implementing space-partitioned data
+structures, such as quadtrees, k-d trees, and radix trees (tries). When
+implemented in main memory, these structures are usually designed as a set of
+dynamically-allocated nodes linked by pointers. This is not suitable for
+direct storing on disk, since the chains of pointers can be rather long and
+require too many disk accesses. In contrast, disk based data structures
+should have a high fanout to minimize I/O. The challenge is to map tree
+nodes to disk pages in such a way that the search algorithm accesses only a
+few disk pages, even if it traverses many nodes.
+
+
+COMMON STRUCTURE DESCRIPTION
+
+Logically, an SP-GiST tree is a set of tuples, each of which can be either
+an inner or leaf tuple. Each inner tuple contains "nodes", which are
+(label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to
+another inner tuple or to the head of a list of leaf tuples. Inner tuples
+can have different numbers of nodes (children). Branches can be of different
+depth (actually, there is no control or code to support balancing), which
+means that the tree is non-balanced. However, leaf and inner tuples cannot
+be intermixed at the same level: a downlink from a node of an inner tuple
+leads either to one inner tuple, or to a list of leaf tuples.
+
+The SP-GiST core requires that inner and leaf tuples fit on a single index
+page, and even more stringently that the list of leaf tuples reached from a
+single inner-tuple node all be stored on the same index page. (Restricting
+such lists to not cross pages reduces seeks, and allows the list links to be
+stored as simple 2-byte OffsetNumbers.) SP-GiST index opclasses should
+therefore ensure that not too many nodes can be needed in one inner tuple,
+and that inner-tuple prefixes and leaf-node datum values not be too large.
+
+Inner and leaf tuples are stored separately: the former are stored only on
+"inner" pages, the latter only on "leaf" pages. Also, there are special
+restrictions on the root page. Early in an index's life, when there is only
+one page's worth of data, the root page contains an unorganized set of leaf
+tuples. After the first page split has occurred, the root is required to
+contain exactly one inner tuple.
+
+When the search traversal algorithm reaches an inner tuple, it chooses a set
+of nodes to continue tree traverse in depth. If it reaches a leaf page it
+scans a list of leaf tuples to find the ones that match the query. SP-GiST
+also supports ordered (nearest-neighbor) searches - that is during scan pending
+nodes are put into priority queue, so traversal is performed by the
+closest-first model.
+
+
+The insertion algorithm descends the tree similarly, except it must choose
+just one node to descend to from each inner tuple. Insertion might also have
+to modify the inner tuple before it can descend: it could add a new node, or
+it could "split" the tuple to obtain a less-specific prefix that can match
+the value to be inserted. If it's necessary to append a new leaf tuple to a
+list and there is no free space on page, then SP-GiST creates a new inner
+tuple and distributes leaf tuples into a set of lists on, perhaps, several
+pages.
+
+An inner tuple consists of:
+
+ optional prefix value - all successors must be consistent with it.
+ Example:
+ radix tree - prefix value is a common prefix string
+ quad tree - centroid
+ k-d tree - one coordinate
+
+ list of nodes, where node is a (label, pointer) pair.
+ Example of a label: a single character for radix tree
+
+A leaf tuple consists of:
+
+ a leaf value
+ Example:
+ radix tree - the rest of string (postfix)
+ quad and k-d tree - the point itself
+
+ ItemPointer to the corresponding heap tuple
+ nextOffset number of next leaf tuple in a chain on a leaf page
+
+ optional nulls bitmask
+ optional INCLUDE-column values
+
+For compatibility with pre-v14 indexes, a leaf tuple has a nulls bitmask
+only if there are null values (among the leaf value and the INCLUDE values)
+*and* there is at least one INCLUDE column. The null-ness of the leaf
+value can be inferred from whether the tuple is on a "nulls page" (see below)
+so it is not necessary to represent it explicitly. But we include it anyway
+in a bitmask used with INCLUDE values, so that standard tuple deconstruction
+code can be used.
+
+
+NULLS HANDLING
+
+We assume that SPGiST-indexable operators are strict (can never succeed for
+null inputs). It is still desirable to index nulls, so that whole-table
+indexscans are possible and so that "x IS NULL" can be implemented by an
+SPGiST indexscan. However, we prefer that SPGiST index opclasses not have
+to cope with nulls. Therefore, the main tree of an SPGiST index does not
+include any null entries. We store null entries in a separate SPGiST tree
+occupying a disjoint set of pages (in particular, its own root page).
+Insertions and searches in the nulls tree do not use any of the
+opclass-supplied functions, but just use hardwired logic comparable to
+AllTheSame cases in the normal tree.
+
+
+INSERTION ALGORITHM
+
+Insertion algorithm is designed to keep the tree in a consistent state at
+any moment. Here is a simplified insertion algorithm specification
+(numbers refer to notes below):
+
+ Start with the first tuple on the root page (1)
+
+ loop:
+ if (page is leaf) then
+ if (enough space)
+ insert on page and exit (5)
+ else (7)
+ call PickSplitFn() (2)
+ end if
+ else
+ switch (chooseFn())
+ case MatchNode - descend through selected node
+ case AddNode - add node and then retry chooseFn (3, 6)
+ case SplitTuple - split inner tuple to prefix and postfix, then
+ retry chooseFn with the prefix tuple (4, 6)
+ end if
+
+Notes:
+
+(1) Initially, we just dump leaf tuples into the root page until it is full;
+then we split it. Once the root is not a leaf page, it can have only one
+inner tuple, so as to keep the amount of free space on the root as large as
+possible. Both of these rules are meant to postpone doing PickSplit on the
+root for as long as possible, so that the topmost partitioning of the search
+space is as good as we can easily make it.
+
+(2) Current implementation allows to do picksplit and insert a new leaf tuple
+in one operation, if the new list of leaf tuples fits on one page. It's
+always possible for trees with small nodes like quad tree or k-d tree, but
+radix trees may require another picksplit.
+
+(3) Addition of node must keep size of inner tuple small enough to fit on a
+page. After addition, inner tuple could become too large to be stored on
+current page because of other tuples on page. In this case it will be moved
+to another inner page (see notes about page management). When moving tuple to
+another page, we can't change the numbers of other tuples on the page, else
+we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves
+a "placeholder" tuple, which can be reused later whenever another tuple is
+added to the page. See also Concurrency and Vacuum sections below. Right now
+only radix trees could add a node to the tuple; quad trees and k-d trees
+make all possible nodes at once in PickSplitFn() call.
+
+(4) Prefix value could only partially match a new value, so the SplitTuple
+action allows breaking the current tree branch into upper and lower sections.
+Another way to say it is that we can split the current inner tuple into
+"prefix" and "postfix" parts, where the prefix part is able to match the
+incoming new value. Consider example of insertion into a radix tree. We use
+the following notation, where tuple's id is just for discussion (no such id
+is actually stored):
+
+inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ]
+leaf tuple: {tuple id}<value>
+
+Suppose we need to insert string 'www.gogo.com' into inner tuple
+
+ {1}(www.google.com/)[a, i]
+
+The string does not match the prefix so we cannot descend. We must
+split the inner tuple into two tuples:
+
+ {2}(www.go)[o] - prefix tuple
+ |
+ {3}(gle.com/)[a,i] - postfix tuple
+
+On the next iteration of loop we find that 'www.gogo.com' matches the
+prefix, but not any node label, so we add a node [g] to tuple {2}:
+
+ NIL (no child exists yet)
+ |
+ {2}(www.go)[o, g]
+ |
+ {3}(gle.com/)[a,i]
+
+Now we can descend through the [g] node, which will cause us to update
+the target string to just 'o.com'. Finally, we'll insert a leaf tuple
+bearing that string:
+
+ {4}<o.com>
+ |
+ {2}(www.go)[o, g]
+ |
+ {3}(gle.com/)[a,i]
+
+As we can see, the original tuple's node array moves to postfix tuple without
+any changes. Note also that SP-GiST core assumes that prefix tuple is not
+larger than old inner tuple. That allows us to store prefix tuple directly
+in place of old inner tuple. SP-GiST core will try to store postfix tuple on
+the same page if possible, but will use another page if there is not enough
+free space (see notes 5 and 6). Currently, quad and k-d trees don't use this
+feature, because they have no concept of a prefix being "inconsistent" with
+any new value. They grow their depth only by PickSplitFn() call.
+
+(5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf
+page to store on. At first, it tries to use the last-used leaf page with the
+largest free space (which we track in each backend) to better utilize disk
+space. If that's not large enough, then the algorithm allocates a new page.
+
+(6) Management of inner pages is very similar to management of leaf pages,
+described in (5).
+
+(7) Actually, current implementation can move the whole list of leaf tuples
+and a new tuple to another page, if the list is short enough. This improves
+space utilization, but doesn't change the basis of the algorithm.
+
+
+CONCURRENCY
+
+While descending the tree, the insertion algorithm holds exclusive lock on
+two tree levels at a time, ie both parent and child pages (but parent and
+child pages can be the same, see notes above). There is a possibility of
+deadlock between two insertions if there are cross-referenced pages in
+different branches. That is, if inner tuple on page M has a child on page N
+while an inner tuple from another branch is on page N and has a child on
+page M, then two insertions descending the two branches could deadlock,
+since they will each hold their parent page's lock while trying to get the
+child page's lock.
+
+Currently, we deal with this by conditionally locking buffers as we descend
+the tree. If we fail to get lock on a buffer, we release both buffers and
+restart the insertion process. This is potentially inefficient, but the
+locking costs of a more deterministic approach seem very high.
+
+To reduce the number of cases where that happens, we introduce a concept of
+"triple parity" of pages: if inner tuple is on page with BlockNumber N, then
+its child tuples should be placed on the same page, or else on a page with
+BlockNumber M where (N+1) mod 3 == M mod 3. This rule ensures that tuples
+on page M will have no children on page N, since (M+1) mod 3 != N mod 3.
+That makes it unlikely that two insertion processes will conflict against
+each other while descending the tree. It's not perfect though: in the first
+place, we could still get a deadlock among three or more insertion processes,
+and in the second place, it's impractical to preserve this invariant in every
+case when we expand or split an inner tuple. So we still have to allow for
+deadlocks.
+
+Insertion may also need to take locks on an additional inner and/or leaf page
+to add tuples of the right type(s), when there's not enough room on the pages
+it descended through. However, we don't care exactly which such page we add
+to, so deadlocks can be avoided by conditionally locking the additional
+buffers: if we fail to get lock on an additional page, just try another one.
+
+Search traversal algorithm is rather traditional. At each non-leaf level, it
+share-locks the page, identifies which node(s) in the current inner tuple
+need to be visited, and puts those addresses on a stack of pages to examine
+later. It then releases lock on the current buffer before visiting the next
+stack item. So only one page is locked at a time, and no deadlock is
+possible. But instead, we have to worry about race conditions: by the time
+we arrive at a pointed-to page, a concurrent insertion could have replaced
+the target inner tuple (or leaf tuple chain) with data placed elsewhere.
+To handle that, whenever the insertion algorithm changes a nonempty downlink
+in an inner tuple, it places a "redirect tuple" in place of the lower-level
+inner tuple or leaf-tuple chain head that the link formerly led to. Scans
+(though not insertions) must be prepared to honor such redirects. Only a
+scan that had already visited the parent level could possibly reach such a
+redirect tuple, so we can remove redirects once all active transactions have
+been flushed out of the system.
+
+
+DEAD TUPLES
+
+Tuples on leaf pages can be in one of four states:
+
+SPGIST_LIVE: normal, live pointer to a heap tuple.
+
+SPGIST_REDIRECT: placeholder that contains a link to another place in the
+index. When a chain of leaf tuples has to be moved to another page, a
+redirect tuple is inserted in place of the chain's head tuple. The parent
+inner tuple's downlink is updated when this happens, but concurrent scans
+might be "in flight" from the parent page to the child page (since they
+release lock on the parent page before attempting to lock the child).
+The redirect pointer serves to tell such a scan where to go. A redirect
+pointer is only needed for as long as such concurrent scans could be in
+progress. Eventually, it's converted into a PLACEHOLDER dead tuple by
+VACUUM, and is then a candidate for replacement. Searches that find such
+a tuple (which should never be part of a chain) should immediately proceed
+to the other place, forgetting about the redirect tuple. Insertions that
+reach such a tuple should raise error, since a valid downlink should never
+point to such a tuple.
+
+SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a
+different offset on the page because there is a link leading to it from
+some inner tuple elsewhere in the index. (Such a tuple is never part of a
+chain, since we don't need one unless there is nothing live left in its
+chain.) Searches should ignore such entries. If an insertion action
+arrives at such a tuple, it should either replace it in-place (if there's
+room on the page to hold the desired new leaf tuple) or replace it with a
+redirection pointer to wherever it puts the new leaf tuple.
+
+SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to
+it from elsewhere. When a live tuple is deleted or moved away, and not
+replaced by a redirect pointer, it is replaced by a placeholder to keep
+the offsets of later tuples on the same page from changing. Placeholders
+can be freely replaced when adding a new tuple to the page, and also
+VACUUM will delete any that are at the end of the range of valid tuple
+offsets. Both searches and insertions should complain if a link from
+elsewhere leads them to a placeholder tuple.
+
+When the root page is also a leaf, all its tuple should be in LIVE state;
+there's no need for the others since there are no links and no need to
+preserve offset numbers.
+
+Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states.
+The REDIRECT state has the same function as on leaf pages, to send
+concurrent searches to the place where they need to go after an inner
+tuple is moved to another page. Expired REDIRECT pointers are converted
+to PLACEHOLDER status by VACUUM, and are then candidates for replacement.
+DEAD state is not currently possible, since VACUUM does not attempt to
+remove unused inner tuples.
+
+
+VACUUM
+
+VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan
+over the entire index. On both leaf and inner pages, we can convert old
+REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs
+that are at the end of the page (since they aren't needed to preserve the
+offsets of any live tuples). On leaf pages, we scan for tuples that need
+to be deleted because their heap TIDs match a vacuum target TID.
+
+If we find a deletable tuple that is not at the head of its chain, we
+can simply replace it with a PLACEHOLDER, updating the chain links to
+remove it from the chain. If it is at the head of its chain, but there's
+at least one live tuple remaining in the chain, we move that live tuple
+to the head tuple's offset, replacing it with a PLACEHOLDER to preserve
+the offsets of other tuples. This keeps the parent inner tuple's downlink
+valid. If we find ourselves deleting all live tuples in a chain, we
+replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS.
+The parent inner tuple's downlink thus points to the DEAD tuple, and the
+rules explained in the previous section keep everything working.
+
+VACUUM doesn't know a-priori which tuples are heads of their chains, but
+it can easily figure that out by constructing a predecessor array that's
+the reverse map of the nextOffset links (ie, when we see tuple x links to
+tuple y, we set predecessor[y] = x). Then head tuples are the ones with
+no predecessor.
+
+Because insertions can occur while VACUUM runs, a pure sequential scan
+could miss deleting some target leaf tuples, because they could get moved
+from a not-yet-visited leaf page to an already-visited leaf page as a
+consequence of a PickSplit or MoveLeafs operation. Failing to delete any
+target TID is not acceptable, so we have to extend the algorithm to cope
+with such cases. We recognize that such a move might have occurred when
+we see a leaf-page REDIRECT tuple whose XID indicates it might have been
+created after the VACUUM scan started. We add the redirection target TID
+to a "pending list" of places we need to recheck. Between pages of the
+main sequential scan, we empty the pending list by visiting each listed
+TID. If it points to an inner tuple (from a PickSplit), add each downlink
+TID to the pending list. If it points to a leaf page, vacuum that page.
+(We could just vacuum the single pointed-to chain, but vacuuming the
+whole page simplifies the code and reduces the odds of VACUUM having to
+modify the same page multiple times.) To ensure that pending-list
+processing can never get into an endless loop, even in the face of
+concurrent index changes, we don't remove list entries immediately but
+only after we've completed all pending-list processing; instead we just
+mark items as done after processing them. Adding a TID that's already in
+the list is a no-op, whether or not that item is marked done yet.
+
+spgbulkdelete also updates the index's free space map.
+
+Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
+performed; otherwise, it does an spgbulkdelete scan with an empty target
+list, so as to clean up redirections and placeholders, update the free
+space map, and gather statistics.
+
+
+LAST USED PAGE MANAGEMENT
+
+The list of last used pages contains four pages - a leaf page and three
+inner pages, one from each "triple parity" group. (Actually, there's one
+such list for the main tree and a separate one for the nulls tree.) This
+list is stored between calls on the index meta page, but updates are never
+WAL-logged to decrease WAL traffic. Incorrect data on meta page isn't
+critical, because we could allocate a new page at any moment.
+
+
+AUTHORS
+
+ Teodor Sigaev <teodor@sigaev.ru>
+ Oleg Bartunov <oleg@sai.msu.su>