diff options
Diffstat (limited to 'src/backend/access/hash/README')
-rw-r--r-- | src/backend/access/hash/README | 651 |
1 files changed, 651 insertions, 0 deletions
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README new file mode 100644 index 0000000..2227ebf --- /dev/null +++ b/src/backend/access/hash/README @@ -0,0 +1,651 @@ +src/backend/access/hash/README + +Hash Indexing +============= + +This directory contains an implementation of hash indexing for Postgres. +Most of the core ideas are taken from Margo Seltzer and Ozan Yigit, +A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference, +January 1991. (Our in-memory hashtable implementation, +src/backend/utils/hash/dynahash.c, also relies on some of the same concepts; +it is derived from code written by Esmond Pitt and later improved by Margo +among others.) + +A hash index consists of two or more "buckets", into which tuples are +placed whenever their hash key maps to the bucket number. The +key-to-bucket-number mapping is chosen so that the index can be +incrementally expanded. When a new bucket is to be added to the index, +exactly one existing bucket will need to be "split", with some of its +tuples being transferred to the new bucket according to the updated +key-to-bucket-number mapping. This is essentially the same hash table +management technique embodied in src/backend/utils/hash/dynahash.c for +in-memory hash tables. + +Each bucket in the hash index comprises one or more index pages. The +bucket's first page is permanently assigned to it when the bucket is +created. Additional pages, called "overflow pages", are added if the +bucket receives too many tuples to fit in the primary bucket page. +The pages of a bucket are chained together in a doubly-linked list +using fields in the index page special space. + +There is currently no provision to shrink a hash index, other than by +rebuilding it with REINDEX. Overflow pages can be recycled for reuse +in other buckets, but we never give them back to the operating system. +There is no provision for reducing the number of buckets, either. + +As of PostgreSQL 8.4, hash index entries store only the hash code, not the +actual data value, for each indexed item. This makes the index entries +smaller (perhaps very substantially so) and speeds up various operations. +In particular, we can speed searches by keeping the index entries in any +one index page sorted by hash code, thus allowing binary search to be used +within an index page. Note however that there is *no* assumption about the +relative ordering of hash codes across different index pages of a bucket. + + +Page Addressing +--------------- + +There are four kinds of pages in a hash index: the meta page (page zero), +which contains statically allocated control information; primary bucket +pages; overflow pages; and bitmap pages, which keep track of overflow +pages that have been freed and are available for re-use. For addressing +purposes, bitmap pages are regarded as a subset of the overflow pages. + +Primary bucket pages and overflow pages are allocated independently (since +any given index might need more or fewer overflow pages relative to its +number of buckets). The hash code uses an interesting set of addressing +rules to support a variable number of overflow pages while not having to +move primary bucket pages around after they are created. + +Primary bucket pages (henceforth just "bucket pages") are allocated in +power-of-2 groups, called "split points" in the code. That means at every new +splitpoint we double the existing number of buckets. Allocating huge chunks +of bucket pages all at once isn't optimal and we will take ages to consume +those. To avoid this exponential growth of index size, we did use a trick to +break up allocation of buckets at the splitpoint into 4 equal phases. If +(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on +we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) +of total buckets at each phase of splitpoint group. Next quarter of allocation +will only happen if buckets of the previous phase have been already consumed. +For the initial splitpoint groups < 10 we will allocate all of their buckets in +single phase only, as number of buckets allocated at initial groups are small +in numbers. And for the groups >= 10 the allocation process is distributed +among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4 +different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces +indicate the number of buckets allocated within each phase of splitpoint group +10. And, for splitpoint group 11 and 12 allocation phases will be +{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We +can see that at each splitpoint group we double the total number of buckets +from the previous group but in an incremental phase. The bucket pages +allocated within one phase of a splitpoint group will appear consecutively in +the index. This addressing scheme allows the physical location of a bucket +page to be computed from the bucket number relatively easily, using only a +small amount of control information. If we look at the function +_hash_spareindex for a given bucket number we first compute the +splitpoint group it belongs to and then the phase to which the bucket belongs +to. Adding them we get the global splitpoint phase number S to which the +bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] +is an array stored in the metapage) with given bucket number to compute its +physical address. The hashm_spares[S] can be interpreted as the total number +of overflow pages that have been allocated before the bucket pages of +splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1 +always appear at block numbers 1 and 2, just after the meta page. We always +have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the +former. The difference between the two represents the number of overflow pages +appearing between the bucket page groups of splitpoints phase N and N+1. +(Note: the above describes what happens when filling an initially minimally +sized hash index. In practice, we try to estimate the required index size and +allocate a suitable number of splitpoints phases immediately, to avoid +expensive re-splitting during initial index build.) + +When S splitpoints exist altogether, the array entries hashm_spares[0] +through hashm_spares[S] are valid; hashm_spares[S] records the current +total number of overflow pages. New overflow pages are created as needed +at the end of the index, and recorded by incrementing hashm_spares[S]. +When it is time to create a new splitpoint phase's worth of bucket pages, we +copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is +stored in the hashm_ovflpoint field of the meta page). This has the +effect of reserving the correct number of bucket pages at the end of the +index, and preparing to allocate additional overflow pages after those +bucket pages. hashm_spares[] entries before S cannot change anymore, +since that would require moving already-created bucket pages. + +The last page nominally used by the index is always determinable from +hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by +the filesystem and smgr must always be greater than or equal to this page. +We have to allow the case "greater than" because it's possible that during +an index extension we crash after allocating filesystem space and before +updating the metapage. Note that on filesystems that allow "holes" in +files, it's entirely likely that pages before the logical EOF are not yet +allocated: when we allocate a new splitpoint phase's worth of bucket pages, we +physically zero the last such page to force the EOF up, and the first such +page will be used immediately, but the intervening pages are not written +until needed. + +Since overflow pages may be recycled if enough tuples are deleted from +their bucket, we need a way to keep track of currently-free overflow +pages. The state of each overflow page (0 = available, 1 = not available) +is recorded in "bitmap" pages dedicated to this purpose. The entries in +the bitmap are indexed by "bit number", a zero-based count in which every +overflow page has a unique entry. We can convert between an overflow +page's physical block number and its bit number using the information in +hashm_spares[] (see hashovfl.c for details). The bit number sequence +includes the bitmap pages, which is the reason for saying that bitmap +pages are a subset of the overflow pages. It turns out in fact that each +bitmap page's first bit represents itself --- this is not an essential +property, but falls out of the fact that we only allocate another bitmap +page when we really need one. Bit number zero always corresponds to the +first bitmap page, which is allocated during index creation just after all +the initially created buckets. + + +Lock Definitions +---------------- + +Concurrency control for hash indexes is provided using buffer content +locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, +cleanup lock means that we hold an exclusive lock on the buffer and have +observed at some point after acquiring the lock that we hold the only pin +on that buffer. For hash indexes, a cleanup lock on a primary bucket page +represents the right to perform an arbitrary reorganization of the entire +bucket. Therefore, scans retain a pin on the primary bucket page for the +bucket they are currently scanning. Splitting a bucket requires a cleanup +lock on both the old and new primary bucket pages. VACUUM therefore takes +a cleanup lock on every bucket page in order to remove tuples. It can also +remove tuples copied to a new bucket by any previous split operation, because +the cleanup lock taken on the primary bucket page guarantees that no scans +which started prior to the most recent split can still be in progress. After +cleaning each page individually, it attempts to take a cleanup lock on the +primary bucket page in order to "squeeze" the bucket down to the minimum +possible number of pages. + +To avoid deadlocks, we must be consistent about the lock order in which we +lock the buckets for operations that requires locks on two different buckets. +We choose to always lock the lower-numbered bucket first. The metapage is +only ever locked after all bucket locks have been taken. + + +Metapage Caching +---------------- + +Both scanning the index and inserting tuples require locating the bucket +where a given tuple ought to be located. To do this, we need the bucket +count, highmask, and lowmask from the metapage; however, it's undesirable +for performance reasons to have to have to lock and pin the metapage for +every such operation. Instead, we retain a cached copy of the metapage +in each backend's relcache entry. This will produce the correct +bucket mapping as long as the target bucket hasn't been split since the +last cache refresh. + +To guard against the possibility that such a split has occurred, the +primary page of each bucket chain stores the number of buckets that +existed as of the time the bucket was last split, or if never split as +of the time it was created, in the space normally used for the +previous block number (that is, hasho_prevblkno). This doesn't cost +anything because the primary bucket page is always the first page in +the chain, and the previous block number is therefore always, in +reality, InvalidBlockNumber. + +After computing the ostensibly-correct bucket number based on our cached +copy of the metapage, we lock the corresponding primary bucket page and +check whether the bucket count stored in hasho_prevblkno is greater than +the number of buckets stored in our cached copy of the metapage. If +so, the bucket has certainly been split, because the count must originally +have been less than the number of buckets that existed at that time and +can't have increased except due to a split. If not, the bucket can't have +been split, because a split would have created a new bucket with a higher +bucket number than any we'd seen previously. In the latter case, we've +locked the correct bucket and can proceed; in the former case, we must +release the lock on this bucket, lock the metapage, update our cache, +unlock the metapage, and retry. + +Needing to retry occasionally might seem expensive, but the number of times +any given bucket can be split is limited to a few dozen no matter how +many times the hash index is accessed, because the total number of +buckets is limited to less than 2^32. On the other hand, the number of +times we access a bucket is unbounded and will be several orders of +magnitude larger even in unsympathetic cases. + +(The metapage cache is new in v10. Older hash indexes had the primary +bucket page's hasho_prevblkno initialized to InvalidBuffer.) + +Pseudocode Algorithms +--------------------- + +Various flags that are used in hash index operations are described as below: + +The bucket-being-split and bucket-being-populated flags indicate that split +the operation is in progress for a bucket. During split operation, a +bucket-being-split flag is set on the old bucket and bucket-being-populated +flag is set on new bucket. These flags are cleared once the split operation +is finished. + +The split-cleanup flag indicates that a bucket which has been recently split +still contains tuples that were also copied to the new bucket; it essentially +marks the split as incomplete. Once we're certain that no scans which +started before the new bucket was fully populated are still in progress, we +can remove the copies from the old bucket and clear the flag. We insist that +this flag must be clear before splitting a bucket; thus, a bucket can't be +split again until the previous split is totally complete. + +The moved-by-split flag on a tuple indicates that tuple is moved from old to +new bucket. Concurrent scans will skip such tuples until the split operation +is finished. Once the tuple is marked as moved-by-split, it will remain so +forever but that does no harm. We have intentionally not cleared it as that +can generate an additional I/O which is not necessary. + +The operations we need to support are: readers scanning the index for +entries of a particular hash code (which by definition are all in the same +bucket); insertion of a new tuple into the correct bucket; enlarging the +hash table by splitting an existing bucket; and garbage collection +(deletion of dead tuples and compaction of buckets). Bucket splitting is +done at conclusion of any insertion that leaves the hash table more full +than the target load factor, but it is convenient to consider it as an +independent operation. Note that we do not have a bucket-merge operation +--- the number of buckets never shrinks. Insertion, splitting, and +garbage collection may all need access to freelist management, which keeps +track of available overflow pages. + +The reader algorithm is: + + lock the primary bucket page of the target bucket + if the target bucket is still being populated by a split: + release the buffer content lock on current bucket page + pin and acquire the buffer content lock on old bucket in shared mode + release the buffer content lock on old bucket, but not pin + retake the buffer content lock on new bucket + arrange to scan the old bucket normally and the new bucket for + tuples which are not moved-by-split +-- then, per read request: + reacquire content lock on current page + step to next page if necessary (no chaining of content locks, but keep + the pin on the primary bucket throughout the scan) + save all the matching tuples from current index page into an items array + release pin and content lock (but if it is primary bucket page retain + its pin till the end of the scan) + get tuple from an item array +-- at scan shutdown: + release all pins still held + +Holding the buffer pin on the primary bucket page for the whole scan prevents +the reader's current-tuple pointer from being invalidated by splits or +compactions. (Of course, other buckets can still be split or compacted.) + +To minimize lock/unlock traffic, hash index scan always searches the entire +hash page to identify all the matching items at once, copying their heap tuple +IDs into backend-local storage. The heap tuple IDs are then processed while not +holding any page lock within the index thereby, allowing concurrent insertion +to happen on the same index page without any requirement of re-finding the +current scan position for the reader. We do continue to hold a pin on the +bucket page, to protect against concurrent deletions and bucket split. + +To allow for scans during a bucket split, if at the start of the scan, the +bucket is marked as bucket-being-populated, it scan all the tuples in that +bucket except for those that are marked as moved-by-split. Once it finishes +the scan of all the tuples in the current bucket, it scans the old bucket from +which this bucket is formed by split. + +The insertion algorithm is rather similar: + + lock the primary bucket page of the target bucket +-- (so far same as reader, except for acquisition of buffer content lock in + exclusive mode on primary bucket page) + if the bucket-being-split flag is set for a bucket and pin count on it is + one, then finish the split + release the buffer content lock on current bucket + get the "new" bucket which was being populated by the split + scan the new bucket and form the hash table of TIDs + conditionally get the cleanup lock on old and new buckets + if we get the lock on both the buckets + finish the split using algorithm mentioned below for split + release the pin on old bucket and restart the insert from beginning. + if current page is full, first check if this page contains any dead tuples. + if yes, remove dead tuples from the current page and again check for the + availability of the space. If enough space found, insert the tuple else + release lock but not pin, read/exclusive-lock + next page; repeat as needed + >> see below if no space in any page of bucket + take buffer content lock in exclusive mode on metapage + insert tuple at appropriate place in page + mark current page dirty + increment tuple count, decide if split needed + mark meta page dirty + write WAL for insertion of tuple + release the buffer content lock on metapage + release buffer content lock on current page + if current page is not a bucket page, release the pin on bucket page + if split is needed, enter Split algorithm below + release the pin on metapage + +To speed searches, the index entries within any individual index page are +kept sorted by hash code; the insertion code must take care to insert new +entries in the right place. It is okay for an insertion to take place in a +bucket that is being actively scanned, because readers can cope with this +as explained above. We only need the short-term buffer locks to ensure +that readers do not see a partially-updated page. + +To avoid deadlock between readers and inserters, whenever there is a need +to lock multiple buckets, we always take in the order suggested in Lock +Definitions above. This algorithm allows them a very high degree of +concurrency. (The exclusive metapage lock taken to update the tuple count +is stronger than necessary, since readers do not care about the tuple count, +but the lock is held for such a short time that this is probably not an +issue.) + +When an inserter cannot find space in any existing page of a bucket, it +must obtain an overflow page and add that page to the bucket's chain. +Details of that part of the algorithm appear later. + +The page split algorithm is entered whenever an inserter observes that the +index is overfull (has a higher-than-wanted ratio of tuples to buckets). +The algorithm attempts, but does not necessarily succeed, to split one +existing bucket in two, thereby lowering the fill ratio: + + pin meta page and take buffer content lock in exclusive mode + check split still needed + if split not needed anymore, drop buffer content lock and pin and exit + decide which bucket to split + try to take a cleanup lock on that bucket; if fail, give up + if that bucket is still being split or has split-cleanup work: + try to finish the split and the cleanup work + if that succeeds, start over; if it fails, give up + mark the old and new buckets indicating split is in progress + mark both old and new buckets as dirty + write WAL for allocation of new page for split + copy the tuples that belongs to new bucket from old bucket, marking + them as moved-by-split + write WAL record for moving tuples to new page once the new page is full + or all the pages of old bucket are finished + release lock but not pin for primary bucket page of old bucket, + read/shared-lock next page; repeat as needed + clear the bucket-being-split and bucket-being-populated flags + mark the old bucket indicating split-cleanup + write WAL for changing the flags on both old and new buckets + +The split operation's attempt to acquire cleanup-lock on the old bucket number +could fail if another process holds any lock or pin on it. We do not want to +wait if that happens, because we don't want to wait while holding the metapage +exclusive-lock. So, this is a conditional LWLockAcquire operation, and if +it fails we just abandon the attempt to split. This is all right since the +index is overfull but perfectly functional. Every subsequent inserter will +try to split, and eventually one will succeed. If multiple inserters failed +to split, the index might still be overfull, but eventually, the index will +not be overfull and split attempts will stop. (We could make a successful +splitter loop to see if the index is still overfull, but it seems better to +distribute the split overhead across successive insertions.) + +If a split fails partway through (e.g. due to insufficient disk space or an +interrupt), the index will not be corrupted. Instead, we'll retry the split +every time a tuple is inserted into the old bucket prior to inserting the new +tuple; eventually, we should succeed. The fact that a split is left +unfinished doesn't prevent subsequent buckets from being split, but we won't +try to split the bucket again until the prior split is finished. In other +words, a bucket can be in the middle of being split for some time, but it can't +be in the middle of two splits at the same time. + +The fourth operation is garbage collection (bulk deletion): + + next bucket := 0 + pin metapage and take buffer content lock in exclusive mode + fetch current max bucket number + release meta page buffer content lock and pin + while next bucket <= max bucket do + acquire cleanup lock on primary bucket page + loop: + scan and remove tuples + mark the target page dirty + write WAL for deleting tuples from target page + if this is the last bucket page, break out of loop + pin and x-lock next page + release prior lock and pin (except keep pin on primary bucket page) + if the page we have locked is not the primary bucket page: + release lock and take exclusive lock on primary bucket page + if there are no other pins on the primary bucket page: + squeeze the bucket to remove free space + release the pin on primary bucket page + next bucket ++ + end loop + pin metapage and take buffer content lock in exclusive mode + check if number of buckets changed + if so, release content lock and pin and return to for-each-bucket loop + else update metapage tuple count + mark meta page dirty and write WAL for update of metapage + release buffer content lock and pin + +Note that this is designed to allow concurrent splits and scans. If a split +occurs, tuples relocated into the new bucket will be visited twice by the +scan, but that does no harm. See also "Interlocking Between Scans and +VACUUM", below. + +We must be careful about the statistics reported by the VACUUM operation. +What we can do is count the number of tuples scanned, and believe this in +preference to the stored tuple count if the stored tuple count and number of +buckets did *not* change at any time during the scan. This provides a way of +correcting the stored tuple count if it gets out of sync for some reason. But +if a split or insertion does occur concurrently, the scan count is +untrustworthy; instead, subtract the number of tuples deleted from the stored +tuple count and use that. + +Interlocking Between Scans and VACUUM +------------------------------------- + +Since we release the lock on bucket page during a cleanup scan of a bucket, a +concurrent scan could start in that bucket before we've finished vacuuming it. +If a scan gets ahead of cleanup, we could have the following problem: (1) the +scan sees heap TIDs that are about to be removed before they are processed by +VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3) +VACUUM completes, (4) one or more of the TIDs the scan decided were dead are +reused for an unrelated tuple, and finally (5) the scan wakes up and +erroneously kills the new tuple. + +Note that this requires VACUUM and a scan to be active in the same bucket at +the same time. If VACUUM completes before the scan starts, the scan never has +a chance to see the dead tuples; if the scan completes before the VACUUM +starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM +can't start on a bucket that has an active scan, because the scan holds a pin +on the primary bucket page, and VACUUM must take a cleanup lock on that page +in order to begin cleanup. Therefore, the only way this problem can occur is +for a scan to start after VACUUM has released the cleanup lock on the bucket +but before it has processed the entire bucket and then overtake the cleanup +operation. + +Currently, we prevent this using lock chaining: cleanup locks the next page +in the chain before releasing the lock and pin on the page just processed. + +Free Space Management +--------------------- + +(Question: why is this so complicated? Why not just have a linked list +of free pages with the list head in the metapage? It's not like we +avoid needing to modify the metapage with all this.) + +Free space management consists of two sub-algorithms, one for reserving +an overflow page to add to a bucket chain, and one for returning an empty +overflow page to the free pool. + +Obtaining an overflow page: + + take metapage content lock in exclusive mode + determine next bitmap page number; if none, exit loop + release meta page content lock + pin bitmap page and take content lock in exclusive mode + search for a free page (zero bit in bitmap) + if found: + set bit in bitmap + mark bitmap page dirty + take metapage buffer content lock in exclusive mode + if first-free-bit value did not change, + update it and mark meta page dirty + else (not found): + release bitmap page buffer content lock + loop back to try next bitmap page, if any +-- here when we have checked all bitmap pages; we hold meta excl. lock + extend index to add another overflow page; update meta information + mark meta page dirty + return page number + +It is slightly annoying to release and reacquire the metapage lock +multiple times, but it seems best to do it that way to minimize loss of +concurrency against processes just entering the index. We don't want +to hold the metapage exclusive lock while reading in a bitmap page. +(We can at least avoid repeated buffer pin/unpin here.) + +The normal path for extending the index does not require doing I/O while +holding the metapage lock. We do have to do I/O when the extension +requires adding a new bitmap page as well as the required overflow page +... but that is an infrequent case, so the loss of concurrency seems +acceptable. + +The portion of tuple insertion that calls the above subroutine looks +like this: + + -- having determined that no space is free in the target bucket: + remember last page of bucket, drop write lock on it + re-write-lock last page of bucket + if it is not last anymore, step to the last page + execute free-page-acquire (obtaining an overflow page) mechanism + described above + update (former) last page to point to the new page and mark buffer dirty + write-lock and initialize new page, with back link to former last page + write WAL for addition of overflow page + release the locks on meta page and bitmap page acquired in + free-page-acquire algorithm + release the lock on former last page + release the lock on new overflow page + insert tuple into new page + -- etc. + +Notice this handles the case where two concurrent inserters try to extend +the same bucket. They will end up with a valid, though perhaps +space-inefficient, configuration: two overflow pages will be added to the +bucket, each containing one tuple. + +The last part of this violates the rule about holding write lock on two +pages concurrently, but it should be okay to write-lock the previously +free page; there can be no other process holding lock on it. + +Bucket splitting uses a similar algorithm if it has to extend the new +bucket, but it need not worry about concurrent extension since it has +buffer content lock in exclusive mode on the new bucket. + +Freeing an overflow page requires the process to hold buffer content lock in +exclusive mode on the containing bucket, so need not worry about other +accessors of pages in the bucket. The algorithm is: + + delink overflow page from bucket chain + (this requires read/update/write/release of fore and aft siblings) + pin meta page and take buffer content lock in shared mode + determine which bitmap page contains the free space bit for page + release meta page buffer content lock + pin bitmap page and take buffer content lock in exclusive mode + retake meta page buffer content lock in exclusive mode + move (insert) tuples that belong to the overflow page being freed + update bitmap bit + mark bitmap page dirty + if page number is still less than first-free-bit, + update first-free-bit field and mark meta page dirty + write WAL for delinking overflow page operation + release buffer content lock and pin + release meta page buffer content lock and pin + +We have to do it this way because we must clear the bitmap bit before +changing the first-free-bit field (hashm_firstfree). It is possible that +we set first-free-bit too small (because someone has already reused the +page we just freed), but that is okay; the only cost is the next overflow +page acquirer will scan more bitmap bits than he needs to. What must be +avoided is having first-free-bit greater than the actual first free bit, +because then that free page would never be found by searchers. + +The reason of moving tuples from overflow page while delinking the later is +to make that as an atomic operation. Not doing so could lead to spurious reads +on standby. Basically, the user might see the same tuple twice. + + +WAL Considerations +------------------ + +The hash index operations like create index, insert, delete, bucket split, +allocate overflow page, and squeeze in themselves don't guarantee hash index +consistency after a crash. To provide robustness, we write WAL for each of +these operations. + +CREATE INDEX writes multiple WAL records. First, we write a record to cover +the initializatoin of the metapage, followed by one for each new bucket +created, followed by one for the initial bitmap page. It's not important for +index creation to appear atomic, because the index isn't yet visible to any +other transaction, and the creating transaction will roll back in the event of +a crash. It would be difficult to cover the whole operation with a single +write-ahead log record anyway, because we can log only a fixed number of +pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery. + +Ordinary item insertions (that don't force a page split or need a new overflow +page) are single WAL entries. They touch a single bucket page and the +metapage. The metapage is updated during replay as it is updated during +original operation. + +If an insertion causes the addition of an overflow page, there will be one +WAL entry for the new overflow page and second entry for insert itself. + +If an insertion causes a bucket split, there will be one WAL entry for insert +itself, followed by a WAL entry for allocating a new bucket, followed by a WAL +entry for each overflow bucket page in the new bucket to which the tuples are +moved from old bucket, followed by a WAL entry to indicate that split is +complete for both old and new buckets. A split operation which requires +overflow pages to complete the operation will need to write a WAL record for +each new allocation of an overflow page. + +As splitting involves multiple atomic actions, it's possible that the system +crashes between moving tuples from bucket pages of the old bucket to new +bucket. In such a case, after recovery, the old and new buckets will be +marked with bucket-being-split and bucket-being-populated flags respectively +which indicates that split is in progress for those buckets. The reader +algorithm works correctly, as it will scan both the old and new buckets when +the split is in progress as explained in the reader algorithm section above. + +We finish the split at next insert or split operation on the old bucket as +explained in insert and split algorithm above. It could be done during +searches, too, but it seems best not to put any extra updates in what would +otherwise be a read-only operation (updating is not possible in hot standby +mode anyway). It would seem natural to complete the split in VACUUM, but since +splitting a bucket might require allocating a new page, it might fail if you +run out of disk space. That would be bad during VACUUM - the reason for +running VACUUM in the first place might be that you run out of disk space, +and now VACUUM won't finish because you're out of disk space. In contrast, +an insertion can require enlarging the physical file anyway. + +Deletion of tuples from a bucket is performed for two reasons: to remove dead +tuples, and to remove tuples that were moved by a bucket split. A WAL entry +is made for each bucket page from which tuples are removed, and then another +WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples +are removed, a separate WAL entry is made to update the metapage. + +As deletion involves multiple atomic operations, it is quite possible that +system crashes after (a) removing tuples from some of the bucket pages, (b) +before clearing the garbage flag, or (c) before updating the metapage. If the +system crashes before completing (b), it will again try to clean the bucket +during next vacuum or insert after recovery which can have some performance +impact, but it will work fine. If the system crashes before completing (c), +after recovery there could be some additional splits until the next vacuum +updates the metapage, but the other operations like insert, delete and scan +will work correctly. We can fix this problem by actually updating the +metapage based on delete operation during replay, but it's not clear whether +it's worth the complication. + +A squeeze operation moves tuples from one of the buckets later in the chain to +one of the bucket earlier in chain and writes WAL record when either the +bucket to which it is writing tuples is filled or bucket from which it +is removing the tuples becomes empty. + +As a squeeze operation involves writing multiple atomic operations, it is +quite possible that the system crashes before completing the operation on +entire bucket. After recovery, the operations will work correctly, but +the index will remain bloated and this can impact performance of read and +insert operations until the next vacuum squeeze the bucket completely. + + +Other Notes +----------- + +Clean up locks prevent a split from occurring while *another* process is stopped +in a given bucket. It also ensures that one of our *own* backend's scans is not +stopped in the bucket. |