src/backend/access/hash/README Hash Indexing ============= This directory contains an implementation of hash indexing for Postgres. Most of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference, January 1991. (Our in-memory hashtable implementation, src/backend/utils/hash/dynahash.c, also relies on some of the same concepts; it is derived from code written by Esmond Pitt and later improved by Margo among others.) A hash index consists of two or more "buckets", into which tuples are placed whenever their hash key maps to the bucket number. The key-to-bucket-number mapping is chosen so that the index can be incrementally expanded. When a new bucket is to be added to the index, exactly one existing bucket will need to be "split", with some of its tuples being transferred to the new bucket according to the updated key-to-bucket-number mapping. This is essentially the same hash table management technique embodied in src/backend/utils/hash/dynahash.c for in-memory hash tables. Each bucket in the hash index comprises one or more index pages. The bucket's first page is permanently assigned to it when the bucket is created. Additional pages, called "overflow pages", are added if the bucket receives too many tuples to fit in the primary bucket page. The pages of a bucket are chained together in a doubly-linked list using fields in the index page special space. There is currently no provision to shrink a hash index, other than by rebuilding it with REINDEX. Overflow pages can be recycled for reuse in other buckets, but we never give them back to the operating system. There is no provision for reducing the number of buckets, either. As of PostgreSQL 8.4, hash index entries store only the hash code, not the actual data value, for each indexed item. This makes the index entries smaller (perhaps very substantially so) and speeds up various operations. In particular, we can speed searches by keeping the index entries in any one index page sorted by hash code, thus allowing binary search to be used within an index page. Note however that there is *no* assumption about the relative ordering of hash codes across different index pages of a bucket. Page Addressing --------------- There are four kinds of pages in a hash index: the meta page (page zero), which contains statically allocated control information; primary bucket pages; overflow pages; and bitmap pages, which keep track of overflow pages that have been freed and are available for re-use. For addressing purposes, bitmap pages are regarded as a subset of the overflow pages. Primary bucket pages and overflow pages are allocated independently (since any given index might need more or fewer overflow pages relative to its number of buckets). The hash code uses an interesting set of addressing rules to support a variable number of overflow pages while not having to move primary bucket pages around after they are created. Primary bucket pages (henceforth just "bucket pages") are allocated in power-of-2 groups, called "split points" in the code. That means at every new splitpoint we double the existing number of buckets. Allocating huge chunks of bucket pages all at once isn't optimal and we will take ages to consume those. To avoid this exponential growth of index size, we did use a trick to break up allocation of buckets at the splitpoint into 4 equal phases. If (2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) of total buckets at each phase of splitpoint group. Next quarter of allocation will only happen if buckets of the previous phase have been already consumed. For the initial splitpoint groups < 10 we will allocate all of their buckets in single phase only, as number of buckets allocated at initial groups are small in numbers. And for the groups >= 10 the allocation process is distributed among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4 different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces indicate the number of buckets allocated within each phase of splitpoint group 10. And, for splitpoint group 11 and 12 allocation phases will be {2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We can see that at each splitpoint group we double the total number of buckets from the previous group but in an incremental phase. The bucket pages allocated within one phase of a splitpoint group will appear consecutively in the index. This addressing scheme allows the physical location of a bucket page to be computed from the bucket number relatively easily, using only a small amount of control information. If we look at the function _hash_spareindex for a given bucket number we first compute the splitpoint group it belongs to and then the phase to which the bucket belongs to. Adding them we get the global splitpoint phase number S to which the bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the metapage) with given bucket number to compute its physical address. The hashm_spares[S] can be interpreted as the total number of overflow pages that have been allocated before the bucket pages of splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1 always appear at block numbers 1 and 2, just after the meta page. We always have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the former. The difference between the two represents the number of overflow pages appearing between the bucket page groups of splitpoints phase N and N+1. (Note: the above describes what happens when filling an initially minimally sized hash index. In practice, we try to estimate the required index size and allocate a suitable number of splitpoints phases immediately, to avoid expensive re-splitting during initial index build.) When S splitpoints exist altogether, the array entries hashm_spares[0] through hashm_spares[S] are valid; hashm_spares[S] records the current total number of overflow pages. New overflow pages are created as needed at the end of the index, and recorded by incrementing hashm_spares[S]. When it is time to create a new splitpoint phase's worth of bucket pages, we copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is stored in the hashm_ovflpoint field of the meta page). This has the effect of reserving the correct number of bucket pages at the end of the index, and preparing to allocate additional overflow pages after those bucket pages. hashm_spares[] entries before S cannot change anymore, since that would require moving already-created bucket pages. The last page nominally used by the index is always determinable from hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by the filesystem and smgr must always be greater than or equal to this page. We have to allow the case "greater than" because it's possible that during an index extension we crash after allocating filesystem space and before updating the metapage. Note that on filesystems that allow "holes" in files, it's entirely likely that pages before the logical EOF are not yet allocated: when we allocate a new splitpoint phase's worth of bucket pages, we physically zero the last such page to force the EOF up, and the first such page will be used immediately, but the intervening pages are not written until needed. Since overflow pages may be recycled if enough tuples are deleted from their bucket, we need a way to keep track of currently-free overflow pages. The state of each overflow page (0 = available, 1 = not available) is recorded in "bitmap" pages dedicated to this purpose. The entries in the bitmap are indexed by "bit number", a zero-based count in which every overflow page has a unique entry. We can convert between an overflow page's physical block number and its bit number using the information in hashm_spares[] (see hashovfl.c for details). The bit number sequence includes the bitmap pages, which is the reason for saying that bitmap pages are a subset of the overflow pages. It turns out in fact that each bitmap page's first bit represents itself --- this is not an essential property, but falls out of the fact that we only allocate another bitmap page when we really need one. Bit number zero always corresponds to the first bitmap page, which is allocated during index creation just after all the initially created buckets. Lock Definitions ---------------- Concurrency control for hash indexes is provided using buffer content locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, cleanup lock means that we hold an exclusive lock on the buffer and have observed at some point after acquiring the lock that we hold the only pin on that buffer. For hash indexes, a cleanup lock on a primary bucket page represents the right to perform an arbitrary reorganization of the entire bucket. Therefore, scans retain a pin on the primary bucket page for the bucket they are currently scanning. Splitting a bucket requires a cleanup lock on both the old and new primary bucket pages. VACUUM therefore takes a cleanup lock on every bucket page in order to remove tuples. It can also remove tuples copied to a new bucket by any previous split operation, because the cleanup lock taken on the primary bucket page guarantees that no scans which started prior to the most recent split can still be in progress. After cleaning each page individually, it attempts to take a cleanup lock on the primary bucket page in order to "squeeze" the bucket down to the minimum possible number of pages. To avoid deadlocks, we must be consistent about the lock order in which we lock the buckets for operations that requires locks on two different buckets. We choose to always lock the lower-numbered bucket first. The metapage is only ever locked after all bucket locks have been taken. Metapage Caching ---------------- Both scanning the index and inserting tuples require locating the bucket where a given tuple ought to be located. To do this, we need the bucket count, highmask, and lowmask from the metapage; however, it's undesirable for performance reasons to have to have to lock and pin the metapage for every such operation. Instead, we retain a cached copy of the metapage in each backend's relcache entry. This will produce the correct bucket mapping as long as the target bucket hasn't been split since the last cache refresh. To guard against the possibility that such a split has occurred, the primary page of each bucket chain stores the number of buckets that existed as of the time the bucket was last split, or if never split as of the time it was created, in the space normally used for the previous block number (that is, hasho_prevblkno). This doesn't cost anything because the primary bucket page is always the first page in the chain, and the previous block number is therefore always, in reality, InvalidBlockNumber. After computing the ostensibly-correct bucket number based on our cached copy of the metapage, we lock the corresponding primary bucket page and check whether the bucket count stored in hasho_prevblkno is greater than the number of buckets stored in our cached copy of the metapage. If so, the bucket has certainly been split, because the count must originally have been less than the number of buckets that existed at that time and can't have increased except due to a split. If not, the bucket can't have been split, because a split would have created a new bucket with a higher bucket number than any we'd seen previously. In the latter case, we've locked the correct bucket and can proceed; in the former case, we must release the lock on this bucket, lock the metapage, update our cache, unlock the metapage, and retry. Needing to retry occasionally might seem expensive, but the number of times any given bucket can be split is limited to a few dozen no matter how many times the hash index is accessed, because the total number of buckets is limited to less than 2^32. On the other hand, the number of times we access a bucket is unbounded and will be several orders of magnitude larger even in unsympathetic cases. (The metapage cache is new in v10. Older hash indexes had the primary bucket page's hasho_prevblkno initialized to InvalidBuffer.) Pseudocode Algorithms --------------------- Various flags that are used in hash index operations are described as below: The bucket-being-split and bucket-being-populated flags indicate that split the operation is in progress for a bucket. During split operation, a bucket-being-split flag is set on the old bucket and bucket-being-populated flag is set on new bucket. These flags are cleared once the split operation is finished. The split-cleanup flag indicates that a bucket which has been recently split still contains tuples that were also copied to the new bucket; it essentially marks the split as incomplete. Once we're certain that no scans which started before the new bucket was fully populated are still in progress, we can remove the copies from the old bucket and clear the flag. We insist that this flag must be clear before splitting a bucket; thus, a bucket can't be split again until the previous split is totally complete. The moved-by-split flag on a tuple indicates that tuple is moved from old to new bucket. Concurrent scans will skip such tuples until the split operation is finished. Once the tuple is marked as moved-by-split, it will remain so forever but that does no harm. We have intentionally not cleared it as that can generate an additional I/O which is not necessary. The operations we need to support are: readers scanning the index for entries of a particular hash code (which by definition are all in the same bucket); insertion of a new tuple into the correct bucket; enlarging the hash table by splitting an existing bucket; and garbage collection (deletion of dead tuples and compaction of buckets). Bucket splitting is done at conclusion of any insertion that leaves the hash table more full than the target load factor, but it is convenient to consider it as an independent operation. Note that we do not have a bucket-merge operation --- the number of buckets never shrinks. Insertion, splitting, and garbage collection may all need access to freelist management, which keeps track of available overflow pages. The reader algorithm is: lock the primary bucket page of the target bucket if the target bucket is still being populated by a split: release the buffer content lock on current bucket page pin and acquire the buffer content lock on old bucket in shared mode release the buffer content lock on old bucket, but not pin retake the buffer content lock on new bucket arrange to scan the old bucket normally and the new bucket for tuples which are not moved-by-split -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep the pin on the primary bucket throughout the scan) save all the matching tuples from current index page into an items array release pin and content lock (but if it is primary bucket page retain its pin till the end of the scan) get tuple from an item array -- at scan shutdown: release all pins still held Holding the buffer pin on the primary bucket page for the whole scan prevents the reader's current-tuple pointer from being invalidated by splits or compactions. (Of course, other buckets can still be split or compacted.) To minimize lock/unlock traffic, hash index scan always searches the entire hash page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index thereby, allowing concurrent insertion to happen on the same index page without any requirement of re-finding the current scan position for the reader. We do continue to hold a pin on the bucket page, to protect against concurrent deletions and bucket split. To allow for scans during a bucket split, if at the start of the scan, the bucket is marked as bucket-being-populated, it scan all the tuples in that bucket except for those that are marked as moved-by-split. Once it finishes the scan of all the tuples in the current bucket, it scans the old bucket from which this bucket is formed by split. The insertion algorithm is rather similar: lock the primary bucket page of the target bucket -- (so far same as reader, except for acquisition of buffer content lock in exclusive mode on primary bucket page) if the bucket-being-split flag is set for a bucket and pin count on it is one, then finish the split release the buffer content lock on current bucket get the "new" bucket which was being populated by the split scan the new bucket and form the hash table of TIDs conditionally get the cleanup lock on old and new buckets if we get the lock on both the buckets finish the split using algorithm mentioned below for split release the pin on old bucket and restart the insert from beginning. if current page is full, first check if this page contains any dead tuples. if yes, remove dead tuples from the current page and again check for the availability of the space. If enough space found, insert the tuple else release lock but not pin, read/exclusive-lock next page; repeat as needed >> see below if no space in any page of bucket take buffer content lock in exclusive mode on metapage insert tuple at appropriate place in page mark current page dirty increment tuple count, decide if split needed mark meta page dirty write WAL for insertion of tuple release the buffer content lock on metapage release buffer content lock on current page if current page is not a bucket page, release the pin on bucket page if split is needed, enter Split algorithm below release the pin on metapage To speed searches, the index entries within any individual index page are kept sorted by hash code; the insertion code must take care to insert new entries in the right place. It is okay for an insertion to take place in a bucket that is being actively scanned, because readers can cope with this as explained above. We only need the short-term buffer locks to ensure that readers do not see a partially-updated page. To avoid deadlock between readers and inserters, whenever there is a need to lock multiple buckets, we always take in the order suggested in Lock Definitions above. This algorithm allows them a very high degree of concurrency. (The exclusive metapage lock taken to update the tuple count is stronger than necessary, since readers do not care about the tuple count, but the lock is held for such a short time that this is probably not an issue.) When an inserter cannot find space in any existing page of a bucket, it must obtain an overflow page and add that page to the bucket's chain. Details of that part of the algorithm appear later. The page split algorithm is entered whenever an inserter observes that the index is overfull (has a higher-than-wanted ratio of tuples to buckets). The algorithm attempts, but does not necessarily succeed, to split one existing bucket in two, thereby lowering the fill ratio: pin meta page and take buffer content lock in exclusive mode check split still needed if split not needed anymore, drop buffer content lock and pin and exit decide which bucket to split try to take a cleanup lock on that bucket; if fail, give up if that bucket is still being split or has split-cleanup work: try to finish the split and the cleanup work if that succeeds, start over; if it fails, give up mark the old and new buckets indicating split is in progress mark both old and new buckets as dirty write WAL for allocation of new page for split copy the tuples that belongs to new bucket from old bucket, marking them as moved-by-split write WAL record for moving tuples to new page once the new page is full or all the pages of old bucket are finished release lock but not pin for primary bucket page of old bucket, read/shared-lock next page; repeat as needed clear the bucket-being-split and bucket-being-populated flags mark the old bucket indicating split-cleanup write WAL for changing the flags on both old and new buckets The split operation's attempt to acquire cleanup-lock on the old bucket number could fail if another process holds any lock or pin on it. We do not want to wait if that happens, because we don't want to wait while holding the metapage exclusive-lock. So, this is a conditional LWLockAcquire operation, and if it fails we just abandon the attempt to split. This is all right since the index is overfull but perfectly functional. Every subsequent inserter will try to split, and eventually one will succeed. If multiple inserters failed to split, the index might still be overfull, but eventually, the index will not be overfull and split attempts will stop. (We could make a successful splitter loop to see if the index is still overfull, but it seems better to distribute the split overhead across successive insertions.) If a split fails partway through (e.g. due to insufficient disk space or an interrupt), the index will not be corrupted. Instead, we'll retry the split every time a tuple is inserted into the old bucket prior to inserting the new tuple; eventually, we should succeed. The fact that a split is left unfinished doesn't prevent subsequent buckets from being split, but we won't try to split the bucket again until the prior split is finished. In other words, a bucket can be in the middle of being split for some time, but it can't be in the middle of two splits at the same time. The fourth operation is garbage collection (bulk deletion): next bucket := 0 pin metapage and take buffer content lock in exclusive mode fetch current max bucket number release meta page buffer content lock and pin while next bucket <= max bucket do acquire cleanup lock on primary bucket page loop: scan and remove tuples mark the target page dirty write WAL for deleting tuples from target page if this is the last bucket page, break out of loop pin and x-lock next page release prior lock and pin (except keep pin on primary bucket page) if the page we have locked is not the primary bucket page: release lock and take exclusive lock on primary bucket page if there are no other pins on the primary bucket page: squeeze the bucket to remove free space release the pin on primary bucket page next bucket ++ end loop pin metapage and take buffer content lock in exclusive mode check if number of buckets changed if so, release content lock and pin and return to for-each-bucket loop else update metapage tuple count mark meta page dirty and write WAL for update of metapage release buffer content lock and pin Note that this is designed to allow concurrent splits and scans. If a split occurs, tuples relocated into the new bucket will be visited twice by the scan, but that does no harm. See also "Interlocking Between Scans and VACUUM", below. We must be careful about the statistics reported by the VACUUM operation. What we can do is count the number of tuples scanned, and believe this in preference to the stored tuple count if the stored tuple count and number of buckets did *not* change at any time during the scan. This provides a way of correcting the stored tuple count if it gets out of sync for some reason. But if a split or insertion does occur concurrently, the scan count is untrustworthy; instead, subtract the number of tuples deleted from the stored tuple count and use that. Interlocking Between Scans and VACUUM ------------------------------------- Since we release the lock on bucket page during a cleanup scan of a bucket, a concurrent scan could start in that bucket before we've finished vacuuming it. If a scan gets ahead of cleanup, we could have the following problem: (1) the scan sees heap TIDs that are about to be removed before they are processed by VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3) VACUUM completes, (4) one or more of the TIDs the scan decided were dead are reused for an unrelated tuple, and finally (5) the scan wakes up and erroneously kills the new tuple. Note that this requires VACUUM and a scan to be active in the same bucket at the same time. If VACUUM completes before the scan starts, the scan never has a chance to see the dead tuples; if the scan completes before the VACUUM starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM can't start on a bucket that has an active scan, because the scan holds a pin on the primary bucket page, and VACUUM must take a cleanup lock on that page in order to begin cleanup. Therefore, the only way this problem can occur is for a scan to start after VACUUM has released the cleanup lock on the bucket but before it has processed the entire bucket and then overtake the cleanup operation. Currently, we prevent this using lock chaining: cleanup locks the next page in the chain before releasing the lock and pin on the page just processed. Free Space Management --------------------- (Question: why is this so complicated? Why not just have a linked list of free pages with the list head in the metapage? It's not like we avoid needing to modify the metapage with all this.) Free space management consists of two sub-algorithms, one for reserving an overflow page to add to a bucket chain, and one for returning an empty overflow page to the free pool. Obtaining an overflow page: take metapage content lock in exclusive mode determine next bitmap page number; if none, exit loop release meta page content lock pin bitmap page and take content lock in exclusive mode search for a free page (zero bit in bitmap) if found: set bit in bitmap mark bitmap page dirty take metapage buffer content lock in exclusive mode if first-free-bit value did not change, update it and mark meta page dirty else (not found): release bitmap page buffer content lock loop back to try next bitmap page, if any -- here when we have checked all bitmap pages; we hold meta excl. lock extend index to add another overflow page; update meta information mark meta page dirty return page number It is slightly annoying to release and reacquire the metapage lock multiple times, but it seems best to do it that way to minimize loss of concurrency against processes just entering the index. We don't want to hold the metapage exclusive lock while reading in a bitmap page. (We can at least avoid repeated buffer pin/unpin here.) The normal path for extending the index does not require doing I/O while holding the metapage lock. We do have to do I/O when the extension requires adding a new bitmap page as well as the required overflow page ... but that is an infrequent case, so the loss of concurrency seems acceptable. The portion of tuple insertion that calls the above subroutine looks like this: -- having determined that no space is free in the target bucket: remember last page of bucket, drop write lock on it re-write-lock last page of bucket if it is not last anymore, step to the last page execute free-page-acquire (obtaining an overflow page) mechanism described above update (former) last page to point to the new page and mark buffer dirty write-lock and initialize new page, with back link to former last page write WAL for addition of overflow page release the locks on meta page and bitmap page acquired in free-page-acquire algorithm release the lock on former last page release the lock on new overflow page insert tuple into new page -- etc. Notice this handles the case where two concurrent inserters try to extend the same bucket. They will end up with a valid, though perhaps space-inefficient, configuration: two overflow pages will be added to the bucket, each containing one tuple. The last part of this violates the rule about holding write lock on two pages concurrently, but it should be okay to write-lock the previously free page; there can be no other process holding lock on it. Bucket splitting uses a similar algorithm if it has to extend the new bucket, but it need not worry about concurrent extension since it has buffer content lock in exclusive mode on the new bucket. Freeing an overflow page requires the process to hold buffer content lock in exclusive mode on the containing bucket, so need not worry about other accessors of pages in the bucket. The algorithm is: delink overflow page from bucket chain (this requires read/update/write/release of fore and aft siblings) pin meta page and take buffer content lock in shared mode determine which bitmap page contains the free space bit for page release meta page buffer content lock pin bitmap page and take buffer content lock in exclusive mode retake meta page buffer content lock in exclusive mode move (insert) tuples that belong to the overflow page being freed update bitmap bit mark bitmap page dirty if page number is still less than first-free-bit, update first-free-bit field and mark meta page dirty write WAL for delinking overflow page operation release buffer content lock and pin release meta page buffer content lock and pin We have to do it this way because we must clear the bitmap bit before changing the first-free-bit field (hashm_firstfree). It is possible that we set first-free-bit too small (because someone has already reused the page we just freed), but that is okay; the only cost is the next overflow page acquirer will scan more bitmap bits than he needs to. What must be avoided is having first-free-bit greater than the actual first free bit, because then that free page would never be found by searchers. The reason of moving tuples from overflow page while delinking the later is to make that as an atomic operation. Not doing so could lead to spurious reads on standby. Basically, the user might see the same tuple twice. WAL Considerations ------------------ The hash index operations like create index, insert, delete, bucket split, allocate overflow page, and squeeze in themselves don't guarantee hash index consistency after a crash. To provide robustness, we write WAL for each of these operations. CREATE INDEX writes multiple WAL records. First, we write a record to cover the initializatoin of the metapage, followed by one for each new bucket created, followed by one for the initial bitmap page. It's not important for index creation to appear atomic, because the index isn't yet visible to any other transaction, and the creating transaction will roll back in the event of a crash. It would be difficult to cover the whole operation with a single write-ahead log record anyway, because we can log only a fixed number of pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery. Ordinary item insertions (that don't force a page split or need a new overflow page) are single WAL entries. They touch a single bucket page and the metapage. The metapage is updated during replay as it is updated during original operation. If an insertion causes the addition of an overflow page, there will be one WAL entry for the new overflow page and second entry for insert itself. If an insertion causes a bucket split, there will be one WAL entry for insert itself, followed by a WAL entry for allocating a new bucket, followed by a WAL entry for each overflow bucket page in the new bucket to which the tuples are moved from old bucket, followed by a WAL entry to indicate that split is complete for both old and new buckets. A split operation which requires overflow pages to complete the operation will need to write a WAL record for each new allocation of an overflow page. As splitting involves multiple atomic actions, it's possible that the system crashes between moving tuples from bucket pages of the old bucket to new bucket. In such a case, after recovery, the old and new buckets will be marked with bucket-being-split and bucket-being-populated flags respectively which indicates that split is in progress for those buckets. The reader algorithm works correctly, as it will scan both the old and new buckets when the split is in progress as explained in the reader algorithm section above. We finish the split at next insert or split operation on the old bucket as explained in insert and split algorithm above. It could be done during searches, too, but it seems best not to put any extra updates in what would otherwise be a read-only operation (updating is not possible in hot standby mode anyway). It would seem natural to complete the split in VACUUM, but since splitting a bucket might require allocating a new page, it might fail if you run out of disk space. That would be bad during VACUUM - the reason for running VACUUM in the first place might be that you run out of disk space, and now VACUUM won't finish because you're out of disk space. In contrast, an insertion can require enlarging the physical file anyway. Deletion of tuples from a bucket is performed for two reasons: to remove dead tuples, and to remove tuples that were moved by a bucket split. A WAL entry is made for each bucket page from which tuples are removed, and then another WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples are removed, a separate WAL entry is made to update the metapage. As deletion involves multiple atomic operations, it is quite possible that system crashes after (a) removing tuples from some of the bucket pages, (b) before clearing the garbage flag, or (c) before updating the metapage. If the system crashes before completing (b), it will again try to clean the bucket during next vacuum or insert after recovery which can have some performance impact, but it will work fine. If the system crashes before completing (c), after recovery there could be some additional splits until the next vacuum updates the metapage, but the other operations like insert, delete and scan will work correctly. We can fix this problem by actually updating the metapage based on delete operation during replay, but it's not clear whether it's worth the complication. A squeeze operation moves tuples from one of the buckets later in the chain to one of the bucket earlier in chain and writes WAL record when either the bucket to which it is writing tuples is filled or bucket from which it is removing the tuples becomes empty. As a squeeze operation involves writing multiple atomic operations, it is quite possible that the system crashes before completing the operation on entire bucket. After recovery, the operations will work correctly, but the index will remain bloated and this can impact performance of read and insert operations until the next vacuum squeeze the bucket completely. Other Notes ----------- Clean up locks prevent a split from occurring while *another* process is stopped in a given bucket. It also ensures that one of our *own* backend's scans is not stopped in the bucket.