summaryrefslogtreecommitdiffstats
path: root/src/backend/access/hash/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/access/hash/README')
-rw-r--r--src/backend/access/hash/README651
1 files changed, 651 insertions, 0 deletions
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
new file mode 100644
index 0000000..2227ebf
--- /dev/null
+++ b/src/backend/access/hash/README
@@ -0,0 +1,651 @@
+src/backend/access/hash/README
+
+Hash Indexing
+=============
+
+This directory contains an implementation of hash indexing for Postgres.
+Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
+A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
+January 1991. (Our in-memory hashtable implementation,
+src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
+it is derived from code written by Esmond Pitt and later improved by Margo
+among others.)
+
+A hash index consists of two or more "buckets", into which tuples are
+placed whenever their hash key maps to the bucket number. The
+key-to-bucket-number mapping is chosen so that the index can be
+incrementally expanded. When a new bucket is to be added to the index,
+exactly one existing bucket will need to be "split", with some of its
+tuples being transferred to the new bucket according to the updated
+key-to-bucket-number mapping. This is essentially the same hash table
+management technique embodied in src/backend/utils/hash/dynahash.c for
+in-memory hash tables.
+
+Each bucket in the hash index comprises one or more index pages. The
+bucket's first page is permanently assigned to it when the bucket is
+created. Additional pages, called "overflow pages", are added if the
+bucket receives too many tuples to fit in the primary bucket page.
+The pages of a bucket are chained together in a doubly-linked list
+using fields in the index page special space.
+
+There is currently no provision to shrink a hash index, other than by
+rebuilding it with REINDEX. Overflow pages can be recycled for reuse
+in other buckets, but we never give them back to the operating system.
+There is no provision for reducing the number of buckets, either.
+
+As of PostgreSQL 8.4, hash index entries store only the hash code, not the
+actual data value, for each indexed item. This makes the index entries
+smaller (perhaps very substantially so) and speeds up various operations.
+In particular, we can speed searches by keeping the index entries in any
+one index page sorted by hash code, thus allowing binary search to be used
+within an index page. Note however that there is *no* assumption about the
+relative ordering of hash codes across different index pages of a bucket.
+
+
+Page Addressing
+---------------
+
+There are four kinds of pages in a hash index: the meta page (page zero),
+which contains statically allocated control information; primary bucket
+pages; overflow pages; and bitmap pages, which keep track of overflow
+pages that have been freed and are available for re-use. For addressing
+purposes, bitmap pages are regarded as a subset of the overflow pages.
+
+Primary bucket pages and overflow pages are allocated independently (since
+any given index might need more or fewer overflow pages relative to its
+number of buckets). The hash code uses an interesting set of addressing
+rules to support a variable number of overflow pages while not having to
+move primary bucket pages around after they are created.
+
+Primary bucket pages (henceforth just "bucket pages") are allocated in
+power-of-2 groups, called "split points" in the code. That means at every new
+splitpoint we double the existing number of buckets. Allocating huge chunks
+of bucket pages all at once isn't optimal and we will take ages to consume
+those. To avoid this exponential growth of index size, we did use a trick to
+break up allocation of buckets at the splitpoint into 4 equal phases. If
+(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
+we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
+of total buckets at each phase of splitpoint group. Next quarter of allocation
+will only happen if buckets of the previous phase have been already consumed.
+For the initial splitpoint groups < 10 we will allocate all of their buckets in
+single phase only, as number of buckets allocated at initial groups are small
+in numbers. And for the groups >= 10 the allocation process is distributed
+among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4
+different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
+indicate the number of buckets allocated within each phase of splitpoint group
+10. And, for splitpoint group 11 and 12 allocation phases will be
+{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We
+can see that at each splitpoint group we double the total number of buckets
+from the previous group but in an incremental phase. The bucket pages
+allocated within one phase of a splitpoint group will appear consecutively in
+the index. This addressing scheme allows the physical location of a bucket
+page to be computed from the bucket number relatively easily, using only a
+small amount of control information. If we look at the function
+_hash_spareindex for a given bucket number we first compute the
+splitpoint group it belongs to and then the phase to which the bucket belongs
+to. Adding them we get the global splitpoint phase number S to which the
+bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
+is an array stored in the metapage) with given bucket number to compute its
+physical address. The hashm_spares[S] can be interpreted as the total number
+of overflow pages that have been allocated before the bucket pages of
+splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1
+always appear at block numbers 1 and 2, just after the meta page. We always
+have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
+former. The difference between the two represents the number of overflow pages
+appearing between the bucket page groups of splitpoints phase N and N+1.
+(Note: the above describes what happens when filling an initially minimally
+sized hash index. In practice, we try to estimate the required index size and
+allocate a suitable number of splitpoints phases immediately, to avoid
+expensive re-splitting during initial index build.)
+
+When S splitpoints exist altogether, the array entries hashm_spares[0]
+through hashm_spares[S] are valid; hashm_spares[S] records the current
+total number of overflow pages. New overflow pages are created as needed
+at the end of the index, and recorded by incrementing hashm_spares[S].
+When it is time to create a new splitpoint phase's worth of bucket pages, we
+copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
+stored in the hashm_ovflpoint field of the meta page). This has the
+effect of reserving the correct number of bucket pages at the end of the
+index, and preparing to allocate additional overflow pages after those
+bucket pages. hashm_spares[] entries before S cannot change anymore,
+since that would require moving already-created bucket pages.
+
+The last page nominally used by the index is always determinable from
+hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by
+the filesystem and smgr must always be greater than or equal to this page.
+We have to allow the case "greater than" because it's possible that during
+an index extension we crash after allocating filesystem space and before
+updating the metapage. Note that on filesystems that allow "holes" in
+files, it's entirely likely that pages before the logical EOF are not yet
+allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
+physically zero the last such page to force the EOF up, and the first such
+page will be used immediately, but the intervening pages are not written
+until needed.
+
+Since overflow pages may be recycled if enough tuples are deleted from
+their bucket, we need a way to keep track of currently-free overflow
+pages. The state of each overflow page (0 = available, 1 = not available)
+is recorded in "bitmap" pages dedicated to this purpose. The entries in
+the bitmap are indexed by "bit number", a zero-based count in which every
+overflow page has a unique entry. We can convert between an overflow
+page's physical block number and its bit number using the information in
+hashm_spares[] (see hashovfl.c for details). The bit number sequence
+includes the bitmap pages, which is the reason for saying that bitmap
+pages are a subset of the overflow pages. It turns out in fact that each
+bitmap page's first bit represents itself --- this is not an essential
+property, but falls out of the fact that we only allocate another bitmap
+page when we really need one. Bit number zero always corresponds to the
+first bitmap page, which is allocated during index creation just after all
+the initially created buckets.
+
+
+Lock Definitions
+----------------
+
+Concurrency control for hash indexes is provided using buffer content
+locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL,
+cleanup lock means that we hold an exclusive lock on the buffer and have
+observed at some point after acquiring the lock that we hold the only pin
+on that buffer. For hash indexes, a cleanup lock on a primary bucket page
+represents the right to perform an arbitrary reorganization of the entire
+bucket. Therefore, scans retain a pin on the primary bucket page for the
+bucket they are currently scanning. Splitting a bucket requires a cleanup
+lock on both the old and new primary bucket pages. VACUUM therefore takes
+a cleanup lock on every bucket page in order to remove tuples. It can also
+remove tuples copied to a new bucket by any previous split operation, because
+the cleanup lock taken on the primary bucket page guarantees that no scans
+which started prior to the most recent split can still be in progress. After
+cleaning each page individually, it attempts to take a cleanup lock on the
+primary bucket page in order to "squeeze" the bucket down to the minimum
+possible number of pages.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We choose to always lock the lower-numbered bucket first. The metapage is
+only ever locked after all bucket locks have been taken.
+
+
+Metapage Caching
+----------------
+
+Both scanning the index and inserting tuples require locating the bucket
+where a given tuple ought to be located. To do this, we need the bucket
+count, highmask, and lowmask from the metapage; however, it's undesirable
+for performance reasons to have to have to lock and pin the metapage for
+every such operation. Instead, we retain a cached copy of the metapage
+in each backend's relcache entry. This will produce the correct
+bucket mapping as long as the target bucket hasn't been split since the
+last cache refresh.
+
+To guard against the possibility that such a split has occurred, the
+primary page of each bucket chain stores the number of buckets that
+existed as of the time the bucket was last split, or if never split as
+of the time it was created, in the space normally used for the
+previous block number (that is, hasho_prevblkno). This doesn't cost
+anything because the primary bucket page is always the first page in
+the chain, and the previous block number is therefore always, in
+reality, InvalidBlockNumber.
+
+After computing the ostensibly-correct bucket number based on our cached
+copy of the metapage, we lock the corresponding primary bucket page and
+check whether the bucket count stored in hasho_prevblkno is greater than
+the number of buckets stored in our cached copy of the metapage. If
+so, the bucket has certainly been split, because the count must originally
+have been less than the number of buckets that existed at that time and
+can't have increased except due to a split. If not, the bucket can't have
+been split, because a split would have created a new bucket with a higher
+bucket number than any we'd seen previously. In the latter case, we've
+locked the correct bucket and can proceed; in the former case, we must
+release the lock on this bucket, lock the metapage, update our cache,
+unlock the metapage, and retry.
+
+Needing to retry occasionally might seem expensive, but the number of times
+any given bucket can be split is limited to a few dozen no matter how
+many times the hash index is accessed, because the total number of
+buckets is limited to less than 2^32. On the other hand, the number of
+times we access a bucket is unbounded and will be several orders of
+magnitude larger even in unsympathetic cases.
+
+(The metapage cache is new in v10. Older hash indexes had the primary
+bucket page's hasho_prevblkno initialized to InvalidBuffer.)
+
+Pseudocode Algorithms
+---------------------
+
+Various flags that are used in hash index operations are described as below:
+
+The bucket-being-split and bucket-being-populated flags indicate that split
+the operation is in progress for a bucket. During split operation, a
+bucket-being-split flag is set on the old bucket and bucket-being-populated
+flag is set on new bucket. These flags are cleared once the split operation
+is finished.
+
+The split-cleanup flag indicates that a bucket which has been recently split
+still contains tuples that were also copied to the new bucket; it essentially
+marks the split as incomplete. Once we're certain that no scans which
+started before the new bucket was fully populated are still in progress, we
+can remove the copies from the old bucket and clear the flag. We insist that
+this flag must be clear before splitting a bucket; thus, a bucket can't be
+split again until the previous split is totally complete.
+
+The moved-by-split flag on a tuple indicates that tuple is moved from old to
+new bucket. Concurrent scans will skip such tuples until the split operation
+is finished. Once the tuple is marked as moved-by-split, it will remain so
+forever but that does no harm. We have intentionally not cleared it as that
+can generate an additional I/O which is not necessary.
+
+The operations we need to support are: readers scanning the index for
+entries of a particular hash code (which by definition are all in the same
+bucket); insertion of a new tuple into the correct bucket; enlarging the
+hash table by splitting an existing bucket; and garbage collection
+(deletion of dead tuples and compaction of buckets). Bucket splitting is
+done at conclusion of any insertion that leaves the hash table more full
+than the target load factor, but it is convenient to consider it as an
+independent operation. Note that we do not have a bucket-merge operation
+--- the number of buckets never shrinks. Insertion, splitting, and
+garbage collection may all need access to freelist management, which keeps
+track of available overflow pages.
+
+The reader algorithm is:
+
+ lock the primary bucket page of the target bucket
+ if the target bucket is still being populated by a split:
+ release the buffer content lock on current bucket page
+ pin and acquire the buffer content lock on old bucket in shared mode
+ release the buffer content lock on old bucket, but not pin
+ retake the buffer content lock on new bucket
+ arrange to scan the old bucket normally and the new bucket for
+ tuples which are not moved-by-split
+-- then, per read request:
+ reacquire content lock on current page
+ step to next page if necessary (no chaining of content locks, but keep
+ the pin on the primary bucket throughout the scan)
+ save all the matching tuples from current index page into an items array
+ release pin and content lock (but if it is primary bucket page retain
+ its pin till the end of the scan)
+ get tuple from an item array
+-- at scan shutdown:
+ release all pins still held
+
+Holding the buffer pin on the primary bucket page for the whole scan prevents
+the reader's current-tuple pointer from being invalidated by splits or
+compactions. (Of course, other buckets can still be split or compacted.)
+
+To minimize lock/unlock traffic, hash index scan always searches the entire
+hash page to identify all the matching items at once, copying their heap tuple
+IDs into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on the same index page without any requirement of re-finding the
+current scan position for the reader. We do continue to hold a pin on the
+bucket page, to protect against concurrent deletions and bucket split.
+
+To allow for scans during a bucket split, if at the start of the scan, the
+bucket is marked as bucket-being-populated, it scan all the tuples in that
+bucket except for those that are marked as moved-by-split. Once it finishes
+the scan of all the tuples in the current bucket, it scans the old bucket from
+which this bucket is formed by split.
+
+The insertion algorithm is rather similar:
+
+ lock the primary bucket page of the target bucket
+-- (so far same as reader, except for acquisition of buffer content lock in
+ exclusive mode on primary bucket page)
+ if the bucket-being-split flag is set for a bucket and pin count on it is
+ one, then finish the split
+ release the buffer content lock on current bucket
+ get the "new" bucket which was being populated by the split
+ scan the new bucket and form the hash table of TIDs
+ conditionally get the cleanup lock on old and new buckets
+ if we get the lock on both the buckets
+ finish the split using algorithm mentioned below for split
+ release the pin on old bucket and restart the insert from beginning.
+ if current page is full, first check if this page contains any dead tuples.
+ if yes, remove dead tuples from the current page and again check for the
+ availability of the space. If enough space found, insert the tuple else
+ release lock but not pin, read/exclusive-lock
+ next page; repeat as needed
+ >> see below if no space in any page of bucket
+ take buffer content lock in exclusive mode on metapage
+ insert tuple at appropriate place in page
+ mark current page dirty
+ increment tuple count, decide if split needed
+ mark meta page dirty
+ write WAL for insertion of tuple
+ release the buffer content lock on metapage
+ release buffer content lock on current page
+ if current page is not a bucket page, release the pin on bucket page
+ if split is needed, enter Split algorithm below
+ release the pin on metapage
+
+To speed searches, the index entries within any individual index page are
+kept sorted by hash code; the insertion code must take care to insert new
+entries in the right place. It is okay for an insertion to take place in a
+bucket that is being actively scanned, because readers can cope with this
+as explained above. We only need the short-term buffer locks to ensure
+that readers do not see a partially-updated page.
+
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Lock
+Definitions above. This algorithm allows them a very high degree of
+concurrency. (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
+
+When an inserter cannot find space in any existing page of a bucket, it
+must obtain an overflow page and add that page to the bucket's chain.
+Details of that part of the algorithm appear later.
+
+The page split algorithm is entered whenever an inserter observes that the
+index is overfull (has a higher-than-wanted ratio of tuples to buckets).
+The algorithm attempts, but does not necessarily succeed, to split one
+existing bucket in two, thereby lowering the fill ratio:
+
+ pin meta page and take buffer content lock in exclusive mode
+ check split still needed
+ if split not needed anymore, drop buffer content lock and pin and exit
+ decide which bucket to split
+ try to take a cleanup lock on that bucket; if fail, give up
+ if that bucket is still being split or has split-cleanup work:
+ try to finish the split and the cleanup work
+ if that succeeds, start over; if it fails, give up
+ mark the old and new buckets indicating split is in progress
+ mark both old and new buckets as dirty
+ write WAL for allocation of new page for split
+ copy the tuples that belongs to new bucket from old bucket, marking
+ them as moved-by-split
+ write WAL record for moving tuples to new page once the new page is full
+ or all the pages of old bucket are finished
+ release lock but not pin for primary bucket page of old bucket,
+ read/shared-lock next page; repeat as needed
+ clear the bucket-being-split and bucket-being-populated flags
+ mark the old bucket indicating split-cleanup
+ write WAL for changing the flags on both old and new buckets
+
+The split operation's attempt to acquire cleanup-lock on the old bucket number
+could fail if another process holds any lock or pin on it. We do not want to
+wait if that happens, because we don't want to wait while holding the metapage
+exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split. This is all right since the
+index is overfull but perfectly functional. Every subsequent inserter will
+try to split, and eventually one will succeed. If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
+not be overfull and split attempts will stop. (We could make a successful
+splitter loop to see if the index is still overfull, but it seems better to
+distribute the split overhead across successive insertions.)
+
+If a split fails partway through (e.g. due to insufficient disk space or an
+interrupt), the index will not be corrupted. Instead, we'll retry the split
+every time a tuple is inserted into the old bucket prior to inserting the new
+tuple; eventually, we should succeed. The fact that a split is left
+unfinished doesn't prevent subsequent buckets from being split, but we won't
+try to split the bucket again until the prior split is finished. In other
+words, a bucket can be in the middle of being split for some time, but it can't
+be in the middle of two splits at the same time.
+
+The fourth operation is garbage collection (bulk deletion):
+
+ next bucket := 0
+ pin metapage and take buffer content lock in exclusive mode
+ fetch current max bucket number
+ release meta page buffer content lock and pin
+ while next bucket <= max bucket do
+ acquire cleanup lock on primary bucket page
+ loop:
+ scan and remove tuples
+ mark the target page dirty
+ write WAL for deleting tuples from target page
+ if this is the last bucket page, break out of loop
+ pin and x-lock next page
+ release prior lock and pin (except keep pin on primary bucket page)
+ if the page we have locked is not the primary bucket page:
+ release lock and take exclusive lock on primary bucket page
+ if there are no other pins on the primary bucket page:
+ squeeze the bucket to remove free space
+ release the pin on primary bucket page
+ next bucket ++
+ end loop
+ pin metapage and take buffer content lock in exclusive mode
+ check if number of buckets changed
+ if so, release content lock and pin and return to for-each-bucket loop
+ else update metapage tuple count
+ mark meta page dirty and write WAL for update of metapage
+ release buffer content lock and pin
+
+Note that this is designed to allow concurrent splits and scans. If a split
+occurs, tuples relocated into the new bucket will be visited twice by the
+scan, but that does no harm. See also "Interlocking Between Scans and
+VACUUM", below.
+
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number of
+buckets did *not* change at any time during the scan. This provides a way of
+correcting the stored tuple count if it gets out of sync for some reason. But
+if a split or insertion does occur concurrently, the scan count is
+untrustworthy; instead, subtract the number of tuples deleted from the stored
+tuple count and use that.
+
+Interlocking Between Scans and VACUUM
+-------------------------------------
+
+Since we release the lock on bucket page during a cleanup scan of a bucket, a
+concurrent scan could start in that bucket before we've finished vacuuming it.
+If a scan gets ahead of cleanup, we could have the following problem: (1) the
+scan sees heap TIDs that are about to be removed before they are processed by
+VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3)
+VACUUM completes, (4) one or more of the TIDs the scan decided were dead are
+reused for an unrelated tuple, and finally (5) the scan wakes up and
+erroneously kills the new tuple.
+
+Note that this requires VACUUM and a scan to be active in the same bucket at
+the same time. If VACUUM completes before the scan starts, the scan never has
+a chance to see the dead tuples; if the scan completes before the VACUUM
+starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM
+can't start on a bucket that has an active scan, because the scan holds a pin
+on the primary bucket page, and VACUUM must take a cleanup lock on that page
+in order to begin cleanup. Therefore, the only way this problem can occur is
+for a scan to start after VACUUM has released the cleanup lock on the bucket
+but before it has processed the entire bucket and then overtake the cleanup
+operation.
+
+Currently, we prevent this using lock chaining: cleanup locks the next page
+in the chain before releasing the lock and pin on the page just processed.
+
+Free Space Management
+---------------------
+
+(Question: why is this so complicated? Why not just have a linked list
+of free pages with the list head in the metapage? It's not like we
+avoid needing to modify the metapage with all this.)
+
+Free space management consists of two sub-algorithms, one for reserving
+an overflow page to add to a bucket chain, and one for returning an empty
+overflow page to the free pool.
+
+Obtaining an overflow page:
+
+ take metapage content lock in exclusive mode
+ determine next bitmap page number; if none, exit loop
+ release meta page content lock
+ pin bitmap page and take content lock in exclusive mode
+ search for a free page (zero bit in bitmap)
+ if found:
+ set bit in bitmap
+ mark bitmap page dirty
+ take metapage buffer content lock in exclusive mode
+ if first-free-bit value did not change,
+ update it and mark meta page dirty
+ else (not found):
+ release bitmap page buffer content lock
+ loop back to try next bitmap page, if any
+-- here when we have checked all bitmap pages; we hold meta excl. lock
+ extend index to add another overflow page; update meta information
+ mark meta page dirty
+ return page number
+
+It is slightly annoying to release and reacquire the metapage lock
+multiple times, but it seems best to do it that way to minimize loss of
+concurrency against processes just entering the index. We don't want
+to hold the metapage exclusive lock while reading in a bitmap page.
+(We can at least avoid repeated buffer pin/unpin here.)
+
+The normal path for extending the index does not require doing I/O while
+holding the metapage lock. We do have to do I/O when the extension
+requires adding a new bitmap page as well as the required overflow page
+... but that is an infrequent case, so the loss of concurrency seems
+acceptable.
+
+The portion of tuple insertion that calls the above subroutine looks
+like this:
+
+ -- having determined that no space is free in the target bucket:
+ remember last page of bucket, drop write lock on it
+ re-write-lock last page of bucket
+ if it is not last anymore, step to the last page
+ execute free-page-acquire (obtaining an overflow page) mechanism
+ described above
+ update (former) last page to point to the new page and mark buffer dirty
+ write-lock and initialize new page, with back link to former last page
+ write WAL for addition of overflow page
+ release the locks on meta page and bitmap page acquired in
+ free-page-acquire algorithm
+ release the lock on former last page
+ release the lock on new overflow page
+ insert tuple into new page
+ -- etc.
+
+Notice this handles the case where two concurrent inserters try to extend
+the same bucket. They will end up with a valid, though perhaps
+space-inefficient, configuration: two overflow pages will be added to the
+bucket, each containing one tuple.
+
+The last part of this violates the rule about holding write lock on two
+pages concurrently, but it should be okay to write-lock the previously
+free page; there can be no other process holding lock on it.
+
+Bucket splitting uses a similar algorithm if it has to extend the new
+bucket, but it need not worry about concurrent extension since it has
+buffer content lock in exclusive mode on the new bucket.
+
+Freeing an overflow page requires the process to hold buffer content lock in
+exclusive mode on the containing bucket, so need not worry about other
+accessors of pages in the bucket. The algorithm is:
+
+ delink overflow page from bucket chain
+ (this requires read/update/write/release of fore and aft siblings)
+ pin meta page and take buffer content lock in shared mode
+ determine which bitmap page contains the free space bit for page
+ release meta page buffer content lock
+ pin bitmap page and take buffer content lock in exclusive mode
+ retake meta page buffer content lock in exclusive mode
+ move (insert) tuples that belong to the overflow page being freed
+ update bitmap bit
+ mark bitmap page dirty
+ if page number is still less than first-free-bit,
+ update first-free-bit field and mark meta page dirty
+ write WAL for delinking overflow page operation
+ release buffer content lock and pin
+ release meta page buffer content lock and pin
+
+We have to do it this way because we must clear the bitmap bit before
+changing the first-free-bit field (hashm_firstfree). It is possible that
+we set first-free-bit too small (because someone has already reused the
+page we just freed), but that is okay; the only cost is the next overflow
+page acquirer will scan more bitmap bits than he needs to. What must be
+avoided is having first-free-bit greater than the actual first free bit,
+because then that free page would never be found by searchers.
+
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation. Not doing so could lead to spurious reads
+on standby. Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, and squeeze in themselves don't guarantee hash index
+consistency after a crash. To provide robustness, we write WAL for each of
+these operations.
+
+CREATE INDEX writes multiple WAL records. First, we write a record to cover
+the initializatoin of the metapage, followed by one for each new bucket
+created, followed by one for the initial bitmap page. It's not important for
+index creation to appear atomic, because the index isn't yet visible to any
+other transaction, and the creating transaction will roll back in the event of
+a crash. It would be difficult to cover the whole operation with a single
+write-ahead log record anyway, because we can log only a fixed number of
+pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries. They touch a single bucket page and the
+metapage. The metapage is updated during replay as it is updated during
+original operation.
+
+If an insertion causes the addition of an overflow page, there will be one
+WAL entry for the new overflow page and second entry for insert itself.
+
+If an insertion causes a bucket split, there will be one WAL entry for insert
+itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
+entry for each overflow bucket page in the new bucket to which the tuples are
+moved from old bucket, followed by a WAL entry to indicate that split is
+complete for both old and new buckets. A split operation which requires
+overflow pages to complete the operation will need to write a WAL record for
+each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the system
+crashes between moving tuples from bucket pages of the old bucket to new
+bucket. In such a case, after recovery, the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets. The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above. It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway). It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space. That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space. In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons: to remove dead
+tuples, and to remove tuples that were moved by a bucket split. A WAL entry
+is made for each bucket page from which tuples are removed, and then another
+WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples
+are removed, a separate WAL entry is made to update the metapage.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages, (b)
+before clearing the garbage flag, or (c) before updating the metapage. If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits until the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly. We can fix this problem by actually updating the
+metapage based on delete operation during replay, but it's not clear whether
+it's worth the complication.
+
+A squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As a squeeze operation involves writing multiple atomic operations, it is
+quite possible that the system crashes before completing the operation on
+entire bucket. After recovery, the operations will work correctly, but
+the index will remain bloated and this can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
+
+
+Other Notes
+-----------
+
+Clean up locks prevent a split from occurring while *another* process is stopped
+in a given bucket. It also ensures that one of our *own* backend's scans is not
+stopped in the bucket.