diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
commit | 46651ce6fe013220ed397add242004d764fc0153 (patch) | |
tree | 6e5299f990f88e60174a1d3ae6e48eedd2688b2b /src/backend/access/hash | |
parent | Initial commit. (diff) | |
download | postgresql-14-46651ce6fe013220ed397add242004d764fc0153.tar.xz postgresql-14-46651ce6fe013220ed397add242004d764fc0153.zip |
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/backend/access/hash')
-rw-r--r-- | src/backend/access/hash/Makefile | 27 | ||||
-rw-r--r-- | src/backend/access/hash/README | 651 | ||||
-rw-r--r-- | src/backend/access/hash/hash.c | 918 | ||||
-rw-r--r-- | src/backend/access/hash/hash_xlog.c | 1145 | ||||
-rw-r--r-- | src/backend/access/hash/hashfunc.c | 411 | ||||
-rw-r--r-- | src/backend/access/hash/hashinsert.c | 432 | ||||
-rw-r--r-- | src/backend/access/hash/hashovfl.c | 1083 | ||||
-rw-r--r-- | src/backend/access/hash/hashpage.c | 1612 | ||||
-rw-r--r-- | src/backend/access/hash/hashsearch.c | 721 | ||||
-rw-r--r-- | src/backend/access/hash/hashsort.c | 152 | ||||
-rw-r--r-- | src/backend/access/hash/hashutil.c | 622 | ||||
-rw-r--r-- | src/backend/access/hash/hashvalidate.c | 439 |
12 files changed, 8213 insertions, 0 deletions
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile new file mode 100644 index 0000000..75bf365 --- /dev/null +++ b/src/backend/access/hash/Makefile @@ -0,0 +1,27 @@ +#------------------------------------------------------------------------- +# +# Makefile-- +# Makefile for access/hash +# +# IDENTIFICATION +# src/backend/access/hash/Makefile +# +#------------------------------------------------------------------------- + +subdir = src/backend/access/hash +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = \ + hash.o \ + hash_xlog.o \ + hashfunc.o \ + hashinsert.o \ + hashovfl.o \ + hashpage.o \ + hashsearch.o \ + hashsort.o \ + hashutil.o \ + hashvalidate.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README new file mode 100644 index 0000000..2227ebf --- /dev/null +++ b/src/backend/access/hash/README @@ -0,0 +1,651 @@ +src/backend/access/hash/README + +Hash Indexing +============= + +This directory contains an implementation of hash indexing for Postgres. +Most of the core ideas are taken from Margo Seltzer and Ozan Yigit, +A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference, +January 1991. (Our in-memory hashtable implementation, +src/backend/utils/hash/dynahash.c, also relies on some of the same concepts; +it is derived from code written by Esmond Pitt and later improved by Margo +among others.) + +A hash index consists of two or more "buckets", into which tuples are +placed whenever their hash key maps to the bucket number. The +key-to-bucket-number mapping is chosen so that the index can be +incrementally expanded. When a new bucket is to be added to the index, +exactly one existing bucket will need to be "split", with some of its +tuples being transferred to the new bucket according to the updated +key-to-bucket-number mapping. This is essentially the same hash table +management technique embodied in src/backend/utils/hash/dynahash.c for +in-memory hash tables. + +Each bucket in the hash index comprises one or more index pages. The +bucket's first page is permanently assigned to it when the bucket is +created. Additional pages, called "overflow pages", are added if the +bucket receives too many tuples to fit in the primary bucket page. +The pages of a bucket are chained together in a doubly-linked list +using fields in the index page special space. + +There is currently no provision to shrink a hash index, other than by +rebuilding it with REINDEX. Overflow pages can be recycled for reuse +in other buckets, but we never give them back to the operating system. +There is no provision for reducing the number of buckets, either. + +As of PostgreSQL 8.4, hash index entries store only the hash code, not the +actual data value, for each indexed item. This makes the index entries +smaller (perhaps very substantially so) and speeds up various operations. +In particular, we can speed searches by keeping the index entries in any +one index page sorted by hash code, thus allowing binary search to be used +within an index page. Note however that there is *no* assumption about the +relative ordering of hash codes across different index pages of a bucket. + + +Page Addressing +--------------- + +There are four kinds of pages in a hash index: the meta page (page zero), +which contains statically allocated control information; primary bucket +pages; overflow pages; and bitmap pages, which keep track of overflow +pages that have been freed and are available for re-use. For addressing +purposes, bitmap pages are regarded as a subset of the overflow pages. + +Primary bucket pages and overflow pages are allocated independently (since +any given index might need more or fewer overflow pages relative to its +number of buckets). The hash code uses an interesting set of addressing +rules to support a variable number of overflow pages while not having to +move primary bucket pages around after they are created. + +Primary bucket pages (henceforth just "bucket pages") are allocated in +power-of-2 groups, called "split points" in the code. That means at every new +splitpoint we double the existing number of buckets. Allocating huge chunks +of bucket pages all at once isn't optimal and we will take ages to consume +those. To avoid this exponential growth of index size, we did use a trick to +break up allocation of buckets at the splitpoint into 4 equal phases. If +(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on +we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) +of total buckets at each phase of splitpoint group. Next quarter of allocation +will only happen if buckets of the previous phase have been already consumed. +For the initial splitpoint groups < 10 we will allocate all of their buckets in +single phase only, as number of buckets allocated at initial groups are small +in numbers. And for the groups >= 10 the allocation process is distributed +among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4 +different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces +indicate the number of buckets allocated within each phase of splitpoint group +10. And, for splitpoint group 11 and 12 allocation phases will be +{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We +can see that at each splitpoint group we double the total number of buckets +from the previous group but in an incremental phase. The bucket pages +allocated within one phase of a splitpoint group will appear consecutively in +the index. This addressing scheme allows the physical location of a bucket +page to be computed from the bucket number relatively easily, using only a +small amount of control information. If we look at the function +_hash_spareindex for a given bucket number we first compute the +splitpoint group it belongs to and then the phase to which the bucket belongs +to. Adding them we get the global splitpoint phase number S to which the +bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] +is an array stored in the metapage) with given bucket number to compute its +physical address. The hashm_spares[S] can be interpreted as the total number +of overflow pages that have been allocated before the bucket pages of +splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1 +always appear at block numbers 1 and 2, just after the meta page. We always +have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the +former. The difference between the two represents the number of overflow pages +appearing between the bucket page groups of splitpoints phase N and N+1. +(Note: the above describes what happens when filling an initially minimally +sized hash index. In practice, we try to estimate the required index size and +allocate a suitable number of splitpoints phases immediately, to avoid +expensive re-splitting during initial index build.) + +When S splitpoints exist altogether, the array entries hashm_spares[0] +through hashm_spares[S] are valid; hashm_spares[S] records the current +total number of overflow pages. New overflow pages are created as needed +at the end of the index, and recorded by incrementing hashm_spares[S]. +When it is time to create a new splitpoint phase's worth of bucket pages, we +copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is +stored in the hashm_ovflpoint field of the meta page). This has the +effect of reserving the correct number of bucket pages at the end of the +index, and preparing to allocate additional overflow pages after those +bucket pages. hashm_spares[] entries before S cannot change anymore, +since that would require moving already-created bucket pages. + +The last page nominally used by the index is always determinable from +hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by +the filesystem and smgr must always be greater than or equal to this page. +We have to allow the case "greater than" because it's possible that during +an index extension we crash after allocating filesystem space and before +updating the metapage. Note that on filesystems that allow "holes" in +files, it's entirely likely that pages before the logical EOF are not yet +allocated: when we allocate a new splitpoint phase's worth of bucket pages, we +physically zero the last such page to force the EOF up, and the first such +page will be used immediately, but the intervening pages are not written +until needed. + +Since overflow pages may be recycled if enough tuples are deleted from +their bucket, we need a way to keep track of currently-free overflow +pages. The state of each overflow page (0 = available, 1 = not available) +is recorded in "bitmap" pages dedicated to this purpose. The entries in +the bitmap are indexed by "bit number", a zero-based count in which every +overflow page has a unique entry. We can convert between an overflow +page's physical block number and its bit number using the information in +hashm_spares[] (see hashovfl.c for details). The bit number sequence +includes the bitmap pages, which is the reason for saying that bitmap +pages are a subset of the overflow pages. It turns out in fact that each +bitmap page's first bit represents itself --- this is not an essential +property, but falls out of the fact that we only allocate another bitmap +page when we really need one. Bit number zero always corresponds to the +first bitmap page, which is allocated during index creation just after all +the initially created buckets. + + +Lock Definitions +---------------- + +Concurrency control for hash indexes is provided using buffer content +locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, +cleanup lock means that we hold an exclusive lock on the buffer and have +observed at some point after acquiring the lock that we hold the only pin +on that buffer. For hash indexes, a cleanup lock on a primary bucket page +represents the right to perform an arbitrary reorganization of the entire +bucket. Therefore, scans retain a pin on the primary bucket page for the +bucket they are currently scanning. Splitting a bucket requires a cleanup +lock on both the old and new primary bucket pages. VACUUM therefore takes +a cleanup lock on every bucket page in order to remove tuples. It can also +remove tuples copied to a new bucket by any previous split operation, because +the cleanup lock taken on the primary bucket page guarantees that no scans +which started prior to the most recent split can still be in progress. After +cleaning each page individually, it attempts to take a cleanup lock on the +primary bucket page in order to "squeeze" the bucket down to the minimum +possible number of pages. + +To avoid deadlocks, we must be consistent about the lock order in which we +lock the buckets for operations that requires locks on two different buckets. +We choose to always lock the lower-numbered bucket first. The metapage is +only ever locked after all bucket locks have been taken. + + +Metapage Caching +---------------- + +Both scanning the index and inserting tuples require locating the bucket +where a given tuple ought to be located. To do this, we need the bucket +count, highmask, and lowmask from the metapage; however, it's undesirable +for performance reasons to have to have to lock and pin the metapage for +every such operation. Instead, we retain a cached copy of the metapage +in each backend's relcache entry. This will produce the correct +bucket mapping as long as the target bucket hasn't been split since the +last cache refresh. + +To guard against the possibility that such a split has occurred, the +primary page of each bucket chain stores the number of buckets that +existed as of the time the bucket was last split, or if never split as +of the time it was created, in the space normally used for the +previous block number (that is, hasho_prevblkno). This doesn't cost +anything because the primary bucket page is always the first page in +the chain, and the previous block number is therefore always, in +reality, InvalidBlockNumber. + +After computing the ostensibly-correct bucket number based on our cached +copy of the metapage, we lock the corresponding primary bucket page and +check whether the bucket count stored in hasho_prevblkno is greater than +the number of buckets stored in our cached copy of the metapage. If +so, the bucket has certainly been split, because the count must originally +have been less than the number of buckets that existed at that time and +can't have increased except due to a split. If not, the bucket can't have +been split, because a split would have created a new bucket with a higher +bucket number than any we'd seen previously. In the latter case, we've +locked the correct bucket and can proceed; in the former case, we must +release the lock on this bucket, lock the metapage, update our cache, +unlock the metapage, and retry. + +Needing to retry occasionally might seem expensive, but the number of times +any given bucket can be split is limited to a few dozen no matter how +many times the hash index is accessed, because the total number of +buckets is limited to less than 2^32. On the other hand, the number of +times we access a bucket is unbounded and will be several orders of +magnitude larger even in unsympathetic cases. + +(The metapage cache is new in v10. Older hash indexes had the primary +bucket page's hasho_prevblkno initialized to InvalidBuffer.) + +Pseudocode Algorithms +--------------------- + +Various flags that are used in hash index operations are described as below: + +The bucket-being-split and bucket-being-populated flags indicate that split +the operation is in progress for a bucket. During split operation, a +bucket-being-split flag is set on the old bucket and bucket-being-populated +flag is set on new bucket. These flags are cleared once the split operation +is finished. + +The split-cleanup flag indicates that a bucket which has been recently split +still contains tuples that were also copied to the new bucket; it essentially +marks the split as incomplete. Once we're certain that no scans which +started before the new bucket was fully populated are still in progress, we +can remove the copies from the old bucket and clear the flag. We insist that +this flag must be clear before splitting a bucket; thus, a bucket can't be +split again until the previous split is totally complete. + +The moved-by-split flag on a tuple indicates that tuple is moved from old to +new bucket. Concurrent scans will skip such tuples until the split operation +is finished. Once the tuple is marked as moved-by-split, it will remain so +forever but that does no harm. We have intentionally not cleared it as that +can generate an additional I/O which is not necessary. + +The operations we need to support are: readers scanning the index for +entries of a particular hash code (which by definition are all in the same +bucket); insertion of a new tuple into the correct bucket; enlarging the +hash table by splitting an existing bucket; and garbage collection +(deletion of dead tuples and compaction of buckets). Bucket splitting is +done at conclusion of any insertion that leaves the hash table more full +than the target load factor, but it is convenient to consider it as an +independent operation. Note that we do not have a bucket-merge operation +--- the number of buckets never shrinks. Insertion, splitting, and +garbage collection may all need access to freelist management, which keeps +track of available overflow pages. + +The reader algorithm is: + + lock the primary bucket page of the target bucket + if the target bucket is still being populated by a split: + release the buffer content lock on current bucket page + pin and acquire the buffer content lock on old bucket in shared mode + release the buffer content lock on old bucket, but not pin + retake the buffer content lock on new bucket + arrange to scan the old bucket normally and the new bucket for + tuples which are not moved-by-split +-- then, per read request: + reacquire content lock on current page + step to next page if necessary (no chaining of content locks, but keep + the pin on the primary bucket throughout the scan) + save all the matching tuples from current index page into an items array + release pin and content lock (but if it is primary bucket page retain + its pin till the end of the scan) + get tuple from an item array +-- at scan shutdown: + release all pins still held + +Holding the buffer pin on the primary bucket page for the whole scan prevents +the reader's current-tuple pointer from being invalidated by splits or +compactions. (Of course, other buckets can still be split or compacted.) + +To minimize lock/unlock traffic, hash index scan always searches the entire +hash page to identify all the matching items at once, copying their heap tuple +IDs into backend-local storage. The heap tuple IDs are then processed while not +holding any page lock within the index thereby, allowing concurrent insertion +to happen on the same index page without any requirement of re-finding the +current scan position for the reader. We do continue to hold a pin on the +bucket page, to protect against concurrent deletions and bucket split. + +To allow for scans during a bucket split, if at the start of the scan, the +bucket is marked as bucket-being-populated, it scan all the tuples in that +bucket except for those that are marked as moved-by-split. Once it finishes +the scan of all the tuples in the current bucket, it scans the old bucket from +which this bucket is formed by split. + +The insertion algorithm is rather similar: + + lock the primary bucket page of the target bucket +-- (so far same as reader, except for acquisition of buffer content lock in + exclusive mode on primary bucket page) + if the bucket-being-split flag is set for a bucket and pin count on it is + one, then finish the split + release the buffer content lock on current bucket + get the "new" bucket which was being populated by the split + scan the new bucket and form the hash table of TIDs + conditionally get the cleanup lock on old and new buckets + if we get the lock on both the buckets + finish the split using algorithm mentioned below for split + release the pin on old bucket and restart the insert from beginning. + if current page is full, first check if this page contains any dead tuples. + if yes, remove dead tuples from the current page and again check for the + availability of the space. If enough space found, insert the tuple else + release lock but not pin, read/exclusive-lock + next page; repeat as needed + >> see below if no space in any page of bucket + take buffer content lock in exclusive mode on metapage + insert tuple at appropriate place in page + mark current page dirty + increment tuple count, decide if split needed + mark meta page dirty + write WAL for insertion of tuple + release the buffer content lock on metapage + release buffer content lock on current page + if current page is not a bucket page, release the pin on bucket page + if split is needed, enter Split algorithm below + release the pin on metapage + +To speed searches, the index entries within any individual index page are +kept sorted by hash code; the insertion code must take care to insert new +entries in the right place. It is okay for an insertion to take place in a +bucket that is being actively scanned, because readers can cope with this +as explained above. We only need the short-term buffer locks to ensure +that readers do not see a partially-updated page. + +To avoid deadlock between readers and inserters, whenever there is a need +to lock multiple buckets, we always take in the order suggested in Lock +Definitions above. This algorithm allows them a very high degree of +concurrency. (The exclusive metapage lock taken to update the tuple count +is stronger than necessary, since readers do not care about the tuple count, +but the lock is held for such a short time that this is probably not an +issue.) + +When an inserter cannot find space in any existing page of a bucket, it +must obtain an overflow page and add that page to the bucket's chain. +Details of that part of the algorithm appear later. + +The page split algorithm is entered whenever an inserter observes that the +index is overfull (has a higher-than-wanted ratio of tuples to buckets). +The algorithm attempts, but does not necessarily succeed, to split one +existing bucket in two, thereby lowering the fill ratio: + + pin meta page and take buffer content lock in exclusive mode + check split still needed + if split not needed anymore, drop buffer content lock and pin and exit + decide which bucket to split + try to take a cleanup lock on that bucket; if fail, give up + if that bucket is still being split or has split-cleanup work: + try to finish the split and the cleanup work + if that succeeds, start over; if it fails, give up + mark the old and new buckets indicating split is in progress + mark both old and new buckets as dirty + write WAL for allocation of new page for split + copy the tuples that belongs to new bucket from old bucket, marking + them as moved-by-split + write WAL record for moving tuples to new page once the new page is full + or all the pages of old bucket are finished + release lock but not pin for primary bucket page of old bucket, + read/shared-lock next page; repeat as needed + clear the bucket-being-split and bucket-being-populated flags + mark the old bucket indicating split-cleanup + write WAL for changing the flags on both old and new buckets + +The split operation's attempt to acquire cleanup-lock on the old bucket number +could fail if another process holds any lock or pin on it. We do not want to +wait if that happens, because we don't want to wait while holding the metapage +exclusive-lock. So, this is a conditional LWLockAcquire operation, and if +it fails we just abandon the attempt to split. This is all right since the +index is overfull but perfectly functional. Every subsequent inserter will +try to split, and eventually one will succeed. If multiple inserters failed +to split, the index might still be overfull, but eventually, the index will +not be overfull and split attempts will stop. (We could make a successful +splitter loop to see if the index is still overfull, but it seems better to +distribute the split overhead across successive insertions.) + +If a split fails partway through (e.g. due to insufficient disk space or an +interrupt), the index will not be corrupted. Instead, we'll retry the split +every time a tuple is inserted into the old bucket prior to inserting the new +tuple; eventually, we should succeed. The fact that a split is left +unfinished doesn't prevent subsequent buckets from being split, but we won't +try to split the bucket again until the prior split is finished. In other +words, a bucket can be in the middle of being split for some time, but it can't +be in the middle of two splits at the same time. + +The fourth operation is garbage collection (bulk deletion): + + next bucket := 0 + pin metapage and take buffer content lock in exclusive mode + fetch current max bucket number + release meta page buffer content lock and pin + while next bucket <= max bucket do + acquire cleanup lock on primary bucket page + loop: + scan and remove tuples + mark the target page dirty + write WAL for deleting tuples from target page + if this is the last bucket page, break out of loop + pin and x-lock next page + release prior lock and pin (except keep pin on primary bucket page) + if the page we have locked is not the primary bucket page: + release lock and take exclusive lock on primary bucket page + if there are no other pins on the primary bucket page: + squeeze the bucket to remove free space + release the pin on primary bucket page + next bucket ++ + end loop + pin metapage and take buffer content lock in exclusive mode + check if number of buckets changed + if so, release content lock and pin and return to for-each-bucket loop + else update metapage tuple count + mark meta page dirty and write WAL for update of metapage + release buffer content lock and pin + +Note that this is designed to allow concurrent splits and scans. If a split +occurs, tuples relocated into the new bucket will be visited twice by the +scan, but that does no harm. See also "Interlocking Between Scans and +VACUUM", below. + +We must be careful about the statistics reported by the VACUUM operation. +What we can do is count the number of tuples scanned, and believe this in +preference to the stored tuple count if the stored tuple count and number of +buckets did *not* change at any time during the scan. This provides a way of +correcting the stored tuple count if it gets out of sync for some reason. But +if a split or insertion does occur concurrently, the scan count is +untrustworthy; instead, subtract the number of tuples deleted from the stored +tuple count and use that. + +Interlocking Between Scans and VACUUM +------------------------------------- + +Since we release the lock on bucket page during a cleanup scan of a bucket, a +concurrent scan could start in that bucket before we've finished vacuuming it. +If a scan gets ahead of cleanup, we could have the following problem: (1) the +scan sees heap TIDs that are about to be removed before they are processed by +VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3) +VACUUM completes, (4) one or more of the TIDs the scan decided were dead are +reused for an unrelated tuple, and finally (5) the scan wakes up and +erroneously kills the new tuple. + +Note that this requires VACUUM and a scan to be active in the same bucket at +the same time. If VACUUM completes before the scan starts, the scan never has +a chance to see the dead tuples; if the scan completes before the VACUUM +starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM +can't start on a bucket that has an active scan, because the scan holds a pin +on the primary bucket page, and VACUUM must take a cleanup lock on that page +in order to begin cleanup. Therefore, the only way this problem can occur is +for a scan to start after VACUUM has released the cleanup lock on the bucket +but before it has processed the entire bucket and then overtake the cleanup +operation. + +Currently, we prevent this using lock chaining: cleanup locks the next page +in the chain before releasing the lock and pin on the page just processed. + +Free Space Management +--------------------- + +(Question: why is this so complicated? Why not just have a linked list +of free pages with the list head in the metapage? It's not like we +avoid needing to modify the metapage with all this.) + +Free space management consists of two sub-algorithms, one for reserving +an overflow page to add to a bucket chain, and one for returning an empty +overflow page to the free pool. + +Obtaining an overflow page: + + take metapage content lock in exclusive mode + determine next bitmap page number; if none, exit loop + release meta page content lock + pin bitmap page and take content lock in exclusive mode + search for a free page (zero bit in bitmap) + if found: + set bit in bitmap + mark bitmap page dirty + take metapage buffer content lock in exclusive mode + if first-free-bit value did not change, + update it and mark meta page dirty + else (not found): + release bitmap page buffer content lock + loop back to try next bitmap page, if any +-- here when we have checked all bitmap pages; we hold meta excl. lock + extend index to add another overflow page; update meta information + mark meta page dirty + return page number + +It is slightly annoying to release and reacquire the metapage lock +multiple times, but it seems best to do it that way to minimize loss of +concurrency against processes just entering the index. We don't want +to hold the metapage exclusive lock while reading in a bitmap page. +(We can at least avoid repeated buffer pin/unpin here.) + +The normal path for extending the index does not require doing I/O while +holding the metapage lock. We do have to do I/O when the extension +requires adding a new bitmap page as well as the required overflow page +... but that is an infrequent case, so the loss of concurrency seems +acceptable. + +The portion of tuple insertion that calls the above subroutine looks +like this: + + -- having determined that no space is free in the target bucket: + remember last page of bucket, drop write lock on it + re-write-lock last page of bucket + if it is not last anymore, step to the last page + execute free-page-acquire (obtaining an overflow page) mechanism + described above + update (former) last page to point to the new page and mark buffer dirty + write-lock and initialize new page, with back link to former last page + write WAL for addition of overflow page + release the locks on meta page and bitmap page acquired in + free-page-acquire algorithm + release the lock on former last page + release the lock on new overflow page + insert tuple into new page + -- etc. + +Notice this handles the case where two concurrent inserters try to extend +the same bucket. They will end up with a valid, though perhaps +space-inefficient, configuration: two overflow pages will be added to the +bucket, each containing one tuple. + +The last part of this violates the rule about holding write lock on two +pages concurrently, but it should be okay to write-lock the previously +free page; there can be no other process holding lock on it. + +Bucket splitting uses a similar algorithm if it has to extend the new +bucket, but it need not worry about concurrent extension since it has +buffer content lock in exclusive mode on the new bucket. + +Freeing an overflow page requires the process to hold buffer content lock in +exclusive mode on the containing bucket, so need not worry about other +accessors of pages in the bucket. The algorithm is: + + delink overflow page from bucket chain + (this requires read/update/write/release of fore and aft siblings) + pin meta page and take buffer content lock in shared mode + determine which bitmap page contains the free space bit for page + release meta page buffer content lock + pin bitmap page and take buffer content lock in exclusive mode + retake meta page buffer content lock in exclusive mode + move (insert) tuples that belong to the overflow page being freed + update bitmap bit + mark bitmap page dirty + if page number is still less than first-free-bit, + update first-free-bit field and mark meta page dirty + write WAL for delinking overflow page operation + release buffer content lock and pin + release meta page buffer content lock and pin + +We have to do it this way because we must clear the bitmap bit before +changing the first-free-bit field (hashm_firstfree). It is possible that +we set first-free-bit too small (because someone has already reused the +page we just freed), but that is okay; the only cost is the next overflow +page acquirer will scan more bitmap bits than he needs to. What must be +avoided is having first-free-bit greater than the actual first free bit, +because then that free page would never be found by searchers. + +The reason of moving tuples from overflow page while delinking the later is +to make that as an atomic operation. Not doing so could lead to spurious reads +on standby. Basically, the user might see the same tuple twice. + + +WAL Considerations +------------------ + +The hash index operations like create index, insert, delete, bucket split, +allocate overflow page, and squeeze in themselves don't guarantee hash index +consistency after a crash. To provide robustness, we write WAL for each of +these operations. + +CREATE INDEX writes multiple WAL records. First, we write a record to cover +the initializatoin of the metapage, followed by one for each new bucket +created, followed by one for the initial bitmap page. It's not important for +index creation to appear atomic, because the index isn't yet visible to any +other transaction, and the creating transaction will roll back in the event of +a crash. It would be difficult to cover the whole operation with a single +write-ahead log record anyway, because we can log only a fixed number of +pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery. + +Ordinary item insertions (that don't force a page split or need a new overflow +page) are single WAL entries. They touch a single bucket page and the +metapage. The metapage is updated during replay as it is updated during +original operation. + +If an insertion causes the addition of an overflow page, there will be one +WAL entry for the new overflow page and second entry for insert itself. + +If an insertion causes a bucket split, there will be one WAL entry for insert +itself, followed by a WAL entry for allocating a new bucket, followed by a WAL +entry for each overflow bucket page in the new bucket to which the tuples are +moved from old bucket, followed by a WAL entry to indicate that split is +complete for both old and new buckets. A split operation which requires +overflow pages to complete the operation will need to write a WAL record for +each new allocation of an overflow page. + +As splitting involves multiple atomic actions, it's possible that the system +crashes between moving tuples from bucket pages of the old bucket to new +bucket. In such a case, after recovery, the old and new buckets will be +marked with bucket-being-split and bucket-being-populated flags respectively +which indicates that split is in progress for those buckets. The reader +algorithm works correctly, as it will scan both the old and new buckets when +the split is in progress as explained in the reader algorithm section above. + +We finish the split at next insert or split operation on the old bucket as +explained in insert and split algorithm above. It could be done during +searches, too, but it seems best not to put any extra updates in what would +otherwise be a read-only operation (updating is not possible in hot standby +mode anyway). It would seem natural to complete the split in VACUUM, but since +splitting a bucket might require allocating a new page, it might fail if you +run out of disk space. That would be bad during VACUUM - the reason for +running VACUUM in the first place might be that you run out of disk space, +and now VACUUM won't finish because you're out of disk space. In contrast, +an insertion can require enlarging the physical file anyway. + +Deletion of tuples from a bucket is performed for two reasons: to remove dead +tuples, and to remove tuples that were moved by a bucket split. A WAL entry +is made for each bucket page from which tuples are removed, and then another +WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples +are removed, a separate WAL entry is made to update the metapage. + +As deletion involves multiple atomic operations, it is quite possible that +system crashes after (a) removing tuples from some of the bucket pages, (b) +before clearing the garbage flag, or (c) before updating the metapage. If the +system crashes before completing (b), it will again try to clean the bucket +during next vacuum or insert after recovery which can have some performance +impact, but it will work fine. If the system crashes before completing (c), +after recovery there could be some additional splits until the next vacuum +updates the metapage, but the other operations like insert, delete and scan +will work correctly. We can fix this problem by actually updating the +metapage based on delete operation during replay, but it's not clear whether +it's worth the complication. + +A squeeze operation moves tuples from one of the buckets later in the chain to +one of the bucket earlier in chain and writes WAL record when either the +bucket to which it is writing tuples is filled or bucket from which it +is removing the tuples becomes empty. + +As a squeeze operation involves writing multiple atomic operations, it is +quite possible that the system crashes before completing the operation on +entire bucket. After recovery, the operations will work correctly, but +the index will remain bloated and this can impact performance of read and +insert operations until the next vacuum squeeze the bucket completely. + + +Other Notes +----------- + +Clean up locks prevent a split from occurring while *another* process is stopped +in a given bucket. It also ensures that one of our *own* backend's scans is not +stopped in the bucket. diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c new file mode 100644 index 0000000..0752fb3 --- /dev/null +++ b/src/backend/access/hash/hash.c @@ -0,0 +1,918 @@ +/*------------------------------------------------------------------------- + * + * hash.c + * Implementation of Margo Seltzer's Hashing package for postgres. + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hash.c + * + * NOTES + * This file contains only the public interface routines. + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/hash_xlog.h" +#include "access/relscan.h" +#include "access/tableam.h" +#include "catalog/index.h" +#include "commands/progress.h" +#include "commands/vacuum.h" +#include "miscadmin.h" +#include "optimizer/plancat.h" +#include "pgstat.h" +#include "utils/builtins.h" +#include "utils/index_selfuncs.h" +#include "utils/rel.h" + +/* Working state for hashbuild and its callback */ +typedef struct +{ + HSpool *spool; /* NULL if not using spooling */ + double indtuples; /* # tuples accepted into index */ + Relation heapRel; /* heap relation descriptor */ +} HashBuildState; + +static void hashbuildCallback(Relation index, + ItemPointer tid, + Datum *values, + bool *isnull, + bool tupleIsAlive, + void *state); + + +/* + * Hash handler function: return IndexAmRoutine with access method parameters + * and callbacks. + */ +Datum +hashhandler(PG_FUNCTION_ARGS) +{ + IndexAmRoutine *amroutine = makeNode(IndexAmRoutine); + + amroutine->amstrategies = HTMaxStrategyNumber; + amroutine->amsupport = HASHNProcs; + amroutine->amoptsprocnum = HASHOPTIONS_PROC; + amroutine->amcanorder = false; + amroutine->amcanorderbyop = false; + amroutine->amcanbackward = true; + amroutine->amcanunique = false; + amroutine->amcanmulticol = false; + amroutine->amoptionalkey = false; + amroutine->amsearcharray = false; + amroutine->amsearchnulls = false; + amroutine->amstorage = false; + amroutine->amclusterable = false; + amroutine->ampredlocks = true; + amroutine->amcanparallel = false; + amroutine->amcaninclude = false; + amroutine->amusemaintenanceworkmem = false; + amroutine->amparallelvacuumoptions = + VACUUM_OPTION_PARALLEL_BULKDEL; + amroutine->amkeytype = INT4OID; + + amroutine->ambuild = hashbuild; + amroutine->ambuildempty = hashbuildempty; + amroutine->aminsert = hashinsert; + amroutine->ambulkdelete = hashbulkdelete; + amroutine->amvacuumcleanup = hashvacuumcleanup; + amroutine->amcanreturn = NULL; + amroutine->amcostestimate = hashcostestimate; + amroutine->amoptions = hashoptions; + amroutine->amproperty = NULL; + amroutine->ambuildphasename = NULL; + amroutine->amvalidate = hashvalidate; + amroutine->amadjustmembers = hashadjustmembers; + amroutine->ambeginscan = hashbeginscan; + amroutine->amrescan = hashrescan; + amroutine->amgettuple = hashgettuple; + amroutine->amgetbitmap = hashgetbitmap; + amroutine->amendscan = hashendscan; + amroutine->ammarkpos = NULL; + amroutine->amrestrpos = NULL; + amroutine->amestimateparallelscan = NULL; + amroutine->aminitparallelscan = NULL; + amroutine->amparallelrescan = NULL; + + PG_RETURN_POINTER(amroutine); +} + +/* + * hashbuild() -- build a new hash index. + */ +IndexBuildResult * +hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) +{ + IndexBuildResult *result; + BlockNumber relpages; + double reltuples; + double allvisfrac; + uint32 num_buckets; + long sort_threshold; + HashBuildState buildstate; + + /* + * We expect to be called exactly once for any index relation. If that's + * not the case, big trouble's what we have. + */ + if (RelationGetNumberOfBlocks(index) != 0) + elog(ERROR, "index \"%s\" already contains data", + RelationGetRelationName(index)); + + /* Estimate the number of rows currently present in the table */ + estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac); + + /* Initialize the hash index metadata page and initial buckets */ + num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM); + + /* + * If we just insert the tuples into the index in scan order, then + * (assuming their hash codes are pretty random) there will be no locality + * of access to the index, and if the index is bigger than available RAM + * then we'll thrash horribly. To prevent that scenario, we can sort the + * tuples by (expected) bucket number. However, such a sort is useless + * overhead when the index does fit in RAM. We choose to sort if the + * initial index size exceeds maintenance_work_mem, or the number of + * buffers usable for the index, whichever is less. (Limiting by the + * number of buffers should reduce thrashing between PG buffers and kernel + * buffers, which seems useful even if no physical I/O results. Limiting + * by maintenance_work_mem is useful to allow easy testing of the sort + * code path, and may be useful to DBAs as an additional control knob.) + * + * NOTE: this test will need adjustment if a bucket is ever different from + * one page. Also, "initial index size" accounting does not include the + * metapage, nor the first bitmap page. + */ + sort_threshold = (maintenance_work_mem * 1024L) / BLCKSZ; + if (index->rd_rel->relpersistence != RELPERSISTENCE_TEMP) + sort_threshold = Min(sort_threshold, NBuffers); + else + sort_threshold = Min(sort_threshold, NLocBuffer); + + if (num_buckets >= (uint32) sort_threshold) + buildstate.spool = _h_spoolinit(heap, index, num_buckets); + else + buildstate.spool = NULL; + + /* prepare to build the index */ + buildstate.indtuples = 0; + buildstate.heapRel = heap; + + /* do the heap scan */ + reltuples = table_index_build_scan(heap, index, indexInfo, true, true, + hashbuildCallback, + (void *) &buildstate, NULL); + pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, + buildstate.indtuples); + + if (buildstate.spool) + { + /* sort the tuples and insert them into the index */ + _h_indexbuild(buildstate.spool, buildstate.heapRel); + _h_spooldestroy(buildstate.spool); + } + + /* + * Return statistics + */ + result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult)); + + result->heap_tuples = reltuples; + result->index_tuples = buildstate.indtuples; + + return result; +} + +/* + * hashbuildempty() -- build an empty hash index in the initialization fork + */ +void +hashbuildempty(Relation index) +{ + _hash_init(index, 0, INIT_FORKNUM); +} + +/* + * Per-tuple callback for table_index_build_scan + */ +static void +hashbuildCallback(Relation index, + ItemPointer tid, + Datum *values, + bool *isnull, + bool tupleIsAlive, + void *state) +{ + HashBuildState *buildstate = (HashBuildState *) state; + Datum index_values[1]; + bool index_isnull[1]; + IndexTuple itup; + + /* convert data to a hash key; on failure, do not insert anything */ + if (!_hash_convert_tuple(index, + values, isnull, + index_values, index_isnull)) + return; + + /* Either spool the tuple for sorting, or just put it into the index */ + if (buildstate->spool) + _h_spool(buildstate->spool, tid, index_values, index_isnull); + else + { + /* form an index tuple and point it at the heap tuple */ + itup = index_form_tuple(RelationGetDescr(index), + index_values, index_isnull); + itup->t_tid = *tid; + _hash_doinsert(index, itup, buildstate->heapRel); + pfree(itup); + } + + buildstate->indtuples += 1; +} + +/* + * hashinsert() -- insert an index tuple into a hash table. + * + * Hash on the heap tuple's key, form an index tuple with hash code. + * Find the appropriate location for the new tuple, and put it there. + */ +bool +hashinsert(Relation rel, Datum *values, bool *isnull, + ItemPointer ht_ctid, Relation heapRel, + IndexUniqueCheck checkUnique, + bool indexUnchanged, + IndexInfo *indexInfo) +{ + Datum index_values[1]; + bool index_isnull[1]; + IndexTuple itup; + + /* convert data to a hash key; on failure, do not insert anything */ + if (!_hash_convert_tuple(rel, + values, isnull, + index_values, index_isnull)) + return false; + + /* form an index tuple and point it at the heap tuple */ + itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull); + itup->t_tid = *ht_ctid; + + _hash_doinsert(rel, itup, heapRel); + + pfree(itup); + + return false; +} + + +/* + * hashgettuple() -- Get the next tuple in the scan. + */ +bool +hashgettuple(IndexScanDesc scan, ScanDirection dir) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + bool res; + + /* Hash indexes are always lossy since we store only the hash code */ + scan->xs_recheck = true; + + /* + * If we've already initialized this scan, we can just advance it in the + * appropriate direction. If we haven't done so yet, we call a routine to + * get the first item in the scan. + */ + if (!HashScanPosIsValid(so->currPos)) + res = _hash_first(scan, dir); + else + { + /* + * Check to see if we should kill the previously-fetched tuple. + */ + if (scan->kill_prior_tuple) + { + /* + * Yes, so remember it for later. (We'll deal with all such tuples + * at once right after leaving the index page or at end of scan.) + * In case if caller reverses the indexscan direction it is quite + * possible that the same item might get entered multiple times. + * But, we don't detect that; instead, we just forget any excess + * entries. + */ + if (so->killedItems == NULL) + so->killedItems = (int *) + palloc(MaxIndexTuplesPerPage * sizeof(int)); + + if (so->numKilled < MaxIndexTuplesPerPage) + so->killedItems[so->numKilled++] = so->currPos.itemIndex; + } + + /* + * Now continue the scan. + */ + res = _hash_next(scan, dir); + } + + return res; +} + + +/* + * hashgetbitmap() -- get all tuples at once + */ +int64 +hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + bool res; + int64 ntids = 0; + HashScanPosItem *currItem; + + res = _hash_first(scan, ForwardScanDirection); + + while (res) + { + currItem = &so->currPos.items[so->currPos.itemIndex]; + + /* + * _hash_first and _hash_next handle eliminate dead index entries + * whenever scan->ignore_killed_tuples is true. Therefore, there's + * nothing to do here except add the results to the TIDBitmap. + */ + tbm_add_tuples(tbm, &(currItem->heapTid), 1, true); + ntids++; + + res = _hash_next(scan, ForwardScanDirection); + } + + return ntids; +} + + +/* + * hashbeginscan() -- start a scan on a hash index + */ +IndexScanDesc +hashbeginscan(Relation rel, int nkeys, int norderbys) +{ + IndexScanDesc scan; + HashScanOpaque so; + + /* no order by operators allowed */ + Assert(norderbys == 0); + + scan = RelationGetIndexScan(rel, nkeys, norderbys); + + so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData)); + HashScanPosInvalidate(so->currPos); + so->hashso_bucket_buf = InvalidBuffer; + so->hashso_split_bucket_buf = InvalidBuffer; + + so->hashso_buc_populated = false; + so->hashso_buc_split = false; + + so->killedItems = NULL; + so->numKilled = 0; + + scan->opaque = so; + + return scan; +} + +/* + * hashrescan() -- rescan an index relation + */ +void +hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, + ScanKey orderbys, int norderbys) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + Relation rel = scan->indexRelation; + + if (HashScanPosIsValid(so->currPos)) + { + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + _hash_kill_items(scan); + } + + _hash_dropscanbuf(rel, so); + + /* set position invalid (this will cause _hash_first call) */ + HashScanPosInvalidate(so->currPos); + + /* Update scan key, if a new one is given */ + if (scankey && scan->numberOfKeys > 0) + { + memmove(scan->keyData, + scankey, + scan->numberOfKeys * sizeof(ScanKeyData)); + } + + so->hashso_buc_populated = false; + so->hashso_buc_split = false; +} + +/* + * hashendscan() -- close down a scan + */ +void +hashendscan(IndexScanDesc scan) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + Relation rel = scan->indexRelation; + + if (HashScanPosIsValid(so->currPos)) + { + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + _hash_kill_items(scan); + } + + _hash_dropscanbuf(rel, so); + + if (so->killedItems != NULL) + pfree(so->killedItems); + pfree(so); + scan->opaque = NULL; +} + +/* + * Bulk deletion of all index entries pointing to a set of heap tuples. + * The set of target tuples is specified via a callback routine that tells + * whether any given heap tuple (identified by ItemPointer) is being deleted. + * + * This function also deletes the tuples that are moved by split to other + * bucket. + * + * Result: a palloc'd struct containing statistical info for VACUUM displays. + */ +IndexBulkDeleteResult * +hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, + IndexBulkDeleteCallback callback, void *callback_state) +{ + Relation rel = info->index; + double tuples_removed; + double num_index_tuples; + double orig_ntuples; + Bucket orig_maxbucket; + Bucket cur_maxbucket; + Bucket cur_bucket; + Buffer metabuf = InvalidBuffer; + HashMetaPage metap; + HashMetaPage cachedmetap; + + tuples_removed = 0; + num_index_tuples = 0; + + /* + * We need a copy of the metapage so that we can use its hashm_spares[] + * values to compute bucket page addresses, but a cached copy should be + * good enough. (If not, we'll detect that further down and refresh the + * cache as necessary.) + */ + cachedmetap = _hash_getcachedmetap(rel, &metabuf, false); + Assert(cachedmetap != NULL); + + orig_maxbucket = cachedmetap->hashm_maxbucket; + orig_ntuples = cachedmetap->hashm_ntuples; + + /* Scan the buckets that we know exist */ + cur_bucket = 0; + cur_maxbucket = orig_maxbucket; + +loop_top: + while (cur_bucket <= cur_maxbucket) + { + BlockNumber bucket_blkno; + BlockNumber blkno; + Buffer bucket_buf; + Buffer buf; + HashPageOpaque bucket_opaque; + Page page; + bool split_cleanup = false; + + /* Get address of bucket's start page */ + bucket_blkno = BUCKET_TO_BLKNO(cachedmetap, cur_bucket); + + blkno = bucket_blkno; + + /* + * We need to acquire a cleanup lock on the primary bucket page to out + * wait concurrent scans before deleting the dead tuples. + */ + buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy); + LockBufferForCleanup(buf); + _hash_checkpage(rel, buf, LH_BUCKET_PAGE); + + page = BufferGetPage(buf); + bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + /* + * If the bucket contains tuples that are moved by split, then we need + * to delete such tuples. We can't delete such tuples if the split + * operation on bucket is not finished as those are needed by scans. + */ + if (!H_BUCKET_BEING_SPLIT(bucket_opaque) && + H_NEEDS_SPLIT_CLEANUP(bucket_opaque)) + { + split_cleanup = true; + + /* + * This bucket might have been split since we last held a lock on + * the metapage. If so, hashm_maxbucket, hashm_highmask and + * hashm_lowmask might be old enough to cause us to fail to remove + * tuples left behind by the most recent split. To prevent that, + * now that the primary page of the target bucket has been locked + * (and thus can't be further split), check whether we need to + * update our cached metapage data. + */ + Assert(bucket_opaque->hasho_prevblkno != InvalidBlockNumber); + if (bucket_opaque->hasho_prevblkno > cachedmetap->hashm_maxbucket) + { + cachedmetap = _hash_getcachedmetap(rel, &metabuf, true); + Assert(cachedmetap != NULL); + } + } + + bucket_buf = buf; + + hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy, + cachedmetap->hashm_maxbucket, + cachedmetap->hashm_highmask, + cachedmetap->hashm_lowmask, &tuples_removed, + &num_index_tuples, split_cleanup, + callback, callback_state); + + _hash_dropbuf(rel, bucket_buf); + + /* Advance to next bucket */ + cur_bucket++; + } + + if (BufferIsInvalid(metabuf)) + metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE); + + /* Write-lock metapage and check for split since we started */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + if (cur_maxbucket != metap->hashm_maxbucket) + { + /* There's been a split, so process the additional bucket(s) */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + cachedmetap = _hash_getcachedmetap(rel, &metabuf, true); + Assert(cachedmetap != NULL); + cur_maxbucket = cachedmetap->hashm_maxbucket; + goto loop_top; + } + + /* Okay, we're really done. Update tuple count in metapage. */ + START_CRIT_SECTION(); + + if (orig_maxbucket == metap->hashm_maxbucket && + orig_ntuples == metap->hashm_ntuples) + { + /* + * No one has split or inserted anything since start of scan, so + * believe our count as gospel. + */ + metap->hashm_ntuples = num_index_tuples; + } + else + { + /* + * Otherwise, our count is untrustworthy since we may have + * double-scanned tuples in split buckets. Proceed by dead-reckoning. + * (Note: we still return estimated_count = false, because using this + * count is better than not updating reltuples at all.) + */ + if (metap->hashm_ntuples > tuples_removed) + metap->hashm_ntuples -= tuples_removed; + else + metap->hashm_ntuples = 0; + num_index_tuples = metap->hashm_ntuples; + } + + MarkBufferDirty(metabuf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_update_meta_page xlrec; + XLogRecPtr recptr; + + xlrec.ntuples = metap->hashm_ntuples; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashUpdateMetaPage); + + XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE); + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + _hash_relbuf(rel, metabuf); + + /* return statistics */ + if (stats == NULL) + stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); + stats->estimated_count = false; + stats->num_index_tuples = num_index_tuples; + stats->tuples_removed += tuples_removed; + /* hashvacuumcleanup will fill in num_pages */ + + return stats; +} + +/* + * Post-VACUUM cleanup. + * + * Result: a palloc'd struct containing statistical info for VACUUM displays. + */ +IndexBulkDeleteResult * +hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) +{ + Relation rel = info->index; + BlockNumber num_pages; + + /* If hashbulkdelete wasn't called, return NULL signifying no change */ + /* Note: this covers the analyze_only case too */ + if (stats == NULL) + return NULL; + + /* update statistics */ + num_pages = RelationGetNumberOfBlocks(rel); + stats->num_pages = num_pages; + + return stats; +} + +/* + * Helper function to perform deletion of index entries from a bucket. + * + * This function expects that the caller has acquired a cleanup lock on the + * primary bucket page, and will return with a write lock again held on the + * primary bucket page. The lock won't necessarily be held continuously, + * though, because we'll release it when visiting overflow pages. + * + * There can't be any concurrent scans in progress when we first enter this + * function because of the cleanup lock we hold on the primary bucket page, + * but as soon as we release that lock, there might be. If those scans got + * ahead of our cleanup scan, they might see a tuple before we kill it and + * wake up only after VACUUM has completed and the TID has been recycled for + * an unrelated tuple. To avoid that calamity, we prevent scans from passing + * our cleanup scan by locking the next page in the bucket chain before + * releasing the lock on the previous page. (This type of lock chaining is not + * ideal, so we might want to look for a better solution at some point.) + * + * We need to retain a pin on the primary bucket to ensure that no concurrent + * split can start. + */ +void +hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, + BlockNumber bucket_blkno, BufferAccessStrategy bstrategy, + uint32 maxbucket, uint32 highmask, uint32 lowmask, + double *tuples_removed, double *num_index_tuples, + bool split_cleanup, + IndexBulkDeleteCallback callback, void *callback_state) +{ + BlockNumber blkno; + Buffer buf; + Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket; + bool bucket_dirty = false; + + blkno = bucket_blkno; + buf = bucket_buf; + + if (split_cleanup) + new_bucket = _hash_get_newbucket_from_oldbucket(rel, cur_bucket, + lowmask, maxbucket); + + /* Scan each page in bucket */ + for (;;) + { + HashPageOpaque opaque; + OffsetNumber offno; + OffsetNumber maxoffno; + Buffer next_buf; + Page page; + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + bool retain_pin = false; + bool clear_dead_marking = false; + + vacuum_delay_point(); + + page = BufferGetPage(buf); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + /* Scan each tuple in page */ + maxoffno = PageGetMaxOffsetNumber(page); + for (offno = FirstOffsetNumber; + offno <= maxoffno; + offno = OffsetNumberNext(offno)) + { + ItemPointer htup; + IndexTuple itup; + Bucket bucket; + bool kill_tuple = false; + + itup = (IndexTuple) PageGetItem(page, + PageGetItemId(page, offno)); + htup = &(itup->t_tid); + + /* + * To remove the dead tuples, we strictly want to rely on results + * of callback function. refer btvacuumpage for detailed reason. + */ + if (callback && callback(htup, callback_state)) + { + kill_tuple = true; + if (tuples_removed) + *tuples_removed += 1; + } + else if (split_cleanup) + { + /* delete the tuples that are moved by split. */ + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup), + maxbucket, + highmask, + lowmask); + /* mark the item for deletion */ + if (bucket != cur_bucket) + { + /* + * We expect tuples to either belong to current bucket or + * new_bucket. This is ensured because we don't allow + * further splits from bucket that contains garbage. See + * comments in _hash_expandtable. + */ + Assert(bucket == new_bucket); + kill_tuple = true; + } + } + + if (kill_tuple) + { + /* mark the item for deletion */ + deletable[ndeletable++] = offno; + } + else + { + /* we're keeping it, so count it */ + if (num_index_tuples) + *num_index_tuples += 1; + } + } + + /* retain the pin on primary bucket page till end of bucket scan */ + if (blkno == bucket_blkno) + retain_pin = true; + else + retain_pin = false; + + blkno = opaque->hasho_nextblkno; + + /* + * Apply deletions, advance to next page and write page if needed. + */ + if (ndeletable > 0) + { + /* No ereport(ERROR) until changes are logged */ + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + bucket_dirty = true; + + /* + * Let us mark the page as clean if vacuum removes the DEAD tuples + * from an index page. We do this by clearing + * LH_PAGE_HAS_DEAD_TUPLES flag. + */ + if (tuples_removed && *tuples_removed > 0 && + H_HAS_DEAD_TUPLES(opaque)) + { + opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; + clear_dead_marking = true; + } + + MarkBufferDirty(buf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_delete xlrec; + XLogRecPtr recptr; + + xlrec.clear_dead_marking = clear_dead_marking; + xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashDelete); + + /* + * bucket buffer needs to be registered to ensure that we can + * acquire a cleanup lock on it during replay. + */ + if (!xlrec.is_primary_bucket_page) + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE); + + XLogRegisterBuffer(1, buf, REGBUF_STANDARD); + XLogRegisterBufData(1, (char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE); + PageSetLSN(BufferGetPage(buf), recptr); + } + + END_CRIT_SECTION(); + } + + /* bail out if there are no more pages to scan. */ + if (!BlockNumberIsValid(blkno)) + break; + + next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + + /* + * release the lock on previous page after acquiring the lock on next + * page + */ + if (retain_pin) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, buf); + + buf = next_buf; + } + + /* + * lock the bucket page to clear the garbage flag and squeeze the bucket. + * if the current buffer is same as bucket buffer, then we already have + * lock on bucket page. + */ + if (buf != bucket_buf) + { + _hash_relbuf(rel, buf); + LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE); + } + + /* + * Clear the garbage flag from bucket after deleting the tuples that are + * moved by split. We purposefully clear the flag before squeeze bucket, + * so that after restart, vacuum shouldn't again try to delete the moved + * by split tuples. + */ + if (split_cleanup) + { + HashPageOpaque bucket_opaque; + Page page; + + page = BufferGetPage(bucket_buf); + bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + /* No ereport(ERROR) until changes are logged */ + START_CRIT_SECTION(); + + bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP; + MarkBufferDirty(bucket_buf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + XLogBeginInsert(); + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP); + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + } + + /* + * If we have deleted anything, try to compact free space. For squeezing + * the bucket, we must have a cleanup lock, else it can impact the + * ordering of tuples for a scan that has started before it. + */ + if (bucket_dirty && IsBufferCleanupOK(bucket_buf)) + _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf, + bstrategy); + else + LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK); +} diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c new file mode 100644 index 0000000..af35a99 --- /dev/null +++ b/src/backend/access/hash/hash_xlog.c @@ -0,0 +1,1145 @@ +/*------------------------------------------------------------------------- + * + * hash_xlog.c + * WAL replay logic for hash index. + * + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/hash/hash_xlog.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/bufmask.h" +#include "access/hash.h" +#include "access/hash_xlog.h" +#include "access/transam.h" +#include "access/xlog.h" +#include "access/xlogutils.h" +#include "miscadmin.h" +#include "storage/procarray.h" + +/* + * replay a hash index meta page + */ +static void +hash_xlog_init_meta_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Page page; + Buffer metabuf; + ForkNumber forknum; + + xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record); + + /* create the index' metapage */ + metabuf = XLogInitBufferForRedo(record, 0); + Assert(BufferIsValid(metabuf)); + _hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid, + xlrec->ffactor, true); + page = (Page) BufferGetPage(metabuf); + PageSetLSN(page, lsn); + MarkBufferDirty(metabuf); + + /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. We need + * special handling for init forks as create index operations don't log a + * full page image of the metapage. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); + + /* all done */ + UnlockReleaseBuffer(metabuf); +} + +/* + * replay a hash index bitmap page + */ +static void +hash_xlog_init_bitmap_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer bitmapbuf; + Buffer metabuf; + Page page; + HashMetaPage metap; + uint32 num_buckets; + ForkNumber forknum; + + xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record); + + /* + * Initialize bitmap page + */ + bitmapbuf = XLogInitBufferForRedo(record, 0); + _hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true); + PageSetLSN(BufferGetPage(bitmapbuf), lsn); + MarkBufferDirty(bitmapbuf); + + /* + * Force the on-disk state of init forks to always be in sync with the + * state in shared buffers. See XLogReadBufferForRedoExtended. We need + * special handling for init forks as create index operations don't log a + * full page image of the metapage. + */ + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(bitmapbuf); + UnlockReleaseBuffer(bitmapbuf); + + /* add the new bitmap page to the metapage's list of bitmaps */ + if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO) + { + /* + * Note: in normal operation, we'd update the metapage while still + * holding lock on the bitmap page. But during replay it's not + * necessary to hold that lock, since nobody can see it yet; the + * creating transaction hasn't yet committed. + */ + page = BufferGetPage(metabuf); + metap = HashPageGetMeta(page); + + num_buckets = metap->hashm_maxbucket + 1; + metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1; + metap->hashm_nmaps++; + + PageSetLSN(page, lsn); + MarkBufferDirty(metabuf); + + XLogRecGetBlockTag(record, 1, NULL, &forknum, NULL); + if (forknum == INIT_FORKNUM) + FlushOneBuffer(metabuf); + } + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); +} + +/* + * replay a hash index insert without split + */ +static void +hash_xlog_insert(XLogReaderState *record) +{ + HashMetaPage metap; + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record); + Buffer buffer; + Page page; + + if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO) + { + Size datalen; + char *datapos = XLogRecGetBlockData(record, 0, &datalen); + + page = BufferGetPage(buffer); + + if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum, + false, false) == InvalidOffsetNumber) + elog(PANIC, "hash_xlog_insert: failed to add item"); + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO) + { + /* + * Note: in normal operation, we'd update the metapage while still + * holding lock on the page we inserted into. But during replay it's + * not necessary to hold that lock, since no other index updates can + * be happening concurrently. + */ + page = BufferGetPage(buffer); + metap = HashPageGetMeta(page); + metap->hashm_ntuples += 1; + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} + +/* + * replay addition of overflow page for hash index + */ +static void +hash_xlog_add_ovfl_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_add_ovfl_page *xlrec = (xl_hash_add_ovfl_page *) XLogRecGetData(record); + Buffer leftbuf; + Buffer ovflbuf; + Buffer metabuf; + BlockNumber leftblk; + BlockNumber rightblk; + BlockNumber newmapblk = InvalidBlockNumber; + Page ovflpage; + HashPageOpaque ovflopaque; + uint32 *num_bucket; + char *data; + Size datalen PG_USED_FOR_ASSERTS_ONLY; + bool new_bmpage = false; + + XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk); + XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk); + + ovflbuf = XLogInitBufferForRedo(record, 0); + Assert(BufferIsValid(ovflbuf)); + + data = XLogRecGetBlockData(record, 0, &datalen); + num_bucket = (uint32 *) data; + Assert(datalen == sizeof(uint32)); + _hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE, + true); + /* update backlink */ + ovflpage = BufferGetPage(ovflbuf); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + ovflopaque->hasho_prevblkno = leftblk; + + PageSetLSN(ovflpage, lsn); + MarkBufferDirty(ovflbuf); + + if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO) + { + Page leftpage; + HashPageOpaque leftopaque; + + leftpage = BufferGetPage(leftbuf); + leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage); + leftopaque->hasho_nextblkno = rightblk; + + PageSetLSN(leftpage, lsn); + MarkBufferDirty(leftbuf); + } + + if (BufferIsValid(leftbuf)) + UnlockReleaseBuffer(leftbuf); + UnlockReleaseBuffer(ovflbuf); + + /* + * Note: in normal operation, we'd update the bitmap and meta page while + * still holding lock on the overflow pages. But during replay it's not + * necessary to hold those locks, since no other index updates can be + * happening concurrently. + */ + if (XLogRecHasBlockRef(record, 2)) + { + Buffer mapbuffer; + + if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO) + { + Page mappage = (Page) BufferGetPage(mapbuffer); + uint32 *freep = NULL; + char *data; + uint32 *bitmap_page_bit; + + freep = HashPageGetBitmap(mappage); + + data = XLogRecGetBlockData(record, 2, &datalen); + bitmap_page_bit = (uint32 *) data; + + SETBIT(freep, *bitmap_page_bit); + + PageSetLSN(mappage, lsn); + MarkBufferDirty(mapbuffer); + } + if (BufferIsValid(mapbuffer)) + UnlockReleaseBuffer(mapbuffer); + } + + if (XLogRecHasBlockRef(record, 3)) + { + Buffer newmapbuf; + + newmapbuf = XLogInitBufferForRedo(record, 3); + + _hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true); + + new_bmpage = true; + newmapblk = BufferGetBlockNumber(newmapbuf); + + MarkBufferDirty(newmapbuf); + PageSetLSN(BufferGetPage(newmapbuf), lsn); + + UnlockReleaseBuffer(newmapbuf); + } + + if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO) + { + HashMetaPage metap; + Page page; + uint32 *firstfree_ovflpage; + + data = XLogRecGetBlockData(record, 4, &datalen); + firstfree_ovflpage = (uint32 *) data; + + page = BufferGetPage(metabuf); + metap = HashPageGetMeta(page); + metap->hashm_firstfree = *firstfree_ovflpage; + + if (!xlrec->bmpage_found) + { + metap->hashm_spares[metap->hashm_ovflpoint]++; + + if (new_bmpage) + { + Assert(BlockNumberIsValid(newmapblk)); + + metap->hashm_mapp[metap->hashm_nmaps] = newmapblk; + metap->hashm_nmaps++; + metap->hashm_spares[metap->hashm_ovflpoint]++; + } + } + + PageSetLSN(page, lsn); + MarkBufferDirty(metabuf); + } + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); +} + +/* + * replay allocation of page for split operation + */ +static void +hash_xlog_split_allocate_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_split_allocate_page *xlrec = (xl_hash_split_allocate_page *) XLogRecGetData(record); + Buffer oldbuf; + Buffer newbuf; + Buffer metabuf; + Size datalen PG_USED_FOR_ASSERTS_ONLY; + char *data; + XLogRedoAction action; + + /* + * To be consistent with normal operation, here we take cleanup locks on + * both the old and new buckets even though there can't be any concurrent + * inserts. + */ + + /* replay the record for old bucket */ + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &oldbuf); + + /* + * Note that we still update the page even if it was restored from a full + * page image, because the special space is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + Page oldpage; + HashPageOpaque oldopaque; + + oldpage = BufferGetPage(oldbuf); + oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage); + + oldopaque->hasho_flag = xlrec->old_bucket_flag; + oldopaque->hasho_prevblkno = xlrec->new_bucket; + + PageSetLSN(oldpage, lsn); + MarkBufferDirty(oldbuf); + } + + /* replay the record for new bucket */ + newbuf = XLogInitBufferForRedo(record, 1); + _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket, + xlrec->new_bucket_flag, true); + if (!IsBufferCleanupOK(newbuf)) + elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock"); + MarkBufferDirty(newbuf); + PageSetLSN(BufferGetPage(newbuf), lsn); + + /* + * We can release the lock on old bucket early as well but doing here to + * consistent with normal operation. + */ + if (BufferIsValid(oldbuf)) + UnlockReleaseBuffer(oldbuf); + if (BufferIsValid(newbuf)) + UnlockReleaseBuffer(newbuf); + + /* + * Note: in normal operation, we'd update the meta page while still + * holding lock on the old and new bucket pages. But during replay it's + * not necessary to hold those locks, since no other bucket splits can be + * happening concurrently. + */ + + /* replay the record for metapage changes */ + if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO) + { + Page page; + HashMetaPage metap; + + page = BufferGetPage(metabuf); + metap = HashPageGetMeta(page); + metap->hashm_maxbucket = xlrec->new_bucket; + + data = XLogRecGetBlockData(record, 2, &datalen); + + if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) + { + uint32 lowmask; + uint32 *highmask; + + /* extract low and high masks. */ + memcpy(&lowmask, data, sizeof(uint32)); + highmask = (uint32 *) ((char *) data + sizeof(uint32)); + + /* update metapage */ + metap->hashm_lowmask = lowmask; + metap->hashm_highmask = *highmask; + + data += sizeof(uint32) * 2; + } + + if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) + { + uint32 ovflpoint; + uint32 *ovflpages; + + /* extract information of overflow pages. */ + memcpy(&ovflpoint, data, sizeof(uint32)); + ovflpages = (uint32 *) ((char *) data + sizeof(uint32)); + + /* update metapage */ + metap->hashm_spares[ovflpoint] = *ovflpages; + metap->hashm_ovflpoint = ovflpoint; + } + + MarkBufferDirty(metabuf); + PageSetLSN(BufferGetPage(metabuf), lsn); + } + + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); +} + +/* + * replay of split operation + */ +static void +hash_xlog_split_page(XLogReaderState *record) +{ + Buffer buf; + + if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED) + elog(ERROR, "Hash split record did not contain a full-page image"); + + UnlockReleaseBuffer(buf); +} + +/* + * replay completion of split operation + */ +static void +hash_xlog_split_complete(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record); + Buffer oldbuf; + Buffer newbuf; + XLogRedoAction action; + + /* replay the record for old bucket */ + action = XLogReadBufferForRedo(record, 0, &oldbuf); + + /* + * Note that we still update the page even if it was restored from a full + * page image, because the bucket flag is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + Page oldpage; + HashPageOpaque oldopaque; + + oldpage = BufferGetPage(oldbuf); + oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage); + + oldopaque->hasho_flag = xlrec->old_bucket_flag; + + PageSetLSN(oldpage, lsn); + MarkBufferDirty(oldbuf); + } + if (BufferIsValid(oldbuf)) + UnlockReleaseBuffer(oldbuf); + + /* replay the record for new bucket */ + action = XLogReadBufferForRedo(record, 1, &newbuf); + + /* + * Note that we still update the page even if it was restored from a full + * page image, because the bucket flag is not included in the image. + */ + if (action == BLK_NEEDS_REDO || action == BLK_RESTORED) + { + Page newpage; + HashPageOpaque nopaque; + + newpage = BufferGetPage(newbuf); + nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage); + + nopaque->hasho_flag = xlrec->new_bucket_flag; + + PageSetLSN(newpage, lsn); + MarkBufferDirty(newbuf); + } + if (BufferIsValid(newbuf)) + UnlockReleaseBuffer(newbuf); +} + +/* + * replay move of page contents for squeeze operation of hash index + */ +static void +hash_xlog_move_page_contents(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record); + Buffer bucketbuf = InvalidBuffer; + Buffer writebuf = InvalidBuffer; + Buffer deletebuf = InvalidBuffer; + XLogRedoAction action; + + /* + * Ensure we have a cleanup lock on primary bucket page before we start + * with the actual replay operation. This is to ensure that neither a + * scan can start nor a scan can be already-in-progress during the replay + * of this operation. If we allow scans during this operation, then they + * can miss some records or show the same record multiple times. + */ + if (xldata->is_prim_bucket_same_wrt) + action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf); + else + { + /* + * we don't care for return value as the purpose of reading bucketbuf + * is to ensure a cleanup lock on primary bucket page. + */ + (void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf); + + action = XLogReadBufferForRedo(record, 1, &writebuf); + } + + /* replay the record for adding entries in overflow buffer */ + if (action == BLK_NEEDS_REDO) + { + Page writepage; + char *begin; + char *data; + Size datalen; + uint16 ninserted = 0; + + data = begin = XLogRecGetBlockData(record, 1, &datalen); + + writepage = (Page) BufferGetPage(writebuf); + + if (xldata->ntups > 0) + { + OffsetNumber *towrite = (OffsetNumber *) data; + + data += sizeof(OffsetNumber) * xldata->ntups; + + while (data - begin < datalen) + { + IndexTuple itup = (IndexTuple) data; + Size itemsz; + OffsetNumber l; + + itemsz = IndexTupleSize(itup); + itemsz = MAXALIGN(itemsz); + + data += itemsz; + + l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false); + if (l == InvalidOffsetNumber) + elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes", + (int) itemsz); + + ninserted++; + } + } + + /* + * number of tuples inserted must be same as requested in REDO record. + */ + Assert(ninserted == xldata->ntups); + + PageSetLSN(writepage, lsn); + MarkBufferDirty(writebuf); + } + + /* replay the record for deleting entries from overflow buffer */ + if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO) + { + Page page; + char *ptr; + Size len; + + ptr = XLogRecGetBlockData(record, 2, &len); + + page = (Page) BufferGetPage(deletebuf); + + if (len > 0) + { + OffsetNumber *unused; + OffsetNumber *unend; + + unused = (OffsetNumber *) ptr; + unend = (OffsetNumber *) ((char *) ptr + len); + + if ((unend - unused) > 0) + PageIndexMultiDelete(page, unused, unend - unused); + } + + PageSetLSN(page, lsn); + MarkBufferDirty(deletebuf); + } + + /* + * Replay is complete, now we can release the buffers. We release locks at + * end of replay operation to ensure that we hold lock on primary bucket + * page till end of operation. We can optimize by releasing the lock on + * write buffer as soon as the operation for same is complete, if it is + * not same as primary bucket page, but that doesn't seem to be worth + * complicating the code. + */ + if (BufferIsValid(deletebuf)) + UnlockReleaseBuffer(deletebuf); + + if (BufferIsValid(writebuf)) + UnlockReleaseBuffer(writebuf); + + if (BufferIsValid(bucketbuf)) + UnlockReleaseBuffer(bucketbuf); +} + +/* + * replay squeeze page operation of hash index + */ +static void +hash_xlog_squeeze_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record); + Buffer bucketbuf = InvalidBuffer; + Buffer writebuf; + Buffer ovflbuf; + Buffer prevbuf = InvalidBuffer; + Buffer mapbuf; + XLogRedoAction action; + + /* + * Ensure we have a cleanup lock on primary bucket page before we start + * with the actual replay operation. This is to ensure that neither a + * scan can start nor a scan can be already-in-progress during the replay + * of this operation. If we allow scans during this operation, then they + * can miss some records or show the same record multiple times. + */ + if (xldata->is_prim_bucket_same_wrt) + action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf); + else + { + /* + * we don't care for return value as the purpose of reading bucketbuf + * is to ensure a cleanup lock on primary bucket page. + */ + (void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf); + + action = XLogReadBufferForRedo(record, 1, &writebuf); + } + + /* replay the record for adding entries in overflow buffer */ + if (action == BLK_NEEDS_REDO) + { + Page writepage; + char *begin; + char *data; + Size datalen; + uint16 ninserted = 0; + + data = begin = XLogRecGetBlockData(record, 1, &datalen); + + writepage = (Page) BufferGetPage(writebuf); + + if (xldata->ntups > 0) + { + OffsetNumber *towrite = (OffsetNumber *) data; + + data += sizeof(OffsetNumber) * xldata->ntups; + + while (data - begin < datalen) + { + IndexTuple itup = (IndexTuple) data; + Size itemsz; + OffsetNumber l; + + itemsz = IndexTupleSize(itup); + itemsz = MAXALIGN(itemsz); + + data += itemsz; + + l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false); + if (l == InvalidOffsetNumber) + elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes", + (int) itemsz); + + ninserted++; + } + } + + /* + * number of tuples inserted must be same as requested in REDO record. + */ + Assert(ninserted == xldata->ntups); + + /* + * if the page on which are adding tuples is a page previous to freed + * overflow page, then update its nextblkno. + */ + if (xldata->is_prev_bucket_same_wrt) + { + HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage); + + writeopaque->hasho_nextblkno = xldata->nextblkno; + } + + PageSetLSN(writepage, lsn); + MarkBufferDirty(writebuf); + } + + /* replay the record for initializing overflow buffer */ + if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO) + { + Page ovflpage; + HashPageOpaque ovflopaque; + + ovflpage = BufferGetPage(ovflbuf); + + _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + + PageSetLSN(ovflpage, lsn); + MarkBufferDirty(ovflbuf); + } + if (BufferIsValid(ovflbuf)) + UnlockReleaseBuffer(ovflbuf); + + /* replay the record for page previous to the freed overflow page */ + if (!xldata->is_prev_bucket_same_wrt && + XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO) + { + Page prevpage = BufferGetPage(prevbuf); + HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage); + + prevopaque->hasho_nextblkno = xldata->nextblkno; + + PageSetLSN(prevpage, lsn); + MarkBufferDirty(prevbuf); + } + if (BufferIsValid(prevbuf)) + UnlockReleaseBuffer(prevbuf); + + /* replay the record for page next to the freed overflow page */ + if (XLogRecHasBlockRef(record, 4)) + { + Buffer nextbuf; + + if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO) + { + Page nextpage = BufferGetPage(nextbuf); + HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage); + + nextopaque->hasho_prevblkno = xldata->prevblkno; + + PageSetLSN(nextpage, lsn); + MarkBufferDirty(nextbuf); + } + if (BufferIsValid(nextbuf)) + UnlockReleaseBuffer(nextbuf); + } + + if (BufferIsValid(writebuf)) + UnlockReleaseBuffer(writebuf); + + if (BufferIsValid(bucketbuf)) + UnlockReleaseBuffer(bucketbuf); + + /* + * Note: in normal operation, we'd update the bitmap and meta page while + * still holding lock on the primary bucket page and overflow pages. But + * during replay it's not necessary to hold those locks, since no other + * index updates can be happening concurrently. + */ + /* replay the record for bitmap page */ + if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO) + { + Page mappage = (Page) BufferGetPage(mapbuf); + uint32 *freep = NULL; + char *data; + uint32 *bitmap_page_bit; + Size datalen; + + freep = HashPageGetBitmap(mappage); + + data = XLogRecGetBlockData(record, 5, &datalen); + bitmap_page_bit = (uint32 *) data; + + CLRBIT(freep, *bitmap_page_bit); + + PageSetLSN(mappage, lsn); + MarkBufferDirty(mapbuf); + } + if (BufferIsValid(mapbuf)) + UnlockReleaseBuffer(mapbuf); + + /* replay the record for meta page */ + if (XLogRecHasBlockRef(record, 6)) + { + Buffer metabuf; + + if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO) + { + HashMetaPage metap; + Page page; + char *data; + uint32 *firstfree_ovflpage; + Size datalen; + + data = XLogRecGetBlockData(record, 6, &datalen); + firstfree_ovflpage = (uint32 *) data; + + page = BufferGetPage(metabuf); + metap = HashPageGetMeta(page); + metap->hashm_firstfree = *firstfree_ovflpage; + + PageSetLSN(page, lsn); + MarkBufferDirty(metabuf); + } + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); + } +} + +/* + * replay delete operation of hash index + */ +static void +hash_xlog_delete(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record); + Buffer bucketbuf = InvalidBuffer; + Buffer deletebuf; + Page page; + XLogRedoAction action; + + /* + * Ensure we have a cleanup lock on primary bucket page before we start + * with the actual replay operation. This is to ensure that neither a + * scan can start nor a scan can be already-in-progress during the replay + * of this operation. If we allow scans during this operation, then they + * can miss some records or show the same record multiple times. + */ + if (xldata->is_primary_bucket_page) + action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf); + else + { + /* + * we don't care for return value as the purpose of reading bucketbuf + * is to ensure a cleanup lock on primary bucket page. + */ + (void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf); + + action = XLogReadBufferForRedo(record, 1, &deletebuf); + } + + /* replay the record for deleting entries in bucket page */ + if (action == BLK_NEEDS_REDO) + { + char *ptr; + Size len; + + ptr = XLogRecGetBlockData(record, 1, &len); + + page = (Page) BufferGetPage(deletebuf); + + if (len > 0) + { + OffsetNumber *unused; + OffsetNumber *unend; + + unused = (OffsetNumber *) ptr; + unend = (OffsetNumber *) ((char *) ptr + len); + + if ((unend - unused) > 0) + PageIndexMultiDelete(page, unused, unend - unused); + } + + /* + * Mark the page as not containing any LP_DEAD items only if + * clear_dead_marking flag is set to true. See comments in + * hashbucketcleanup() for details. + */ + if (xldata->clear_dead_marking) + { + HashPageOpaque pageopaque; + + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; + } + + PageSetLSN(page, lsn); + MarkBufferDirty(deletebuf); + } + if (BufferIsValid(deletebuf)) + UnlockReleaseBuffer(deletebuf); + + if (BufferIsValid(bucketbuf)) + UnlockReleaseBuffer(bucketbuf); +} + +/* + * replay split cleanup flag operation for primary bucket page. + */ +static void +hash_xlog_split_cleanup(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + Buffer buffer; + Page page; + + if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO) + { + HashPageOpaque bucket_opaque; + + page = (Page) BufferGetPage(buffer); + + bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page); + bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP; + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); +} + +/* + * replay for update meta page + */ +static void +hash_xlog_update_meta_page(XLogReaderState *record) +{ + HashMetaPage metap; + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record); + Buffer metabuf; + Page page; + + if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO) + { + page = BufferGetPage(metabuf); + metap = HashPageGetMeta(page); + + metap->hashm_ntuples = xldata->ntuples; + + PageSetLSN(page, lsn); + MarkBufferDirty(metabuf); + } + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); +} + +/* + * replay delete operation in hash index to remove + * tuples marked as DEAD during index tuple insertion. + */ +static void +hash_xlog_vacuum_one_page(XLogReaderState *record) +{ + XLogRecPtr lsn = record->EndRecPtr; + xl_hash_vacuum_one_page *xldata; + Buffer buffer; + Buffer metabuf; + Page page; + XLogRedoAction action; + HashPageOpaque pageopaque; + + xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record); + + /* + * If we have any conflict processing to do, it must happen before we + * update the page. + * + * Hash index records that are marked as LP_DEAD and being removed during + * hash index tuple insertion can conflict with standby queries. You might + * think that vacuum records would conflict as well, but we've handled + * that already. XLOG_HEAP2_PRUNE records provide the highest xid cleaned + * by the vacuum of the heap and so we can resolve any conflicts just once + * when that arrives. After that we know that no conflicts exist from + * individual hash index vacuum records on that index. + */ + if (InHotStandby) + { + RelFileNode rnode; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL); + ResolveRecoveryConflictWithSnapshot(xldata->latestRemovedXid, rnode); + } + + action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer); + + if (action == BLK_NEEDS_REDO) + { + page = (Page) BufferGetPage(buffer); + + if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage) + { + OffsetNumber *unused; + + unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage); + + PageIndexMultiDelete(page, unused, xldata->ntuples); + } + + /* + * Mark the page as not containing any LP_DEAD items. See comments in + * _hash_vacuum_one_page() for details. + */ + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; + + PageSetLSN(page, lsn); + MarkBufferDirty(buffer); + } + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); + + if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO) + { + Page metapage; + HashMetaPage metap; + + metapage = BufferGetPage(metabuf); + metap = HashPageGetMeta(metapage); + + metap->hashm_ntuples -= xldata->ntuples; + + PageSetLSN(metapage, lsn); + MarkBufferDirty(metabuf); + } + if (BufferIsValid(metabuf)) + UnlockReleaseBuffer(metabuf); +} + +void +hash_redo(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + + switch (info) + { + case XLOG_HASH_INIT_META_PAGE: + hash_xlog_init_meta_page(record); + break; + case XLOG_HASH_INIT_BITMAP_PAGE: + hash_xlog_init_bitmap_page(record); + break; + case XLOG_HASH_INSERT: + hash_xlog_insert(record); + break; + case XLOG_HASH_ADD_OVFL_PAGE: + hash_xlog_add_ovfl_page(record); + break; + case XLOG_HASH_SPLIT_ALLOCATE_PAGE: + hash_xlog_split_allocate_page(record); + break; + case XLOG_HASH_SPLIT_PAGE: + hash_xlog_split_page(record); + break; + case XLOG_HASH_SPLIT_COMPLETE: + hash_xlog_split_complete(record); + break; + case XLOG_HASH_MOVE_PAGE_CONTENTS: + hash_xlog_move_page_contents(record); + break; + case XLOG_HASH_SQUEEZE_PAGE: + hash_xlog_squeeze_page(record); + break; + case XLOG_HASH_DELETE: + hash_xlog_delete(record); + break; + case XLOG_HASH_SPLIT_CLEANUP: + hash_xlog_split_cleanup(record); + break; + case XLOG_HASH_UPDATE_META_PAGE: + hash_xlog_update_meta_page(record); + break; + case XLOG_HASH_VACUUM_ONE_PAGE: + hash_xlog_vacuum_one_page(record); + break; + default: + elog(PANIC, "hash_redo: unknown op code %u", info); + } +} + +/* + * Mask a hash page before performing consistency checks on it. + */ +void +hash_mask(char *pagedata, BlockNumber blkno) +{ + Page page = (Page) pagedata; + HashPageOpaque opaque; + int pagetype; + + mask_page_lsn_and_checksum(page); + + mask_page_hint_bits(page); + mask_unused_space(page); + + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + pagetype = opaque->hasho_flag & LH_PAGE_TYPE; + if (pagetype == LH_UNUSED_PAGE) + { + /* + * Mask everything on a UNUSED page. + */ + mask_page_content(page); + } + else if (pagetype == LH_BUCKET_PAGE || + pagetype == LH_OVERFLOW_PAGE) + { + /* + * In hash bucket and overflow pages, it is possible to modify the + * LP_FLAGS without emitting any WAL record. Hence, mask the line + * pointer flags. See hashgettuple(), _hash_kill_items() for details. + */ + mask_lp_flags(page); + } + + /* + * It is possible that the hint bit LH_PAGE_HAS_DEAD_TUPLES may remain + * unlogged. So, mask it. See _hash_kill_items() for details. + */ + opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; +} diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c new file mode 100644 index 0000000..2423339 --- /dev/null +++ b/src/backend/access/hash/hashfunc.c @@ -0,0 +1,411 @@ +/*------------------------------------------------------------------------- + * + * hashfunc.c + * Support functions for hash access method. + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashfunc.c + * + * NOTES + * These functions are stored in pg_amproc. For each operator class + * defined for hash indexes, they compute the hash value of the argument. + * + * Additional hash functions appear in /utils/adt/ files for various + * specialized datatypes. + * + * It is expected that every bit of a hash function's 32-bit result is + * as random as every other; failure to ensure this is likely to lead + * to poor performance of hash joins, for example. In most cases a hash + * function should use hash_any() or its variant hash_uint32(). + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "catalog/pg_collation.h" +#include "common/hashfn.h" +#include "utils/builtins.h" +#include "utils/float.h" +#include "utils/pg_locale.h" + +/* + * Datatype-specific hash functions. + * + * These support both hash indexes and hash joins. + * + * NOTE: some of these are also used by catcache operations, without + * any direct connection to hash indexes. Also, the common hash_any + * routine is also used by dynahash tables. + */ + +/* Note: this is used for both "char" and boolean datatypes */ +Datum +hashchar(PG_FUNCTION_ARGS) +{ + return hash_uint32((int32) PG_GETARG_CHAR(0)); +} + +Datum +hashcharextended(PG_FUNCTION_ARGS) +{ + return hash_uint32_extended((int32) PG_GETARG_CHAR(0), PG_GETARG_INT64(1)); +} + +Datum +hashint2(PG_FUNCTION_ARGS) +{ + return hash_uint32((int32) PG_GETARG_INT16(0)); +} + +Datum +hashint2extended(PG_FUNCTION_ARGS) +{ + return hash_uint32_extended((int32) PG_GETARG_INT16(0), PG_GETARG_INT64(1)); +} + +Datum +hashint4(PG_FUNCTION_ARGS) +{ + return hash_uint32(PG_GETARG_INT32(0)); +} + +Datum +hashint4extended(PG_FUNCTION_ARGS) +{ + return hash_uint32_extended(PG_GETARG_INT32(0), PG_GETARG_INT64(1)); +} + +Datum +hashint8(PG_FUNCTION_ARGS) +{ + /* + * The idea here is to produce a hash value compatible with the values + * produced by hashint4 and hashint2 for logically equal inputs; this is + * necessary to support cross-type hash joins across these input types. + * Since all three types are signed, we can xor the high half of the int8 + * value if the sign is positive, or the complement of the high half when + * the sign is negative. + */ + int64 val = PG_GETARG_INT64(0); + uint32 lohalf = (uint32) val; + uint32 hihalf = (uint32) (val >> 32); + + lohalf ^= (val >= 0) ? hihalf : ~hihalf; + + return hash_uint32(lohalf); +} + +Datum +hashint8extended(PG_FUNCTION_ARGS) +{ + /* Same approach as hashint8 */ + int64 val = PG_GETARG_INT64(0); + uint32 lohalf = (uint32) val; + uint32 hihalf = (uint32) (val >> 32); + + lohalf ^= (val >= 0) ? hihalf : ~hihalf; + + return hash_uint32_extended(lohalf, PG_GETARG_INT64(1)); +} + +Datum +hashoid(PG_FUNCTION_ARGS) +{ + return hash_uint32((uint32) PG_GETARG_OID(0)); +} + +Datum +hashoidextended(PG_FUNCTION_ARGS) +{ + return hash_uint32_extended((uint32) PG_GETARG_OID(0), PG_GETARG_INT64(1)); +} + +Datum +hashenum(PG_FUNCTION_ARGS) +{ + return hash_uint32((uint32) PG_GETARG_OID(0)); +} + +Datum +hashenumextended(PG_FUNCTION_ARGS) +{ + return hash_uint32_extended((uint32) PG_GETARG_OID(0), PG_GETARG_INT64(1)); +} + +Datum +hashfloat4(PG_FUNCTION_ARGS) +{ + float4 key = PG_GETARG_FLOAT4(0); + float8 key8; + + /* + * On IEEE-float machines, minus zero and zero have different bit patterns + * but should compare as equal. We must ensure that they have the same + * hash value, which is most reliably done this way: + */ + if (key == (float4) 0) + PG_RETURN_UINT32(0); + + /* + * To support cross-type hashing of float8 and float4, we want to return + * the same hash value hashfloat8 would produce for an equal float8 value. + * So, widen the value to float8 and hash that. (We must do this rather + * than have hashfloat8 try to narrow its value to float4; that could fail + * on overflow.) + */ + key8 = key; + + /* + * Similarly, NaNs can have different bit patterns but they should all + * compare as equal. For backwards-compatibility reasons we force them to + * have the hash value of a standard float8 NaN. (You'd think we could + * replace key with a float4 NaN and then widen it; but on some old + * platforms, that way produces a different bit pattern.) + */ + if (isnan(key8)) + key8 = get_float8_nan(); + + return hash_any((unsigned char *) &key8, sizeof(key8)); +} + +Datum +hashfloat4extended(PG_FUNCTION_ARGS) +{ + float4 key = PG_GETARG_FLOAT4(0); + uint64 seed = PG_GETARG_INT64(1); + float8 key8; + + /* Same approach as hashfloat4 */ + if (key == (float4) 0) + PG_RETURN_UINT64(seed); + key8 = key; + if (isnan(key8)) + key8 = get_float8_nan(); + + return hash_any_extended((unsigned char *) &key8, sizeof(key8), seed); +} + +Datum +hashfloat8(PG_FUNCTION_ARGS) +{ + float8 key = PG_GETARG_FLOAT8(0); + + /* + * On IEEE-float machines, minus zero and zero have different bit patterns + * but should compare as equal. We must ensure that they have the same + * hash value, which is most reliably done this way: + */ + if (key == (float8) 0) + PG_RETURN_UINT32(0); + + /* + * Similarly, NaNs can have different bit patterns but they should all + * compare as equal. For backwards-compatibility reasons we force them to + * have the hash value of a standard NaN. + */ + if (isnan(key)) + key = get_float8_nan(); + + return hash_any((unsigned char *) &key, sizeof(key)); +} + +Datum +hashfloat8extended(PG_FUNCTION_ARGS) +{ + float8 key = PG_GETARG_FLOAT8(0); + uint64 seed = PG_GETARG_INT64(1); + + /* Same approach as hashfloat8 */ + if (key == (float8) 0) + PG_RETURN_UINT64(seed); + if (isnan(key)) + key = get_float8_nan(); + + return hash_any_extended((unsigned char *) &key, sizeof(key), seed); +} + +Datum +hashoidvector(PG_FUNCTION_ARGS) +{ + oidvector *key = (oidvector *) PG_GETARG_POINTER(0); + + return hash_any((unsigned char *) key->values, key->dim1 * sizeof(Oid)); +} + +Datum +hashoidvectorextended(PG_FUNCTION_ARGS) +{ + oidvector *key = (oidvector *) PG_GETARG_POINTER(0); + + return hash_any_extended((unsigned char *) key->values, + key->dim1 * sizeof(Oid), + PG_GETARG_INT64(1)); +} + +Datum +hashname(PG_FUNCTION_ARGS) +{ + char *key = NameStr(*PG_GETARG_NAME(0)); + + return hash_any((unsigned char *) key, strlen(key)); +} + +Datum +hashnameextended(PG_FUNCTION_ARGS) +{ + char *key = NameStr(*PG_GETARG_NAME(0)); + + return hash_any_extended((unsigned char *) key, strlen(key), + PG_GETARG_INT64(1)); +} + +Datum +hashtext(PG_FUNCTION_ARGS) +{ + text *key = PG_GETARG_TEXT_PP(0); + Oid collid = PG_GET_COLLATION(); + pg_locale_t mylocale = 0; + Datum result; + + if (!collid) + ereport(ERROR, + (errcode(ERRCODE_INDETERMINATE_COLLATION), + errmsg("could not determine which collation to use for string hashing"), + errhint("Use the COLLATE clause to set the collation explicitly."))); + + if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) + mylocale = pg_newlocale_from_collation(collid); + + if (!mylocale || mylocale->deterministic) + { + result = hash_any((unsigned char *) VARDATA_ANY(key), + VARSIZE_ANY_EXHDR(key)); + } + else + { +#ifdef USE_ICU + if (mylocale->provider == COLLPROVIDER_ICU) + { + int32_t ulen = -1; + UChar *uchar = NULL; + Size bsize; + uint8_t *buf; + + ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key)); + + bsize = ucol_getSortKey(mylocale->info.icu.ucol, + uchar, ulen, NULL, 0); + buf = palloc(bsize); + ucol_getSortKey(mylocale->info.icu.ucol, + uchar, ulen, buf, bsize); + + result = hash_any(buf, bsize); + + pfree(buf); + } + else +#endif + /* shouldn't happen */ + elog(ERROR, "unsupported collprovider: %c", mylocale->provider); + } + + /* Avoid leaking memory for toasted inputs */ + PG_FREE_IF_COPY(key, 0); + + return result; +} + +Datum +hashtextextended(PG_FUNCTION_ARGS) +{ + text *key = PG_GETARG_TEXT_PP(0); + Oid collid = PG_GET_COLLATION(); + pg_locale_t mylocale = 0; + Datum result; + + if (!collid) + ereport(ERROR, + (errcode(ERRCODE_INDETERMINATE_COLLATION), + errmsg("could not determine which collation to use for string hashing"), + errhint("Use the COLLATE clause to set the collation explicitly."))); + + if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID) + mylocale = pg_newlocale_from_collation(collid); + + if (!mylocale || mylocale->deterministic) + { + result = hash_any_extended((unsigned char *) VARDATA_ANY(key), + VARSIZE_ANY_EXHDR(key), + PG_GETARG_INT64(1)); + } + else + { +#ifdef USE_ICU + if (mylocale->provider == COLLPROVIDER_ICU) + { + int32_t ulen = -1; + UChar *uchar = NULL; + Size bsize; + uint8_t *buf; + + ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key)); + + bsize = ucol_getSortKey(mylocale->info.icu.ucol, + uchar, ulen, NULL, 0); + buf = palloc(bsize); + ucol_getSortKey(mylocale->info.icu.ucol, + uchar, ulen, buf, bsize); + + result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1)); + + pfree(buf); + } + else +#endif + /* shouldn't happen */ + elog(ERROR, "unsupported collprovider: %c", mylocale->provider); + } + + PG_FREE_IF_COPY(key, 0); + + return result; +} + +/* + * hashvarlena() can be used for any varlena datatype in which there are + * no non-significant bits, ie, distinct bitpatterns never compare as equal. + */ +Datum +hashvarlena(PG_FUNCTION_ARGS) +{ + struct varlena *key = PG_GETARG_VARLENA_PP(0); + Datum result; + + result = hash_any((unsigned char *) VARDATA_ANY(key), + VARSIZE_ANY_EXHDR(key)); + + /* Avoid leaking memory for toasted inputs */ + PG_FREE_IF_COPY(key, 0); + + return result; +} + +Datum +hashvarlenaextended(PG_FUNCTION_ARGS) +{ + struct varlena *key = PG_GETARG_VARLENA_PP(0); + Datum result; + + result = hash_any_extended((unsigned char *) VARDATA_ANY(key), + VARSIZE_ANY_EXHDR(key), + PG_GETARG_INT64(1)); + + PG_FREE_IF_COPY(key, 0); + + return result; +} diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c new file mode 100644 index 0000000..d254a00 --- /dev/null +++ b/src/backend/access/hash/hashinsert.c @@ -0,0 +1,432 @@ +/*------------------------------------------------------------------------- + * + * hashinsert.c + * Item insertion in hash tables for Postgres. + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashinsert.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "access/hash_xlog.h" +#include "miscadmin.h" +#include "storage/buf_internals.h" +#include "storage/lwlock.h" +#include "storage/predicate.h" +#include "utils/rel.h" + +static void _hash_vacuum_one_page(Relation rel, Relation hrel, + Buffer metabuf, Buffer buf); + +/* + * _hash_doinsert() -- Handle insertion of a single index tuple. + * + * This routine is called by the public interface routines, hashbuild + * and hashinsert. By here, itup is completely filled in. + */ +void +_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel) +{ + Buffer buf = InvalidBuffer; + Buffer bucket_buf; + Buffer metabuf; + HashMetaPage metap; + HashMetaPage usedmetap = NULL; + Page metapage; + Page page; + HashPageOpaque pageopaque; + Size itemsz; + bool do_expand; + uint32 hashkey; + Bucket bucket; + OffsetNumber itup_off; + + /* + * Get the hash key for the item (it's stored in the index tuple itself). + */ + hashkey = _hash_get_indextuple_hashkey(itup); + + /* compute item size too */ + itemsz = IndexTupleSize(itup); + itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we + * need to be consistent */ + +restart_insert: + + /* + * Read the metapage. We don't lock it yet; HashMaxItemSize() will + * examine pd_pagesize_version, but that can't change so we can examine it + * without a lock. + */ + metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE); + metapage = BufferGetPage(metabuf); + + /* + * Check whether the item can fit on a hash page at all. (Eventually, we + * ought to try to apply TOAST methods if not.) Note that at this point, + * itemsz doesn't include the ItemId. + * + * XXX this is useless code if we are only storing hash keys. + */ + if (itemsz > HashMaxItemSize(metapage)) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("index row size %zu exceeds hash maximum %zu", + itemsz, HashMaxItemSize(metapage)), + errhint("Values larger than a buffer page cannot be indexed."))); + + /* Lock the primary bucket page for the target bucket. */ + buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_WRITE, + &usedmetap); + Assert(usedmetap != NULL); + + CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(buf)); + + /* remember the primary bucket buffer to release the pin on it at end. */ + bucket_buf = buf; + + page = BufferGetPage(buf); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + bucket = pageopaque->hasho_bucket; + + /* + * If this bucket is in the process of being split, try to finish the + * split before inserting, because that might create room for the + * insertion to proceed without allocating an additional overflow page. + * It's only interesting to finish the split if we're trying to insert + * into the bucket from which we're removing tuples (the "old" bucket), + * not if we're trying to insert into the bucket into which tuples are + * being moved (the "new" bucket). + */ + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf)) + { + /* release the lock on bucket buffer, before completing the split. */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + _hash_finish_split(rel, metabuf, buf, bucket, + usedmetap->hashm_maxbucket, + usedmetap->hashm_highmask, + usedmetap->hashm_lowmask); + + /* release the pin on old and meta buffer. retry for insert. */ + _hash_dropbuf(rel, buf); + _hash_dropbuf(rel, metabuf); + goto restart_insert; + } + + /* Do the insertion */ + while (PageGetFreeSpace(page) < itemsz) + { + BlockNumber nextblkno; + + /* + * Check if current page has any DEAD tuples. If yes, delete these + * tuples and see if we can get a space for the new item to be + * inserted before moving to the next page in the bucket chain. + */ + if (H_HAS_DEAD_TUPLES(pageopaque)) + { + + if (IsBufferCleanupOK(buf)) + { + _hash_vacuum_one_page(rel, heapRel, metabuf, buf); + + if (PageGetFreeSpace(page) >= itemsz) + break; /* OK, now we have enough space */ + } + } + + /* + * no space on this page; check for an overflow page + */ + nextblkno = pageopaque->hasho_nextblkno; + + if (BlockNumberIsValid(nextblkno)) + { + /* + * ovfl page exists; go get it. if it doesn't have room, we'll + * find out next pass through the loop test above. we always + * release both the lock and pin if this is an overflow page, but + * only the lock if this is the primary bucket page, since the pin + * on the primary bucket must be retained throughout the scan. + */ + if (buf != bucket_buf) + _hash_relbuf(rel, buf); + else + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE); + page = BufferGetPage(buf); + } + else + { + /* + * we're at the end of the bucket chain and we haven't found a + * page with enough room. allocate a new overflow page. + */ + + /* release our write lock without modifying buffer */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + /* chain to a new overflow page */ + buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false); + page = BufferGetPage(buf); + + /* should fit now, given test above */ + Assert(PageGetFreeSpace(page) >= itemsz); + } + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + Assert((pageopaque->hasho_flag & LH_PAGE_TYPE) == LH_OVERFLOW_PAGE); + Assert(pageopaque->hasho_bucket == bucket); + } + + /* + * Write-lock the metapage so we can increment the tuple count. After + * incrementing it, check to see if it's time for a split. + */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* Do the update. No ereport(ERROR) until changes are logged */ + START_CRIT_SECTION(); + + /* found page with enough space, so add the item here */ + itup_off = _hash_pgaddtup(rel, buf, itemsz, itup); + MarkBufferDirty(buf); + + /* metapage operations */ + metap = HashPageGetMeta(metapage); + metap->hashm_ntuples += 1; + + /* Make sure this stays in sync with _hash_expandtable() */ + do_expand = metap->hashm_ntuples > + (double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1); + + MarkBufferDirty(metabuf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_insert xlrec; + XLogRecPtr recptr; + + xlrec.offnum = itup_off; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashInsert); + + XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD); + + XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup)); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT); + + PageSetLSN(BufferGetPage(buf), recptr); + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + /* drop lock on metapage, but keep pin */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + /* + * Release the modified page and ensure to release the pin on primary + * page. + */ + _hash_relbuf(rel, buf); + if (buf != bucket_buf) + _hash_dropbuf(rel, bucket_buf); + + /* Attempt to split if a split is needed */ + if (do_expand) + _hash_expandtable(rel, metabuf); + + /* Finally drop our pin on the metapage */ + _hash_dropbuf(rel, metabuf); +} + +/* + * _hash_pgaddtup() -- add a tuple to a particular page in the index. + * + * This routine adds the tuple to the page as requested; it does not write out + * the page. It is an error to call this function without pin and write lock + * on the target buffer. + * + * Returns the offset number at which the tuple was inserted. This function + * is responsible for preserving the condition that tuples in a hash index + * page are sorted by hashkey value. + */ +OffsetNumber +_hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup) +{ + OffsetNumber itup_off; + Page page; + uint32 hashkey; + + _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + page = BufferGetPage(buf); + + /* Find where to insert the tuple (preserving page's hashkey ordering) */ + hashkey = _hash_get_indextuple_hashkey(itup); + itup_off = _hash_binsearch(page, hashkey); + + if (PageAddItem(page, (Item) itup, itemsize, itup_off, false, false) + == InvalidOffsetNumber) + elog(ERROR, "failed to add index item to \"%s\"", + RelationGetRelationName(rel)); + + return itup_off; +} + +/* + * _hash_pgaddmultitup() -- add a tuple vector to a particular page in the + * index. + * + * This routine has same requirements for locking and tuple ordering as + * _hash_pgaddtup(). + * + * Returns the offset number array at which the tuples were inserted. + */ +void +_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups, + OffsetNumber *itup_offsets, uint16 nitups) +{ + OffsetNumber itup_off; + Page page; + uint32 hashkey; + int i; + + _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + page = BufferGetPage(buf); + + for (i = 0; i < nitups; i++) + { + Size itemsize; + + itemsize = IndexTupleSize(itups[i]); + itemsize = MAXALIGN(itemsize); + + /* Find where to insert the tuple (preserving page's hashkey ordering) */ + hashkey = _hash_get_indextuple_hashkey(itups[i]); + itup_off = _hash_binsearch(page, hashkey); + + itup_offsets[i] = itup_off; + + if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false) + == InvalidOffsetNumber) + elog(ERROR, "failed to add index item to \"%s\"", + RelationGetRelationName(rel)); + } +} + +/* + * _hash_vacuum_one_page - vacuum just one index page. + * + * Try to remove LP_DEAD items from the given page. We must acquire cleanup + * lock on the page being modified before calling this function. + */ + +static void +_hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf) +{ + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + maxoff; + Page page = BufferGetPage(buf); + HashPageOpaque pageopaque; + HashMetaPage metap; + + /* Scan each tuple in page to see if it is marked as LP_DEAD */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable > 0) + { + TransactionId latestRemovedXid; + + latestRemovedXid = + index_compute_xid_horizon_for_tuples(rel, hrel, buf, + deletable, ndeletable); + + /* + * Write-lock the meta page so that we can decrement tuple count. + */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* No ereport(ERROR) until changes are logged */ + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* + * Mark the page as not containing any LP_DEAD items. This is not + * certainly true (there might be some that have recently been marked, + * but weren't included in our target-item list), but it will almost + * always be true and it doesn't seem worth an additional page scan to + * check it. Remember that LH_PAGE_HAS_DEAD_TUPLES is only a hint + * anyway. + */ + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; + + metap = HashPageGetMeta(BufferGetPage(metabuf)); + metap->hashm_ntuples -= ndeletable; + + MarkBufferDirty(buf); + MarkBufferDirty(metabuf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_vacuum_one_page xlrec; + XLogRecPtr recptr; + + xlrec.latestRemovedXid = latestRemovedXid; + xlrec.ntuples = ndeletable; + + XLogBeginInsert(); + XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + XLogRegisterData((char *) &xlrec, SizeOfHashVacuumOnePage); + + /* + * We need the target-offsets array whether or not we store the + * whole buffer, to allow us to find the latestRemovedXid on a + * standby server. + */ + XLogRegisterData((char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE); + + PageSetLSN(BufferGetPage(buf), recptr); + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + /* + * Releasing write lock on meta page as we have updated the tuple + * count. + */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + } +} diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c new file mode 100644 index 0000000..1ff2e0c --- /dev/null +++ b/src/backend/access/hash/hashovfl.c @@ -0,0 +1,1083 @@ +/*------------------------------------------------------------------------- + * + * hashovfl.c + * Overflow page management code for the Postgres hash access method + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashovfl.c + * + * NOTES + * Overflow pages look like ordinary relation pages. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/hash.h" +#include "access/hash_xlog.h" +#include "miscadmin.h" +#include "utils/rel.h" + + +static uint32 _hash_firstfreebit(uint32 map); + + +/* + * Convert overflow page bit number (its index in the free-page bitmaps) + * to block number within the index. + */ +static BlockNumber +bitno_to_blkno(HashMetaPage metap, uint32 ovflbitnum) +{ + uint32 splitnum = metap->hashm_ovflpoint; + uint32 i; + + /* Convert zero-based bitnumber to 1-based page number */ + ovflbitnum += 1; + + /* Determine the split number for this page (must be >= 1) */ + for (i = 1; + i < splitnum && ovflbitnum > metap->hashm_spares[i]; + i++) + /* loop */ ; + + /* + * Convert to absolute page number by adding the number of bucket pages + * that exist before this split point. + */ + return (BlockNumber) (_hash_get_totalbuckets(i) + ovflbitnum); +} + +/* + * _hash_ovflblkno_to_bitno + * + * Convert overflow page block number to bit number for free-page bitmap. + */ +uint32 +_hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno) +{ + uint32 splitnum = metap->hashm_ovflpoint; + uint32 i; + uint32 bitnum; + + /* Determine the split number containing this page */ + for (i = 1; i <= splitnum; i++) + { + if (ovflblkno <= (BlockNumber) _hash_get_totalbuckets(i)) + break; /* oops */ + bitnum = ovflblkno - _hash_get_totalbuckets(i); + + /* + * bitnum has to be greater than number of overflow page added in + * previous split point. The overflow page at this splitnum (i) if any + * should start from (_hash_get_totalbuckets(i) + + * metap->hashm_spares[i - 1] + 1). + */ + if (bitnum > metap->hashm_spares[i - 1] && + bitnum <= metap->hashm_spares[i]) + return bitnum - 1; /* -1 to convert 1-based to 0-based */ + } + + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid overflow block number %u", ovflblkno))); + return 0; /* keep compiler quiet */ +} + +/* + * _hash_addovflpage + * + * Add an overflow page to the bucket whose last page is pointed to by 'buf'. + * + * On entry, the caller must hold a pin but no lock on 'buf'. The pin is + * dropped before exiting (we assume the caller is not interested in 'buf' + * anymore) if not asked to retain. The pin will be retained only for the + * primary bucket. The returned overflow page will be pinned and + * write-locked; it is guaranteed to be empty. + * + * The caller must hold a pin, but no lock, on the metapage buffer. + * That buffer is returned in the same state. + * + * NB: since this could be executed concurrently by multiple processes, + * one should not assume that the returned overflow page will be the + * immediate successor of the originally passed 'buf'. Additional overflow + * pages might have been added to the bucket chain in between. + */ +Buffer +_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin) +{ + Buffer ovflbuf; + Page page; + Page ovflpage; + HashPageOpaque pageopaque; + HashPageOpaque ovflopaque; + HashMetaPage metap; + Buffer mapbuf = InvalidBuffer; + Buffer newmapbuf = InvalidBuffer; + BlockNumber blkno; + uint32 orig_firstfree; + uint32 splitnum; + uint32 *freep = NULL; + uint32 max_ovflpg; + uint32 bit; + uint32 bitmap_page_bit; + uint32 first_page; + uint32 last_bit; + uint32 last_page; + uint32 i, + j; + bool page_found = false; + + /* + * Write-lock the tail page. Here, we need to maintain locking order such + * that, first acquire the lock on tail page of bucket, then on meta page + * to find and lock the bitmap page and if it is found, then lock on meta + * page is released, then finally acquire the lock on new overflow buffer. + * We need this locking order to avoid deadlock with backends that are + * doing inserts. + * + * Note: We could have avoided locking many buffers here if we made two + * WAL records for acquiring an overflow page (one to allocate an overflow + * page and another to add it to overflow bucket chain). However, doing + * so can leak an overflow page, if the system crashes after allocation. + * Needless to say, it is better to have a single record from a + * performance point of view as well. + */ + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + /* probably redundant... */ + _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + + /* loop to find current tail page, in case someone else inserted too */ + for (;;) + { + BlockNumber nextblkno; + + page = BufferGetPage(buf); + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + nextblkno = pageopaque->hasho_nextblkno; + + if (!BlockNumberIsValid(nextblkno)) + break; + + /* we assume we do not need to write the unmodified page */ + if (retain_pin) + { + /* pin will be retained only for the primary bucket page */ + Assert((pageopaque->hasho_flag & LH_PAGE_TYPE) == LH_BUCKET_PAGE); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + } + else + _hash_relbuf(rel, buf); + + retain_pin = false; + + buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE); + } + + /* Get exclusive lock on the meta page */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + _hash_checkpage(rel, metabuf, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + /* start search at hashm_firstfree */ + orig_firstfree = metap->hashm_firstfree; + first_page = orig_firstfree >> BMPG_SHIFT(metap); + bit = orig_firstfree & BMPG_MASK(metap); + i = first_page; + j = bit / BITS_PER_MAP; + bit &= ~(BITS_PER_MAP - 1); + + /* outer loop iterates once per bitmap page */ + for (;;) + { + BlockNumber mapblkno; + Page mappage; + uint32 last_inpage; + + /* want to end search with the last existing overflow page */ + splitnum = metap->hashm_ovflpoint; + max_ovflpg = metap->hashm_spares[splitnum] - 1; + last_page = max_ovflpg >> BMPG_SHIFT(metap); + last_bit = max_ovflpg & BMPG_MASK(metap); + + if (i > last_page) + break; + + Assert(i < metap->hashm_nmaps); + mapblkno = metap->hashm_mapp[i]; + + if (i == last_page) + last_inpage = last_bit; + else + last_inpage = BMPGSZ_BIT(metap) - 1; + + /* Release exclusive lock on metapage while reading bitmap page */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + mapbuf = _hash_getbuf(rel, mapblkno, HASH_WRITE, LH_BITMAP_PAGE); + mappage = BufferGetPage(mapbuf); + freep = HashPageGetBitmap(mappage); + + for (; bit <= last_inpage; j++, bit += BITS_PER_MAP) + { + if (freep[j] != ALL_SET) + { + page_found = true; + + /* Reacquire exclusive lock on the meta page */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* convert bit to bit number within page */ + bit += _hash_firstfreebit(freep[j]); + bitmap_page_bit = bit; + + /* convert bit to absolute bit number */ + bit += (i << BMPG_SHIFT(metap)); + /* Calculate address of the recycled overflow page */ + blkno = bitno_to_blkno(metap, bit); + + /* Fetch and init the recycled page */ + ovflbuf = _hash_getinitbuf(rel, blkno); + + goto found; + } + } + + /* No free space here, try to advance to next map page */ + _hash_relbuf(rel, mapbuf); + mapbuf = InvalidBuffer; + i++; + j = 0; /* scan from start of next map page */ + bit = 0; + + /* Reacquire exclusive lock on the meta page */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + } + + /* + * No free pages --- have to extend the relation to add an overflow page. + * First, check to see if we have to add a new bitmap page too. + */ + if (last_bit == (uint32) (BMPGSZ_BIT(metap) - 1)) + { + /* + * We create the new bitmap page with all pages marked "in use". + * Actually two pages in the new bitmap's range will exist + * immediately: the bitmap page itself, and the following page which + * is the one we return to the caller. Both of these are correctly + * marked "in use". Subsequent pages do not exist yet, but it is + * convenient to pre-mark them as "in use" too. + */ + bit = metap->hashm_spares[splitnum]; + + /* metapage already has a write lock */ + if (metap->hashm_nmaps >= HASH_MAX_BITMAPS) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("out of overflow pages in hash index \"%s\"", + RelationGetRelationName(rel)))); + + newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM); + } + else + { + /* + * Nothing to do here; since the page will be past the last used page, + * we know its bitmap bit was preinitialized to "in use". + */ + } + + /* Calculate address of the new overflow page */ + bit = BufferIsValid(newmapbuf) ? + metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum]; + blkno = bitno_to_blkno(metap, bit); + + /* + * Fetch the page with _hash_getnewbuf to ensure smgr's idea of the + * relation length stays in sync with ours. XXX It's annoying to do this + * with metapage write lock held; would be better to use a lock that + * doesn't block incoming searches. + * + * It is okay to hold two buffer locks here (one on tail page of bucket + * and other on new overflow page) since there cannot be anyone else + * contending for access to ovflbuf. + */ + ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM); + +found: + + /* + * Do the update. No ereport(ERROR) until changes are logged. We want to + * log the changes for bitmap page and overflow page together to avoid + * loss of pages in case the new page is added. + */ + START_CRIT_SECTION(); + + if (page_found) + { + Assert(BufferIsValid(mapbuf)); + + /* mark page "in use" in the bitmap */ + SETBIT(freep, bitmap_page_bit); + MarkBufferDirty(mapbuf); + } + else + { + /* update the count to indicate new overflow page is added */ + metap->hashm_spares[splitnum]++; + + if (BufferIsValid(newmapbuf)) + { + _hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false); + MarkBufferDirty(newmapbuf); + + /* add the new bitmap page to the metapage's list of bitmaps */ + metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf); + metap->hashm_nmaps++; + metap->hashm_spares[splitnum]++; + } + + MarkBufferDirty(metabuf); + + /* + * for new overflow page, we don't need to explicitly set the bit in + * bitmap page, as by default that will be set to "in use". + */ + } + + /* + * Adjust hashm_firstfree to avoid redundant searches. But don't risk + * changing it if someone moved it while we were searching bitmap pages. + */ + if (metap->hashm_firstfree == orig_firstfree) + { + metap->hashm_firstfree = bit + 1; + MarkBufferDirty(metabuf); + } + + /* initialize new overflow page */ + ovflpage = BufferGetPage(ovflbuf); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf); + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = pageopaque->hasho_bucket; + ovflopaque->hasho_flag = LH_OVERFLOW_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + + MarkBufferDirty(ovflbuf); + + /* logically chain overflow page to previous page */ + pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf); + + MarkBufferDirty(buf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + xl_hash_add_ovfl_page xlrec; + + xlrec.bmpage_found = page_found; + xlrec.bmsize = metap->hashm_bmsize; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage); + + XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT); + XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket)); + + XLogRegisterBuffer(1, buf, REGBUF_STANDARD); + + if (BufferIsValid(mapbuf)) + { + XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD); + XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32)); + } + + if (BufferIsValid(newmapbuf)) + XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT); + + XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD); + XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32)); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE); + + PageSetLSN(BufferGetPage(ovflbuf), recptr); + PageSetLSN(BufferGetPage(buf), recptr); + + if (BufferIsValid(mapbuf)) + PageSetLSN(BufferGetPage(mapbuf), recptr); + + if (BufferIsValid(newmapbuf)) + PageSetLSN(BufferGetPage(newmapbuf), recptr); + + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + if (retain_pin) + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, buf); + + if (BufferIsValid(mapbuf)) + _hash_relbuf(rel, mapbuf); + + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + if (BufferIsValid(newmapbuf)) + _hash_relbuf(rel, newmapbuf); + + return ovflbuf; +} + +/* + * _hash_firstfreebit() + * + * Return the number of the first bit that is not set in the word 'map'. + */ +static uint32 +_hash_firstfreebit(uint32 map) +{ + uint32 i, + mask; + + mask = 0x1; + for (i = 0; i < BITS_PER_MAP; i++) + { + if (!(mask & map)) + return i; + mask <<= 1; + } + + elog(ERROR, "firstfreebit found no free bit"); + + return 0; /* keep compiler quiet */ +} + +/* + * _hash_freeovflpage() - + * + * Remove this overflow page from its bucket's chain, and mark the page as + * free. On entry, ovflbuf is write-locked; it is released before exiting. + * + * Add the tuples (itups) to wbuf in this function. We could do that in the + * caller as well, but the advantage of doing it here is we can easily write + * the WAL for XLOG_HASH_SQUEEZE_PAGE operation. Addition of tuples and + * removal of overflow page has to done as an atomic operation, otherwise + * during replay on standby users might find duplicate records. + * + * Since this function is invoked in VACUUM, we provide an access strategy + * parameter that controls fetches of the bucket pages. + * + * Returns the block number of the page that followed the given page + * in the bucket, or InvalidBlockNumber if no following page. + * + * NB: caller must not hold lock on metapage, nor on page, that's next to + * ovflbuf in the bucket chain. We don't acquire the lock on page that's + * prior to ovflbuf in chain if it is same as wbuf because the caller already + * has a lock on same. + */ +BlockNumber +_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, + Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets, + Size *tups_size, uint16 nitups, + BufferAccessStrategy bstrategy) +{ + HashMetaPage metap; + Buffer metabuf; + Buffer mapbuf; + BlockNumber ovflblkno; + BlockNumber prevblkno; + BlockNumber blkno; + BlockNumber nextblkno; + BlockNumber writeblkno; + HashPageOpaque ovflopaque; + Page ovflpage; + Page mappage; + uint32 *freep; + uint32 ovflbitno; + int32 bitmappage, + bitmapbit; + Bucket bucket PG_USED_FOR_ASSERTS_ONLY; + Buffer prevbuf = InvalidBuffer; + Buffer nextbuf = InvalidBuffer; + bool update_metap = false; + + /* Get information from the doomed page */ + _hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE); + ovflblkno = BufferGetBlockNumber(ovflbuf); + ovflpage = BufferGetPage(ovflbuf); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + nextblkno = ovflopaque->hasho_nextblkno; + prevblkno = ovflopaque->hasho_prevblkno; + writeblkno = BufferGetBlockNumber(wbuf); + bucket = ovflopaque->hasho_bucket; + + /* + * Fix up the bucket chain. this is a doubly-linked list, so we must fix + * up the bucket chain members behind and ahead of the overflow page being + * deleted. Concurrency issues are avoided by using lock chaining as + * described atop hashbucketcleanup. + */ + if (BlockNumberIsValid(prevblkno)) + { + if (prevblkno == writeblkno) + prevbuf = wbuf; + else + prevbuf = _hash_getbuf_with_strategy(rel, + prevblkno, + HASH_WRITE, + LH_BUCKET_PAGE | LH_OVERFLOW_PAGE, + bstrategy); + } + if (BlockNumberIsValid(nextblkno)) + nextbuf = _hash_getbuf_with_strategy(rel, + nextblkno, + HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + + /* Note: bstrategy is intentionally not used for metapage and bitmap */ + + /* Read the metapage so we can determine which bitmap page to use */ + metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + /* Identify which bit to set */ + ovflbitno = _hash_ovflblkno_to_bitno(metap, ovflblkno); + + bitmappage = ovflbitno >> BMPG_SHIFT(metap); + bitmapbit = ovflbitno & BMPG_MASK(metap); + + if (bitmappage >= metap->hashm_nmaps) + elog(ERROR, "invalid overflow bit number %u", ovflbitno); + blkno = metap->hashm_mapp[bitmappage]; + + /* Release metapage lock while we access the bitmap page */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + /* read the bitmap page to clear the bitmap bit */ + mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE); + mappage = BufferGetPage(mapbuf); + freep = HashPageGetBitmap(mappage); + Assert(ISSET(freep, bitmapbit)); + + /* Get write-lock on metapage to update firstfree */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* This operation needs to log multiple tuples, prepare WAL for that */ + if (RelationNeedsWAL(rel)) + XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups); + + START_CRIT_SECTION(); + + /* + * we have to insert tuples on the "write" page, being careful to preserve + * hashkey ordering. (If we insert many tuples into the same "write" page + * it would be worth qsort'ing them). + */ + if (nitups > 0) + { + _hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups); + MarkBufferDirty(wbuf); + } + + /* + * Reinitialize the freed overflow page. Just zeroing the page won't + * work, because WAL replay routines expect pages to be initialized. See + * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. We are + * careful to make the special space valid here so that tools like + * pageinspect won't get confused. + */ + _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + + MarkBufferDirty(ovflbuf); + + if (BufferIsValid(prevbuf)) + { + Page prevpage = BufferGetPage(prevbuf); + HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage); + + Assert(prevopaque->hasho_bucket == bucket); + prevopaque->hasho_nextblkno = nextblkno; + MarkBufferDirty(prevbuf); + } + if (BufferIsValid(nextbuf)) + { + Page nextpage = BufferGetPage(nextbuf); + HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage); + + Assert(nextopaque->hasho_bucket == bucket); + nextopaque->hasho_prevblkno = prevblkno; + MarkBufferDirty(nextbuf); + } + + /* Clear the bitmap bit to indicate that this overflow page is free */ + CLRBIT(freep, bitmapbit); + MarkBufferDirty(mapbuf); + + /* if this is now the first free page, update hashm_firstfree */ + if (ovflbitno < metap->hashm_firstfree) + { + metap->hashm_firstfree = ovflbitno; + update_metap = true; + MarkBufferDirty(metabuf); + } + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_squeeze_page xlrec; + XLogRecPtr recptr; + int i; + + xlrec.prevblkno = prevblkno; + xlrec.nextblkno = nextblkno; + xlrec.ntups = nitups; + xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); + xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf); + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage); + + /* + * bucket buffer needs to be registered to ensure that we can acquire + * a cleanup lock on it during replay. + */ + if (!xlrec.is_prim_bucket_same_wrt) + XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE); + + XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD); + if (xlrec.ntups > 0) + { + XLogRegisterBufData(1, (char *) itup_offsets, + nitups * sizeof(OffsetNumber)); + for (i = 0; i < nitups; i++) + XLogRegisterBufData(1, (char *) itups[i], tups_size[i]); + } + + XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD); + + /* + * If prevpage and the writepage (block in which we are moving tuples + * from overflow) are same, then no need to separately register + * prevpage. During replay, we can directly update the nextblock in + * writepage. + */ + if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt) + XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD); + + if (BufferIsValid(nextbuf)) + XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD); + + XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD); + XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32)); + + if (update_metap) + { + XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD); + XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32)); + } + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE); + + PageSetLSN(BufferGetPage(wbuf), recptr); + PageSetLSN(BufferGetPage(ovflbuf), recptr); + + if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt) + PageSetLSN(BufferGetPage(prevbuf), recptr); + if (BufferIsValid(nextbuf)) + PageSetLSN(BufferGetPage(nextbuf), recptr); + + PageSetLSN(BufferGetPage(mapbuf), recptr); + + if (update_metap) + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + /* release previous bucket if it is not same as write bucket */ + if (BufferIsValid(prevbuf) && prevblkno != writeblkno) + _hash_relbuf(rel, prevbuf); + + if (BufferIsValid(ovflbuf)) + _hash_relbuf(rel, ovflbuf); + + if (BufferIsValid(nextbuf)) + _hash_relbuf(rel, nextbuf); + + _hash_relbuf(rel, mapbuf); + _hash_relbuf(rel, metabuf); + + return nextblkno; +} + + +/* + * _hash_initbitmapbuffer() + * + * Initialize a new bitmap page. All bits in the new bitmap page are set to + * "1", indicating "in use". + */ +void +_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage) +{ + Page pg; + HashPageOpaque op; + uint32 *freep; + + pg = BufferGetPage(buf); + + /* initialize the page */ + if (initpage) + _hash_pageinit(pg, BufferGetPageSize(buf)); + + /* initialize the page's special space */ + op = (HashPageOpaque) PageGetSpecialPointer(pg); + op->hasho_prevblkno = InvalidBlockNumber; + op->hasho_nextblkno = InvalidBlockNumber; + op->hasho_bucket = -1; + op->hasho_flag = LH_BITMAP_PAGE; + op->hasho_page_id = HASHO_PAGE_ID; + + /* set all of the bits to 1 */ + freep = HashPageGetBitmap(pg); + MemSet(freep, 0xFF, bmsize); + + /* + * Set pd_lower just past the end of the bitmap page data. We could even + * set pd_lower equal to pd_upper, but this is more precise and makes the + * page look compressible to xlog.c. + */ + ((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg; +} + + +/* + * _hash_squeezebucket(rel, bucket) + * + * Try to squeeze the tuples onto pages occurring earlier in the + * bucket chain in an attempt to free overflow pages. When we start + * the "squeezing", the page from which we start taking tuples (the + * "read" page) is the last bucket in the bucket chain and the page + * onto which we start squeezing tuples (the "write" page) is the + * first page in the bucket chain. The read page works backward and + * the write page works forward; the procedure terminates when the + * read page and write page are the same page. + * + * At completion of this procedure, it is guaranteed that all pages in + * the bucket are nonempty, unless the bucket is totally empty (in + * which case all overflow pages will be freed). The original implementation + * required that to be true on entry as well, but it's a lot easier for + * callers to leave empty overflow pages and let this guy clean it up. + * + * Caller must acquire cleanup lock on the primary page of the target + * bucket to exclude any scans that are in progress, which could easily + * be confused into returning the same tuple more than once or some tuples + * not at all by the rearrangement we are performing here. To prevent + * any concurrent scan to cross the squeeze scan we use lock chaining + * similar to hashbucketcleanup. Refer comments atop hashbucketcleanup. + * + * We need to retain a pin on the primary bucket to ensure that no concurrent + * split can start. + * + * Since this function is invoked in VACUUM, we provide an access strategy + * parameter that controls fetches of the bucket pages. + */ +void +_hash_squeezebucket(Relation rel, + Bucket bucket, + BlockNumber bucket_blkno, + Buffer bucket_buf, + BufferAccessStrategy bstrategy) +{ + BlockNumber wblkno; + BlockNumber rblkno; + Buffer wbuf; + Buffer rbuf; + Page wpage; + Page rpage; + HashPageOpaque wopaque; + HashPageOpaque ropaque; + + /* + * start squeezing into the primary bucket page. + */ + wblkno = bucket_blkno; + wbuf = bucket_buf; + wpage = BufferGetPage(wbuf); + wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage); + + /* + * if there aren't any overflow pages, there's nothing to squeeze. caller + * is responsible for releasing the pin on primary bucket page. + */ + if (!BlockNumberIsValid(wopaque->hasho_nextblkno)) + { + LockBuffer(wbuf, BUFFER_LOCK_UNLOCK); + return; + } + + /* + * Find the last page in the bucket chain by starting at the base bucket + * page and working forward. Note: we assume that a hash bucket chain is + * usually smaller than the buffer ring being used by VACUUM, else using + * the access strategy here would be counterproductive. + */ + rbuf = InvalidBuffer; + ropaque = wopaque; + do + { + rblkno = ropaque->hasho_nextblkno; + if (rbuf != InvalidBuffer) + _hash_relbuf(rel, rbuf); + rbuf = _hash_getbuf_with_strategy(rel, + rblkno, + HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + rpage = BufferGetPage(rbuf); + ropaque = (HashPageOpaque) PageGetSpecialPointer(rpage); + Assert(ropaque->hasho_bucket == bucket); + } while (BlockNumberIsValid(ropaque->hasho_nextblkno)); + + /* + * squeeze the tuples. + */ + for (;;) + { + OffsetNumber roffnum; + OffsetNumber maxroffnum; + OffsetNumber deletable[MaxOffsetNumber]; + IndexTuple itups[MaxIndexTuplesPerPage]; + Size tups_size[MaxIndexTuplesPerPage]; + OffsetNumber itup_offsets[MaxIndexTuplesPerPage]; + uint16 ndeletable = 0; + uint16 nitups = 0; + Size all_tups_size = 0; + int i; + bool retain_pin = false; + +readpage: + /* Scan each tuple in "read" page */ + maxroffnum = PageGetMaxOffsetNumber(rpage); + for (roffnum = FirstOffsetNumber; + roffnum <= maxroffnum; + roffnum = OffsetNumberNext(roffnum)) + { + IndexTuple itup; + Size itemsz; + + /* skip dead tuples */ + if (ItemIdIsDead(PageGetItemId(rpage, roffnum))) + continue; + + itup = (IndexTuple) PageGetItem(rpage, + PageGetItemId(rpage, roffnum)); + itemsz = IndexTupleSize(itup); + itemsz = MAXALIGN(itemsz); + + /* + * Walk up the bucket chain, looking for a page big enough for + * this item and all other accumulated items. Exit if we reach + * the read page. + */ + while (PageGetFreeSpaceForMultipleTuples(wpage, nitups + 1) < (all_tups_size + itemsz)) + { + Buffer next_wbuf = InvalidBuffer; + bool tups_moved = false; + + Assert(!PageIsEmpty(wpage)); + + if (wblkno == bucket_blkno) + retain_pin = true; + + wblkno = wopaque->hasho_nextblkno; + Assert(BlockNumberIsValid(wblkno)); + + /* don't need to move to next page if we reached the read page */ + if (wblkno != rblkno) + next_wbuf = _hash_getbuf_with_strategy(rel, + wblkno, + HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + + if (nitups > 0) + { + Assert(nitups == ndeletable); + + /* + * This operation needs to log multiple tuples, prepare + * WAL for that. + */ + if (RelationNeedsWAL(rel)) + XLogEnsureRecordSpace(0, 3 + nitups); + + START_CRIT_SECTION(); + + /* + * we have to insert tuples on the "write" page, being + * careful to preserve hashkey ordering. (If we insert + * many tuples into the same "write" page it would be + * worth qsort'ing them). + */ + _hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups); + MarkBufferDirty(wbuf); + + /* Delete tuples we already moved off read page */ + PageIndexMultiDelete(rpage, deletable, ndeletable); + MarkBufferDirty(rbuf); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + xl_hash_move_page_contents xlrec; + + xlrec.ntups = nitups; + xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents); + + /* + * bucket buffer needs to be registered to ensure that + * we can acquire a cleanup lock on it during replay. + */ + if (!xlrec.is_prim_bucket_same_wrt) + XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE); + + XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD); + XLogRegisterBufData(1, (char *) itup_offsets, + nitups * sizeof(OffsetNumber)); + for (i = 0; i < nitups; i++) + XLogRegisterBufData(1, (char *) itups[i], tups_size[i]); + + XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD); + XLogRegisterBufData(2, (char *) deletable, + ndeletable * sizeof(OffsetNumber)); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS); + + PageSetLSN(BufferGetPage(wbuf), recptr); + PageSetLSN(BufferGetPage(rbuf), recptr); + } + + END_CRIT_SECTION(); + + tups_moved = true; + } + + /* + * release the lock on previous page after acquiring the lock + * on next page + */ + if (retain_pin) + LockBuffer(wbuf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, wbuf); + + /* nothing more to do if we reached the read page */ + if (rblkno == wblkno) + { + _hash_relbuf(rel, rbuf); + return; + } + + wbuf = next_wbuf; + wpage = BufferGetPage(wbuf); + wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage); + Assert(wopaque->hasho_bucket == bucket); + retain_pin = false; + + /* be tidy */ + for (i = 0; i < nitups; i++) + pfree(itups[i]); + nitups = 0; + all_tups_size = 0; + ndeletable = 0; + + /* + * after moving the tuples, rpage would have been compacted, + * so we need to rescan it. + */ + if (tups_moved) + goto readpage; + } + + /* remember tuple for deletion from "read" page */ + deletable[ndeletable++] = roffnum; + + /* + * we need a copy of index tuples as they can be freed as part of + * overflow page, however we need them to write a WAL record in + * _hash_freeovflpage. + */ + itups[nitups] = CopyIndexTuple(itup); + tups_size[nitups++] = itemsz; + all_tups_size += itemsz; + } + + /* + * If we reach here, there are no live tuples on the "read" page --- + * it was empty when we got to it, or we moved them all. So we can + * just free the page without bothering with deleting tuples + * individually. Then advance to the previous "read" page. + * + * Tricky point here: if our read and write pages are adjacent in the + * bucket chain, our write lock on wbuf will conflict with + * _hash_freeovflpage's attempt to update the sibling links of the + * removed page. In that case, we don't need to lock it again. + */ + rblkno = ropaque->hasho_prevblkno; + Assert(BlockNumberIsValid(rblkno)); + + /* free this overflow page (releases rbuf) */ + _hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets, + tups_size, nitups, bstrategy); + + /* be tidy */ + for (i = 0; i < nitups; i++) + pfree(itups[i]); + + /* are we freeing the page adjacent to wbuf? */ + if (rblkno == wblkno) + { + /* retain the pin on primary bucket page till end of bucket scan */ + if (wblkno == bucket_blkno) + LockBuffer(wbuf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, wbuf); + return; + } + + rbuf = _hash_getbuf_with_strategy(rel, + rblkno, + HASH_WRITE, + LH_OVERFLOW_PAGE, + bstrategy); + rpage = BufferGetPage(rbuf); + ropaque = (HashPageOpaque) PageGetSpecialPointer(rpage); + Assert(ropaque->hasho_bucket == bucket); + } + + /* NOTREACHED */ +} diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c new file mode 100644 index 0000000..49a9867 --- /dev/null +++ b/src/backend/access/hash/hashpage.c @@ -0,0 +1,1612 @@ +/*------------------------------------------------------------------------- + * + * hashpage.c + * Hash table page management code for the Postgres hash access method + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashpage.c + * + * NOTES + * Postgres hash pages look like ordinary relation pages. The opaque + * data at high addresses includes information about the page including + * whether a page is an overflow page or a true bucket, the bucket + * number, and the block numbers of the preceding and following pages + * in the same bucket. + * + * The first page in a hash relation, page zero, is special -- it stores + * information describing the hash table; it is referred to as the + * "meta page." Pages one and higher store the actual data. + * + * There are also bitmap pages, which are not manipulated here; + * see hashovfl.c. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/hash.h" +#include "access/hash_xlog.h" +#include "miscadmin.h" +#include "port/pg_bitutils.h" +#include "storage/lmgr.h" +#include "storage/predicate.h" +#include "storage/smgr.h" + +static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock, + uint32 nblocks); +static void _hash_splitbucket(Relation rel, Buffer metabuf, + Bucket obucket, Bucket nbucket, + Buffer obuf, + Buffer nbuf, + HTAB *htab, + uint32 maxbucket, + uint32 highmask, uint32 lowmask); +static void log_split_page(Relation rel, Buffer buf); + + +/* + * _hash_getbuf() -- Get a buffer by block number for read or write. + * + * 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK. + * 'flags' is a bitwise OR of the allowed page types. + * + * This must be used only to fetch pages that are expected to be valid + * already. _hash_checkpage() is applied using the given flags. + * + * When this routine returns, the appropriate lock is set on the + * requested buffer and its reference count has been incremented + * (ie, the buffer is "locked and pinned"). + * + * P_NEW is disallowed because this routine can only be used + * to access pages that are known to be before the filesystem EOF. + * Extending the index should be done with _hash_getnewbuf. + */ +Buffer +_hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags) +{ + Buffer buf; + + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); + + buf = ReadBuffer(rel, blkno); + + if (access != HASH_NOLOCK) + LockBuffer(buf, access); + + /* ref count and lock type are correct */ + + _hash_checkpage(rel, buf, flags); + + return buf; +} + +/* + * _hash_getbuf_with_condlock_cleanup() -- Try to get a buffer for cleanup. + * + * We read the page and try to acquire a cleanup lock. If we get it, + * we return the buffer; otherwise, we return InvalidBuffer. + */ +Buffer +_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags) +{ + Buffer buf; + + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); + + buf = ReadBuffer(rel, blkno); + + if (!ConditionalLockBufferForCleanup(buf)) + { + ReleaseBuffer(buf); + return InvalidBuffer; + } + + /* ref count and lock type are correct */ + + _hash_checkpage(rel, buf, flags); + + return buf; +} + +/* + * _hash_getinitbuf() -- Get and initialize a buffer by block number. + * + * This must be used only to fetch pages that are known to be before + * the index's filesystem EOF, but are to be filled from scratch. + * _hash_pageinit() is applied automatically. Otherwise it has + * effects similar to _hash_getbuf() with access = HASH_WRITE. + * + * When this routine returns, a write lock is set on the + * requested buffer and its reference count has been incremented + * (ie, the buffer is "locked and pinned"). + * + * P_NEW is disallowed because this routine can only be used + * to access pages that are known to be before the filesystem EOF. + * Extending the index should be done with _hash_getnewbuf. + */ +Buffer +_hash_getinitbuf(Relation rel, BlockNumber blkno) +{ + Buffer buf; + + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); + + buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, + NULL); + + /* ref count and lock type are correct */ + + /* initialize the page */ + _hash_pageinit(BufferGetPage(buf), BufferGetPageSize(buf)); + + return buf; +} + +/* + * _hash_initbuf() -- Get and initialize a buffer by bucket number. + */ +void +_hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag, + bool initpage) +{ + HashPageOpaque pageopaque; + Page page; + + page = BufferGetPage(buf); + + /* initialize the page */ + if (initpage) + _hash_pageinit(page, BufferGetPageSize(buf)); + + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + + /* + * Set hasho_prevblkno with current hashm_maxbucket. This value will be + * used to validate cached HashMetaPageData. See + * _hash_getbucketbuf_from_hashkey(). + */ + pageopaque->hasho_prevblkno = max_bucket; + pageopaque->hasho_nextblkno = InvalidBlockNumber; + pageopaque->hasho_bucket = num_bucket; + pageopaque->hasho_flag = flag; + pageopaque->hasho_page_id = HASHO_PAGE_ID; +} + +/* + * _hash_getnewbuf() -- Get a new page at the end of the index. + * + * This has the same API as _hash_getinitbuf, except that we are adding + * a page to the index, and hence expect the page to be past the + * logical EOF. (However, we have to support the case where it isn't, + * since a prior try might have crashed after extending the filesystem + * EOF but before updating the metapage to reflect the added page.) + * + * It is caller's responsibility to ensure that only one process can + * extend the index at a time. In practice, this function is called + * only while holding write lock on the metapage, because adding a page + * is always associated with an update of metapage data. + */ +Buffer +_hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum) +{ + BlockNumber nblocks = RelationGetNumberOfBlocksInFork(rel, forkNum); + Buffer buf; + + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); + if (blkno > nblocks) + elog(ERROR, "access to noncontiguous page in hash index \"%s\"", + RelationGetRelationName(rel)); + + /* smgr insists we use P_NEW to extend the relation */ + if (blkno == nblocks) + { + buf = ReadBufferExtended(rel, forkNum, P_NEW, RBM_NORMAL, NULL); + if (BufferGetBlockNumber(buf) != blkno) + elog(ERROR, "unexpected hash relation size: %u, should be %u", + BufferGetBlockNumber(buf), blkno); + LockBuffer(buf, HASH_WRITE); + } + else + { + buf = ReadBufferExtended(rel, forkNum, blkno, RBM_ZERO_AND_LOCK, + NULL); + } + + /* ref count and lock type are correct */ + + /* initialize the page */ + _hash_pageinit(BufferGetPage(buf), BufferGetPageSize(buf)); + + return buf; +} + +/* + * _hash_getbuf_with_strategy() -- Get a buffer with nondefault strategy. + * + * This is identical to _hash_getbuf() but also allows a buffer access + * strategy to be specified. We use this for VACUUM operations. + */ +Buffer +_hash_getbuf_with_strategy(Relation rel, BlockNumber blkno, + int access, int flags, + BufferAccessStrategy bstrategy) +{ + Buffer buf; + + if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW"); + + buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy); + + if (access != HASH_NOLOCK) + LockBuffer(buf, access); + + /* ref count and lock type are correct */ + + _hash_checkpage(rel, buf, flags); + + return buf; +} + +/* + * _hash_relbuf() -- release a locked buffer. + * + * Lock and pin (refcount) are both dropped. + */ +void +_hash_relbuf(Relation rel, Buffer buf) +{ + UnlockReleaseBuffer(buf); +} + +/* + * _hash_dropbuf() -- release an unlocked buffer. + * + * This is used to unpin a buffer on which we hold no lock. + */ +void +_hash_dropbuf(Relation rel, Buffer buf) +{ + ReleaseBuffer(buf); +} + +/* + * _hash_dropscanbuf() -- release buffers used in scan. + * + * This routine unpins the buffers used during scan on which we + * hold no lock. + */ +void +_hash_dropscanbuf(Relation rel, HashScanOpaque so) +{ + /* release pin we hold on primary bucket page */ + if (BufferIsValid(so->hashso_bucket_buf) && + so->hashso_bucket_buf != so->currPos.buf) + _hash_dropbuf(rel, so->hashso_bucket_buf); + so->hashso_bucket_buf = InvalidBuffer; + + /* release pin we hold on primary bucket page of bucket being split */ + if (BufferIsValid(so->hashso_split_bucket_buf) && + so->hashso_split_bucket_buf != so->currPos.buf) + _hash_dropbuf(rel, so->hashso_split_bucket_buf); + so->hashso_split_bucket_buf = InvalidBuffer; + + /* release any pin we still hold */ + if (BufferIsValid(so->currPos.buf)) + _hash_dropbuf(rel, so->currPos.buf); + so->currPos.buf = InvalidBuffer; + + /* reset split scan */ + so->hashso_buc_populated = false; + so->hashso_buc_split = false; +} + + +/* + * _hash_init() -- Initialize the metadata page of a hash index, + * the initial buckets, and the initial bitmap page. + * + * The initial number of buckets is dependent on num_tuples, an estimate + * of the number of tuples to be loaded into the index initially. The + * chosen number of buckets is returned. + * + * We are fairly cavalier about locking here, since we know that no one else + * could be accessing this index. In particular the rule about not holding + * multiple buffer locks is ignored. + */ +uint32 +_hash_init(Relation rel, double num_tuples, ForkNumber forkNum) +{ + Buffer metabuf; + Buffer buf; + Buffer bitmapbuf; + Page pg; + HashMetaPage metap; + RegProcedure procid; + int32 data_width; + int32 item_width; + int32 ffactor; + uint32 num_buckets; + uint32 i; + bool use_wal; + + /* safety check */ + if (RelationGetNumberOfBlocksInFork(rel, forkNum) != 0) + elog(ERROR, "cannot initialize non-empty hash index \"%s\"", + RelationGetRelationName(rel)); + + /* + * WAL log creation of pages if the relation is persistent, or this is the + * init fork. Init forks for unlogged relations always need to be WAL + * logged. + */ + use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM; + + /* + * Determine the target fill factor (in tuples per bucket) for this index. + * The idea is to make the fill factor correspond to pages about as full + * as the user-settable fillfactor parameter says. We can compute it + * exactly since the index datatype (i.e. uint32 hash key) is fixed-width. + */ + data_width = sizeof(uint32); + item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) + + sizeof(ItemIdData); /* include the line pointer */ + ffactor = HashGetTargetPageUsage(rel) / item_width; + /* keep to a sane range */ + if (ffactor < 10) + ffactor = 10; + + procid = index_getprocid(rel, 1, HASHSTANDARD_PROC); + + /* + * We initialize the metapage, the first N bucket pages, and the first + * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend() + * calls to occur. This ensures that the smgr level has the right idea of + * the physical index length. + * + * Critical section not required, because on error the creation of the + * whole relation will be rolled back. + */ + metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum); + _hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false); + MarkBufferDirty(metabuf); + + pg = BufferGetPage(metabuf); + metap = HashPageGetMeta(pg); + + /* XLOG stuff */ + if (use_wal) + { + xl_hash_init_meta_page xlrec; + XLogRecPtr recptr; + + xlrec.num_tuples = num_tuples; + xlrec.procid = metap->hashm_procid; + xlrec.ffactor = metap->hashm_ffactor; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage); + XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE); + + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + num_buckets = metap->hashm_maxbucket + 1; + + /* + * Release buffer lock on the metapage while we initialize buckets. + * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS + * won't accomplish anything. It's a bad idea to hold buffer locks for + * long intervals in any case, since that can block the bgwriter. + */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + /* + * Initialize and WAL Log the first N buckets + */ + for (i = 0; i < num_buckets; i++) + { + BlockNumber blkno; + + /* Allow interrupts, in case N is huge */ + CHECK_FOR_INTERRUPTS(); + + blkno = BUCKET_TO_BLKNO(metap, i); + buf = _hash_getnewbuf(rel, blkno, forkNum); + _hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false); + MarkBufferDirty(buf); + + if (use_wal) + log_newpage(&rel->rd_node, + forkNum, + blkno, + BufferGetPage(buf), + true); + _hash_relbuf(rel, buf); + } + + /* Now reacquire buffer lock on metapage */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + /* + * Initialize bitmap page + */ + bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum); + _hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false); + MarkBufferDirty(bitmapbuf); + + /* add the new bitmap page to the metapage's list of bitmaps */ + /* metapage already has a write lock */ + if (metap->hashm_nmaps >= HASH_MAX_BITMAPS) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("out of overflow pages in hash index \"%s\"", + RelationGetRelationName(rel)))); + + metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1; + + metap->hashm_nmaps++; + MarkBufferDirty(metabuf); + + /* XLOG stuff */ + if (use_wal) + { + xl_hash_init_bitmap_page xlrec; + XLogRecPtr recptr; + + xlrec.bmsize = metap->hashm_bmsize; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage); + XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT); + + /* + * This is safe only because nobody else can be modifying the index at + * this stage; it's only visible to the transaction that is creating + * it. + */ + XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE); + + PageSetLSN(BufferGetPage(bitmapbuf), recptr); + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + /* all done */ + _hash_relbuf(rel, bitmapbuf); + _hash_relbuf(rel, metabuf); + + return num_buckets; +} + +/* + * _hash_init_metabuffer() -- Initialize the metadata page of a hash index. + */ +void +_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid, + uint16 ffactor, bool initpage) +{ + HashMetaPage metap; + HashPageOpaque pageopaque; + Page page; + double dnumbuckets; + uint32 num_buckets; + uint32 spare_index; + uint32 lshift; + + /* + * Choose the number of initial bucket pages to match the fill factor + * given the estimated number of tuples. We round up the result to the + * total number of buckets which has to be allocated before using its + * hashm_spares element. However always force at least 2 bucket pages. The + * upper limit is determined by considerations explained in + * _hash_expandtable(). + */ + dnumbuckets = num_tuples / ffactor; + if (dnumbuckets <= 2.0) + num_buckets = 2; + else if (dnumbuckets >= (double) 0x40000000) + num_buckets = 0x40000000; + else + num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets)); + + spare_index = _hash_spareindex(num_buckets); + Assert(spare_index < HASH_MAX_SPLITPOINTS); + + page = BufferGetPage(buf); + if (initpage) + _hash_pageinit(page, BufferGetPageSize(buf)); + + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); + pageopaque->hasho_prevblkno = InvalidBlockNumber; + pageopaque->hasho_nextblkno = InvalidBlockNumber; + pageopaque->hasho_bucket = -1; + pageopaque->hasho_flag = LH_META_PAGE; + pageopaque->hasho_page_id = HASHO_PAGE_ID; + + metap = HashPageGetMeta(page); + + metap->hashm_magic = HASH_MAGIC; + metap->hashm_version = HASH_VERSION; + metap->hashm_ntuples = 0; + metap->hashm_nmaps = 0; + metap->hashm_ffactor = ffactor; + metap->hashm_bsize = HashGetMaxBitmapSize(page); + + /* find largest bitmap array size that will fit in page size */ + lshift = pg_leftmost_one_pos32(metap->hashm_bsize); + Assert(lshift > 0); + metap->hashm_bmsize = 1 << lshift; + metap->hashm_bmshift = lshift + BYTE_TO_BIT; + Assert((1 << BMPG_SHIFT(metap)) == (BMPG_MASK(metap) + 1)); + + /* + * Label the index with its primary hash support function's OID. This is + * pretty useless for normal operation (in fact, hashm_procid is not used + * anywhere), but it might be handy for forensic purposes so we keep it. + */ + metap->hashm_procid = procid; + + /* + * We initialize the index with N buckets, 0 .. N-1, occupying physical + * blocks 1 to N. The first freespace bitmap page is in block N+1. + */ + metap->hashm_maxbucket = num_buckets - 1; + + /* + * Set highmask as next immediate ((2 ^ x) - 1), which should be + * sufficient to cover num_buckets. + */ + metap->hashm_highmask = pg_nextpower2_32(num_buckets + 1) - 1; + metap->hashm_lowmask = (metap->hashm_highmask >> 1); + + MemSet(metap->hashm_spares, 0, sizeof(metap->hashm_spares)); + MemSet(metap->hashm_mapp, 0, sizeof(metap->hashm_mapp)); + + /* Set up mapping for one spare page after the initial splitpoints */ + metap->hashm_spares[spare_index] = 1; + metap->hashm_ovflpoint = spare_index; + metap->hashm_firstfree = 0; + + /* + * Set pd_lower just past the end of the metadata. This is essential, + * because without doing so, metadata will be lost if xlog.c compresses + * the page. + */ + ((PageHeader) page)->pd_lower = + ((char *) metap + sizeof(HashMetaPageData)) - (char *) page; +} + +/* + * _hash_pageinit() -- Initialize a new hash index page. + */ +void +_hash_pageinit(Page page, Size size) +{ + PageInit(page, size, sizeof(HashPageOpaqueData)); +} + +/* + * Attempt to expand the hash table by creating one new bucket. + * + * This will silently do nothing if we don't get cleanup lock on old or + * new bucket. + * + * Complete the pending splits and remove the tuples from old bucket, + * if there are any left over from the previous split. + * + * The caller must hold a pin, but no lock, on the metapage buffer. + * The buffer is returned in the same state. + */ +void +_hash_expandtable(Relation rel, Buffer metabuf) +{ + HashMetaPage metap; + Bucket old_bucket; + Bucket new_bucket; + uint32 spare_ndx; + BlockNumber start_oblkno; + BlockNumber start_nblkno; + Buffer buf_nblkno; + Buffer buf_oblkno; + Page opage; + Page npage; + HashPageOpaque oopaque; + HashPageOpaque nopaque; + uint32 maxbucket; + uint32 highmask; + uint32 lowmask; + bool metap_update_masks = false; + bool metap_update_splitpoint = false; + +restart_expand: + + /* + * Write-lock the meta page. It used to be necessary to acquire a + * heavyweight lock to begin a split, but that is no longer required. + */ + LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); + + _hash_checkpage(rel, metabuf, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + /* + * Check to see if split is still needed; someone else might have already + * done one while we waited for the lock. + * + * Make sure this stays in sync with _hash_doinsert() + */ + if (metap->hashm_ntuples <= + (double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1)) + goto fail; + + /* + * Can't split anymore if maxbucket has reached its maximum possible + * value. + * + * Ideally we'd allow bucket numbers up to UINT_MAX-1 (no higher because + * the calculation maxbucket+1 mustn't overflow). Currently we restrict + * to half that to prevent failure of pg_ceil_log2_32() and insufficient + * space in hashm_spares[]. It's moot anyway because an index with 2^32 + * buckets would certainly overflow BlockNumber and hence + * _hash_alloc_buckets() would fail, but if we supported buckets smaller + * than a disk block then this would be an independent constraint. + * + * If you change this, see also the maximum initial number of buckets in + * _hash_init(). + */ + if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE) + goto fail; + + /* + * Determine which bucket is to be split, and attempt to take cleanup lock + * on the old bucket. If we can't get the lock, give up. + * + * The cleanup lock protects us not only against other backends, but + * against our own backend as well. + * + * The cleanup lock is mainly to protect the split from concurrent + * inserts. See src/backend/access/hash/README, Lock Definitions for + * further details. Due to this locking restriction, if there is any + * pending scan, the split will give up which is not good, but harmless. + */ + new_bucket = metap->hashm_maxbucket + 1; + + old_bucket = (new_bucket & metap->hashm_lowmask); + + start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket); + + buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE); + if (!buf_oblkno) + goto fail; + + opage = BufferGetPage(buf_oblkno); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); + + /* + * We want to finish the split from a bucket as there is no apparent + * benefit by not doing so and it will make the code complicated to finish + * the split that involves multiple buckets considering the case where new + * split also fails. We don't need to consider the new bucket for + * completing the split here as it is not possible that a re-split of new + * bucket starts when there is still a pending split from old bucket. + */ + if (H_BUCKET_BEING_SPLIT(oopaque)) + { + /* + * Copy bucket mapping info now; refer the comment in code below where + * we copy this information before calling _hash_splitbucket to see + * why this is okay. + */ + maxbucket = metap->hashm_maxbucket; + highmask = metap->hashm_highmask; + lowmask = metap->hashm_lowmask; + + /* + * Release the lock on metapage and old_bucket, before completing the + * split. + */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + LockBuffer(buf_oblkno, BUFFER_LOCK_UNLOCK); + + _hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket, + highmask, lowmask); + + /* release the pin on old buffer and retry for expand. */ + _hash_dropbuf(rel, buf_oblkno); + + goto restart_expand; + } + + /* + * Clean the tuples remained from the previous split. This operation + * requires cleanup lock and we already have one on the old bucket, so + * let's do it. We also don't want to allow further splits from the bucket + * till the garbage of previous split is cleaned. This has two + * advantages; first, it helps in avoiding the bloat due to garbage and + * second is, during cleanup of bucket, we are always sure that the + * garbage tuples belong to most recently split bucket. On the contrary, + * if we allow cleanup of bucket after meta page is updated to indicate + * the new split and before the actual split, the cleanup operation won't + * be able to decide whether the tuple has been moved to the newly created + * bucket and ended up deleting such tuples. + */ + if (H_NEEDS_SPLIT_CLEANUP(oopaque)) + { + /* + * Copy bucket mapping info now; refer to the comment in code below + * where we copy this information before calling _hash_splitbucket to + * see why this is okay. + */ + maxbucket = metap->hashm_maxbucket; + highmask = metap->hashm_highmask; + lowmask = metap->hashm_lowmask; + + /* Release the metapage lock. */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL, + maxbucket, highmask, lowmask, NULL, NULL, true, + NULL, NULL); + + _hash_dropbuf(rel, buf_oblkno); + + goto restart_expand; + } + + /* + * There shouldn't be any active scan on new bucket. + * + * Note: it is safe to compute the new bucket's blkno here, even though we + * may still need to update the BUCKET_TO_BLKNO mapping. This is because + * the current value of hashm_spares[hashm_ovflpoint] correctly shows + * where we are going to put a new splitpoint's worth of buckets. + */ + start_nblkno = BUCKET_TO_BLKNO(metap, new_bucket); + + /* + * If the split point is increasing we need to allocate a new batch of + * bucket pages. + */ + spare_ndx = _hash_spareindex(new_bucket + 1); + if (spare_ndx > metap->hashm_ovflpoint) + { + uint32 buckets_to_add; + + Assert(spare_ndx == metap->hashm_ovflpoint + 1); + + /* + * We treat allocation of buckets as a separate WAL-logged action. + * Even if we fail after this operation, won't leak bucket pages; + * rather, the next split will consume this space. In any case, even + * without failure we don't use all the space in one split operation. + */ + buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket; + if (!_hash_alloc_buckets(rel, start_nblkno, buckets_to_add)) + { + /* can't split due to BlockNumber overflow */ + _hash_relbuf(rel, buf_oblkno); + goto fail; + } + } + + /* + * Physically allocate the new bucket's primary page. We want to do this + * before changing the metapage's mapping info, in case we can't get the + * disk space. Ideally, we don't need to check for cleanup lock on new + * bucket as no other backend could find this bucket unless meta page is + * updated. However, it is good to be consistent with old bucket locking. + */ + buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM); + if (!IsBufferCleanupOK(buf_nblkno)) + { + _hash_relbuf(rel, buf_oblkno); + _hash_relbuf(rel, buf_nblkno); + goto fail; + } + + /* + * Since we are scribbling on the pages in the shared buffers, establish a + * critical section. Any failure in this next code leaves us with a big + * problem: the metapage is effectively corrupt but could get written back + * to disk. + */ + START_CRIT_SECTION(); + + /* + * Okay to proceed with split. Update the metapage bucket mapping info. + */ + metap->hashm_maxbucket = new_bucket; + + if (new_bucket > metap->hashm_highmask) + { + /* Starting a new doubling */ + metap->hashm_lowmask = metap->hashm_highmask; + metap->hashm_highmask = new_bucket | metap->hashm_lowmask; + metap_update_masks = true; + } + + /* + * If the split point is increasing we need to adjust the hashm_spares[] + * array and hashm_ovflpoint so that future overflow pages will be created + * beyond this new batch of bucket pages. + */ + if (spare_ndx > metap->hashm_ovflpoint) + { + metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint]; + metap->hashm_ovflpoint = spare_ndx; + metap_update_splitpoint = true; + } + + MarkBufferDirty(metabuf); + + /* + * Copy bucket mapping info now; this saves re-accessing the meta page + * inside _hash_splitbucket's inner loop. Note that once we drop the + * split lock, other splits could begin, so these values might be out of + * date before _hash_splitbucket finishes. That's okay, since all it + * needs is to tell which of these two buckets to map hashkeys into. + */ + maxbucket = metap->hashm_maxbucket; + highmask = metap->hashm_highmask; + lowmask = metap->hashm_lowmask; + + opage = BufferGetPage(buf_oblkno); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); + + /* + * Mark the old bucket to indicate that split is in progress. (At + * operation end, we will clear the split-in-progress flag.) Also, for a + * primary bucket page, hasho_prevblkno stores the number of buckets that + * existed as of the last split, so we must update that value here. + */ + oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT; + oopaque->hasho_prevblkno = maxbucket; + + MarkBufferDirty(buf_oblkno); + + npage = BufferGetPage(buf_nblkno); + + /* + * initialize the new bucket's primary page and mark it to indicate that + * split is in progress. + */ + nopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + nopaque->hasho_prevblkno = maxbucket; + nopaque->hasho_nextblkno = InvalidBlockNumber; + nopaque->hasho_bucket = new_bucket; + nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED; + nopaque->hasho_page_id = HASHO_PAGE_ID; + + MarkBufferDirty(buf_nblkno); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + xl_hash_split_allocate_page xlrec; + XLogRecPtr recptr; + + xlrec.new_bucket = maxbucket; + xlrec.old_bucket_flag = oopaque->hasho_flag; + xlrec.new_bucket_flag = nopaque->hasho_flag; + xlrec.flags = 0; + + XLogBeginInsert(); + + XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD); + XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT); + XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD); + + if (metap_update_masks) + { + xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS; + XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32)); + XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32)); + } + + if (metap_update_splitpoint) + { + xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT; + XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint, + sizeof(uint32)); + XLogRegisterBufData(2, + (char *) &metap->hashm_spares[metap->hashm_ovflpoint], + sizeof(uint32)); + } + + XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE); + + PageSetLSN(BufferGetPage(buf_oblkno), recptr); + PageSetLSN(BufferGetPage(buf_nblkno), recptr); + PageSetLSN(BufferGetPage(metabuf), recptr); + } + + END_CRIT_SECTION(); + + /* drop lock, but keep pin */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); + + /* Relocate records to the new bucket */ + _hash_splitbucket(rel, metabuf, + old_bucket, new_bucket, + buf_oblkno, buf_nblkno, NULL, + maxbucket, highmask, lowmask); + + /* all done, now release the pins on primary buckets. */ + _hash_dropbuf(rel, buf_oblkno); + _hash_dropbuf(rel, buf_nblkno); + + return; + + /* Here if decide not to split or fail to acquire old bucket lock */ +fail: + + /* We didn't write the metapage, so just drop lock */ + LockBuffer(metabuf, BUFFER_LOCK_UNLOCK); +} + + +/* + * _hash_alloc_buckets -- allocate a new splitpoint's worth of bucket pages + * + * This does not need to initialize the new bucket pages; we'll do that as + * each one is used by _hash_expandtable(). But we have to extend the logical + * EOF to the end of the splitpoint; this keeps smgr's idea of the EOF in + * sync with ours, so that we don't get complaints from smgr. + * + * We do this by writing a page of zeroes at the end of the splitpoint range. + * We expect that the filesystem will ensure that the intervening pages read + * as zeroes too. On many filesystems this "hole" will not be allocated + * immediately, which means that the index file may end up more fragmented + * than if we forced it all to be allocated now; but since we don't scan + * hash indexes sequentially anyway, that probably doesn't matter. + * + * XXX It's annoying that this code is executed with the metapage lock held. + * We need to interlock against _hash_addovflpage() adding a new overflow page + * concurrently, but it'd likely be better to use LockRelationForExtension + * for the purpose. OTOH, adding a splitpoint is a very infrequent operation, + * so it may not be worth worrying about. + * + * Returns true if successful, or false if allocation failed due to + * BlockNumber overflow. + */ +static bool +_hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks) +{ + BlockNumber lastblock; + PGAlignedBlock zerobuf; + Page page; + HashPageOpaque ovflopaque; + + lastblock = firstblock + nblocks - 1; + + /* + * Check for overflow in block number calculation; if so, we cannot extend + * the index anymore. + */ + if (lastblock < firstblock || lastblock == InvalidBlockNumber) + return false; + + page = (Page) zerobuf.data; + + /* + * Initialize the page. Just zeroing the page won't work; see + * _hash_freeovflpage for similar usage. We take care to make the special + * space valid for the benefit of tools such as pageinspect. + */ + _hash_pageinit(page, BLCKSZ); + + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(page); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + + if (RelationNeedsWAL(rel)) + log_newpage(&rel->rd_node, + MAIN_FORKNUM, + lastblock, + zerobuf.data, + true); + + RelationOpenSmgr(rel); + PageSetChecksumInplace(page, lastblock); + smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf.data, false); + + return true; +} + + +/* + * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket' + * + * This routine is used to partition the tuples between old and new bucket and + * is used to finish the incomplete split operations. To finish the previously + * interrupted split operation, the caller needs to fill htab. If htab is set, + * then we skip the movement of tuples that exists in htab, otherwise NULL + * value of htab indicates movement of all the tuples that belong to the new + * bucket. + * + * We are splitting a bucket that consists of a base bucket page and zero + * or more overflow (bucket chain) pages. We must relocate tuples that + * belong in the new bucket. + * + * The caller must hold cleanup locks on both buckets to ensure that + * no one else is trying to access them (see README). + * + * The caller must hold a pin, but no lock, on the metapage buffer. + * The buffer is returned in the same state. (The metapage is only + * touched if it becomes necessary to add or remove overflow pages.) + * + * Split needs to retain pin on primary bucket pages of both old and new + * buckets till end of operation. This is to prevent vacuum from starting + * while a split is in progress. + * + * In addition, the caller must have created the new bucket's base page, + * which is passed in buffer nbuf, pinned and write-locked. The lock will be + * released here and pin must be released by the caller. (The API is set up + * this way because we must do _hash_getnewbuf() before releasing the metapage + * write lock. So instead of passing the new bucket's start block number, we + * pass an actual buffer.) + */ +static void +_hash_splitbucket(Relation rel, + Buffer metabuf, + Bucket obucket, + Bucket nbucket, + Buffer obuf, + Buffer nbuf, + HTAB *htab, + uint32 maxbucket, + uint32 highmask, + uint32 lowmask) +{ + Buffer bucket_obuf; + Buffer bucket_nbuf; + Page opage; + Page npage; + HashPageOpaque oopaque; + HashPageOpaque nopaque; + OffsetNumber itup_offsets[MaxIndexTuplesPerPage]; + IndexTuple itups[MaxIndexTuplesPerPage]; + Size all_tups_size = 0; + int i; + uint16 nitups = 0; + + bucket_obuf = obuf; + opage = BufferGetPage(obuf); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); + + bucket_nbuf = nbuf; + npage = BufferGetPage(nbuf); + nopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + + /* Copy the predicate locks from old bucket to new bucket. */ + PredicateLockPageSplit(rel, + BufferGetBlockNumber(bucket_obuf), + BufferGetBlockNumber(bucket_nbuf)); + + /* + * Partition the tuples in the old bucket between the old bucket and the + * new bucket, advancing along the old bucket's overflow bucket chain and + * adding overflow pages to the new bucket as needed. Outer loop iterates + * once per page in old bucket. + */ + for (;;) + { + BlockNumber oblkno; + OffsetNumber ooffnum; + OffsetNumber omaxoffnum; + + /* Scan each tuple in old page */ + omaxoffnum = PageGetMaxOffsetNumber(opage); + for (ooffnum = FirstOffsetNumber; + ooffnum <= omaxoffnum; + ooffnum = OffsetNumberNext(ooffnum)) + { + IndexTuple itup; + Size itemsz; + Bucket bucket; + bool found = false; + + /* skip dead tuples */ + if (ItemIdIsDead(PageGetItemId(opage, ooffnum))) + continue; + + /* + * Before inserting a tuple, probe the hash table containing TIDs + * of tuples belonging to new bucket, if we find a match, then + * skip that tuple, else fetch the item's hash key (conveniently + * stored in the item) and determine which bucket it now belongs + * in. + */ + itup = (IndexTuple) PageGetItem(opage, + PageGetItemId(opage, ooffnum)); + + if (htab) + (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found); + + if (found) + continue; + + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup), + maxbucket, highmask, lowmask); + + if (bucket == nbucket) + { + IndexTuple new_itup; + + /* + * make a copy of index tuple as we have to scribble on it. + */ + new_itup = CopyIndexTuple(itup); + + /* + * mark the index tuple as moved by split, such tuples are + * skipped by scan if there is split in progress for a bucket. + */ + new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK; + + /* + * insert the tuple into the new bucket. if it doesn't fit on + * the current page in the new bucket, we must allocate a new + * overflow page and place the tuple on that page instead. + */ + itemsz = IndexTupleSize(new_itup); + itemsz = MAXALIGN(itemsz); + + if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz)) + { + /* + * Change the shared buffer state in critical section, + * otherwise any error could make it unrecoverable. + */ + START_CRIT_SECTION(); + + _hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups); + MarkBufferDirty(nbuf); + /* log the split operation before releasing the lock */ + log_split_page(rel, nbuf); + + END_CRIT_SECTION(); + + /* drop lock, but keep pin */ + LockBuffer(nbuf, BUFFER_LOCK_UNLOCK); + + /* be tidy */ + for (i = 0; i < nitups; i++) + pfree(itups[i]); + nitups = 0; + all_tups_size = 0; + + /* chain to a new overflow page */ + nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false); + npage = BufferGetPage(nbuf); + nopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + } + + itups[nitups++] = new_itup; + all_tups_size += itemsz; + } + else + { + /* + * the tuple stays on this page, so nothing to do. + */ + Assert(bucket == obucket); + } + } + + oblkno = oopaque->hasho_nextblkno; + + /* retain the pin on the old primary bucket */ + if (obuf == bucket_obuf) + LockBuffer(obuf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, obuf); + + /* Exit loop if no more overflow pages in old bucket */ + if (!BlockNumberIsValid(oblkno)) + { + /* + * Change the shared buffer state in critical section, otherwise + * any error could make it unrecoverable. + */ + START_CRIT_SECTION(); + + _hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups); + MarkBufferDirty(nbuf); + /* log the split operation before releasing the lock */ + log_split_page(rel, nbuf); + + END_CRIT_SECTION(); + + if (nbuf == bucket_nbuf) + LockBuffer(nbuf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, nbuf); + + /* be tidy */ + for (i = 0; i < nitups; i++) + pfree(itups[i]); + break; + } + + /* Else, advance to next old page */ + obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE); + opage = BufferGetPage(obuf); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); + } + + /* + * We're at the end of the old bucket chain, so we're done partitioning + * the tuples. Mark the old and new buckets to indicate split is + * finished. + * + * To avoid deadlocks due to locking order of buckets, first lock the old + * bucket and then the new bucket. + */ + LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE); + opage = BufferGetPage(bucket_obuf); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); + + LockBuffer(bucket_nbuf, BUFFER_LOCK_EXCLUSIVE); + npage = BufferGetPage(bucket_nbuf); + nopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + + START_CRIT_SECTION(); + + oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT; + nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED; + + /* + * After the split is finished, mark the old bucket to indicate that it + * contains deletable tuples. We will clear split-cleanup flag after + * deleting such tuples either at the end of split or at the next split + * from old bucket or at the time of vacuum. + */ + oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP; + + /* + * now write the buffers, here we don't release the locks as caller is + * responsible to release locks. + */ + MarkBufferDirty(bucket_obuf); + MarkBufferDirty(bucket_nbuf); + + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + xl_hash_split_complete xlrec; + + xlrec.old_bucket_flag = oopaque->hasho_flag; + xlrec.new_bucket_flag = nopaque->hasho_flag; + + XLogBeginInsert(); + + XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete); + + XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD); + XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE); + + PageSetLSN(BufferGetPage(bucket_obuf), recptr); + PageSetLSN(BufferGetPage(bucket_nbuf), recptr); + } + + END_CRIT_SECTION(); + + /* + * If possible, clean up the old bucket. We might not be able to do this + * if someone else has a pin on it, but if not then we can go ahead. This + * isn't absolutely necessary, but it reduces bloat; if we don't do it + * now, VACUUM will do it eventually, but maybe not until new overflow + * pages have been allocated. Note that there's no need to clean up the + * new bucket. + */ + if (IsBufferCleanupOK(bucket_obuf)) + { + LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); + hashbucketcleanup(rel, obucket, bucket_obuf, + BufferGetBlockNumber(bucket_obuf), NULL, + maxbucket, highmask, lowmask, NULL, NULL, true, + NULL, NULL); + } + else + { + LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK); + LockBuffer(bucket_obuf, BUFFER_LOCK_UNLOCK); + } +} + +/* + * _hash_finish_split() -- Finish the previously interrupted split operation + * + * To complete the split operation, we form the hash table of TIDs in new + * bucket which is then used by split operation to skip tuples that are + * already moved before the split operation was previously interrupted. + * + * The caller must hold a pin, but no lock, on the metapage and old bucket's + * primary page buffer. The buffers are returned in the same state. (The + * metapage is only touched if it becomes necessary to add or remove overflow + * pages.) + */ +void +_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket, + uint32 maxbucket, uint32 highmask, uint32 lowmask) +{ + HASHCTL hash_ctl; + HTAB *tidhtab; + Buffer bucket_nbuf = InvalidBuffer; + Buffer nbuf; + Page npage; + BlockNumber nblkno; + BlockNumber bucket_nblkno; + HashPageOpaque npageopaque; + Bucket nbucket; + bool found; + + /* Initialize hash tables used to track TIDs */ + hash_ctl.keysize = sizeof(ItemPointerData); + hash_ctl.entrysize = sizeof(ItemPointerData); + hash_ctl.hcxt = CurrentMemoryContext; + + tidhtab = + hash_create("bucket ctids", + 256, /* arbitrary initial size */ + &hash_ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + + bucket_nblkno = nblkno = _hash_get_newblock_from_oldbucket(rel, obucket); + + /* + * Scan the new bucket and build hash table of TIDs + */ + for (;;) + { + OffsetNumber noffnum; + OffsetNumber nmaxoffnum; + + nbuf = _hash_getbuf(rel, nblkno, HASH_READ, + LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + + /* remember the primary bucket buffer to acquire cleanup lock on it. */ + if (nblkno == bucket_nblkno) + bucket_nbuf = nbuf; + + npage = BufferGetPage(nbuf); + npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + + /* Scan each tuple in new page */ + nmaxoffnum = PageGetMaxOffsetNumber(npage); + for (noffnum = FirstOffsetNumber; + noffnum <= nmaxoffnum; + noffnum = OffsetNumberNext(noffnum)) + { + IndexTuple itup; + + /* Fetch the item's TID and insert it in hash table. */ + itup = (IndexTuple) PageGetItem(npage, + PageGetItemId(npage, noffnum)); + + (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found); + + Assert(!found); + } + + nblkno = npageopaque->hasho_nextblkno; + + /* + * release our write lock without modifying buffer and ensure to + * retain the pin on primary bucket. + */ + if (nbuf == bucket_nbuf) + LockBuffer(nbuf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, nbuf); + + /* Exit loop if no more overflow pages in new bucket */ + if (!BlockNumberIsValid(nblkno)) + break; + } + + /* + * Conditionally get the cleanup lock on old and new buckets to perform + * the split operation. If we don't get the cleanup locks, silently give + * up and next insertion on old bucket will try again to complete the + * split. + */ + if (!ConditionalLockBufferForCleanup(obuf)) + { + hash_destroy(tidhtab); + return; + } + if (!ConditionalLockBufferForCleanup(bucket_nbuf)) + { + LockBuffer(obuf, BUFFER_LOCK_UNLOCK); + hash_destroy(tidhtab); + return; + } + + npage = BufferGetPage(bucket_nbuf); + npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage); + nbucket = npageopaque->hasho_bucket; + + _hash_splitbucket(rel, metabuf, obucket, + nbucket, obuf, bucket_nbuf, tidhtab, + maxbucket, highmask, lowmask); + + _hash_dropbuf(rel, bucket_nbuf); + hash_destroy(tidhtab); +} + +/* + * log_split_page() -- Log the split operation + * + * We log the split operation when the new page in new bucket gets full, + * so we log the entire page. + * + * 'buf' must be locked by the caller which is also responsible for unlocking + * it. + */ +static void +log_split_page(Relation rel, Buffer buf) +{ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + XLogBeginInsert(); + + XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD); + + recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE); + + PageSetLSN(BufferGetPage(buf), recptr); + } +} + +/* + * _hash_getcachedmetap() -- Returns cached metapage data. + * + * If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on + * the metapage. If not set, we'll set it before returning if we have to + * refresh the cache, and return with a pin but no lock on it; caller is + * responsible for releasing the pin. + * + * We refresh the cache if it's not initialized yet or force_refresh is true. + */ +HashMetaPage +_hash_getcachedmetap(Relation rel, Buffer *metabuf, bool force_refresh) +{ + Page page; + + Assert(metabuf); + if (force_refresh || rel->rd_amcache == NULL) + { + char *cache = NULL; + + /* + * It's important that we don't set rd_amcache to an invalid value. + * Either MemoryContextAlloc or _hash_getbuf could fail, so don't + * install a pointer to the newly-allocated storage in the actual + * relcache entry until both have succeeded. + */ + if (rel->rd_amcache == NULL) + cache = MemoryContextAlloc(rel->rd_indexcxt, + sizeof(HashMetaPageData)); + + /* Read the metapage. */ + if (BufferIsValid(*metabuf)) + LockBuffer(*metabuf, BUFFER_LOCK_SHARE); + else + *metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, + LH_META_PAGE); + page = BufferGetPage(*metabuf); + + /* Populate the cache. */ + if (rel->rd_amcache == NULL) + rel->rd_amcache = cache; + memcpy(rel->rd_amcache, HashPageGetMeta(page), + sizeof(HashMetaPageData)); + + /* Release metapage lock, but keep the pin. */ + LockBuffer(*metabuf, BUFFER_LOCK_UNLOCK); + } + + return (HashMetaPage) rel->rd_amcache; +} + +/* + * _hash_getbucketbuf_from_hashkey() -- Get the bucket's buffer for the given + * hashkey. + * + * Bucket pages do not move or get removed once they are allocated. This give + * us an opportunity to use the previously saved metapage contents to reach + * the target bucket buffer, instead of reading from the metapage every time. + * This saves one buffer access every time we want to reach the target bucket + * buffer, which is very helpful savings in bufmgr traffic and contention. + * + * The access type parameter (HASH_READ or HASH_WRITE) indicates whether the + * bucket buffer has to be locked for reading or writing. + * + * The out parameter cachedmetap is set with metapage contents used for + * hashkey to bucket buffer mapping. Some callers need this info to reach the + * old bucket in case of bucket split, see _hash_doinsert(). + */ +Buffer +_hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey, int access, + HashMetaPage *cachedmetap) +{ + HashMetaPage metap; + Buffer buf; + Buffer metabuf = InvalidBuffer; + Page page; + Bucket bucket; + BlockNumber blkno; + HashPageOpaque opaque; + + /* We read from target bucket buffer, hence locking is must. */ + Assert(access == HASH_READ || access == HASH_WRITE); + + metap = _hash_getcachedmetap(rel, &metabuf, false); + Assert(metap != NULL); + + /* + * Loop until we get a lock on the correct target bucket. + */ + for (;;) + { + /* + * Compute the target bucket number, and convert to block number. + */ + bucket = _hash_hashkey2bucket(hashkey, + metap->hashm_maxbucket, + metap->hashm_highmask, + metap->hashm_lowmask); + + blkno = BUCKET_TO_BLKNO(metap, bucket); + + /* Fetch the primary bucket page for the bucket */ + buf = _hash_getbuf(rel, blkno, access, LH_BUCKET_PAGE); + page = BufferGetPage(buf); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + Assert(opaque->hasho_bucket == bucket); + Assert(opaque->hasho_prevblkno != InvalidBlockNumber); + + /* + * If this bucket hasn't been split, we're done. + */ + if (opaque->hasho_prevblkno <= metap->hashm_maxbucket) + break; + + /* Drop lock on this buffer, update cached metapage, and retry. */ + _hash_relbuf(rel, buf); + metap = _hash_getcachedmetap(rel, &metabuf, true); + Assert(metap != NULL); + } + + if (BufferIsValid(metabuf)) + _hash_dropbuf(rel, metabuf); + + if (cachedmetap) + *cachedmetap = metap; + + return buf; +} diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c new file mode 100644 index 0000000..2ffa28e --- /dev/null +++ b/src/backend/access/hash/hashsearch.c @@ -0,0 +1,721 @@ +/*------------------------------------------------------------------------- + * + * hashsearch.c + * search code for postgres hash tables + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashsearch.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/hash.h" +#include "access/relscan.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "storage/predicate.h" +#include "utils/rel.h" + +static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP, + ScanDirection dir); +static int _hash_load_qualified_items(IndexScanDesc scan, Page page, + OffsetNumber offnum, ScanDirection dir); +static inline void _hash_saveitem(HashScanOpaque so, int itemIndex, + OffsetNumber offnum, IndexTuple itup); +static void _hash_readnext(IndexScanDesc scan, Buffer *bufp, + Page *pagep, HashPageOpaque *opaquep); + +/* + * _hash_next() -- Get the next item in a scan. + * + * On entry, so->currPos describes the current page, which may + * be pinned but not locked, and so->currPos.itemIndex identifies + * which item was previously returned. + * + * On successful exit, scan->xs_ctup.t_self is set to the TID + * of the next heap tuple. so->currPos is updated as needed. + * + * On failure exit (no more tuples), we return false with pin + * held on bucket page but no pins or locks held on overflow + * page. + */ +bool +_hash_next(IndexScanDesc scan, ScanDirection dir) +{ + Relation rel = scan->indexRelation; + HashScanOpaque so = (HashScanOpaque) scan->opaque; + HashScanPosItem *currItem; + BlockNumber blkno; + Buffer buf; + bool end_of_scan = false; + + /* + * Advance to the next tuple on the current page; or if done, try to read + * data from the next or previous page based on the scan direction. Before + * moving to the next or previous page make sure that we deal with all the + * killed items. + */ + if (ScanDirectionIsForward(dir)) + { + if (++so->currPos.itemIndex > so->currPos.lastItem) + { + if (so->numKilled > 0) + _hash_kill_items(scan); + + blkno = so->currPos.nextPage; + if (BlockNumberIsValid(blkno)) + { + buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE); + TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf)); + if (!_hash_readpage(scan, &buf, dir)) + end_of_scan = true; + } + else + end_of_scan = true; + } + } + else + { + if (--so->currPos.itemIndex < so->currPos.firstItem) + { + if (so->numKilled > 0) + _hash_kill_items(scan); + + blkno = so->currPos.prevPage; + if (BlockNumberIsValid(blkno)) + { + buf = _hash_getbuf(rel, blkno, HASH_READ, + LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf)); + + /* + * We always maintain the pin on bucket page for whole scan + * operation, so releasing the additional pin we have acquired + * here. + */ + if (buf == so->hashso_bucket_buf || + buf == so->hashso_split_bucket_buf) + _hash_dropbuf(rel, buf); + + if (!_hash_readpage(scan, &buf, dir)) + end_of_scan = true; + } + else + end_of_scan = true; + } + } + + if (end_of_scan) + { + _hash_dropscanbuf(rel, so); + HashScanPosInvalidate(so->currPos); + return false; + } + + /* OK, itemIndex says what to return */ + currItem = &so->currPos.items[so->currPos.itemIndex]; + scan->xs_heaptid = currItem->heapTid; + + return true; +} + +/* + * Advance to next page in a bucket, if any. If we are scanning the bucket + * being populated during split operation then this function advances to the + * bucket being split after the last bucket page of bucket being populated. + */ +static void +_hash_readnext(IndexScanDesc scan, + Buffer *bufp, Page *pagep, HashPageOpaque *opaquep) +{ + BlockNumber blkno; + Relation rel = scan->indexRelation; + HashScanOpaque so = (HashScanOpaque) scan->opaque; + bool block_found = false; + + blkno = (*opaquep)->hasho_nextblkno; + + /* + * Retain the pin on primary bucket page till the end of scan. Refer the + * comments in _hash_first to know the reason of retaining pin. + */ + if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf) + LockBuffer(*bufp, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, *bufp); + + *bufp = InvalidBuffer; + /* check for interrupts while we're not holding any buffer lock */ + CHECK_FOR_INTERRUPTS(); + if (BlockNumberIsValid(blkno)) + { + *bufp = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE); + block_found = true; + } + else if (so->hashso_buc_populated && !so->hashso_buc_split) + { + /* + * end of bucket, scan bucket being split if there was a split in + * progress at the start of scan. + */ + *bufp = so->hashso_split_bucket_buf; + + /* + * buffer for bucket being split must be valid as we acquire the pin + * on it before the start of scan and retain it till end of scan. + */ + Assert(BufferIsValid(*bufp)); + + LockBuffer(*bufp, BUFFER_LOCK_SHARE); + PredicateLockPage(rel, BufferGetBlockNumber(*bufp), scan->xs_snapshot); + + /* + * setting hashso_buc_split to true indicates that we are scanning + * bucket being split. + */ + so->hashso_buc_split = true; + + block_found = true; + } + + if (block_found) + { + *pagep = BufferGetPage(*bufp); + TestForOldSnapshot(scan->xs_snapshot, rel, *pagep); + *opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep); + } +} + +/* + * Advance to previous page in a bucket, if any. If the current scan has + * started during split operation then this function advances to bucket + * being populated after the first bucket page of bucket being split. + */ +static void +_hash_readprev(IndexScanDesc scan, + Buffer *bufp, Page *pagep, HashPageOpaque *opaquep) +{ + BlockNumber blkno; + Relation rel = scan->indexRelation; + HashScanOpaque so = (HashScanOpaque) scan->opaque; + bool haveprevblk; + + blkno = (*opaquep)->hasho_prevblkno; + + /* + * Retain the pin on primary bucket page till the end of scan. Refer the + * comments in _hash_first to know the reason of retaining pin. + */ + if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf) + { + LockBuffer(*bufp, BUFFER_LOCK_UNLOCK); + haveprevblk = false; + } + else + { + _hash_relbuf(rel, *bufp); + haveprevblk = true; + } + + *bufp = InvalidBuffer; + /* check for interrupts while we're not holding any buffer lock */ + CHECK_FOR_INTERRUPTS(); + + if (haveprevblk) + { + Assert(BlockNumberIsValid(blkno)); + *bufp = _hash_getbuf(rel, blkno, HASH_READ, + LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + *pagep = BufferGetPage(*bufp); + TestForOldSnapshot(scan->xs_snapshot, rel, *pagep); + *opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep); + + /* + * We always maintain the pin on bucket page for whole scan operation, + * so releasing the additional pin we have acquired here. + */ + if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf) + _hash_dropbuf(rel, *bufp); + } + else if (so->hashso_buc_populated && so->hashso_buc_split) + { + /* + * end of bucket, scan bucket being populated if there was a split in + * progress at the start of scan. + */ + *bufp = so->hashso_bucket_buf; + + /* + * buffer for bucket being populated must be valid as we acquire the + * pin on it before the start of scan and retain it till end of scan. + */ + Assert(BufferIsValid(*bufp)); + + LockBuffer(*bufp, BUFFER_LOCK_SHARE); + *pagep = BufferGetPage(*bufp); + *opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep); + + /* move to the end of bucket chain */ + while (BlockNumberIsValid((*opaquep)->hasho_nextblkno)) + _hash_readnext(scan, bufp, pagep, opaquep); + + /* + * setting hashso_buc_split to false indicates that we are scanning + * bucket being populated. + */ + so->hashso_buc_split = false; + } +} + +/* + * _hash_first() -- Find the first item in a scan. + * + * We find the first item (or, if backward scan, the last item) in the + * index that satisfies the qualification associated with the scan + * descriptor. + * + * On successful exit, if the page containing current index tuple is an + * overflow page, both pin and lock are released whereas if it is a bucket + * page then it is pinned but not locked and data about the matching + * tuple(s) on the page has been loaded into so->currPos, + * scan->xs_ctup.t_self is set to the heap TID of the current tuple. + * + * On failure exit (no more tuples), we return false, with pin held on + * bucket page but no pins or locks held on overflow page. + */ +bool +_hash_first(IndexScanDesc scan, ScanDirection dir) +{ + Relation rel = scan->indexRelation; + HashScanOpaque so = (HashScanOpaque) scan->opaque; + ScanKey cur; + uint32 hashkey; + Bucket bucket; + Buffer buf; + Page page; + HashPageOpaque opaque; + HashScanPosItem *currItem; + + pgstat_count_index_scan(rel); + + /* + * We do not support hash scans with no index qualification, because we + * would have to read the whole index rather than just one bucket. That + * creates a whole raft of problems, since we haven't got a practical way + * to lock all the buckets against splits or compactions. + */ + if (scan->numberOfKeys < 1) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("hash indexes do not support whole-index scans"))); + + /* There may be more than one index qual, but we hash only the first */ + cur = &scan->keyData[0]; + + /* We support only single-column hash indexes */ + Assert(cur->sk_attno == 1); + /* And there's only one operator strategy, too */ + Assert(cur->sk_strategy == HTEqualStrategyNumber); + + /* + * If the constant in the index qual is NULL, assume it cannot match any + * items in the index. + */ + if (cur->sk_flags & SK_ISNULL) + return false; + + /* + * Okay to compute the hash key. We want to do this before acquiring any + * locks, in case a user-defined hash function happens to be slow. + * + * If scankey operator is not a cross-type comparison, we can use the + * cached hash function; otherwise gotta look it up in the catalogs. + * + * We support the convention that sk_subtype == InvalidOid means the + * opclass input type; this is a hack to simplify life for ScanKeyInit(). + */ + if (cur->sk_subtype == rel->rd_opcintype[0] || + cur->sk_subtype == InvalidOid) + hashkey = _hash_datum2hashkey(rel, cur->sk_argument); + else + hashkey = _hash_datum2hashkey_type(rel, cur->sk_argument, + cur->sk_subtype); + + so->hashso_sk_hash = hashkey; + + buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL); + PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot); + page = BufferGetPage(buf); + TestForOldSnapshot(scan->xs_snapshot, rel, page); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + bucket = opaque->hasho_bucket; + + so->hashso_bucket_buf = buf; + + /* + * If a bucket split is in progress, then while scanning the bucket being + * populated, we need to skip tuples that were copied from bucket being + * split. We also need to maintain a pin on the bucket being split to + * ensure that split-cleanup work done by vacuum doesn't remove tuples + * from it till this scan is done. We need to maintain a pin on the + * bucket being populated to ensure that vacuum doesn't squeeze that + * bucket till this scan is complete; otherwise, the ordering of tuples + * can't be maintained during forward and backward scans. Here, we have + * to be cautious about locking order: first, acquire the lock on bucket + * being split; then, release the lock on it but not the pin; then, + * acquire a lock on bucket being populated and again re-verify whether + * the bucket split is still in progress. Acquiring the lock on bucket + * being split first ensures that the vacuum waits for this scan to + * finish. + */ + if (H_BUCKET_BEING_POPULATED(opaque)) + { + BlockNumber old_blkno; + Buffer old_buf; + + old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket); + + /* + * release the lock on new bucket and re-acquire it after acquiring + * the lock on old bucket. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE); + TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf)); + + /* + * remember the split bucket buffer so as to use it later for + * scanning. + */ + so->hashso_split_bucket_buf = old_buf; + LockBuffer(old_buf, BUFFER_LOCK_UNLOCK); + + LockBuffer(buf, BUFFER_LOCK_SHARE); + page = BufferGetPage(buf); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + Assert(opaque->hasho_bucket == bucket); + + if (H_BUCKET_BEING_POPULATED(opaque)) + so->hashso_buc_populated = true; + else + { + _hash_dropbuf(rel, so->hashso_split_bucket_buf); + so->hashso_split_bucket_buf = InvalidBuffer; + } + } + + /* If a backwards scan is requested, move to the end of the chain */ + if (ScanDirectionIsBackward(dir)) + { + /* + * Backward scans that start during split needs to start from end of + * bucket being split. + */ + while (BlockNumberIsValid(opaque->hasho_nextblkno) || + (so->hashso_buc_populated && !so->hashso_buc_split)) + _hash_readnext(scan, &buf, &page, &opaque); + } + + /* remember which buffer we have pinned, if any */ + Assert(BufferIsInvalid(so->currPos.buf)); + so->currPos.buf = buf; + + /* Now find all the tuples satisfying the qualification from a page */ + if (!_hash_readpage(scan, &buf, dir)) + return false; + + /* OK, itemIndex says what to return */ + currItem = &so->currPos.items[so->currPos.itemIndex]; + scan->xs_heaptid = currItem->heapTid; + + /* if we're here, _hash_readpage found a valid tuples */ + return true; +} + +/* + * _hash_readpage() -- Load data from current index page into so->currPos + * + * We scan all the items in the current index page and save them into + * so->currPos if it satisfies the qualification. If no matching items + * are found in the current page, we move to the next or previous page + * in a bucket chain as indicated by the direction. + * + * Return true if any matching items are found else return false. + */ +static bool +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) +{ + Relation rel = scan->indexRelation; + HashScanOpaque so = (HashScanOpaque) scan->opaque; + Buffer buf; + Page page; + HashPageOpaque opaque; + OffsetNumber offnum; + uint16 itemIndex; + + buf = *bufP; + Assert(BufferIsValid(buf)); + _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); + page = BufferGetPage(buf); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + so->currPos.buf = buf; + so->currPos.currPage = BufferGetBlockNumber(buf); + + if (ScanDirectionIsForward(dir)) + { + BlockNumber prev_blkno = InvalidBlockNumber; + + for (;;) + { + /* new page, locate starting position by binary search */ + offnum = _hash_binsearch(page, so->hashso_sk_hash); + + itemIndex = _hash_load_qualified_items(scan, page, offnum, dir); + + if (itemIndex != 0) + break; + + /* + * Could not find any matching tuples in the current page, move to + * the next page. Before leaving the current page, deal with any + * killed items. + */ + if (so->numKilled > 0) + _hash_kill_items(scan); + + /* + * If this is a primary bucket page, hasho_prevblkno is not a real + * block number. + */ + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + prev_blkno = InvalidBlockNumber; + else + prev_blkno = opaque->hasho_prevblkno; + + _hash_readnext(scan, &buf, &page, &opaque); + if (BufferIsValid(buf)) + { + so->currPos.buf = buf; + so->currPos.currPage = BufferGetBlockNumber(buf); + } + else + { + /* + * Remember next and previous block numbers for scrollable + * cursors to know the start position and return false + * indicating that no more matching tuples were found. Also, + * don't reset currPage or lsn, because we expect + * _hash_kill_items to be called for the old page after this + * function returns. + */ + so->currPos.prevPage = prev_blkno; + so->currPos.nextPage = InvalidBlockNumber; + so->currPos.buf = buf; + return false; + } + } + + so->currPos.firstItem = 0; + so->currPos.lastItem = itemIndex - 1; + so->currPos.itemIndex = 0; + } + else + { + BlockNumber next_blkno = InvalidBlockNumber; + + for (;;) + { + /* new page, locate starting position by binary search */ + offnum = _hash_binsearch_last(page, so->hashso_sk_hash); + + itemIndex = _hash_load_qualified_items(scan, page, offnum, dir); + + if (itemIndex != MaxIndexTuplesPerPage) + break; + + /* + * Could not find any matching tuples in the current page, move to + * the previous page. Before leaving the current page, deal with + * any killed items. + */ + if (so->numKilled > 0) + _hash_kill_items(scan); + + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + next_blkno = opaque->hasho_nextblkno; + + _hash_readprev(scan, &buf, &page, &opaque); + if (BufferIsValid(buf)) + { + so->currPos.buf = buf; + so->currPos.currPage = BufferGetBlockNumber(buf); + } + else + { + /* + * Remember next and previous block numbers for scrollable + * cursors to know the start position and return false + * indicating that no more matching tuples were found. Also, + * don't reset currPage or lsn, because we expect + * _hash_kill_items to be called for the old page after this + * function returns. + */ + so->currPos.prevPage = InvalidBlockNumber; + so->currPos.nextPage = next_blkno; + so->currPos.buf = buf; + return false; + } + } + + so->currPos.firstItem = itemIndex; + so->currPos.lastItem = MaxIndexTuplesPerPage - 1; + so->currPos.itemIndex = MaxIndexTuplesPerPage - 1; + } + + if (so->currPos.buf == so->hashso_bucket_buf || + so->currPos.buf == so->hashso_split_bucket_buf) + { + so->currPos.prevPage = InvalidBlockNumber; + so->currPos.nextPage = opaque->hasho_nextblkno; + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + } + else + { + so->currPos.prevPage = opaque->hasho_prevblkno; + so->currPos.nextPage = opaque->hasho_nextblkno; + _hash_relbuf(rel, so->currPos.buf); + so->currPos.buf = InvalidBuffer; + } + + Assert(so->currPos.firstItem <= so->currPos.lastItem); + return true; +} + +/* + * Load all the qualified items from a current index page + * into so->currPos. Helper function for _hash_readpage. + */ +static int +_hash_load_qualified_items(IndexScanDesc scan, Page page, + OffsetNumber offnum, ScanDirection dir) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + IndexTuple itup; + int itemIndex; + OffsetNumber maxoff; + + maxoff = PageGetMaxOffsetNumber(page); + + if (ScanDirectionIsForward(dir)) + { + /* load items[] in ascending order */ + itemIndex = 0; + + while (offnum <= maxoff) + { + Assert(offnum >= FirstOffsetNumber); + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); + + /* + * skip the tuples that are moved by split operation for the scan + * that has started when split was in progress. Also, skip the + * tuples that are marked as dead. + */ + if ((so->hashso_buc_populated && !so->hashso_buc_split && + (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) || + (scan->ignore_killed_tuples && + (ItemIdIsDead(PageGetItemId(page, offnum))))) + { + offnum = OffsetNumberNext(offnum); /* move forward */ + continue; + } + + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && + _hash_checkqual(scan, itup)) + { + /* tuple is qualified, so remember it */ + _hash_saveitem(so, itemIndex, offnum, itup); + itemIndex++; + } + else + { + /* + * No more matching tuples exist in this page. so, exit while + * loop. + */ + break; + } + + offnum = OffsetNumberNext(offnum); + } + + Assert(itemIndex <= MaxIndexTuplesPerPage); + return itemIndex; + } + else + { + /* load items[] in descending order */ + itemIndex = MaxIndexTuplesPerPage; + + while (offnum >= FirstOffsetNumber) + { + Assert(offnum <= maxoff); + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); + + /* + * skip the tuples that are moved by split operation for the scan + * that has started when split was in progress. Also, skip the + * tuples that are marked as dead. + */ + if ((so->hashso_buc_populated && !so->hashso_buc_split && + (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) || + (scan->ignore_killed_tuples && + (ItemIdIsDead(PageGetItemId(page, offnum))))) + { + offnum = OffsetNumberPrev(offnum); /* move back */ + continue; + } + + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) && + _hash_checkqual(scan, itup)) + { + itemIndex--; + /* tuple is qualified, so remember it */ + _hash_saveitem(so, itemIndex, offnum, itup); + } + else + { + /* + * No more matching tuples exist in this page. so, exit while + * loop. + */ + break; + } + + offnum = OffsetNumberPrev(offnum); + } + + Assert(itemIndex >= 0); + return itemIndex; + } +} + +/* Save an index item into so->currPos.items[itemIndex] */ +static inline void +_hash_saveitem(HashScanOpaque so, int itemIndex, + OffsetNumber offnum, IndexTuple itup) +{ + HashScanPosItem *currItem = &so->currPos.items[itemIndex]; + + currItem->heapTid = itup->t_tid; + currItem->indexOffset = offnum; +} diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c new file mode 100644 index 0000000..3ce4248 --- /dev/null +++ b/src/backend/access/hash/hashsort.c @@ -0,0 +1,152 @@ +/*------------------------------------------------------------------------- + * + * hashsort.c + * Sort tuples for insertion into a new hash index. + * + * When building a very large hash index, we pre-sort the tuples by bucket + * number to improve locality of access to the index, and thereby avoid + * thrashing. We use tuplesort.c to sort the given index tuples into order. + * + * Note: if the number of rows in the table has been underestimated, + * bucket splits may occur during the index build. In that case we'd + * be inserting into two or more buckets for each possible masked-off + * hash code value. That's no big problem though, since we'll still have + * plenty of locality of access. + * + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/hash/hashsort.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/hash.h" +#include "commands/progress.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "port/pg_bitutils.h" +#include "utils/tuplesort.h" + + +/* + * Status record for spooling/sorting phase. + */ +struct HSpool +{ + Tuplesortstate *sortstate; /* state data for tuplesort.c */ + Relation index; + + /* + * We sort the hash keys based on the buckets they belong to. Below masks + * are used in _hash_hashkey2bucket to determine the bucket of given hash + * key. + */ + uint32 high_mask; + uint32 low_mask; + uint32 max_buckets; +}; + + +/* + * create and initialize a spool structure + */ +HSpool * +_h_spoolinit(Relation heap, Relation index, uint32 num_buckets) +{ + HSpool *hspool = (HSpool *) palloc0(sizeof(HSpool)); + + hspool->index = index; + + /* + * Determine the bitmask for hash code values. Since there are currently + * num_buckets buckets in the index, the appropriate mask can be computed + * as follows. + * + * NOTE : This hash mask calculation should be in sync with similar + * calculation in _hash_init_metabuffer. + */ + hspool->high_mask = pg_nextpower2_32(num_buckets + 1) - 1; + hspool->low_mask = (hspool->high_mask >> 1); + hspool->max_buckets = num_buckets - 1; + + /* + * We size the sort area as maintenance_work_mem rather than work_mem to + * speed index creation. This should be OK since a single backend can't + * run multiple index creations in parallel. + */ + hspool->sortstate = tuplesort_begin_index_hash(heap, + index, + hspool->high_mask, + hspool->low_mask, + hspool->max_buckets, + maintenance_work_mem, + NULL, + false); + + return hspool; +} + +/* + * clean up a spool structure and its substructures. + */ +void +_h_spooldestroy(HSpool *hspool) +{ + tuplesort_end(hspool->sortstate); + pfree(hspool); +} + +/* + * spool an index entry into the sort file. + */ +void +_h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull) +{ + tuplesort_putindextuplevalues(hspool->sortstate, hspool->index, + self, values, isnull); +} + +/* + * given a spool loaded by successive calls to _h_spool, + * create an entire index. + */ +void +_h_indexbuild(HSpool *hspool, Relation heapRel) +{ + IndexTuple itup; + int64 tups_done = 0; +#ifdef USE_ASSERT_CHECKING + uint32 hashkey = 0; +#endif + + tuplesort_performsort(hspool->sortstate); + + while ((itup = tuplesort_getindextuple(hspool->sortstate, true)) != NULL) + { + /* + * Technically, it isn't critical that hash keys be found in sorted + * order, since this sorting is only used to increase locality of + * access as a performance optimization. It still seems like a good + * idea to test tuplesort.c's handling of hash index tuple sorts + * through an assertion, though. + */ +#ifdef USE_ASSERT_CHECKING + uint32 lasthashkey = hashkey; + + hashkey = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup), + hspool->max_buckets, hspool->high_mask, + hspool->low_mask); + Assert(hashkey >= lasthashkey); +#endif + + _hash_doinsert(hspool->index, itup, heapRel); + + pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, + ++tups_done); + } +} diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c new file mode 100644 index 0000000..5198728 --- /dev/null +++ b/src/backend/access/hash/hashutil.c @@ -0,0 +1,622 @@ +/*------------------------------------------------------------------------- + * + * hashutil.c + * Utility code for Postgres hash implementation. + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/access/hash/hashutil.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/hash.h" +#include "access/reloptions.h" +#include "access/relscan.h" +#include "port/pg_bitutils.h" +#include "storage/buf_internals.h" +#include "utils/lsyscache.h" +#include "utils/rel.h" + +#define CALC_NEW_BUCKET(old_bucket, lowmask) \ + old_bucket | (lowmask + 1) + +/* + * _hash_checkqual -- does the index tuple satisfy the scan conditions? + */ +bool +_hash_checkqual(IndexScanDesc scan, IndexTuple itup) +{ + /* + * Currently, we can't check any of the scan conditions since we do not + * have the original index entry value to supply to the sk_func. Always + * return true; we expect that hashgettuple already set the recheck flag + * to make the main indexscan code do it. + */ +#ifdef NOT_USED + TupleDesc tupdesc = RelationGetDescr(scan->indexRelation); + ScanKey key = scan->keyData; + int scanKeySize = scan->numberOfKeys; + + while (scanKeySize > 0) + { + Datum datum; + bool isNull; + Datum test; + + datum = index_getattr(itup, + key->sk_attno, + tupdesc, + &isNull); + + /* assume sk_func is strict */ + if (isNull) + return false; + if (key->sk_flags & SK_ISNULL) + return false; + + test = FunctionCall2Coll(&key->sk_func, key->sk_collation, + datum, key->sk_argument); + + if (!DatumGetBool(test)) + return false; + + key++; + scanKeySize--; + } +#endif + + return true; +} + +/* + * _hash_datum2hashkey -- given a Datum, call the index's hash function + * + * The Datum is assumed to be of the index's column type, so we can use the + * "primary" hash function that's tracked for us by the generic index code. + */ +uint32 +_hash_datum2hashkey(Relation rel, Datum key) +{ + FmgrInfo *procinfo; + Oid collation; + + /* XXX assumes index has only one attribute */ + procinfo = index_getprocinfo(rel, 1, HASHSTANDARD_PROC); + collation = rel->rd_indcollation[0]; + + return DatumGetUInt32(FunctionCall1Coll(procinfo, collation, key)); +} + +/* + * _hash_datum2hashkey_type -- given a Datum of a specified type, + * hash it in a fashion compatible with this index + * + * This is much more expensive than _hash_datum2hashkey, so use it only in + * cross-type situations. + */ +uint32 +_hash_datum2hashkey_type(Relation rel, Datum key, Oid keytype) +{ + RegProcedure hash_proc; + Oid collation; + + /* XXX assumes index has only one attribute */ + hash_proc = get_opfamily_proc(rel->rd_opfamily[0], + keytype, + keytype, + HASHSTANDARD_PROC); + if (!RegProcedureIsValid(hash_proc)) + elog(ERROR, "missing support function %d(%u,%u) for index \"%s\"", + HASHSTANDARD_PROC, keytype, keytype, + RelationGetRelationName(rel)); + collation = rel->rd_indcollation[0]; + + return DatumGetUInt32(OidFunctionCall1Coll(hash_proc, collation, key)); +} + +/* + * _hash_hashkey2bucket -- determine which bucket the hashkey maps to. + */ +Bucket +_hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket, + uint32 highmask, uint32 lowmask) +{ + Bucket bucket; + + bucket = hashkey & highmask; + if (bucket > maxbucket) + bucket = bucket & lowmask; + + return bucket; +} + +/* + * _hash_spareindex -- returns spare index / global splitpoint phase of the + * bucket + */ +uint32 +_hash_spareindex(uint32 num_bucket) +{ + uint32 splitpoint_group; + uint32 splitpoint_phases; + + splitpoint_group = pg_ceil_log2_32(num_bucket); + + if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) + return splitpoint_group; + + /* account for single-phase groups */ + splitpoint_phases = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE; + + /* account for multi-phase groups before splitpoint_group */ + splitpoint_phases += + ((splitpoint_group - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) << + HASH_SPLITPOINT_PHASE_BITS); + + /* account for phases within current group */ + splitpoint_phases += + (((num_bucket - 1) >> + (splitpoint_group - (HASH_SPLITPOINT_PHASE_BITS + 1))) & + HASH_SPLITPOINT_PHASE_MASK); /* to 0-based value. */ + + return splitpoint_phases; +} + +/* + * _hash_get_totalbuckets -- returns total number of buckets allocated till + * the given splitpoint phase. + */ +uint32 +_hash_get_totalbuckets(uint32 splitpoint_phase) +{ + uint32 splitpoint_group; + uint32 total_buckets; + uint32 phases_within_splitpoint_group; + + if (splitpoint_phase < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) + return (1 << splitpoint_phase); + + /* get splitpoint's group */ + splitpoint_group = HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE; + splitpoint_group += + ((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> + HASH_SPLITPOINT_PHASE_BITS); + + /* account for buckets before splitpoint_group */ + total_buckets = (1 << (splitpoint_group - 1)); + + /* account for buckets within splitpoint_group */ + phases_within_splitpoint_group = + (((splitpoint_phase - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) & + HASH_SPLITPOINT_PHASE_MASK) + 1); /* from 0-based to 1-based */ + total_buckets += + (((1 << (splitpoint_group - 1)) >> HASH_SPLITPOINT_PHASE_BITS) * + phases_within_splitpoint_group); + + return total_buckets; +} + +/* + * _hash_checkpage -- sanity checks on the format of all hash pages + * + * If flags is not zero, it is a bitwise OR of the acceptable page types + * (values of hasho_flag & LH_PAGE_TYPE). + */ +void +_hash_checkpage(Relation rel, Buffer buf, int flags) +{ + Page page = BufferGetPage(buf); + + /* + * ReadBuffer verifies that every newly-read page passes + * PageHeaderIsValid, which means it either contains a reasonably sane + * page header or is all-zero. We have to defend against the all-zero + * case, however. + */ + if (PageIsNew(page)) + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" contains unexpected zero page at block %u", + RelationGetRelationName(rel), + BufferGetBlockNumber(buf)), + errhint("Please REINDEX it."))); + + /* + * Additionally check that the special area looks sane. + */ + if (PageGetSpecialSize(page) != MAXALIGN(sizeof(HashPageOpaqueData))) + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" contains corrupted page at block %u", + RelationGetRelationName(rel), + BufferGetBlockNumber(buf)), + errhint("Please REINDEX it."))); + + if (flags) + { + HashPageOpaque opaque = (HashPageOpaque) PageGetSpecialPointer(page); + + if ((opaque->hasho_flag & flags) == 0) + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" contains corrupted page at block %u", + RelationGetRelationName(rel), + BufferGetBlockNumber(buf)), + errhint("Please REINDEX it."))); + } + + /* + * When checking the metapage, also verify magic number and version. + */ + if (flags == LH_META_PAGE) + { + HashMetaPage metap = HashPageGetMeta(page); + + if (metap->hashm_magic != HASH_MAGIC) + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" is not a hash index", + RelationGetRelationName(rel)))); + + if (metap->hashm_version != HASH_VERSION) + ereport(ERROR, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" has wrong hash version", + RelationGetRelationName(rel)), + errhint("Please REINDEX it."))); + } +} + +bytea * +hashoptions(Datum reloptions, bool validate) +{ + static const relopt_parse_elt tab[] = { + {"fillfactor", RELOPT_TYPE_INT, offsetof(HashOptions, fillfactor)}, + }; + + return (bytea *) build_reloptions(reloptions, validate, + RELOPT_KIND_HASH, + sizeof(HashOptions), + tab, lengthof(tab)); +} + +/* + * _hash_get_indextuple_hashkey - get the hash index tuple's hash key value + */ +uint32 +_hash_get_indextuple_hashkey(IndexTuple itup) +{ + char *attp; + + /* + * We assume the hash key is the first attribute and can't be null, so + * this can be done crudely but very very cheaply ... + */ + attp = (char *) itup + IndexInfoFindDataOffset(itup->t_info); + return *((uint32 *) attp); +} + +/* + * _hash_convert_tuple - convert raw index data to hash key + * + * Inputs: values and isnull arrays for the user data column(s) + * Outputs: values and isnull arrays for the index tuple, suitable for + * passing to index_form_tuple(). + * + * Returns true if successful, false if not (because there are null values). + * On a false result, the given data need not be indexed. + * + * Note: callers know that the index-column arrays are always of length 1. + * In principle, there could be more than one input column, though we do not + * currently support that. + */ +bool +_hash_convert_tuple(Relation index, + Datum *user_values, bool *user_isnull, + Datum *index_values, bool *index_isnull) +{ + uint32 hashkey; + + /* + * We do not insert null values into hash indexes. This is okay because + * the only supported search operator is '=', and we assume it is strict. + */ + if (user_isnull[0]) + return false; + + hashkey = _hash_datum2hashkey(index, user_values[0]); + index_values[0] = UInt32GetDatum(hashkey); + index_isnull[0] = false; + return true; +} + +/* + * _hash_binsearch - Return the offset number in the page where the + * specified hash value should be sought or inserted. + * + * We use binary search, relying on the assumption that the existing entries + * are ordered by hash key. + * + * Returns the offset of the first index entry having hashkey >= hash_value, + * or the page's max offset plus one if hash_value is greater than all + * existing hash keys in the page. This is the appropriate place to start + * a search, or to insert a new item. + */ +OffsetNumber +_hash_binsearch(Page page, uint32 hash_value) +{ + OffsetNumber upper; + OffsetNumber lower; + + /* Loop invariant: lower <= desired place <= upper */ + upper = PageGetMaxOffsetNumber(page) + 1; + lower = FirstOffsetNumber; + + while (upper > lower) + { + OffsetNumber off; + IndexTuple itup; + uint32 hashkey; + + off = (upper + lower) / 2; + Assert(OffsetNumberIsValid(off)); + + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, off)); + hashkey = _hash_get_indextuple_hashkey(itup); + if (hashkey < hash_value) + lower = off + 1; + else + upper = off; + } + + return lower; +} + +/* + * _hash_binsearch_last + * + * Same as above, except that if there are multiple matching items in the + * page, we return the offset of the last one instead of the first one, + * and the possible range of outputs is 0..maxoffset not 1..maxoffset+1. + * This is handy for starting a new page in a backwards scan. + */ +OffsetNumber +_hash_binsearch_last(Page page, uint32 hash_value) +{ + OffsetNumber upper; + OffsetNumber lower; + + /* Loop invariant: lower <= desired place <= upper */ + upper = PageGetMaxOffsetNumber(page); + lower = FirstOffsetNumber - 1; + + while (upper > lower) + { + IndexTuple itup; + OffsetNumber off; + uint32 hashkey; + + off = (upper + lower + 1) / 2; + Assert(OffsetNumberIsValid(off)); + + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, off)); + hashkey = _hash_get_indextuple_hashkey(itup); + if (hashkey > hash_value) + upper = off - 1; + else + lower = off; + } + + return lower; +} + +/* + * _hash_get_oldblock_from_newbucket() -- get the block number of a bucket + * from which current (new) bucket is being split. + */ +BlockNumber +_hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bucket) +{ + Bucket old_bucket; + uint32 mask; + Buffer metabuf; + HashMetaPage metap; + BlockNumber blkno; + + /* + * To get the old bucket from the current bucket, we need a mask to modulo + * into lower half of table. This mask is stored in meta page as + * hashm_lowmask, but here we can't rely on the same, because we need a + * value of lowmask that was prevalent at the time when bucket split was + * started. Masking the most significant bit of new bucket would give us + * old bucket. + */ + mask = (((uint32) 1) << (fls(new_bucket) - 1)) - 1; + old_bucket = new_bucket & mask; + + metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + blkno = BUCKET_TO_BLKNO(metap, old_bucket); + + _hash_relbuf(rel, metabuf); + + return blkno; +} + +/* + * _hash_get_newblock_from_oldbucket() -- get the block number of a bucket + * that will be generated after split from old bucket. + * + * This is used to find the new bucket from old bucket based on current table + * half. It is mainly required to finish the incomplete splits where we are + * sure that not more than one bucket could have split in progress from old + * bucket. + */ +BlockNumber +_hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket) +{ + Bucket new_bucket; + Buffer metabuf; + HashMetaPage metap; + BlockNumber blkno; + + metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE); + metap = HashPageGetMeta(BufferGetPage(metabuf)); + + new_bucket = _hash_get_newbucket_from_oldbucket(rel, old_bucket, + metap->hashm_lowmask, + metap->hashm_maxbucket); + blkno = BUCKET_TO_BLKNO(metap, new_bucket); + + _hash_relbuf(rel, metabuf); + + return blkno; +} + +/* + * _hash_get_newbucket_from_oldbucket() -- get the new bucket that will be + * generated after split from current (old) bucket. + * + * This is used to find the new bucket from old bucket. New bucket can be + * obtained by OR'ing old bucket with most significant bit of current table + * half (lowmask passed in this function can be used to identify msb of + * current table half). There could be multiple buckets that could have + * been split from current bucket. We need the first such bucket that exists. + * Caller must ensure that no more than one split has happened from old + * bucket. + */ +Bucket +_hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket, + uint32 lowmask, uint32 maxbucket) +{ + Bucket new_bucket; + + new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask); + if (new_bucket > maxbucket) + { + lowmask = lowmask >> 1; + new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask); + } + + return new_bucket; +} + +/* + * _hash_kill_items - set LP_DEAD state for items an indexscan caller has + * told us were killed. + * + * scan->opaque, referenced locally through so, contains information about the + * current page and killed tuples thereon (generally, this should only be + * called if so->numKilled > 0). + * + * The caller does not have a lock on the page and may or may not have the + * page pinned in a buffer. Note that read-lock is sufficient for setting + * LP_DEAD status (which is only a hint). + * + * The caller must have pin on bucket buffer, but may or may not have pin + * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos). + * + * We match items by heap TID before assuming they are the right ones to + * delete. + * + * There are never any scans active in a bucket at the time VACUUM begins, + * because VACUUM takes a cleanup lock on the primary bucket page and scans + * hold a pin. A scan can begin after VACUUM leaves the primary bucket page + * but before it finishes the entire bucket, but it can never pass VACUUM, + * because VACUUM always locks the next page before releasing the lock on + * the previous one. Therefore, we don't have to worry about accidentally + * killing a TID that has been reused for an unrelated tuple. + */ +void +_hash_kill_items(IndexScanDesc scan) +{ + HashScanOpaque so = (HashScanOpaque) scan->opaque; + Relation rel = scan->indexRelation; + BlockNumber blkno; + Buffer buf; + Page page; + HashPageOpaque opaque; + OffsetNumber offnum, + maxoff; + int numKilled = so->numKilled; + int i; + bool killedsomething = false; + bool havePin = false; + + Assert(so->numKilled > 0); + Assert(so->killedItems != NULL); + Assert(HashScanPosIsValid(so->currPos)); + + /* + * Always reset the scan state, so we don't look for same items on other + * pages. + */ + so->numKilled = 0; + + blkno = so->currPos.currPage; + if (HashScanPosIsPinned(so->currPos)) + { + /* + * We already have pin on this buffer, so, all we need to do is + * acquire lock on it. + */ + havePin = true; + buf = so->currPos.buf; + LockBuffer(buf, BUFFER_LOCK_SHARE); + } + else + buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE); + + page = BufferGetPage(buf); + opaque = (HashPageOpaque) PageGetSpecialPointer(page); + maxoff = PageGetMaxOffsetNumber(page); + + for (i = 0; i < numKilled; i++) + { + int itemIndex = so->killedItems[i]; + HashScanPosItem *currItem = &so->currPos.items[itemIndex]; + + offnum = currItem->indexOffset; + + Assert(itemIndex >= so->currPos.firstItem && + itemIndex <= so->currPos.lastItem); + + while (offnum <= maxoff) + { + ItemId iid = PageGetItemId(page, offnum); + IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); + + if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid)) + { + /* found the item */ + ItemIdMarkDead(iid); + killedsomething = true; + break; /* out of inner search loop */ + } + offnum = OffsetNumberNext(offnum); + } + } + + /* + * Since this can be redone later if needed, mark as dirty hint. Whenever + * we mark anything LP_DEAD, we also set the page's + * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint. + */ + if (killedsomething) + { + opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES; + MarkBufferDirtyHint(buf, true); + } + + if (so->hashso_bucket_buf == so->currPos.buf || + havePin) + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + else + _hash_relbuf(rel, buf); +} diff --git a/src/backend/access/hash/hashvalidate.c b/src/backend/access/hash/hashvalidate.c new file mode 100644 index 0000000..1e343df --- /dev/null +++ b/src/backend/access/hash/hashvalidate.c @@ -0,0 +1,439 @@ +/*------------------------------------------------------------------------- + * + * hashvalidate.c + * Opclass validator for hash. + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/access/hash/hashvalidate.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/amvalidate.h" +#include "access/hash.h" +#include "access/htup_details.h" +#include "access/xact.h" +#include "catalog/pg_am.h" +#include "catalog/pg_amop.h" +#include "catalog/pg_amproc.h" +#include "catalog/pg_opclass.h" +#include "catalog/pg_opfamily.h" +#include "catalog/pg_proc.h" +#include "catalog/pg_type.h" +#include "parser/parse_coerce.h" +#include "utils/builtins.h" +#include "utils/fmgroids.h" +#include "utils/lsyscache.h" +#include "utils/regproc.h" +#include "utils/syscache.h" + + +static bool check_hash_func_signature(Oid funcid, int16 amprocnum, Oid argtype); + + +/* + * Validator for a hash opclass. + * + * Some of the checks done here cover the whole opfamily, and therefore are + * redundant when checking each opclass in a family. But they don't run long + * enough to be much of a problem, so we accept the duplication rather than + * complicate the amvalidate API. + */ +bool +hashvalidate(Oid opclassoid) +{ + bool result = true; + HeapTuple classtup; + Form_pg_opclass classform; + Oid opfamilyoid; + Oid opcintype; + char *opclassname; + HeapTuple familytup; + Form_pg_opfamily familyform; + char *opfamilyname; + CatCList *proclist, + *oprlist; + List *grouplist; + OpFamilyOpFuncGroup *opclassgroup; + List *hashabletypes = NIL; + int i; + ListCell *lc; + + /* Fetch opclass information */ + classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid)); + if (!HeapTupleIsValid(classtup)) + elog(ERROR, "cache lookup failed for operator class %u", opclassoid); + classform = (Form_pg_opclass) GETSTRUCT(classtup); + + opfamilyoid = classform->opcfamily; + opcintype = classform->opcintype; + opclassname = NameStr(classform->opcname); + + /* Fetch opfamily information */ + familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid)); + if (!HeapTupleIsValid(familytup)) + elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid); + familyform = (Form_pg_opfamily) GETSTRUCT(familytup); + + opfamilyname = NameStr(familyform->opfname); + + /* Fetch all operators and support functions of the opfamily */ + oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid)); + proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid)); + + /* Check individual support functions */ + for (i = 0; i < proclist->n_members; i++) + { + HeapTuple proctup = &proclist->members[i]->tuple; + Form_pg_amproc procform = (Form_pg_amproc) GETSTRUCT(proctup); + + /* + * All hash functions should be registered with matching left/right + * types + */ + if (procform->amproclefttype != procform->amprocrighttype) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains support function %s with different left and right input types", + opfamilyname, "hash", + format_procedure(procform->amproc)))); + result = false; + } + + /* Check procedure numbers and function signatures */ + switch (procform->amprocnum) + { + case HASHSTANDARD_PROC: + case HASHEXTENDED_PROC: + if (!check_hash_func_signature(procform->amproc, procform->amprocnum, + procform->amproclefttype)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains function %s with wrong signature for support number %d", + opfamilyname, "hash", + format_procedure(procform->amproc), + procform->amprocnum))); + result = false; + } + else + { + /* Remember which types we can hash */ + hashabletypes = + list_append_unique_oid(hashabletypes, + procform->amproclefttype); + } + break; + case HASHOPTIONS_PROC: + if (!check_amoptsproc_signature(procform->amproc)) + result = false; + break; + default: + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains function %s with invalid support number %d", + opfamilyname, "hash", + format_procedure(procform->amproc), + procform->amprocnum))); + result = false; + break; + } + } + + /* Check individual operators */ + for (i = 0; i < oprlist->n_members; i++) + { + HeapTuple oprtup = &oprlist->members[i]->tuple; + Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup); + + /* Check that only allowed strategy numbers exist */ + if (oprform->amopstrategy < 1 || + oprform->amopstrategy > HTMaxStrategyNumber) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains operator %s with invalid strategy number %d", + opfamilyname, "hash", + format_operator(oprform->amopopr), + oprform->amopstrategy))); + result = false; + } + + /* hash doesn't support ORDER BY operators */ + if (oprform->amoppurpose != AMOP_SEARCH || + OidIsValid(oprform->amopsortfamily)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains invalid ORDER BY specification for operator %s", + opfamilyname, "hash", + format_operator(oprform->amopopr)))); + result = false; + } + + /* Check operator signature --- same for all hash strategies */ + if (!check_amop_signature(oprform->amopopr, BOOLOID, + oprform->amoplefttype, + oprform->amoprighttype)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s contains operator %s with wrong signature", + opfamilyname, "hash", + format_operator(oprform->amopopr)))); + result = false; + } + + /* There should be relevant hash functions for each datatype */ + if (!list_member_oid(hashabletypes, oprform->amoplefttype) || + !list_member_oid(hashabletypes, oprform->amoprighttype)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s lacks support function for operator %s", + opfamilyname, "hash", + format_operator(oprform->amopopr)))); + result = false; + } + } + + /* Now check for inconsistent groups of operators/functions */ + grouplist = identify_opfamily_groups(oprlist, proclist); + opclassgroup = NULL; + foreach(lc, grouplist) + { + OpFamilyOpFuncGroup *thisgroup = (OpFamilyOpFuncGroup *) lfirst(lc); + + /* Remember the group exactly matching the test opclass */ + if (thisgroup->lefttype == opcintype && + thisgroup->righttype == opcintype) + opclassgroup = thisgroup; + + /* + * Complain if there seems to be an incomplete set of operators for + * this datatype pair (implying that we have a hash function but no + * operator). + */ + if (thisgroup->operatorset != (1 << HTEqualStrategyNumber)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s is missing operator(s) for types %s and %s", + opfamilyname, "hash", + format_type_be(thisgroup->lefttype), + format_type_be(thisgroup->righttype)))); + result = false; + } + } + + /* Check that the originally-named opclass is supported */ + /* (if group is there, we already checked it adequately above) */ + if (!opclassgroup) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator class \"%s\" of access method %s is missing operator(s)", + opclassname, "hash"))); + result = false; + } + + /* + * Complain if the opfamily doesn't have entries for all possible + * combinations of its supported datatypes. While missing cross-type + * operators are not fatal, it seems reasonable to insist that all + * built-in hash opfamilies be complete. + */ + if (list_length(grouplist) != + list_length(hashabletypes) * list_length(hashabletypes)) + { + ereport(INFO, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("operator family \"%s\" of access method %s is missing cross-type operator(s)", + opfamilyname, "hash"))); + result = false; + } + + ReleaseCatCacheList(proclist); + ReleaseCatCacheList(oprlist); + ReleaseSysCache(familytup); + ReleaseSysCache(classtup); + + return result; +} + + +/* + * We need a custom version of check_amproc_signature because of assorted + * hacks in the core hash opclass definitions. + */ +static bool +check_hash_func_signature(Oid funcid, int16 amprocnum, Oid argtype) +{ + bool result = true; + Oid restype; + int16 nargs; + HeapTuple tp; + Form_pg_proc procform; + + switch (amprocnum) + { + case HASHSTANDARD_PROC: + restype = INT4OID; + nargs = 1; + break; + + case HASHEXTENDED_PROC: + restype = INT8OID; + nargs = 2; + break; + + default: + elog(ERROR, "invalid amprocnum"); + } + + tp = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid)); + if (!HeapTupleIsValid(tp)) + elog(ERROR, "cache lookup failed for function %u", funcid); + procform = (Form_pg_proc) GETSTRUCT(tp); + + if (procform->prorettype != restype || procform->proretset || + procform->pronargs != nargs) + result = false; + + if (!IsBinaryCoercible(argtype, procform->proargtypes.values[0])) + { + /* + * Some of the built-in hash opclasses cheat by using hash functions + * that are different from but physically compatible with the opclass + * datatype. In some of these cases, even a "binary coercible" check + * fails because there's no relevant cast. For the moment, fix it by + * having a list of allowed cases. Test the specific function + * identity, not just its input type, because hashvarlena() takes + * INTERNAL and allowing any such function seems too scary. + */ + if ((funcid == F_HASHINT4 || funcid == F_HASHINT4EXTENDED) && + (argtype == DATEOID || + argtype == XIDOID || argtype == CIDOID)) + /* okay, allowed use of hashint4() */ ; + else if ((funcid == F_HASHINT8 || funcid == F_HASHINT8EXTENDED) && + (argtype == XID8OID)) + /* okay, allowed use of hashint8() */ ; + else if ((funcid == F_TIMESTAMP_HASH || + funcid == F_TIMESTAMP_HASH_EXTENDED) && + argtype == TIMESTAMPTZOID) + /* okay, allowed use of timestamp_hash() */ ; + else if ((funcid == F_HASHCHAR || funcid == F_HASHCHAREXTENDED) && + argtype == BOOLOID) + /* okay, allowed use of hashchar() */ ; + else if ((funcid == F_HASHVARLENA || funcid == F_HASHVARLENAEXTENDED) && + argtype == BYTEAOID) + /* okay, allowed use of hashvarlena() */ ; + else + result = false; + } + + /* If function takes a second argument, it must be for a 64-bit salt. */ + if (nargs == 2 && procform->proargtypes.values[1] != INT8OID) + result = false; + + ReleaseSysCache(tp); + return result; +} + +/* + * Prechecking function for adding operators/functions to a hash opfamily. + */ +void +hashadjustmembers(Oid opfamilyoid, + Oid opclassoid, + List *operators, + List *functions) +{ + Oid opcintype; + ListCell *lc; + + /* + * Hash operators and required support functions are always "loose" + * members of the opfamily if they are cross-type. If they are not + * cross-type, we prefer to tie them to the appropriate opclass ... but if + * the user hasn't created one, we can't do that, and must fall back to + * using the opfamily dependency. (We mustn't force creation of an + * opclass in such a case, as leaving an incomplete opclass laying about + * would be bad. Throwing an error is another undesirable alternative.) + * + * This behavior results in a bit of a dump/reload hazard, in that the + * order of restoring objects could affect what dependencies we end up + * with. pg_dump's existing behavior will preserve the dependency choices + * in most cases, but not if a cross-type operator has been bound tightly + * into an opclass. That's a mistake anyway, so silently "fixing" it + * isn't awful. + * + * Optional support functions are always "loose" family members. + * + * To avoid repeated lookups, we remember the most recently used opclass's + * input type. + */ + if (OidIsValid(opclassoid)) + { + /* During CREATE OPERATOR CLASS, need CCI to see the pg_opclass row */ + CommandCounterIncrement(); + opcintype = get_opclass_input_type(opclassoid); + } + else + opcintype = InvalidOid; + + /* + * We handle operators and support functions almost identically, so rather + * than duplicate this code block, just join the lists. + */ + foreach(lc, list_concat_copy(operators, functions)) + { + OpFamilyMember *op = (OpFamilyMember *) lfirst(lc); + + if (op->is_func && op->number != HASHSTANDARD_PROC) + { + /* Optional support proc, so always a soft family dependency */ + op->ref_is_hard = false; + op->ref_is_family = true; + op->refobjid = opfamilyoid; + } + else if (op->lefttype != op->righttype) + { + /* Cross-type, so always a soft family dependency */ + op->ref_is_hard = false; + op->ref_is_family = true; + op->refobjid = opfamilyoid; + } + else + { + /* Not cross-type; is there a suitable opclass? */ + if (op->lefttype != opcintype) + { + /* Avoid repeating this expensive lookup, even if it fails */ + opcintype = op->lefttype; + opclassoid = opclass_for_family_datatype(HASH_AM_OID, + opfamilyoid, + opcintype); + } + if (OidIsValid(opclassoid)) + { + /* Hard dependency on opclass */ + op->ref_is_hard = true; + op->ref_is_family = false; + op->refobjid = opclassoid; + } + else + { + /* We're stuck, so make a soft dependency on the opfamily */ + op->ref_is_hard = false; + op->ref_is_family = true; + op->refobjid = opfamilyoid; + } + } + } +} |