diff options
Diffstat (limited to 'src/backend/storage/freespace')
-rw-r--r-- | src/backend/storage/freespace/Makefile | 20 | ||||
-rw-r--r-- | src/backend/storage/freespace/README | 196 | ||||
-rw-r--r-- | src/backend/storage/freespace/freespace.c | 865 | ||||
-rw-r--r-- | src/backend/storage/freespace/fsmpage.c | 374 | ||||
-rw-r--r-- | src/backend/storage/freespace/indexfsm.c | 74 | ||||
-rw-r--r-- | src/backend/storage/freespace/meson.build | 7 |
6 files changed, 1536 insertions, 0 deletions
diff --git a/src/backend/storage/freespace/Makefile b/src/backend/storage/freespace/Makefile new file mode 100644 index 0000000..ac0fa8b --- /dev/null +++ b/src/backend/storage/freespace/Makefile @@ -0,0 +1,20 @@ +#------------------------------------------------------------------------- +# +# Makefile-- +# Makefile for storage/freespace +# +# IDENTIFICATION +# src/backend/storage/freespace/Makefile +# +#------------------------------------------------------------------------- + +subdir = src/backend/storage/freespace +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = \ + freespace.o \ + fsmpage.o \ + indexfsm.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README new file mode 100644 index 0000000..e7ff23b --- /dev/null +++ b/src/backend/storage/freespace/README @@ -0,0 +1,196 @@ +src/backend/storage/freespace/README + +Free Space Map +-------------- + +The purpose of the free space map is to quickly locate a page with enough +free space to hold a tuple to be stored; or to determine that no such page +exists and the relation must be extended by one page. As of PostgreSQL 8.4 +each relation has its own, extensible free space map stored in a separate +"fork" of its relation. This eliminates the disadvantages of the former +fixed-size FSM. + +It is important to keep the map small so that it can be searched rapidly. +Therefore, we don't attempt to record the exact free space on a page. +We allocate one map byte to each page, allowing us to record free space +at a granularity of 1/256th of a page. Another way to say it is that +the stored value is the free space divided by BLCKSZ/256 (rounding down). +We assume that the free space must always be less than BLCKSZ, since +all pages have some overhead; so the maximum map value is 255. + +To assist in fast searching, the map isn't simply an array of per-page +entries, but has a tree structure above those entries. There is a tree +structure of pages, and a tree structure within each page, as described +below. + +FSM page structure +------------------ + +Within each FSM page, we use a binary tree structure where leaf nodes store +the amount of free space on heap pages (or lower level FSM pages, see +"Higher-level structure" below), with one leaf node per heap page. A non-leaf +node stores the max amount of free space on any of its children. + +For example: + + 4 + 4 2 +3 4 0 2 <- This level represents heap pages + +We need two basic operations: search and update. + +To search for a page with X amount of free space, traverse down the tree +along a path where n >= X, until you hit the bottom. If both children of a +node satisfy the condition, you can pick either one arbitrarily. + +To update the amount of free space on a page to X, first update the leaf node +corresponding to the heap page, then "bubble up" the change to upper nodes, +by walking up to each parent and recomputing its value as the max of its +two children. Repeat until reaching the root or a parent whose value +doesn't change. + +This data structure has a couple of nice properties: +- to discover that there is no page with X bytes of free space, you only + need to look at the root node +- by varying which child to traverse to in the search algorithm, when you have + a choice, we can implement various strategies, like preferring pages closer + to a given page, or spreading the load across the table. + +Higher-level routines that use FSM pages access them through the fsm_set_avail() +and fsm_search_avail() functions. The interface to those functions hides the +page's internal tree structure, treating the FSM page as a black box that has +a certain number of "slots" for storing free space information. (However, +the higher routines have to be aware of the tree structure of the whole map.) + +The binary tree is stored on each FSM page as an array. Because the page +header takes some space on a page, the binary tree isn't perfect. That is, +a few right-most leaf nodes are missing, and there are some useless non-leaf +nodes at the right. So the tree looks something like this: + + 0 + 1 2 + 3 4 5 6 +7 8 9 A B + +where the numbers denote each node's position in the array. Note that the +tree is guaranteed complete above the leaf level; only some leaf nodes are +missing. This is reflected in the number of usable "slots" per page not +being an exact power of 2. + +A FSM page also has a next slot pointer, fp_next_slot, that determines where +to start the next search for free space within that page. The reason for that +is to spread out the pages that are returned by FSM searches. When several +backends are concurrently inserting into a relation, contention can be avoided +by having them insert into different pages. But it is also desirable to fill +up pages in sequential order, to get the benefit of OS prefetching and batched +writes. The FSM is responsible for making that happen, and the next slot +pointer helps provide the desired behavior. + +Higher-level structure +---------------------- + +To scale up the data structure described above beyond a single page, we +maintain a similar tree-structure across pages. Leaf nodes in higher level +pages correspond to lower level FSM pages. The root node within each page +has the same value as the corresponding leaf node on its parent page. + +The root page is always stored at physical block 0. + +For example, assuming each FSM page can hold information about 4 pages (in +reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ), +we get a disk layout like this: + + 0 <-- page 0 at level 2 (root page) + 0 <-- page 0 at level 1 + 0 <-- page 0 at level 0 + 1 <-- page 1 at level 0 + 2 <-- ... + 3 + 1 <-- page 1 at level 1 + 4 + 5 + 6 + 7 + 2 + 8 + 9 + 10 + 11 + 3 + 12 + 13 + 14 + 15 + +where the numbers are page numbers *at that level*, starting from 0. + +To find the physical block # corresponding to leaf page n, we need to +count the number of leaf and upper-level pages preceding page n. +This turns out to be + +y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1 + +where F is the fanout (4 in the above example). The first term n is the number +of preceding leaf pages, the second term is the number of pages at level 1, +and so forth. + +To keep things simple, the tree is always constant height. To cover the +maximum relation size of 2^32-1 blocks, three levels is enough with the default +BLCKSZ (4000^3 > 2^32). + +Addressing +---------- + +The higher-level routines operate on "logical" addresses, consisting of +- level, +- logical page number, and +- slot (if applicable) + +Bottom level FSM pages have level of 0, the level above that 1, and root 2. +As in the diagram above, logical page number is the page number at that level, +starting from 0. + +Locking +------- + +When traversing down to search for free space, only one page is locked at a +time: the parent page is released before locking the child. If the child page +is concurrently modified, and there no longer is free space on the child page +when you land on it, you need to start from scratch (after correcting the +parent page, so that you don't get into an infinite loop). + +We use shared buffer locks when searching, but exclusive buffer lock when +updating a page. However, the next slot search pointer is updated during +searches even though we have only a shared lock. fp_next_slot is just a hint +and we can easily reset it if it gets corrupted; so it seems better to accept +some risk of that type than to pay the overhead of exclusive locking. + +Recovery +-------- + +The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of +self-correcting measures to repair possible corruption. As a result when +we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint() +rather than MarkBufferDirty(). + +First of all, whenever a value is set on an FSM page, the root node of the +page is compared against the new value after bubbling up the change is +finished. It should be greater than or equal to the value just set, or we +have a corrupted page, with a parent somewhere with too small a value. +Secondly, if we detect corrupted pages while we search, traversing down +the tree. That check will notice if a parent node is set to too high a value. +In both cases, the upper nodes on the page are immediately rebuilt, fixing +the corruption so far as that page is concerned. + +VACUUM updates all the bottom-level FSM pages with the correct amount of free +space on corresponding heap pages, as it proceeds through the heap. This +goes through fsm_set_avail(), so that the upper nodes on those pages are +immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range] +to propagate the new free-space info into the upper pages of the FSM tree. + +TODO +---- + +- fastroot to avoid traversing upper nodes with just 1 child +- use a different system for tables that fit into one FSM page, with a + mechanism to switch to the real thing as it grows. diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c new file mode 100644 index 0000000..fb9440f --- /dev/null +++ b/src/backend/storage/freespace/freespace.c @@ -0,0 +1,865 @@ +/*------------------------------------------------------------------------- + * + * freespace.c + * POSTGRES free space map for quickly finding free space in relations + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/freespace/freespace.c + * + * + * NOTES: + * + * Free Space Map keeps track of the amount of free space on pages, and + * allows quickly searching for a page with enough free space. The FSM is + * stored in a dedicated relation fork of all heap relations, and those + * index access methods that need it (see also indexfsm.c). See README for + * more information. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/htup_details.h" +#include "access/xloginsert.h" +#include "access/xlogutils.h" +#include "miscadmin.h" +#include "storage/freespace.h" +#include "storage/fsm_internals.h" +#include "storage/lmgr.h" +#include "storage/smgr.h" + + +/* + * We use just one byte to store the amount of free space on a page, so we + * divide the amount of free space a page can have into 256 different + * categories. The highest category, 255, represents a page with at least + * MaxFSMRequestSize bytes of free space, and the second highest category + * represents the range from 254 * FSM_CAT_STEP, inclusive, to + * MaxFSMRequestSize, exclusive. + * + * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming + * default 8k BLCKSZ, and that MaxFSMRequestSize is 8164 bytes, the + * categories look like this: + * + * + * Range Category + * 0 - 31 0 + * 32 - 63 1 + * ... ... ... + * 8096 - 8127 253 + * 8128 - 8163 254 + * 8164 - 8192 255 + * + * The reason that MaxFSMRequestSize is special is that if MaxFSMRequestSize + * isn't equal to a range boundary, a page with exactly MaxFSMRequestSize + * bytes of free space wouldn't satisfy a request for MaxFSMRequestSize + * bytes. If there isn't more than MaxFSMRequestSize bytes of free space on a + * completely empty page, that would mean that we could never satisfy a + * request of exactly MaxFSMRequestSize bytes. + */ +#define FSM_CATEGORIES 256 +#define FSM_CAT_STEP (BLCKSZ / FSM_CATEGORIES) +#define MaxFSMRequestSize MaxHeapTupleSize + +/* + * Depth of the on-disk tree. We need to be able to address 2^32-1 blocks, + * and 1626 is the smallest number that satisfies X^3 >= 2^32-1. Likewise, + * 256 is the smallest number that satisfies X^4 >= 2^32-1. In practice, + * this means that 4096 bytes is the smallest BLCKSZ that we can get away + * with a 3-level tree, and 512 is the smallest we support. + */ +#define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4) + +#define FSM_ROOT_LEVEL (FSM_TREE_DEPTH - 1) +#define FSM_BOTTOM_LEVEL 0 + +/* + * The internal FSM routines work on a logical addressing scheme. Each + * level of the tree can be thought of as a separately addressable file. + */ +typedef struct +{ + int level; /* level */ + int logpageno; /* page number within the level */ +} FSMAddress; + +/* Address of the root page. */ +static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0}; + +/* functions to navigate the tree */ +static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot); +static FSMAddress fsm_get_parent(FSMAddress child, uint16 *slot); +static FSMAddress fsm_get_location(BlockNumber heapblk, uint16 *slot); +static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot); +static BlockNumber fsm_logical_to_physical(FSMAddress addr); + +static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend); +static Buffer fsm_extend(Relation rel, BlockNumber fsm_nblocks); + +/* functions to convert amount of free space to a FSM category */ +static uint8 fsm_space_avail_to_cat(Size avail); +static uint8 fsm_space_needed_to_cat(Size needed); +static Size fsm_space_cat_to_avail(uint8 cat); + +/* workhorse functions for various operations */ +static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, + uint8 newValue, uint8 minValue); +static BlockNumber fsm_search(Relation rel, uint8 min_cat); +static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, + BlockNumber start, BlockNumber end, + bool *eof_p); + + +/******** Public API ********/ + +/* + * GetPageWithFreeSpace - try to find a page in the given relation with + * at least the specified amount of free space. + * + * If successful, return the block number; if not, return InvalidBlockNumber. + * + * The caller must be prepared for the possibility that the returned page + * will turn out to have too little space available by the time the caller + * gets a lock on it. In that case, the caller should report the actual + * amount of free space available on that page and then try again (see + * RecordAndGetPageWithFreeSpace). If InvalidBlockNumber is returned, + * extend the relation. + */ +BlockNumber +GetPageWithFreeSpace(Relation rel, Size spaceNeeded) +{ + uint8 min_cat = fsm_space_needed_to_cat(spaceNeeded); + + return fsm_search(rel, min_cat); +} + +/* + * RecordAndGetPageWithFreeSpace - update info about a page and try again. + * + * We provide this combo form to save some locking overhead, compared to + * separate RecordPageWithFreeSpace + GetPageWithFreeSpace calls. There's + * also some effort to return a page close to the old page; if there's a + * page with enough free space on the same FSM page where the old one page + * is located, it is preferred. + */ +BlockNumber +RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage, + Size oldSpaceAvail, Size spaceNeeded) +{ + int old_cat = fsm_space_avail_to_cat(oldSpaceAvail); + int search_cat = fsm_space_needed_to_cat(spaceNeeded); + FSMAddress addr; + uint16 slot; + int search_slot; + + /* Get the location of the FSM byte representing the heap block */ + addr = fsm_get_location(oldPage, &slot); + + search_slot = fsm_set_and_search(rel, addr, slot, old_cat, search_cat); + + /* + * If fsm_set_and_search found a suitable new block, return that. + * Otherwise, search as usual. + */ + if (search_slot != -1) + return fsm_get_heap_blk(addr, search_slot); + else + return fsm_search(rel, search_cat); +} + +/* + * RecordPageWithFreeSpace - update info about a page. + * + * Note that if the new spaceAvail value is higher than the old value stored + * in the FSM, the space might not become visible to searchers until the next + * FreeSpaceMapVacuum call, which updates the upper level pages. + */ +void +RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail) +{ + int new_cat = fsm_space_avail_to_cat(spaceAvail); + FSMAddress addr; + uint16 slot; + + /* Get the location of the FSM byte representing the heap block */ + addr = fsm_get_location(heapBlk, &slot); + + fsm_set_and_search(rel, addr, slot, new_cat, 0); +} + +/* + * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in + * WAL replay + */ +void +XLogRecordPageWithFreeSpace(RelFileLocator rlocator, BlockNumber heapBlk, + Size spaceAvail) +{ + int new_cat = fsm_space_avail_to_cat(spaceAvail); + FSMAddress addr; + uint16 slot; + BlockNumber blkno; + Buffer buf; + Page page; + + /* Get the location of the FSM byte representing the heap block */ + addr = fsm_get_location(heapBlk, &slot); + blkno = fsm_logical_to_physical(addr); + + /* If the page doesn't exist already, extend */ + buf = XLogReadBufferExtended(rlocator, FSM_FORKNUM, blkno, + RBM_ZERO_ON_ERROR, InvalidBuffer); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buf); + if (PageIsNew(page)) + PageInit(page, BLCKSZ, 0); + + if (fsm_set_avail(page, slot, new_cat)) + MarkBufferDirtyHint(buf, false); + UnlockReleaseBuffer(buf); +} + +/* + * GetRecordedFreeSpace - return the amount of free space on a particular page, + * according to the FSM. + */ +Size +GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk) +{ + FSMAddress addr; + uint16 slot; + Buffer buf; + uint8 cat; + + /* Get the location of the FSM byte representing the heap block */ + addr = fsm_get_location(heapBlk, &slot); + + buf = fsm_readbuf(rel, addr, false); + if (!BufferIsValid(buf)) + return 0; + cat = fsm_get_avail(BufferGetPage(buf), slot); + ReleaseBuffer(buf); + + return fsm_space_cat_to_avail(cat); +} + +/* + * FreeSpaceMapPrepareTruncateRel - prepare for truncation of a relation. + * + * nblocks is the new size of the heap. + * + * Return the number of blocks of new FSM. + * If it's InvalidBlockNumber, there is nothing to truncate; + * otherwise the caller is responsible for calling smgrtruncate() + * to truncate the FSM pages, and FreeSpaceMapVacuumRange() + * to update upper-level pages in the FSM. + */ +BlockNumber +FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks) +{ + BlockNumber new_nfsmblocks; + FSMAddress first_removed_address; + uint16 first_removed_slot; + Buffer buf; + + /* + * If no FSM has been created yet for this relation, there's nothing to + * truncate. + */ + if (!smgrexists(RelationGetSmgr(rel), FSM_FORKNUM)) + return InvalidBlockNumber; + + /* Get the location in the FSM of the first removed heap block */ + first_removed_address = fsm_get_location(nblocks, &first_removed_slot); + + /* + * Zero out the tail of the last remaining FSM page. If the slot + * representing the first removed heap block is at a page boundary, as the + * first slot on the FSM page that first_removed_address points to, we can + * just truncate that page altogether. + */ + if (first_removed_slot > 0) + { + buf = fsm_readbuf(rel, first_removed_address, false); + if (!BufferIsValid(buf)) + return InvalidBlockNumber; /* nothing to do; the FSM was already + * smaller */ + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + /* NO EREPORT(ERROR) from here till changes are logged */ + START_CRIT_SECTION(); + + fsm_truncate_avail(BufferGetPage(buf), first_removed_slot); + + /* + * Truncation of a relation is WAL-logged at a higher-level, and we + * will be called at WAL replay. But if checksums are enabled, we need + * to still write a WAL record to protect against a torn page, if the + * page is flushed to disk before the truncation WAL record. We cannot + * use MarkBufferDirtyHint here, because that will not dirty the page + * during recovery. + */ + MarkBufferDirty(buf); + if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded()) + log_newpage_buffer(buf, false); + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buf); + + new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1; + } + else + { + new_nfsmblocks = fsm_logical_to_physical(first_removed_address); + if (smgrnblocks(RelationGetSmgr(rel), FSM_FORKNUM) <= new_nfsmblocks) + return InvalidBlockNumber; /* nothing to do; the FSM was already + * smaller */ + } + + return new_nfsmblocks; +} + +/* + * FreeSpaceMapVacuum - update upper-level pages in the rel's FSM + * + * We assume that the bottom-level pages have already been updated with + * new free-space information. + */ +void +FreeSpaceMapVacuum(Relation rel) +{ + bool dummy; + + /* Recursively scan the tree, starting at the root */ + (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, + (BlockNumber) 0, InvalidBlockNumber, + &dummy); +} + +/* + * FreeSpaceMapVacuumRange - update upper-level pages in the rel's FSM + * + * As above, but assume that only heap pages between start and end-1 inclusive + * have new free-space information, so update only the upper-level slots + * covering that block range. end == InvalidBlockNumber is equivalent to + * "all the rest of the relation". + */ +void +FreeSpaceMapVacuumRange(Relation rel, BlockNumber start, BlockNumber end) +{ + bool dummy; + + /* Recursively scan the tree, starting at the root */ + if (end > start) + (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, start, end, &dummy); +} + +/******** Internal routines ********/ + +/* + * Return category corresponding x bytes of free space + */ +static uint8 +fsm_space_avail_to_cat(Size avail) +{ + int cat; + + Assert(avail < BLCKSZ); + + if (avail >= MaxFSMRequestSize) + return 255; + + cat = avail / FSM_CAT_STEP; + + /* + * The highest category, 255, is reserved for MaxFSMRequestSize bytes or + * more. + */ + if (cat > 254) + cat = 254; + + return (uint8) cat; +} + +/* + * Return the lower bound of the range of free space represented by given + * category. + */ +static Size +fsm_space_cat_to_avail(uint8 cat) +{ + /* The highest category represents exactly MaxFSMRequestSize bytes. */ + if (cat == 255) + return MaxFSMRequestSize; + else + return cat * FSM_CAT_STEP; +} + +/* + * Which category does a page need to have, to accommodate x bytes of data? + * While fsm_space_avail_to_cat() rounds down, this needs to round up. + */ +static uint8 +fsm_space_needed_to_cat(Size needed) +{ + int cat; + + /* Can't ask for more space than the highest category represents */ + if (needed > MaxFSMRequestSize) + elog(ERROR, "invalid FSM request size %zu", needed); + + if (needed == 0) + return 1; + + cat = (needed + FSM_CAT_STEP - 1) / FSM_CAT_STEP; + + if (cat > 255) + cat = 255; + + return (uint8) cat; +} + +/* + * Returns the physical block number of a FSM page + */ +static BlockNumber +fsm_logical_to_physical(FSMAddress addr) +{ + BlockNumber pages; + int leafno; + int l; + + /* + * Calculate the logical page number of the first leaf page below the + * given page. + */ + leafno = addr.logpageno; + for (l = 0; l < addr.level; l++) + leafno *= SlotsPerFSMPage; + + /* Count upper level nodes required to address the leaf page */ + pages = 0; + for (l = 0; l < FSM_TREE_DEPTH; l++) + { + pages += leafno + 1; + leafno /= SlotsPerFSMPage; + } + + /* + * If the page we were asked for wasn't at the bottom level, subtract the + * additional lower level pages we counted above. + */ + pages -= addr.level; + + /* Turn the page count into 0-based block number */ + return pages - 1; +} + +/* + * Return the FSM location corresponding to given heap block. + */ +static FSMAddress +fsm_get_location(BlockNumber heapblk, uint16 *slot) +{ + FSMAddress addr; + + addr.level = FSM_BOTTOM_LEVEL; + addr.logpageno = heapblk / SlotsPerFSMPage; + *slot = heapblk % SlotsPerFSMPage; + + return addr; +} + +/* + * Return the heap block number corresponding to given location in the FSM. + */ +static BlockNumber +fsm_get_heap_blk(FSMAddress addr, uint16 slot) +{ + Assert(addr.level == FSM_BOTTOM_LEVEL); + return ((unsigned int) addr.logpageno) * SlotsPerFSMPage + slot; +} + +/* + * Given a logical address of a child page, get the logical page number of + * the parent, and the slot within the parent corresponding to the child. + */ +static FSMAddress +fsm_get_parent(FSMAddress child, uint16 *slot) +{ + FSMAddress parent; + + Assert(child.level < FSM_ROOT_LEVEL); + + parent.level = child.level + 1; + parent.logpageno = child.logpageno / SlotsPerFSMPage; + *slot = child.logpageno % SlotsPerFSMPage; + + return parent; +} + +/* + * Given a logical address of a parent page and a slot number, get the + * logical address of the corresponding child page. + */ +static FSMAddress +fsm_get_child(FSMAddress parent, uint16 slot) +{ + FSMAddress child; + + Assert(parent.level > FSM_BOTTOM_LEVEL); + + child.level = parent.level - 1; + child.logpageno = parent.logpageno * SlotsPerFSMPage + slot; + + return child; +} + +/* + * Read a FSM page. + * + * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is + * true, the FSM file is extended. + */ +static Buffer +fsm_readbuf(Relation rel, FSMAddress addr, bool extend) +{ + BlockNumber blkno = fsm_logical_to_physical(addr); + Buffer buf; + SMgrRelation reln; + + /* + * Caution: re-using this smgr pointer could fail if the relcache entry + * gets closed. It's safe as long as we only do smgr-level operations + * between here and the last use of the pointer. + */ + reln = RelationGetSmgr(rel); + + /* + * If we haven't cached the size of the FSM yet, check it first. Also + * recheck if the requested block seems to be past end, since our cached + * value might be stale. (We send smgr inval messages on truncation, but + * not on extension.) + */ + if (reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber || + blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM]) + { + /* Invalidate the cache so smgrnblocks asks the kernel. */ + reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; + if (smgrexists(reln, FSM_FORKNUM)) + smgrnblocks(reln, FSM_FORKNUM); + else + reln->smgr_cached_nblocks[FSM_FORKNUM] = 0; + } + + /* + * For reading we use ZERO_ON_ERROR mode, and initialize the page if + * necessary. The FSM information is not accurate anyway, so it's better + * to clear corrupt pages than error out. Since the FSM changes are not + * WAL-logged, the so-called torn page problem on crash can lead to pages + * with corrupt headers, for example. + * + * We use the same path below to initialize pages when extending the + * relation, as a concurrent extension can end up with vm_extend() + * returning an already-initialized page. + */ + if (blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM]) + { + if (extend) + buf = fsm_extend(rel, blkno + 1); + else + return InvalidBuffer; + } + else + buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL); + + /* + * Initializing the page when needed is trickier than it looks, because of + * the possibility of multiple backends doing this concurrently, and our + * desire to not uselessly take the buffer lock in the normal path where + * the page is OK. We must take the lock to initialize the page, so + * recheck page newness after we have the lock, in case someone else + * already did it. Also, because we initially check PageIsNew with no + * lock, it's possible to fall through and return the buffer while someone + * else is still initializing the page (i.e., we might see pd_upper as set + * but other page header fields are still zeroes). This is harmless for + * callers that will take a buffer lock themselves, but some callers + * inspect the page without any lock at all. The latter is OK only so + * long as it doesn't depend on the page header having correct contents. + * Current usage is safe because PageGetContents() does not require that. + */ + if (PageIsNew(BufferGetPage(buf))) + { + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + if (PageIsNew(BufferGetPage(buf))) + PageInit(BufferGetPage(buf), BLCKSZ, 0); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + } + return buf; +} + +/* + * Ensure that the FSM fork is at least fsm_nblocks long, extending + * it if necessary with empty pages. And by empty, I mean pages filled + * with zeros, meaning there's no free space. + */ +static Buffer +fsm_extend(Relation rel, BlockNumber fsm_nblocks) +{ + return ExtendBufferedRelTo(BMR_REL(rel), FSM_FORKNUM, NULL, + EB_CREATE_FORK_IF_NEEDED | + EB_CLEAR_SIZE_CACHE, + fsm_nblocks, + RBM_ZERO_ON_ERROR); +} + +/* + * Set value in given FSM page and slot. + * + * If minValue > 0, the updated page is also searched for a page with at + * least minValue of free space. If one is found, its slot number is + * returned, -1 otherwise. + */ +static int +fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, + uint8 newValue, uint8 minValue) +{ + Buffer buf; + Page page; + int newslot = -1; + + buf = fsm_readbuf(rel, addr, true); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buf); + + if (fsm_set_avail(page, slot, newValue)) + MarkBufferDirtyHint(buf, false); + + if (minValue != 0) + { + /* Search while we still hold the lock */ + newslot = fsm_search_avail(buf, minValue, + addr.level == FSM_BOTTOM_LEVEL, + true); + } + + UnlockReleaseBuffer(buf); + + return newslot; +} + +/* + * Search the tree for a heap page with at least min_cat of free space + */ +static BlockNumber +fsm_search(Relation rel, uint8 min_cat) +{ + int restarts = 0; + FSMAddress addr = FSM_ROOT_ADDRESS; + + for (;;) + { + int slot; + Buffer buf; + uint8 max_avail = 0; + + /* Read the FSM page. */ + buf = fsm_readbuf(rel, addr, false); + + /* Search within the page */ + if (BufferIsValid(buf)) + { + LockBuffer(buf, BUFFER_LOCK_SHARE); + slot = fsm_search_avail(buf, min_cat, + (addr.level == FSM_BOTTOM_LEVEL), + false); + if (slot == -1) + max_avail = fsm_get_max_avail(BufferGetPage(buf)); + UnlockReleaseBuffer(buf); + } + else + slot = -1; + + if (slot != -1) + { + /* + * Descend the tree, or return the found block if we're at the + * bottom. + */ + if (addr.level == FSM_BOTTOM_LEVEL) + return fsm_get_heap_blk(addr, slot); + + addr = fsm_get_child(addr, slot); + } + else if (addr.level == FSM_ROOT_LEVEL) + { + /* + * At the root, failure means there's no page with enough free + * space in the FSM. Give up. + */ + return InvalidBlockNumber; + } + else + { + uint16 parentslot; + FSMAddress parent; + + /* + * At lower level, failure can happen if the value in the upper- + * level node didn't reflect the value on the lower page. Update + * the upper node, to avoid falling into the same trap again, and + * start over. + * + * There's a race condition here, if another backend updates this + * page right after we release it, and gets the lock on the parent + * page before us. We'll then update the parent page with the now + * stale information we had. It's OK, because it should happen + * rarely, and will be fixed by the next vacuum. + */ + parent = fsm_get_parent(addr, &parentslot); + fsm_set_and_search(rel, parent, parentslot, max_avail, 0); + + /* + * If the upper pages are badly out of date, we might need to loop + * quite a few times, updating them as we go. Any inconsistencies + * should eventually be corrected and the loop should end. Looping + * indefinitely is nevertheless scary, so provide an emergency + * valve. + */ + if (restarts++ > 10000) + return InvalidBlockNumber; + + /* Start search all over from the root */ + addr = FSM_ROOT_ADDRESS; + } + } +} + + +/* + * Recursive guts of FreeSpaceMapVacuum + * + * Examine the FSM page indicated by addr, as well as its children, updating + * upper-level nodes that cover the heap block range from start to end-1. + * (It's okay if end is beyond the actual end of the map.) + * Return the maximum freespace value on this page. + * + * If addr is past the end of the FSM, set *eof_p to true and return 0. + * + * This traverses the tree in depth-first order. The tree is stored + * physically in depth-first order, so this should be pretty I/O efficient. + */ +static uint8 +fsm_vacuum_page(Relation rel, FSMAddress addr, + BlockNumber start, BlockNumber end, + bool *eof_p) +{ + Buffer buf; + Page page; + uint8 max_avail; + + /* Read the page if it exists, or return EOF */ + buf = fsm_readbuf(rel, addr, false); + if (!BufferIsValid(buf)) + { + *eof_p = true; + return 0; + } + else + *eof_p = false; + + page = BufferGetPage(buf); + + /* + * If we're above the bottom level, recurse into children, and fix the + * information stored about them at this level. + */ + if (addr.level > FSM_BOTTOM_LEVEL) + { + FSMAddress fsm_start, + fsm_end; + uint16 fsm_start_slot, + fsm_end_slot; + int slot, + start_slot, + end_slot; + bool eof = false; + + /* + * Compute the range of slots we need to update on this page, given + * the requested range of heap blocks to consider. The first slot to + * update is the one covering the "start" block, and the last slot is + * the one covering "end - 1". (Some of this work will be duplicated + * in each recursive call, but it's cheap enough to not worry about.) + */ + fsm_start = fsm_get_location(start, &fsm_start_slot); + fsm_end = fsm_get_location(end - 1, &fsm_end_slot); + + while (fsm_start.level < addr.level) + { + fsm_start = fsm_get_parent(fsm_start, &fsm_start_slot); + fsm_end = fsm_get_parent(fsm_end, &fsm_end_slot); + } + Assert(fsm_start.level == addr.level); + + if (fsm_start.logpageno == addr.logpageno) + start_slot = fsm_start_slot; + else if (fsm_start.logpageno > addr.logpageno) + start_slot = SlotsPerFSMPage; /* shouldn't get here... */ + else + start_slot = 0; + + if (fsm_end.logpageno == addr.logpageno) + end_slot = fsm_end_slot; + else if (fsm_end.logpageno > addr.logpageno) + end_slot = SlotsPerFSMPage - 1; + else + end_slot = -1; /* shouldn't get here... */ + + for (slot = start_slot; slot <= end_slot; slot++) + { + int child_avail; + + CHECK_FOR_INTERRUPTS(); + + /* After we hit end-of-file, just clear the rest of the slots */ + if (!eof) + child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), + start, end, + &eof); + else + child_avail = 0; + + /* Update information about the child */ + if (fsm_get_avail(page, slot) != child_avail) + { + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + fsm_set_avail(page, slot, child_avail); + MarkBufferDirtyHint(buf, false); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + } + } + } + + /* Now get the maximum value on the page, to return to caller */ + max_avail = fsm_get_max_avail(page); + + /* + * Reset the next slot pointer. This encourages the use of low-numbered + * pages, increasing the chances that a later vacuum can truncate the + * relation. We don't bother with a lock here, nor with marking the page + * dirty if it wasn't already, since this is just a hint. + */ + ((FSMPage) PageGetContents(page))->fp_next_slot = 0; + + ReleaseBuffer(buf); + + return max_avail; +} diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c new file mode 100644 index 0000000..0cfb2ae --- /dev/null +++ b/src/backend/storage/freespace/fsmpage.c @@ -0,0 +1,374 @@ +/*------------------------------------------------------------------------- + * + * fsmpage.c + * routines to search and manipulate one FSM page. + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/freespace/fsmpage.c + * + * NOTES: + * + * The public functions in this file form an API that hides the internal + * structure of a FSM page. This allows freespace.c to treat each FSM page + * as a black box with SlotsPerPage "slots". fsm_set_avail() and + * fsm_get_avail() let you get/set the value of a slot, and + * fsm_search_avail() lets you search for a slot with value >= X. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "storage/bufmgr.h" +#include "storage/fsm_internals.h" + +/* Macros to navigate the tree within a page. Root has index zero. */ +#define leftchild(x) (2 * (x) + 1) +#define rightchild(x) (2 * (x) + 2) +#define parentof(x) (((x) - 1) / 2) + +/* + * Find right neighbor of x, wrapping around within the level + */ +static int +rightneighbor(int x) +{ + /* + * Move right. This might wrap around, stepping to the leftmost node at + * the next level. + */ + x++; + + /* + * Check if we stepped to the leftmost node at next level, and correct if + * so. The leftmost nodes at each level are numbered x = 2^level - 1, so + * check if (x + 1) is a power of two, using a standard + * twos-complement-arithmetic trick. + */ + if (((x + 1) & x) == 0) + x = parentof(x); + + return x; +} + +/* + * Sets the value of a slot on page. Returns true if the page was modified. + * + * The caller must hold an exclusive lock on the page. + */ +bool +fsm_set_avail(Page page, int slot, uint8 value) +{ + int nodeno = NonLeafNodesPerPage + slot; + FSMPage fsmpage = (FSMPage) PageGetContents(page); + uint8 oldvalue; + + Assert(slot < LeafNodesPerPage); + + oldvalue = fsmpage->fp_nodes[nodeno]; + + /* If the value hasn't changed, we don't need to do anything */ + if (oldvalue == value && value <= fsmpage->fp_nodes[0]) + return false; + + fsmpage->fp_nodes[nodeno] = value; + + /* + * Propagate up, until we hit the root or a node that doesn't need to be + * updated. + */ + do + { + uint8 newvalue = 0; + int lchild; + int rchild; + + nodeno = parentof(nodeno); + lchild = leftchild(nodeno); + rchild = lchild + 1; + + newvalue = fsmpage->fp_nodes[lchild]; + if (rchild < NodesPerPage) + newvalue = Max(newvalue, + fsmpage->fp_nodes[rchild]); + + oldvalue = fsmpage->fp_nodes[nodeno]; + if (oldvalue == newvalue) + break; + + fsmpage->fp_nodes[nodeno] = newvalue; + } while (nodeno > 0); + + /* + * sanity check: if the new value is (still) higher than the value at the + * top, the tree is corrupt. If so, rebuild. + */ + if (value > fsmpage->fp_nodes[0]) + fsm_rebuild_page(page); + + return true; +} + +/* + * Returns the value of given slot on page. + * + * Since this is just a read-only access of a single byte, the page doesn't + * need to be locked. + */ +uint8 +fsm_get_avail(Page page, int slot) +{ + FSMPage fsmpage = (FSMPage) PageGetContents(page); + + Assert(slot < LeafNodesPerPage); + + return fsmpage->fp_nodes[NonLeafNodesPerPage + slot]; +} + +/* + * Returns the value at the root of a page. + * + * Since this is just a read-only access of a single byte, the page doesn't + * need to be locked. + */ +uint8 +fsm_get_max_avail(Page page) +{ + FSMPage fsmpage = (FSMPage) PageGetContents(page); + + return fsmpage->fp_nodes[0]; +} + +/* + * Searches for a slot with category at least minvalue. + * Returns slot number, or -1 if none found. + * + * The caller must hold at least a shared lock on the page, and this + * function can unlock and lock the page again in exclusive mode if it + * needs to be updated. exclusive_lock_held should be set to true if the + * caller is already holding an exclusive lock, to avoid extra work. + * + * If advancenext is false, fp_next_slot is set to point to the returned + * slot, and if it's true, to the slot after the returned slot. + */ +int +fsm_search_avail(Buffer buf, uint8 minvalue, bool advancenext, + bool exclusive_lock_held) +{ + Page page = BufferGetPage(buf); + FSMPage fsmpage = (FSMPage) PageGetContents(page); + int nodeno; + int target; + uint16 slot; + +restart: + + /* + * Check the root first, and exit quickly if there's no leaf with enough + * free space + */ + if (fsmpage->fp_nodes[0] < minvalue) + return -1; + + /* + * Start search using fp_next_slot. It's just a hint, so check that it's + * sane. (This also handles wrapping around when the prior call returned + * the last slot on the page.) + */ + target = fsmpage->fp_next_slot; + if (target < 0 || target >= LeafNodesPerPage) + target = 0; + target += NonLeafNodesPerPage; + + /*---------- + * Start the search from the target slot. At every step, move one + * node to the right, then climb up to the parent. Stop when we reach + * a node with enough free space (as we must, since the root has enough + * space). + * + * The idea is to gradually expand our "search triangle", that is, all + * nodes covered by the current node, and to be sure we search to the + * right from the start point. At the first step, only the target slot + * is examined. When we move up from a left child to its parent, we are + * adding the right-hand subtree of that parent to the search triangle. + * When we move right then up from a right child, we are dropping the + * current search triangle (which we know doesn't contain any suitable + * page) and instead looking at the next-larger-size triangle to its + * right. So we never look left from our original start point, and at + * each step the size of the search triangle doubles, ensuring it takes + * only log2(N) work to search N pages. + * + * The "move right" operation will wrap around if it hits the right edge + * of the tree, so the behavior is still good if we start near the right. + * Note also that the move-and-climb behavior ensures that we can't end + * up on one of the missing nodes at the right of the leaf level. + * + * For example, consider this tree: + * + * 7 + * 7 6 + * 5 7 6 5 + * 4 5 5 7 2 6 5 2 + * T + * + * Assume that the target node is the node indicated by the letter T, + * and we're searching for a node with value of 6 or higher. The search + * begins at T. At the first iteration, we move to the right, then to the + * parent, arriving at the rightmost 5. At the second iteration, we move + * to the right, wrapping around, then climb up, arriving at the 7 on the + * third level. 7 satisfies our search, so we descend down to the bottom, + * following the path of sevens. This is in fact the first suitable page + * to the right of (allowing for wraparound) our start point. + *---------- + */ + nodeno = target; + while (nodeno > 0) + { + if (fsmpage->fp_nodes[nodeno] >= minvalue) + break; + + /* + * Move to the right, wrapping around on same level if necessary, then + * climb up. + */ + nodeno = parentof(rightneighbor(nodeno)); + } + + /* + * We're now at a node with enough free space, somewhere in the middle of + * the tree. Descend to the bottom, following a path with enough free + * space, preferring to move left if there's a choice. + */ + while (nodeno < NonLeafNodesPerPage) + { + int childnodeno = leftchild(nodeno); + + if (childnodeno < NodesPerPage && + fsmpage->fp_nodes[childnodeno] >= minvalue) + { + nodeno = childnodeno; + continue; + } + childnodeno++; /* point to right child */ + if (childnodeno < NodesPerPage && + fsmpage->fp_nodes[childnodeno] >= minvalue) + { + nodeno = childnodeno; + } + else + { + /* + * Oops. The parent node promised that either left or right child + * has enough space, but neither actually did. This can happen in + * case of a "torn page", IOW if we crashed earlier while writing + * the page to disk, and only part of the page made it to disk. + * + * Fix the corruption and restart. + */ + RelFileLocator rlocator; + ForkNumber forknum; + BlockNumber blknum; + + BufferGetTag(buf, &rlocator, &forknum, &blknum); + elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u", + blknum, rlocator.spcOid, rlocator.dbOid, rlocator.relNumber); + + /* make sure we hold an exclusive lock */ + if (!exclusive_lock_held) + { + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + exclusive_lock_held = true; + } + fsm_rebuild_page(page); + MarkBufferDirtyHint(buf, false); + goto restart; + } + } + + /* We're now at the bottom level, at a node with enough space. */ + slot = nodeno - NonLeafNodesPerPage; + + /* + * Update the next-target pointer. Note that we do this even if we're only + * holding a shared lock, on the grounds that it's better to use a shared + * lock and get a garbled next pointer every now and then, than take the + * concurrency hit of an exclusive lock. + * + * Wrap-around is handled at the beginning of this function. + */ + fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0); + + return slot; +} + +/* + * Sets the available space to zero for all slots numbered >= nslots. + * Returns true if the page was modified. + */ +bool +fsm_truncate_avail(Page page, int nslots) +{ + FSMPage fsmpage = (FSMPage) PageGetContents(page); + uint8 *ptr; + bool changed = false; + + Assert(nslots >= 0 && nslots < LeafNodesPerPage); + + /* Clear all truncated leaf nodes */ + ptr = &fsmpage->fp_nodes[NonLeafNodesPerPage + nslots]; + for (; ptr < &fsmpage->fp_nodes[NodesPerPage]; ptr++) + { + if (*ptr != 0) + changed = true; + *ptr = 0; + } + + /* Fix upper nodes. */ + if (changed) + fsm_rebuild_page(page); + + return changed; +} + +/* + * Reconstructs the upper levels of a page. Returns true if the page + * was modified. + */ +bool +fsm_rebuild_page(Page page) +{ + FSMPage fsmpage = (FSMPage) PageGetContents(page); + bool changed = false; + int nodeno; + + /* + * Start from the lowest non-leaf level, at last node, working our way + * backwards, through all non-leaf nodes at all levels, up to the root. + */ + for (nodeno = NonLeafNodesPerPage - 1; nodeno >= 0; nodeno--) + { + int lchild = leftchild(nodeno); + int rchild = lchild + 1; + uint8 newvalue = 0; + + /* The first few nodes we examine might have zero or one child. */ + if (lchild < NodesPerPage) + newvalue = fsmpage->fp_nodes[lchild]; + + if (rchild < NodesPerPage) + newvalue = Max(newvalue, + fsmpage->fp_nodes[rchild]); + + if (fsmpage->fp_nodes[nodeno] != newvalue) + { + fsmpage->fp_nodes[nodeno] = newvalue; + changed = true; + } + } + + return changed; +} diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c new file mode 100644 index 0000000..fff8f4f --- /dev/null +++ b/src/backend/storage/freespace/indexfsm.c @@ -0,0 +1,74 @@ +/*------------------------------------------------------------------------- + * + * indexfsm.c + * POSTGRES free space map for quickly finding free pages in relations + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/freespace/indexfsm.c + * + * + * NOTES: + * + * This is similar to the FSM used for heap, in freespace.c, but instead + * of tracking the amount of free space on pages, we only track whether + * pages are completely free or in-use. We use the same FSM implementation + * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "storage/freespace.h" +#include "storage/indexfsm.h" + +/* + * Exported routines + */ + +/* + * GetFreeIndexPage - return a free page from the FSM + * + * As a side effect, the page is marked as used in the FSM. + */ +BlockNumber +GetFreeIndexPage(Relation rel) +{ + BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2); + + if (blkno != InvalidBlockNumber) + RecordUsedIndexPage(rel, blkno); + + return blkno; +} + +/* + * RecordFreeIndexPage - mark a page as free in the FSM + */ +void +RecordFreeIndexPage(Relation rel, BlockNumber freeBlock) +{ + RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1); +} + + +/* + * RecordUsedIndexPage - mark a page as used in the FSM + */ +void +RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) +{ + RecordPageWithFreeSpace(rel, usedBlock, 0); +} + +/* + * IndexFreeSpaceMapVacuum - scan and fix any inconsistencies in the FSM + */ +void +IndexFreeSpaceMapVacuum(Relation rel) +{ + FreeSpaceMapVacuum(rel); +} diff --git a/src/backend/storage/freespace/meson.build b/src/backend/storage/freespace/meson.build new file mode 100644 index 0000000..4dd8602 --- /dev/null +++ b/src/backend/storage/freespace/meson.build @@ -0,0 +1,7 @@ +# Copyright (c) 2022-2023, PostgreSQL Global Development Group + +backend_sources += files( + 'freespace.c', + 'fsmpage.c', + 'indexfsm.c', +) |