summaryrefslogtreecommitdiffstats
path: root/src/backend/storage/freespace
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/storage/freespace')
-rw-r--r--src/backend/storage/freespace/Makefile20
-rw-r--r--src/backend/storage/freespace/README196
-rw-r--r--src/backend/storage/freespace/freespace.c893
-rw-r--r--src/backend/storage/freespace/fsmpage.c374
-rw-r--r--src/backend/storage/freespace/indexfsm.c74
5 files changed, 1557 insertions, 0 deletions
diff --git a/src/backend/storage/freespace/Makefile b/src/backend/storage/freespace/Makefile
new file mode 100644
index 0000000..ac0fa8b
--- /dev/null
+++ b/src/backend/storage/freespace/Makefile
@@ -0,0 +1,20 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for storage/freespace
+#
+# IDENTIFICATION
+# src/backend/storage/freespace/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/freespace
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+ freespace.o \
+ fsmpage.o \
+ indexfsm.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README
new file mode 100644
index 0000000..e7ff23b
--- /dev/null
+++ b/src/backend/storage/freespace/README
@@ -0,0 +1,196 @@
+src/backend/storage/freespace/README
+
+Free Space Map
+--------------
+
+The purpose of the free space map is to quickly locate a page with enough
+free space to hold a tuple to be stored; or to determine that no such page
+exists and the relation must be extended by one page. As of PostgreSQL 8.4
+each relation has its own, extensible free space map stored in a separate
+"fork" of its relation. This eliminates the disadvantages of the former
+fixed-size FSM.
+
+It is important to keep the map small so that it can be searched rapidly.
+Therefore, we don't attempt to record the exact free space on a page.
+We allocate one map byte to each page, allowing us to record free space
+at a granularity of 1/256th of a page. Another way to say it is that
+the stored value is the free space divided by BLCKSZ/256 (rounding down).
+We assume that the free space must always be less than BLCKSZ, since
+all pages have some overhead; so the maximum map value is 255.
+
+To assist in fast searching, the map isn't simply an array of per-page
+entries, but has a tree structure above those entries. There is a tree
+structure of pages, and a tree structure within each page, as described
+below.
+
+FSM page structure
+------------------
+
+Within each FSM page, we use a binary tree structure where leaf nodes store
+the amount of free space on heap pages (or lower level FSM pages, see
+"Higher-level structure" below), with one leaf node per heap page. A non-leaf
+node stores the max amount of free space on any of its children.
+
+For example:
+
+ 4
+ 4 2
+3 4 0 2 <- This level represents heap pages
+
+We need two basic operations: search and update.
+
+To search for a page with X amount of free space, traverse down the tree
+along a path where n >= X, until you hit the bottom. If both children of a
+node satisfy the condition, you can pick either one arbitrarily.
+
+To update the amount of free space on a page to X, first update the leaf node
+corresponding to the heap page, then "bubble up" the change to upper nodes,
+by walking up to each parent and recomputing its value as the max of its
+two children. Repeat until reaching the root or a parent whose value
+doesn't change.
+
+This data structure has a couple of nice properties:
+- to discover that there is no page with X bytes of free space, you only
+ need to look at the root node
+- by varying which child to traverse to in the search algorithm, when you have
+ a choice, we can implement various strategies, like preferring pages closer
+ to a given page, or spreading the load across the table.
+
+Higher-level routines that use FSM pages access them through the fsm_set_avail()
+and fsm_search_avail() functions. The interface to those functions hides the
+page's internal tree structure, treating the FSM page as a black box that has
+a certain number of "slots" for storing free space information. (However,
+the higher routines have to be aware of the tree structure of the whole map.)
+
+The binary tree is stored on each FSM page as an array. Because the page
+header takes some space on a page, the binary tree isn't perfect. That is,
+a few right-most leaf nodes are missing, and there are some useless non-leaf
+nodes at the right. So the tree looks something like this:
+
+ 0
+ 1 2
+ 3 4 5 6
+7 8 9 A B
+
+where the numbers denote each node's position in the array. Note that the
+tree is guaranteed complete above the leaf level; only some leaf nodes are
+missing. This is reflected in the number of usable "slots" per page not
+being an exact power of 2.
+
+A FSM page also has a next slot pointer, fp_next_slot, that determines where
+to start the next search for free space within that page. The reason for that
+is to spread out the pages that are returned by FSM searches. When several
+backends are concurrently inserting into a relation, contention can be avoided
+by having them insert into different pages. But it is also desirable to fill
+up pages in sequential order, to get the benefit of OS prefetching and batched
+writes. The FSM is responsible for making that happen, and the next slot
+pointer helps provide the desired behavior.
+
+Higher-level structure
+----------------------
+
+To scale up the data structure described above beyond a single page, we
+maintain a similar tree-structure across pages. Leaf nodes in higher level
+pages correspond to lower level FSM pages. The root node within each page
+has the same value as the corresponding leaf node on its parent page.
+
+The root page is always stored at physical block 0.
+
+For example, assuming each FSM page can hold information about 4 pages (in
+reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ),
+we get a disk layout like this:
+
+ 0 <-- page 0 at level 2 (root page)
+ 0 <-- page 0 at level 1
+ 0 <-- page 0 at level 0
+ 1 <-- page 1 at level 0
+ 2 <-- ...
+ 3
+ 1 <-- page 1 at level 1
+ 4
+ 5
+ 6
+ 7
+ 2
+ 8
+ 9
+ 10
+ 11
+ 3
+ 12
+ 13
+ 14
+ 15
+
+where the numbers are page numbers *at that level*, starting from 0.
+
+To find the physical block # corresponding to leaf page n, we need to
+count the number of leaf and upper-level pages preceding page n.
+This turns out to be
+
+y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1
+
+where F is the fanout (4 in the above example). The first term n is the number
+of preceding leaf pages, the second term is the number of pages at level 1,
+and so forth.
+
+To keep things simple, the tree is always constant height. To cover the
+maximum relation size of 2^32-1 blocks, three levels is enough with the default
+BLCKSZ (4000^3 > 2^32).
+
+Addressing
+----------
+
+The higher-level routines operate on "logical" addresses, consisting of
+- level,
+- logical page number, and
+- slot (if applicable)
+
+Bottom level FSM pages have level of 0, the level above that 1, and root 2.
+As in the diagram above, logical page number is the page number at that level,
+starting from 0.
+
+Locking
+-------
+
+When traversing down to search for free space, only one page is locked at a
+time: the parent page is released before locking the child. If the child page
+is concurrently modified, and there no longer is free space on the child page
+when you land on it, you need to start from scratch (after correcting the
+parent page, so that you don't get into an infinite loop).
+
+We use shared buffer locks when searching, but exclusive buffer lock when
+updating a page. However, the next slot search pointer is updated during
+searches even though we have only a shared lock. fp_next_slot is just a hint
+and we can easily reset it if it gets corrupted; so it seems better to accept
+some risk of that type than to pay the overhead of exclusive locking.
+
+Recovery
+--------
+
+The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of
+self-correcting measures to repair possible corruption. As a result when
+we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint()
+rather than MarkBufferDirty().
+
+First of all, whenever a value is set on an FSM page, the root node of the
+page is compared against the new value after bubbling up the change is
+finished. It should be greater than or equal to the value just set, or we
+have a corrupted page, with a parent somewhere with too small a value.
+Secondly, if we detect corrupted pages while we search, traversing down
+the tree. That check will notice if a parent node is set to too high a value.
+In both cases, the upper nodes on the page are immediately rebuilt, fixing
+the corruption so far as that page is concerned.
+
+VACUUM updates all the bottom-level FSM pages with the correct amount of free
+space on corresponding heap pages, as it proceeds through the heap. This
+goes through fsm_set_avail(), so that the upper nodes on those pages are
+immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range]
+to propagate the new free-space info into the upper pages of the FSM tree.
+
+TODO
+----
+
+- fastroot to avoid traversing upper nodes with just 1 child
+- use a different system for tables that fit into one FSM page, with a
+ mechanism to switch to the real thing as it grows.
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
new file mode 100644
index 0000000..8c12dda
--- /dev/null
+++ b/src/backend/storage/freespace/freespace.c
@@ -0,0 +1,893 @@
+/*-------------------------------------------------------------------------
+ *
+ * freespace.c
+ * POSTGRES free space map for quickly finding free space in relations
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/freespace/freespace.c
+ *
+ *
+ * NOTES:
+ *
+ * Free Space Map keeps track of the amount of free space on pages, and
+ * allows quickly searching for a page with enough free space. The FSM is
+ * stored in a dedicated relation fork of all heap relations, and those
+ * index access methods that need it (see also indexfsm.c). See README for
+ * more information.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/xlogutils.h"
+#include "miscadmin.h"
+#include "storage/freespace.h"
+#include "storage/fsm_internals.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+
+
+/*
+ * We use just one byte to store the amount of free space on a page, so we
+ * divide the amount of free space a page can have into 256 different
+ * categories. The highest category, 255, represents a page with at least
+ * MaxFSMRequestSize bytes of free space, and the second highest category
+ * represents the range from 254 * FSM_CAT_STEP, inclusive, to
+ * MaxFSMRequestSize, exclusive.
+ *
+ * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming
+ * default 8k BLCKSZ, and that MaxFSMRequestSize is 8164 bytes, the
+ * categories look like this:
+ *
+ *
+ * Range Category
+ * 0 - 31 0
+ * 32 - 63 1
+ * ... ... ...
+ * 8096 - 8127 253
+ * 8128 - 8163 254
+ * 8164 - 8192 255
+ *
+ * The reason that MaxFSMRequestSize is special is that if MaxFSMRequestSize
+ * isn't equal to a range boundary, a page with exactly MaxFSMRequestSize
+ * bytes of free space wouldn't satisfy a request for MaxFSMRequestSize
+ * bytes. If there isn't more than MaxFSMRequestSize bytes of free space on a
+ * completely empty page, that would mean that we could never satisfy a
+ * request of exactly MaxFSMRequestSize bytes.
+ */
+#define FSM_CATEGORIES 256
+#define FSM_CAT_STEP (BLCKSZ / FSM_CATEGORIES)
+#define MaxFSMRequestSize MaxHeapTupleSize
+
+/*
+ * Depth of the on-disk tree. We need to be able to address 2^32-1 blocks,
+ * and 1626 is the smallest number that satisfies X^3 >= 2^32-1. Likewise,
+ * 216 is the smallest number that satisfies X^4 >= 2^32-1. In practice,
+ * this means that 4096 bytes is the smallest BLCKSZ that we can get away
+ * with a 3-level tree, and 512 is the smallest we support.
+ */
+#define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4)
+
+#define FSM_ROOT_LEVEL (FSM_TREE_DEPTH - 1)
+#define FSM_BOTTOM_LEVEL 0
+
+/*
+ * The internal FSM routines work on a logical addressing scheme. Each
+ * level of the tree can be thought of as a separately addressable file.
+ */
+typedef struct
+{
+ int level; /* level */
+ int logpageno; /* page number within the level */
+} FSMAddress;
+
+/* Address of the root page. */
+static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
+
+/* functions to navigate the tree */
+static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot);
+static FSMAddress fsm_get_parent(FSMAddress child, uint16 *slot);
+static FSMAddress fsm_get_location(BlockNumber heapblk, uint16 *slot);
+static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
+static BlockNumber fsm_logical_to_physical(FSMAddress addr);
+
+static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
+static void fsm_extend(Relation rel, BlockNumber fsm_nblocks);
+
+/* functions to convert amount of free space to a FSM category */
+static uint8 fsm_space_avail_to_cat(Size avail);
+static uint8 fsm_space_needed_to_cat(Size needed);
+static Size fsm_space_cat_to_avail(uint8 cat);
+
+/* workhorse functions for various operations */
+static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
+ uint8 newValue, uint8 minValue);
+static BlockNumber fsm_search(Relation rel, uint8 min_cat);
+static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr,
+ BlockNumber start, BlockNumber end,
+ bool *eof);
+
+
+/******** Public API ********/
+
+/*
+ * GetPageWithFreeSpace - try to find a page in the given relation with
+ * at least the specified amount of free space.
+ *
+ * If successful, return the block number; if not, return InvalidBlockNumber.
+ *
+ * The caller must be prepared for the possibility that the returned page
+ * will turn out to have too little space available by the time the caller
+ * gets a lock on it. In that case, the caller should report the actual
+ * amount of free space available on that page and then try again (see
+ * RecordAndGetPageWithFreeSpace). If InvalidBlockNumber is returned,
+ * extend the relation.
+ */
+BlockNumber
+GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
+{
+ uint8 min_cat = fsm_space_needed_to_cat(spaceNeeded);
+
+ return fsm_search(rel, min_cat);
+}
+
+/*
+ * RecordAndGetPageWithFreeSpace - update info about a page and try again.
+ *
+ * We provide this combo form to save some locking overhead, compared to
+ * separate RecordPageWithFreeSpace + GetPageWithFreeSpace calls. There's
+ * also some effort to return a page close to the old page; if there's a
+ * page with enough free space on the same FSM page where the old one page
+ * is located, it is preferred.
+ */
+BlockNumber
+RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
+ Size oldSpaceAvail, Size spaceNeeded)
+{
+ int old_cat = fsm_space_avail_to_cat(oldSpaceAvail);
+ int search_cat = fsm_space_needed_to_cat(spaceNeeded);
+ FSMAddress addr;
+ uint16 slot;
+ int search_slot;
+
+ /* Get the location of the FSM byte representing the heap block */
+ addr = fsm_get_location(oldPage, &slot);
+
+ search_slot = fsm_set_and_search(rel, addr, slot, old_cat, search_cat);
+
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
+ if (search_slot != -1)
+ return fsm_get_heap_blk(addr, search_slot);
+ else
+ return fsm_search(rel, search_cat);
+}
+
+/*
+ * RecordPageWithFreeSpace - update info about a page.
+ *
+ * Note that if the new spaceAvail value is higher than the old value stored
+ * in the FSM, the space might not become visible to searchers until the next
+ * FreeSpaceMapVacuum call, which updates the upper level pages.
+ */
+void
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
+{
+ int new_cat = fsm_space_avail_to_cat(spaceAvail);
+ FSMAddress addr;
+ uint16 slot;
+
+ /* Get the location of the FSM byte representing the heap block */
+ addr = fsm_get_location(heapBlk, &slot);
+
+ fsm_set_and_search(rel, addr, slot, new_cat, 0);
+}
+
+/*
+ * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
+ * WAL replay
+ */
+void
+XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
+ Size spaceAvail)
+{
+ int new_cat = fsm_space_avail_to_cat(spaceAvail);
+ FSMAddress addr;
+ uint16 slot;
+ BlockNumber blkno;
+ Buffer buf;
+ Page page;
+
+ /* Get the location of the FSM byte representing the heap block */
+ addr = fsm_get_location(heapBlk, &slot);
+ blkno = fsm_logical_to_physical(addr);
+
+ /* If the page doesn't exist already, extend */
+ buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ page = BufferGetPage(buf);
+ if (PageIsNew(page))
+ PageInit(page, BLCKSZ, 0);
+
+ if (fsm_set_avail(page, slot, new_cat))
+ MarkBufferDirtyHint(buf, false);
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * GetRecordedFreeSpace - return the amount of free space on a particular page,
+ * according to the FSM.
+ */
+Size
+GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk)
+{
+ FSMAddress addr;
+ uint16 slot;
+ Buffer buf;
+ uint8 cat;
+
+ /* Get the location of the FSM byte representing the heap block */
+ addr = fsm_get_location(heapBlk, &slot);
+
+ buf = fsm_readbuf(rel, addr, false);
+ if (!BufferIsValid(buf))
+ return 0;
+ cat = fsm_get_avail(BufferGetPage(buf), slot);
+ ReleaseBuffer(buf);
+
+ return fsm_space_cat_to_avail(cat);
+}
+
+/*
+ * FreeSpaceMapPrepareTruncateRel - prepare for truncation of a relation.
+ *
+ * nblocks is the new size of the heap.
+ *
+ * Return the number of blocks of new FSM.
+ * If it's InvalidBlockNumber, there is nothing to truncate;
+ * otherwise the caller is responsible for calling smgrtruncate()
+ * to truncate the FSM pages, and FreeSpaceMapVacuumRange()
+ * to update upper-level pages in the FSM.
+ */
+BlockNumber
+FreeSpaceMapPrepareTruncateRel(Relation rel, BlockNumber nblocks)
+{
+ BlockNumber new_nfsmblocks;
+ FSMAddress first_removed_address;
+ uint16 first_removed_slot;
+ Buffer buf;
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * If no FSM has been created yet for this relation, there's nothing to
+ * truncate.
+ */
+ if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ return InvalidBlockNumber;
+
+ /* Get the location in the FSM of the first removed heap block */
+ first_removed_address = fsm_get_location(nblocks, &first_removed_slot);
+
+ /*
+ * Zero out the tail of the last remaining FSM page. If the slot
+ * representing the first removed heap block is at a page boundary, as the
+ * first slot on the FSM page that first_removed_address points to, we can
+ * just truncate that page altogether.
+ */
+ if (first_removed_slot > 0)
+ {
+ buf = fsm_readbuf(rel, first_removed_address, false);
+ if (!BufferIsValid(buf))
+ return InvalidBlockNumber; /* nothing to do; the FSM was already
+ * smaller */
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /* NO EREPORT(ERROR) from here till changes are logged */
+ START_CRIT_SECTION();
+
+ fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
+
+ /*
+ * Truncation of a relation is WAL-logged at a higher-level, and we
+ * will be called at WAL replay. But if checksums are enabled, we need
+ * to still write a WAL record to protect against a torn page, if the
+ * page is flushed to disk before the truncation WAL record. We cannot
+ * use MarkBufferDirtyHint here, because that will not dirty the page
+ * during recovery.
+ */
+ MarkBufferDirty(buf);
+ if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+ log_newpage_buffer(buf, false);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(buf);
+
+ new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
+ }
+ else
+ {
+ new_nfsmblocks = fsm_logical_to_physical(first_removed_address);
+ if (smgrnblocks(rel->rd_smgr, FSM_FORKNUM) <= new_nfsmblocks)
+ return InvalidBlockNumber; /* nothing to do; the FSM was already
+ * smaller */
+ }
+
+ return new_nfsmblocks;
+}
+
+/*
+ * FreeSpaceMapVacuum - update upper-level pages in the rel's FSM
+ *
+ * We assume that the bottom-level pages have already been updated with
+ * new free-space information.
+ */
+void
+FreeSpaceMapVacuum(Relation rel)
+{
+ bool dummy;
+
+ /* Recursively scan the tree, starting at the root */
+ (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS,
+ (BlockNumber) 0, InvalidBlockNumber,
+ &dummy);
+}
+
+/*
+ * FreeSpaceMapVacuumRange - update upper-level pages in the rel's FSM
+ *
+ * As above, but assume that only heap pages between start and end-1 inclusive
+ * have new free-space information, so update only the upper-level slots
+ * covering that block range. end == InvalidBlockNumber is equivalent to
+ * "all the rest of the relation".
+ */
+void
+FreeSpaceMapVacuumRange(Relation rel, BlockNumber start, BlockNumber end)
+{
+ bool dummy;
+
+ /* Recursively scan the tree, starting at the root */
+ if (end > start)
+ (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, start, end, &dummy);
+}
+
+/******** Internal routines ********/
+
+/*
+ * Return category corresponding x bytes of free space
+ */
+static uint8
+fsm_space_avail_to_cat(Size avail)
+{
+ int cat;
+
+ Assert(avail < BLCKSZ);
+
+ if (avail >= MaxFSMRequestSize)
+ return 255;
+
+ cat = avail / FSM_CAT_STEP;
+
+ /*
+ * The highest category, 255, is reserved for MaxFSMRequestSize bytes or
+ * more.
+ */
+ if (cat > 254)
+ cat = 254;
+
+ return (uint8) cat;
+}
+
+/*
+ * Return the lower bound of the range of free space represented by given
+ * category.
+ */
+static Size
+fsm_space_cat_to_avail(uint8 cat)
+{
+ /* The highest category represents exactly MaxFSMRequestSize bytes. */
+ if (cat == 255)
+ return MaxFSMRequestSize;
+ else
+ return cat * FSM_CAT_STEP;
+}
+
+/*
+ * Which category does a page need to have, to accommodate x bytes of data?
+ * While fsm_space_avail_to_cat() rounds down, this needs to round up.
+ */
+static uint8
+fsm_space_needed_to_cat(Size needed)
+{
+ int cat;
+
+ /* Can't ask for more space than the highest category represents */
+ if (needed > MaxFSMRequestSize)
+ elog(ERROR, "invalid FSM request size %zu", needed);
+
+ if (needed == 0)
+ return 1;
+
+ cat = (needed + FSM_CAT_STEP - 1) / FSM_CAT_STEP;
+
+ if (cat > 255)
+ cat = 255;
+
+ return (uint8) cat;
+}
+
+/*
+ * Returns the physical block number of a FSM page
+ */
+static BlockNumber
+fsm_logical_to_physical(FSMAddress addr)
+{
+ BlockNumber pages;
+ int leafno;
+ int l;
+
+ /*
+ * Calculate the logical page number of the first leaf page below the
+ * given page.
+ */
+ leafno = addr.logpageno;
+ for (l = 0; l < addr.level; l++)
+ leafno *= SlotsPerFSMPage;
+
+ /* Count upper level nodes required to address the leaf page */
+ pages = 0;
+ for (l = 0; l < FSM_TREE_DEPTH; l++)
+ {
+ pages += leafno + 1;
+ leafno /= SlotsPerFSMPage;
+ }
+
+ /*
+ * If the page we were asked for wasn't at the bottom level, subtract the
+ * additional lower level pages we counted above.
+ */
+ pages -= addr.level;
+
+ /* Turn the page count into 0-based block number */
+ return pages - 1;
+}
+
+/*
+ * Return the FSM location corresponding to given heap block.
+ */
+static FSMAddress
+fsm_get_location(BlockNumber heapblk, uint16 *slot)
+{
+ FSMAddress addr;
+
+ addr.level = FSM_BOTTOM_LEVEL;
+ addr.logpageno = heapblk / SlotsPerFSMPage;
+ *slot = heapblk % SlotsPerFSMPage;
+
+ return addr;
+}
+
+/*
+ * Return the heap block number corresponding to given location in the FSM.
+ */
+static BlockNumber
+fsm_get_heap_blk(FSMAddress addr, uint16 slot)
+{
+ Assert(addr.level == FSM_BOTTOM_LEVEL);
+ return ((unsigned int) addr.logpageno) * SlotsPerFSMPage + slot;
+}
+
+/*
+ * Given a logical address of a child page, get the logical page number of
+ * the parent, and the slot within the parent corresponding to the child.
+ */
+static FSMAddress
+fsm_get_parent(FSMAddress child, uint16 *slot)
+{
+ FSMAddress parent;
+
+ Assert(child.level < FSM_ROOT_LEVEL);
+
+ parent.level = child.level + 1;
+ parent.logpageno = child.logpageno / SlotsPerFSMPage;
+ *slot = child.logpageno % SlotsPerFSMPage;
+
+ return parent;
+}
+
+/*
+ * Given a logical address of a parent page and a slot number, get the
+ * logical address of the corresponding child page.
+ */
+static FSMAddress
+fsm_get_child(FSMAddress parent, uint16 slot)
+{
+ FSMAddress child;
+
+ Assert(parent.level > FSM_BOTTOM_LEVEL);
+
+ child.level = parent.level - 1;
+ child.logpageno = parent.logpageno * SlotsPerFSMPage + slot;
+
+ return child;
+}
+
+/*
+ * Read a FSM page.
+ *
+ * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is
+ * true, the FSM file is extended.
+ */
+static Buffer
+fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
+{
+ BlockNumber blkno = fsm_logical_to_physical(addr);
+ Buffer buf;
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * If we haven't cached the size of the FSM yet, check it first. Also
+ * recheck if the requested block seems to be past end, since our cached
+ * value might be stale. (We send smgr inval messages on truncation, but
+ * not on extension.)
+ */
+ if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
+ blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+ {
+ /* Invalidate the cache so smgrnblocks asks the kernel. */
+ rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+ if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+ else
+ rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0;
+ }
+
+ /* Handle requests beyond EOF */
+ if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM])
+ {
+ if (extend)
+ fsm_extend(rel, blkno + 1);
+ else
+ return InvalidBuffer;
+ }
+
+ /*
+ * Use ZERO_ON_ERROR mode, and initialize the page if necessary. The FSM
+ * information is not accurate anyway, so it's better to clear corrupt
+ * pages than error out. Since the FSM changes are not WAL-logged, the
+ * so-called torn page problem on crash can lead to pages with corrupt
+ * headers, for example.
+ *
+ * The initialize-the-page part is trickier than it looks, because of the
+ * possibility of multiple backends doing this concurrently, and our
+ * desire to not uselessly take the buffer lock in the normal path where
+ * the page is OK. We must take the lock to initialize the page, so
+ * recheck page newness after we have the lock, in case someone else
+ * already did it. Also, because we initially check PageIsNew with no
+ * lock, it's possible to fall through and return the buffer while someone
+ * else is still initializing the page (i.e., we might see pd_upper as set
+ * but other page header fields are still zeroes). This is harmless for
+ * callers that will take a buffer lock themselves, but some callers
+ * inspect the page without any lock at all. The latter is OK only so
+ * long as it doesn't depend on the page header having correct contents.
+ * Current usage is safe because PageGetContents() does not require that.
+ */
+ buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
+ if (PageIsNew(BufferGetPage(buf)))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ if (PageIsNew(BufferGetPage(buf)))
+ PageInit(BufferGetPage(buf), BLCKSZ, 0);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ return buf;
+}
+
+/*
+ * Ensure that the FSM fork is at least fsm_nblocks long, extending
+ * it if necessary with empty pages. And by empty, I mean pages filled
+ * with zeros, meaning there's no free space.
+ */
+static void
+fsm_extend(Relation rel, BlockNumber fsm_nblocks)
+{
+ BlockNumber fsm_nblocks_now;
+ PGAlignedBlock pg;
+
+ PageInit((Page) pg.data, BLCKSZ, 0);
+
+ /*
+ * We use the relation extension lock to lock out other backends trying to
+ * extend the FSM at the same time. It also locks out extension of the
+ * main fork, unnecessarily, but extending the FSM happens seldom enough
+ * that it doesn't seem worthwhile to have a separate lock tag type for
+ * it.
+ *
+ * Note that another backend might have extended or created the relation
+ * by the time we get the lock.
+ */
+ LockRelationForExtension(rel, ExclusiveLock);
+
+ /* Might have to re-open if a cache flush happened */
+ RelationOpenSmgr(rel);
+
+ /*
+ * Create the FSM file first if it doesn't exist. If
+ * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
+ * need for an smgrexists call.
+ */
+ if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
+ rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
+ !smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+
+ /* Invalidate cache so that smgrnblocks() asks the kernel. */
+ rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
+ fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);
+
+ while (fsm_nblocks_now < fsm_nblocks)
+ {
+ PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
+
+ smgrextend(rel->rd_smgr, FSM_FORKNUM, fsm_nblocks_now,
+ pg.data, false);
+ fsm_nblocks_now++;
+ }
+
+ UnlockRelationForExtension(rel, ExclusiveLock);
+}
+
+/*
+ * Set value in given FSM page and slot.
+ *
+ * If minValue > 0, the updated page is also searched for a page with at
+ * least minValue of free space. If one is found, its slot number is
+ * returned, -1 otherwise.
+ */
+static int
+fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
+ uint8 newValue, uint8 minValue)
+{
+ Buffer buf;
+ Page page;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ page = BufferGetPage(buf);
+
+ if (fsm_set_avail(page, slot, newValue))
+ MarkBufferDirtyHint(buf, false);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+ addr.level == FSM_BOTTOM_LEVEL,
+ true);
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ return newslot;
+}
+
+/*
+ * Search the tree for a heap page with at least min_cat of free space
+ */
+static BlockNumber
+fsm_search(Relation rel, uint8 min_cat)
+{
+ int restarts = 0;
+ FSMAddress addr = FSM_ROOT_ADDRESS;
+
+ for (;;)
+ {
+ int slot;
+ Buffer buf;
+ uint8 max_avail = 0;
+
+ /* Read the FSM page. */
+ buf = fsm_readbuf(rel, addr, false);
+
+ /* Search within the page */
+ if (BufferIsValid(buf))
+ {
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ slot = fsm_search_avail(buf, min_cat,
+ (addr.level == FSM_BOTTOM_LEVEL),
+ false);
+ if (slot == -1)
+ max_avail = fsm_get_max_avail(BufferGetPage(buf));
+ UnlockReleaseBuffer(buf);
+ }
+ else
+ slot = -1;
+
+ if (slot != -1)
+ {
+ /*
+ * Descend the tree, or return the found block if we're at the
+ * bottom.
+ */
+ if (addr.level == FSM_BOTTOM_LEVEL)
+ return fsm_get_heap_blk(addr, slot);
+
+ addr = fsm_get_child(addr, slot);
+ }
+ else if (addr.level == FSM_ROOT_LEVEL)
+ {
+ /*
+ * At the root, failure means there's no page with enough free
+ * space in the FSM. Give up.
+ */
+ return InvalidBlockNumber;
+ }
+ else
+ {
+ uint16 parentslot;
+ FSMAddress parent;
+
+ /*
+ * At lower level, failure can happen if the value in the upper-
+ * level node didn't reflect the value on the lower page. Update
+ * the upper node, to avoid falling into the same trap again, and
+ * start over.
+ *
+ * There's a race condition here, if another backend updates this
+ * page right after we release it, and gets the lock on the parent
+ * page before us. We'll then update the parent page with the now
+ * stale information we had. It's OK, because it should happen
+ * rarely, and will be fixed by the next vacuum.
+ */
+ parent = fsm_get_parent(addr, &parentslot);
+ fsm_set_and_search(rel, parent, parentslot, max_avail, 0);
+
+ /*
+ * If the upper pages are badly out of date, we might need to loop
+ * quite a few times, updating them as we go. Any inconsistencies
+ * should eventually be corrected and the loop should end. Looping
+ * indefinitely is nevertheless scary, so provide an emergency
+ * valve.
+ */
+ if (restarts++ > 10000)
+ return InvalidBlockNumber;
+
+ /* Start search all over from the root */
+ addr = FSM_ROOT_ADDRESS;
+ }
+ }
+}
+
+
+/*
+ * Recursive guts of FreeSpaceMapVacuum
+ *
+ * Examine the FSM page indicated by addr, as well as its children, updating
+ * upper-level nodes that cover the heap block range from start to end-1.
+ * (It's okay if end is beyond the actual end of the map.)
+ * Return the maximum freespace value on this page.
+ *
+ * If addr is past the end of the FSM, set *eof_p to true and return 0.
+ *
+ * This traverses the tree in depth-first order. The tree is stored
+ * physically in depth-first order, so this should be pretty I/O efficient.
+ */
+static uint8
+fsm_vacuum_page(Relation rel, FSMAddress addr,
+ BlockNumber start, BlockNumber end,
+ bool *eof_p)
+{
+ Buffer buf;
+ Page page;
+ uint8 max_avail;
+
+ /* Read the page if it exists, or return EOF */
+ buf = fsm_readbuf(rel, addr, false);
+ if (!BufferIsValid(buf))
+ {
+ *eof_p = true;
+ return 0;
+ }
+ else
+ *eof_p = false;
+
+ page = BufferGetPage(buf);
+
+ /*
+ * If we're above the bottom level, recurse into children, and fix the
+ * information stored about them at this level.
+ */
+ if (addr.level > FSM_BOTTOM_LEVEL)
+ {
+ FSMAddress fsm_start,
+ fsm_end;
+ uint16 fsm_start_slot,
+ fsm_end_slot;
+ int slot,
+ start_slot,
+ end_slot;
+ bool eof = false;
+
+ /*
+ * Compute the range of slots we need to update on this page, given
+ * the requested range of heap blocks to consider. The first slot to
+ * update is the one covering the "start" block, and the last slot is
+ * the one covering "end - 1". (Some of this work will be duplicated
+ * in each recursive call, but it's cheap enough to not worry about.)
+ */
+ fsm_start = fsm_get_location(start, &fsm_start_slot);
+ fsm_end = fsm_get_location(end - 1, &fsm_end_slot);
+
+ while (fsm_start.level < addr.level)
+ {
+ fsm_start = fsm_get_parent(fsm_start, &fsm_start_slot);
+ fsm_end = fsm_get_parent(fsm_end, &fsm_end_slot);
+ }
+ Assert(fsm_start.level == addr.level);
+
+ if (fsm_start.logpageno == addr.logpageno)
+ start_slot = fsm_start_slot;
+ else if (fsm_start.logpageno > addr.logpageno)
+ start_slot = SlotsPerFSMPage; /* shouldn't get here... */
+ else
+ start_slot = 0;
+
+ if (fsm_end.logpageno == addr.logpageno)
+ end_slot = fsm_end_slot;
+ else if (fsm_end.logpageno > addr.logpageno)
+ end_slot = SlotsPerFSMPage - 1;
+ else
+ end_slot = -1; /* shouldn't get here... */
+
+ for (slot = start_slot; slot <= end_slot; slot++)
+ {
+ int child_avail;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* After we hit end-of-file, just clear the rest of the slots */
+ if (!eof)
+ child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot),
+ start, end,
+ &eof);
+ else
+ child_avail = 0;
+
+ /* Update information about the child */
+ if (fsm_get_avail(page, slot) != child_avail)
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ fsm_set_avail(page, slot, child_avail);
+ MarkBufferDirtyHint(buf, false);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ }
+ }
+
+ /* Now get the maximum value on the page, to return to caller */
+ max_avail = fsm_get_max_avail(page);
+
+ /*
+ * Reset the next slot pointer. This encourages the use of low-numbered
+ * pages, increasing the chances that a later vacuum can truncate the
+ * relation. We don't bother with a lock here, nor with marking the page
+ * dirty if it wasn't already, since this is just a hint.
+ */
+ ((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+
+ ReleaseBuffer(buf);
+
+ return max_avail;
+}
diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c
new file mode 100644
index 0000000..88ae51e
--- /dev/null
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -0,0 +1,374 @@
+/*-------------------------------------------------------------------------
+ *
+ * fsmpage.c
+ * routines to search and manipulate one FSM page.
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/freespace/fsmpage.c
+ *
+ * NOTES:
+ *
+ * The public functions in this file form an API that hides the internal
+ * structure of a FSM page. This allows freespace.c to treat each FSM page
+ * as a black box with SlotsPerPage "slots". fsm_set_avail() and
+ * fsm_get_avail() let you get/set the value of a slot, and
+ * fsm_search_avail() lets you search for a slot with value >= X.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/bufmgr.h"
+#include "storage/fsm_internals.h"
+
+/* Macros to navigate the tree within a page. Root has index zero. */
+#define leftchild(x) (2 * (x) + 1)
+#define rightchild(x) (2 * (x) + 2)
+#define parentof(x) (((x) - 1) / 2)
+
+/*
+ * Find right neighbor of x, wrapping around within the level
+ */
+static int
+rightneighbor(int x)
+{
+ /*
+ * Move right. This might wrap around, stepping to the leftmost node at
+ * the next level.
+ */
+ x++;
+
+ /*
+ * Check if we stepped to the leftmost node at next level, and correct if
+ * so. The leftmost nodes at each level are numbered x = 2^level - 1, so
+ * check if (x + 1) is a power of two, using a standard
+ * twos-complement-arithmetic trick.
+ */
+ if (((x + 1) & x) == 0)
+ x = parentof(x);
+
+ return x;
+}
+
+/*
+ * Sets the value of a slot on page. Returns true if the page was modified.
+ *
+ * The caller must hold an exclusive lock on the page.
+ */
+bool
+fsm_set_avail(Page page, int slot, uint8 value)
+{
+ int nodeno = NonLeafNodesPerPage + slot;
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+ uint8 oldvalue;
+
+ Assert(slot < LeafNodesPerPage);
+
+ oldvalue = fsmpage->fp_nodes[nodeno];
+
+ /* If the value hasn't changed, we don't need to do anything */
+ if (oldvalue == value && value <= fsmpage->fp_nodes[0])
+ return false;
+
+ fsmpage->fp_nodes[nodeno] = value;
+
+ /*
+ * Propagate up, until we hit the root or a node that doesn't need to be
+ * updated.
+ */
+ do
+ {
+ uint8 newvalue = 0;
+ int lchild;
+ int rchild;
+
+ nodeno = parentof(nodeno);
+ lchild = leftchild(nodeno);
+ rchild = lchild + 1;
+
+ newvalue = fsmpage->fp_nodes[lchild];
+ if (rchild < NodesPerPage)
+ newvalue = Max(newvalue,
+ fsmpage->fp_nodes[rchild]);
+
+ oldvalue = fsmpage->fp_nodes[nodeno];
+ if (oldvalue == newvalue)
+ break;
+
+ fsmpage->fp_nodes[nodeno] = newvalue;
+ } while (nodeno > 0);
+
+ /*
+ * sanity check: if the new value is (still) higher than the value at the
+ * top, the tree is corrupt. If so, rebuild.
+ */
+ if (value > fsmpage->fp_nodes[0])
+ fsm_rebuild_page(page);
+
+ return true;
+}
+
+/*
+ * Returns the value of given slot on page.
+ *
+ * Since this is just a read-only access of a single byte, the page doesn't
+ * need to be locked.
+ */
+uint8
+fsm_get_avail(Page page, int slot)
+{
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+
+ Assert(slot < LeafNodesPerPage);
+
+ return fsmpage->fp_nodes[NonLeafNodesPerPage + slot];
+}
+
+/*
+ * Returns the value at the root of a page.
+ *
+ * Since this is just a read-only access of a single byte, the page doesn't
+ * need to be locked.
+ */
+uint8
+fsm_get_max_avail(Page page)
+{
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+
+ return fsmpage->fp_nodes[0];
+}
+
+/*
+ * Searches for a slot with category at least minvalue.
+ * Returns slot number, or -1 if none found.
+ *
+ * The caller must hold at least a shared lock on the page, and this
+ * function can unlock and lock the page again in exclusive mode if it
+ * needs to be updated. exclusive_lock_held should be set to true if the
+ * caller is already holding an exclusive lock, to avoid extra work.
+ *
+ * If advancenext is false, fp_next_slot is set to point to the returned
+ * slot, and if it's true, to the slot after the returned slot.
+ */
+int
+fsm_search_avail(Buffer buf, uint8 minvalue, bool advancenext,
+ bool exclusive_lock_held)
+{
+ Page page = BufferGetPage(buf);
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+ int nodeno;
+ int target;
+ uint16 slot;
+
+restart:
+
+ /*
+ * Check the root first, and exit quickly if there's no leaf with enough
+ * free space
+ */
+ if (fsmpage->fp_nodes[0] < minvalue)
+ return -1;
+
+ /*
+ * Start search using fp_next_slot. It's just a hint, so check that it's
+ * sane. (This also handles wrapping around when the prior call returned
+ * the last slot on the page.)
+ */
+ target = fsmpage->fp_next_slot;
+ if (target < 0 || target >= LeafNodesPerPage)
+ target = 0;
+ target += NonLeafNodesPerPage;
+
+ /*----------
+ * Start the search from the target slot. At every step, move one
+ * node to the right, then climb up to the parent. Stop when we reach
+ * a node with enough free space (as we must, since the root has enough
+ * space).
+ *
+ * The idea is to gradually expand our "search triangle", that is, all
+ * nodes covered by the current node, and to be sure we search to the
+ * right from the start point. At the first step, only the target slot
+ * is examined. When we move up from a left child to its parent, we are
+ * adding the right-hand subtree of that parent to the search triangle.
+ * When we move right then up from a right child, we are dropping the
+ * current search triangle (which we know doesn't contain any suitable
+ * page) and instead looking at the next-larger-size triangle to its
+ * right. So we never look left from our original start point, and at
+ * each step the size of the search triangle doubles, ensuring it takes
+ * only log2(N) work to search N pages.
+ *
+ * The "move right" operation will wrap around if it hits the right edge
+ * of the tree, so the behavior is still good if we start near the right.
+ * Note also that the move-and-climb behavior ensures that we can't end
+ * up on one of the missing nodes at the right of the leaf level.
+ *
+ * For example, consider this tree:
+ *
+ * 7
+ * 7 6
+ * 5 7 6 5
+ * 4 5 5 7 2 6 5 2
+ * T
+ *
+ * Assume that the target node is the node indicated by the letter T,
+ * and we're searching for a node with value of 6 or higher. The search
+ * begins at T. At the first iteration, we move to the right, then to the
+ * parent, arriving at the rightmost 5. At the second iteration, we move
+ * to the right, wrapping around, then climb up, arriving at the 7 on the
+ * third level. 7 satisfies our search, so we descend down to the bottom,
+ * following the path of sevens. This is in fact the first suitable page
+ * to the right of (allowing for wraparound) our start point.
+ *----------
+ */
+ nodeno = target;
+ while (nodeno > 0)
+ {
+ if (fsmpage->fp_nodes[nodeno] >= minvalue)
+ break;
+
+ /*
+ * Move to the right, wrapping around on same level if necessary, then
+ * climb up.
+ */
+ nodeno = parentof(rightneighbor(nodeno));
+ }
+
+ /*
+ * We're now at a node with enough free space, somewhere in the middle of
+ * the tree. Descend to the bottom, following a path with enough free
+ * space, preferring to move left if there's a choice.
+ */
+ while (nodeno < NonLeafNodesPerPage)
+ {
+ int childnodeno = leftchild(nodeno);
+
+ if (childnodeno < NodesPerPage &&
+ fsmpage->fp_nodes[childnodeno] >= minvalue)
+ {
+ nodeno = childnodeno;
+ continue;
+ }
+ childnodeno++; /* point to right child */
+ if (childnodeno < NodesPerPage &&
+ fsmpage->fp_nodes[childnodeno] >= minvalue)
+ {
+ nodeno = childnodeno;
+ }
+ else
+ {
+ /*
+ * Oops. The parent node promised that either left or right child
+ * has enough space, but neither actually did. This can happen in
+ * case of a "torn page", IOW if we crashed earlier while writing
+ * the page to disk, and only part of the page made it to disk.
+ *
+ * Fix the corruption and restart.
+ */
+ RelFileNode rnode;
+ ForkNumber forknum;
+ BlockNumber blknum;
+
+ BufferGetTag(buf, &rnode, &forknum, &blknum);
+ elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u",
+ blknum, rnode.spcNode, rnode.dbNode, rnode.relNode);
+
+ /* make sure we hold an exclusive lock */
+ if (!exclusive_lock_held)
+ {
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ exclusive_lock_held = true;
+ }
+ fsm_rebuild_page(page);
+ MarkBufferDirtyHint(buf, false);
+ goto restart;
+ }
+ }
+
+ /* We're now at the bottom level, at a node with enough space. */
+ slot = nodeno - NonLeafNodesPerPage;
+
+ /*
+ * Update the next-target pointer. Note that we do this even if we're only
+ * holding a shared lock, on the grounds that it's better to use a shared
+ * lock and get a garbled next pointer every now and then, than take the
+ * concurrency hit of an exclusive lock.
+ *
+ * Wrap-around is handled at the beginning of this function.
+ */
+ fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+
+ return slot;
+}
+
+/*
+ * Sets the available space to zero for all slots numbered >= nslots.
+ * Returns true if the page was modified.
+ */
+bool
+fsm_truncate_avail(Page page, int nslots)
+{
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+ uint8 *ptr;
+ bool changed = false;
+
+ Assert(nslots >= 0 && nslots < LeafNodesPerPage);
+
+ /* Clear all truncated leaf nodes */
+ ptr = &fsmpage->fp_nodes[NonLeafNodesPerPage + nslots];
+ for (; ptr < &fsmpage->fp_nodes[NodesPerPage]; ptr++)
+ {
+ if (*ptr != 0)
+ changed = true;
+ *ptr = 0;
+ }
+
+ /* Fix upper nodes. */
+ if (changed)
+ fsm_rebuild_page(page);
+
+ return changed;
+}
+
+/*
+ * Reconstructs the upper levels of a page. Returns true if the page
+ * was modified.
+ */
+bool
+fsm_rebuild_page(Page page)
+{
+ FSMPage fsmpage = (FSMPage) PageGetContents(page);
+ bool changed = false;
+ int nodeno;
+
+ /*
+ * Start from the lowest non-leaf level, at last node, working our way
+ * backwards, through all non-leaf nodes at all levels, up to the root.
+ */
+ for (nodeno = NonLeafNodesPerPage - 1; nodeno >= 0; nodeno--)
+ {
+ int lchild = leftchild(nodeno);
+ int rchild = lchild + 1;
+ uint8 newvalue = 0;
+
+ /* The first few nodes we examine might have zero or one child. */
+ if (lchild < NodesPerPage)
+ newvalue = fsmpage->fp_nodes[lchild];
+
+ if (rchild < NodesPerPage)
+ newvalue = Max(newvalue,
+ fsmpage->fp_nodes[rchild]);
+
+ if (fsmpage->fp_nodes[nodeno] != newvalue)
+ {
+ fsmpage->fp_nodes[nodeno] = newvalue;
+ changed = true;
+ }
+ }
+
+ return changed;
+}
diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c
new file mode 100644
index 0000000..d66e10b
--- /dev/null
+++ b/src/backend/storage/freespace/indexfsm.c
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexfsm.c
+ * POSTGRES free space map for quickly finding free pages in relations
+ *
+ *
+ * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/freespace/indexfsm.c
+ *
+ *
+ * NOTES:
+ *
+ * This is similar to the FSM used for heap, in freespace.c, but instead
+ * of tracking the amount of free space on pages, we only track whether
+ * pages are completely free or in-use. We use the same FSM implementation
+ * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/freespace.h"
+#include "storage/indexfsm.h"
+
+/*
+ * Exported routines
+ */
+
+/*
+ * GetFreeIndexPage - return a free page from the FSM
+ *
+ * As a side effect, the page is marked as used in the FSM.
+ */
+BlockNumber
+GetFreeIndexPage(Relation rel)
+{
+ BlockNumber blkno = GetPageWithFreeSpace(rel, BLCKSZ / 2);
+
+ if (blkno != InvalidBlockNumber)
+ RecordUsedIndexPage(rel, blkno);
+
+ return blkno;
+}
+
+/*
+ * RecordFreeIndexPage - mark a page as free in the FSM
+ */
+void
+RecordFreeIndexPage(Relation rel, BlockNumber freeBlock)
+{
+ RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1);
+}
+
+
+/*
+ * RecordUsedIndexPage - mark a page as used in the FSM
+ */
+void
+RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
+{
+ RecordPageWithFreeSpace(rel, usedBlock, 0);
+}
+
+/*
+ * IndexFreeSpaceMapVacuum - scan and fix any inconsistencies in the FSM
+ */
+void
+IndexFreeSpaceMapVacuum(Relation rel)
+{
+ FreeSpaceMapVacuum(rel);
+}