summaryrefslogtreecommitdiffstats
path: root/src/backend/storage/freespace/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/storage/freespace/README')
-rw-r--r--src/backend/storage/freespace/README196
1 files changed, 196 insertions, 0 deletions
diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README
new file mode 100644
index 0000000..e7ff23b
--- /dev/null
+++ b/src/backend/storage/freespace/README
@@ -0,0 +1,196 @@
+src/backend/storage/freespace/README
+
+Free Space Map
+--------------
+
+The purpose of the free space map is to quickly locate a page with enough
+free space to hold a tuple to be stored; or to determine that no such page
+exists and the relation must be extended by one page. As of PostgreSQL 8.4
+each relation has its own, extensible free space map stored in a separate
+"fork" of its relation. This eliminates the disadvantages of the former
+fixed-size FSM.
+
+It is important to keep the map small so that it can be searched rapidly.
+Therefore, we don't attempt to record the exact free space on a page.
+We allocate one map byte to each page, allowing us to record free space
+at a granularity of 1/256th of a page. Another way to say it is that
+the stored value is the free space divided by BLCKSZ/256 (rounding down).
+We assume that the free space must always be less than BLCKSZ, since
+all pages have some overhead; so the maximum map value is 255.
+
+To assist in fast searching, the map isn't simply an array of per-page
+entries, but has a tree structure above those entries. There is a tree
+structure of pages, and a tree structure within each page, as described
+below.
+
+FSM page structure
+------------------
+
+Within each FSM page, we use a binary tree structure where leaf nodes store
+the amount of free space on heap pages (or lower level FSM pages, see
+"Higher-level structure" below), with one leaf node per heap page. A non-leaf
+node stores the max amount of free space on any of its children.
+
+For example:
+
+ 4
+ 4 2
+3 4 0 2 <- This level represents heap pages
+
+We need two basic operations: search and update.
+
+To search for a page with X amount of free space, traverse down the tree
+along a path where n >= X, until you hit the bottom. If both children of a
+node satisfy the condition, you can pick either one arbitrarily.
+
+To update the amount of free space on a page to X, first update the leaf node
+corresponding to the heap page, then "bubble up" the change to upper nodes,
+by walking up to each parent and recomputing its value as the max of its
+two children. Repeat until reaching the root or a parent whose value
+doesn't change.
+
+This data structure has a couple of nice properties:
+- to discover that there is no page with X bytes of free space, you only
+ need to look at the root node
+- by varying which child to traverse to in the search algorithm, when you have
+ a choice, we can implement various strategies, like preferring pages closer
+ to a given page, or spreading the load across the table.
+
+Higher-level routines that use FSM pages access them through the fsm_set_avail()
+and fsm_search_avail() functions. The interface to those functions hides the
+page's internal tree structure, treating the FSM page as a black box that has
+a certain number of "slots" for storing free space information. (However,
+the higher routines have to be aware of the tree structure of the whole map.)
+
+The binary tree is stored on each FSM page as an array. Because the page
+header takes some space on a page, the binary tree isn't perfect. That is,
+a few right-most leaf nodes are missing, and there are some useless non-leaf
+nodes at the right. So the tree looks something like this:
+
+ 0
+ 1 2
+ 3 4 5 6
+7 8 9 A B
+
+where the numbers denote each node's position in the array. Note that the
+tree is guaranteed complete above the leaf level; only some leaf nodes are
+missing. This is reflected in the number of usable "slots" per page not
+being an exact power of 2.
+
+A FSM page also has a next slot pointer, fp_next_slot, that determines where
+to start the next search for free space within that page. The reason for that
+is to spread out the pages that are returned by FSM searches. When several
+backends are concurrently inserting into a relation, contention can be avoided
+by having them insert into different pages. But it is also desirable to fill
+up pages in sequential order, to get the benefit of OS prefetching and batched
+writes. The FSM is responsible for making that happen, and the next slot
+pointer helps provide the desired behavior.
+
+Higher-level structure
+----------------------
+
+To scale up the data structure described above beyond a single page, we
+maintain a similar tree-structure across pages. Leaf nodes in higher level
+pages correspond to lower level FSM pages. The root node within each page
+has the same value as the corresponding leaf node on its parent page.
+
+The root page is always stored at physical block 0.
+
+For example, assuming each FSM page can hold information about 4 pages (in
+reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ),
+we get a disk layout like this:
+
+ 0 <-- page 0 at level 2 (root page)
+ 0 <-- page 0 at level 1
+ 0 <-- page 0 at level 0
+ 1 <-- page 1 at level 0
+ 2 <-- ...
+ 3
+ 1 <-- page 1 at level 1
+ 4
+ 5
+ 6
+ 7
+ 2
+ 8
+ 9
+ 10
+ 11
+ 3
+ 12
+ 13
+ 14
+ 15
+
+where the numbers are page numbers *at that level*, starting from 0.
+
+To find the physical block # corresponding to leaf page n, we need to
+count the number of leaf and upper-level pages preceding page n.
+This turns out to be
+
+y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1
+
+where F is the fanout (4 in the above example). The first term n is the number
+of preceding leaf pages, the second term is the number of pages at level 1,
+and so forth.
+
+To keep things simple, the tree is always constant height. To cover the
+maximum relation size of 2^32-1 blocks, three levels is enough with the default
+BLCKSZ (4000^3 > 2^32).
+
+Addressing
+----------
+
+The higher-level routines operate on "logical" addresses, consisting of
+- level,
+- logical page number, and
+- slot (if applicable)
+
+Bottom level FSM pages have level of 0, the level above that 1, and root 2.
+As in the diagram above, logical page number is the page number at that level,
+starting from 0.
+
+Locking
+-------
+
+When traversing down to search for free space, only one page is locked at a
+time: the parent page is released before locking the child. If the child page
+is concurrently modified, and there no longer is free space on the child page
+when you land on it, you need to start from scratch (after correcting the
+parent page, so that you don't get into an infinite loop).
+
+We use shared buffer locks when searching, but exclusive buffer lock when
+updating a page. However, the next slot search pointer is updated during
+searches even though we have only a shared lock. fp_next_slot is just a hint
+and we can easily reset it if it gets corrupted; so it seems better to accept
+some risk of that type than to pay the overhead of exclusive locking.
+
+Recovery
+--------
+
+The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of
+self-correcting measures to repair possible corruption. As a result when
+we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint()
+rather than MarkBufferDirty().
+
+First of all, whenever a value is set on an FSM page, the root node of the
+page is compared against the new value after bubbling up the change is
+finished. It should be greater than or equal to the value just set, or we
+have a corrupted page, with a parent somewhere with too small a value.
+Secondly, if we detect corrupted pages while we search, traversing down
+the tree. That check will notice if a parent node is set to too high a value.
+In both cases, the upper nodes on the page are immediately rebuilt, fixing
+the corruption so far as that page is concerned.
+
+VACUUM updates all the bottom-level FSM pages with the correct amount of free
+space on corresponding heap pages, as it proceeds through the heap. This
+goes through fsm_set_avail(), so that the upper nodes on those pages are
+immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range]
+to propagate the new free-space info into the upper pages of the FSM tree.
+
+TODO
+----
+
+- fastroot to avoid traversing upper nodes with just 1 child
+- use a different system for tables that fit into one FSM page, with a
+ mechanism to switch to the real thing as it grows.