67.4. Implementation

67.4. Implementation
Prev	Up	Chapter 67. B-Tree Indexes	Home	Next

67.4.1. B-Tree Structure
67.4.2. Bottom-up Index Deletion
67.4.3. Deduplication

+ This section covers B-Tree index implementation details that may be + of use to advanced users. See + src/backend/access/nbtree/README in the source + distribution for a much more detailed, internals-focused description + of the B-Tree implementation. +

67.4.1. B-Tree Structure

+ PostgreSQL B-Tree indexes are + multi-level tree structures, where each level of the tree can be + used as a doubly-linked list of pages. A single metapage is stored + in a fixed position at the start of the first segment file of the + index. All other pages are either leaf pages or internal pages. + Leaf pages are the pages on the lowest level of the tree. All + other levels consist of internal pages. Each leaf page contains + tuples that point to table rows. Each internal page contains + tuples that point to the next level down in the tree. Typically, + over 99% of all pages are leaf pages. Both internal pages and leaf + pages use the standard page format described in Section 73.6. +

+ New leaf pages are added to a B-Tree index when an existing leaf + page cannot fit an incoming tuple. A page + split operation makes room for items that originally + belonged on the overflowing page by moving a portion of the items + to a new page. Page splits must also insert a new + downlink to the new page in the parent page, + which may cause the parent to split in turn. Page splits + “cascade upwards” in a recursive fashion. When the + root page finally cannot fit a new downlink, a root page + split operation takes place. This adds a new level to + the tree structure by creating a new root page that is one level + above the original root page. +

67.4.2. Bottom-up Index Deletion

+ B-Tree indexes are not directly aware that under MVCC, there might + be multiple extant versions of the same logical table row; to an + index, each tuple is an independent object that needs its own index + entry. “Version churn” tuples may sometimes + accumulate and adversely affect query latency and throughput. This + typically occurs with UPDATE-heavy workloads + where most individual updates cannot apply the + HOT optimization. + Changing the value of only + one column covered by one index during an UPDATE + always necessitates a new set of index tuples + — one for each and every index on the + table. Note in particular that this includes indexes that were not + “logically modified” by the UPDATE. + All indexes will need a successor physical index tuple that points + to the latest version in the table. Each new tuple within each + index will generally need to coexist with the original + “updated” tuple for a short period of time (typically + until shortly after the UPDATE transaction + commits). +

+ B-Tree indexes incrementally delete version churn index tuples by + performing bottom-up index deletion passes. + Each deletion pass is triggered in reaction to an anticipated + “version churn page split”. This only happens with + indexes that are not logically modified by + UPDATE statements, where concentrated build up + of obsolete versions in particular pages would occur otherwise. A + page split will usually be avoided, though it's possible that + certain implementation-level heuristics will fail to identify and + delete even one garbage index tuple (in which case a page split or + deduplication pass resolves the issue of an incoming new tuple not + fitting on a leaf page). The worst-case number of versions that + any index scan must traverse (for any single logical row) is an + important contributor to overall system responsiveness and + throughput. A bottom-up index deletion pass targets suspected + garbage tuples in a single leaf page based on + qualitative distinctions involving logical + rows and versions. This contrasts with the “top-down” + index cleanup performed by autovacuum workers, which is triggered + when certain quantitative table-level + thresholds are exceeded (see Section 25.1.6). +

Note

+ Not all deletion operations that are performed within B-Tree + indexes are bottom-up deletion operations. There is a distinct + category of index tuple deletion: simple index tuple + deletion. This is a deferred maintenance operation + that deletes index tuples that are known to be safe to delete + (those whose item identifier's LP_DEAD bit is + already set). Like bottom-up index deletion, simple index + deletion takes place at the point that a page split is anticipated + as a way of avoiding the split. +

+ Simple deletion is opportunistic in the sense that it can only + take place when recent index scans set the + LP_DEAD bits of affected items in passing. + Prior to PostgreSQL 14, the only + category of B-Tree deletion was simple deletion. The main + differences between it and bottom-up deletion are that only the + former is opportunistically driven by the activity of passing + index scans, while only the latter specifically targets version + churn from UPDATEs that do not logically modify + indexed columns. +

+ Bottom-up index deletion performs the vast majority of all garbage + index tuple cleanup for particular indexes with certain workloads. + This is expected with any B-Tree index that is subject to + significant version churn from UPDATEs that + rarely or never logically modify the columns that the index covers. + The average and worst-case number of versions per logical row can + be kept low purely through targeted incremental deletion passes. + It's quite possible that the on-disk size of certain indexes will + never increase by even one single page/block despite + constant version churn from + UPDATEs. Even then, an exhaustive “clean + sweep” by a VACUUM operation (typically + run in an autovacuum worker process) will eventually be required as + a part of collective cleanup of the table and + each of its indexes. +

+ Unlike VACUUM, bottom-up index deletion does not + provide any strong guarantees about how old the oldest garbage + index tuple may be. No index can be permitted to retain + “floating garbage” index tuples that became dead prior + to a conservative cutoff point shared by the table and all of its + indexes collectively. This fundamental table-level invariant makes + it safe to recycle table TIDs. This is how it + is possible for distinct logical rows to reuse the same table + TID over time (though this can never happen with + two logical rows whose lifetimes span the same + VACUUM cycle). +

67.4.3. Deduplication

+ A duplicate is a leaf page tuple (a tuple that points to a table + row) where all indexed key columns have values + that match corresponding column values from at least one other leaf + page tuple in the same index. Duplicate tuples are quite common in + practice. B-Tree indexes can use a special, space-efficient + representation for duplicates when an optional technique is + enabled: deduplication. +

+ Deduplication works by periodically merging groups of duplicate + tuples together, forming a single posting list tuple for each + group. The column key value(s) only appear once in this + representation. This is followed by a sorted array of + TIDs that point to rows in the table. This + significantly reduces the storage size of indexes where each value + (or each distinct combination of column values) appears several + times on average. The latency of queries can be reduced + significantly. Overall query throughput may increase + significantly. The overhead of routine index vacuuming may also be + reduced significantly. +

Note

+ B-Tree deduplication is just as effective with + “duplicates” that contain a NULL value, even though + NULL values are never equal to each other according to the + = member of any B-Tree operator class. As far + as any part of the implementation that understands the on-disk + B-Tree structure is concerned, NULL is just another value from the + domain of indexed values. +

+ The deduplication process occurs lazily, when a new item is + inserted that cannot fit on an existing leaf page, though only when + index tuple deletion could not free sufficient space for the new + item (typically deletion is briefly considered and then skipped + over). Unlike GIN posting list tuples, B-Tree posting list tuples + do not need to expand every time a new duplicate is inserted; they + are merely an alternative physical representation of the original + logical contents of the leaf page. This design prioritizes + consistent performance with mixed read-write workloads. Most + client applications will at least see a moderate performance + benefit from using deduplication. Deduplication is enabled by + default. +

+ CREATE INDEX and REINDEX + apply deduplication to create posting list tuples, though the + strategy they use is slightly different. Each group of duplicate + ordinary tuples encountered in the sorted input taken from the + table is merged into a posting list tuple + before being added to the current pending leaf + page. Individual posting list tuples are packed with as many + TIDs as possible. Leaf pages are written out in + the usual way, without any separate deduplication pass. This + strategy is well-suited to CREATE INDEX and + REINDEX because they are once-off batch + operations. +

+ Write-heavy workloads that don't benefit from deduplication due to + having few or no duplicate values in indexes will incur a small, + fixed performance penalty (unless deduplication is explicitly + disabled). The deduplicate_items storage + parameter can be used to disable deduplication within individual + indexes. There is never any performance penalty with read-only + workloads, since reading posting list tuples is at least as + efficient as reading the standard tuple representation. Disabling + deduplication isn't usually helpful. +

+ It is sometimes possible for unique indexes (as well as unique + constraints) to use deduplication. This allows leaf pages to + temporarily “absorb” extra version churn duplicates. + Deduplication in unique indexes augments bottom-up index deletion, + especially in cases where a long-running transaction holds a + snapshot that blocks garbage collection. The goal is to buy time + for the bottom-up index deletion strategy to become effective + again. Delaying page splits until a single long-running + transaction naturally goes away can allow a bottom-up deletion pass + to succeed where an earlier deletion pass failed. +

Tip

+ A special heuristic is applied to determine whether a + deduplication pass in a unique index should take place. It can + often skip straight to splitting a leaf page, avoiding a + performance penalty from wasting cycles on unhelpful deduplication + passes. If you're concerned about the overhead of deduplication, + consider setting deduplicate_items = off + selectively. Leaving deduplication enabled in unique indexes has + little downside. +

+ Deduplication cannot be used in all cases due to + implementation-level restrictions. Deduplication safety is + determined when CREATE INDEX or + REINDEX is run. +

+ Note that deduplication is deemed unsafe and cannot be used in the + following cases involving semantically significant differences + among equal datums: +

+ text, varchar, and char + cannot use deduplication when a + nondeterministic collation is used. Case + and accent differences must be preserved among equal datums. +
+ numeric cannot use deduplication. Numeric display + scale must be preserved among equal datums. +
+ jsonb cannot use deduplication, since the + jsonb B-Tree operator class uses + numeric internally. +
+ float4 and float8 cannot use + deduplication. These types have distinct representations for + -0 and 0, which are + nevertheless considered equal. This difference must be + preserved. +

+ There is one further implementation-level restriction that may be + lifted in a future version of + PostgreSQL: +

+ Container types (such as composite types, arrays, or range + types) cannot use deduplication. +

+ There is one further implementation-level restriction that applies + regardless of the operator class or collation used: +

+ INCLUDE indexes can never use deduplication. +

Prev	Up	Next
67.3. B-Tree Support Functions	Home	Chapter 68. GiST Indexes