11.9. Index-Only Scans and Covering Indexes

11.9. Index-Only Scans and Covering Indexes
Prev	Up	Chapter 11. Indexes	Home	Next

+ All indexes in PostgreSQL + are secondary indexes, meaning that each index is + stored separately from the table's main data area (which is called the + table's heap + in PostgreSQL terminology). This means that + in an ordinary index scan, each row retrieval requires fetching data from + both the index and the heap. Furthermore, while the index entries that + match a given indexable WHERE condition are usually + close together in the index, the table rows they reference might be + anywhere in the heap. The heap-access portion of an index scan thus + involves a lot of random access into the heap, which can be slow, + particularly on traditional rotating media. (As described in + Section 11.5, bitmap scans try to alleviate + this cost by doing the heap accesses in sorted order, but that only goes + so far.) +

+ To solve this performance problem, PostgreSQL + supports index-only scans, which can answer + queries from an index alone without any heap access. The basic idea is + to return values directly out of each index entry instead of consulting + the associated heap entry. There are two fundamental restrictions on + when this method can be used: + +

+ The index type must support index-only scans. B-tree indexes always + do. GiST and SP-GiST indexes support index-only scans for some + operator classes but not others. Other index types have no support. + The underlying requirement is that the index must physically store, or + else be able to reconstruct, the original data value for each index + entry. As a counterexample, GIN indexes cannot support index-only + scans because each index entry typically holds only part of the + original data value. +
+ The query must reference only columns stored in the index. For + example, given an index on columns x + and y of a table that also has a + column z, these queries could use index-only scans: +
```
+SELECT x, y FROM tab WHERE x = 'key';
+SELECT x FROM tab WHERE x = 'key' AND y < 42;
+
```
+ but these queries could not: +
```
+SELECT x, z FROM tab WHERE x = 'key';
+SELECT x FROM tab WHERE x = 'key' AND z < 42;
+
```
+ (Expression indexes and partial indexes complicate this rule, + as discussed below.) +

+ If these two fundamental requirements are met, then all the data values + required by the query are available from the index, so an index-only scan + is physically possible. But there is an additional requirement for any + table scan in PostgreSQL: it must verify that + each retrieved row be “visible” to the query's MVCC + snapshot, as discussed in Chapter 13. Visibility information + is not stored in index entries, only in heap entries; so at first glance + it would seem that every row retrieval would require a heap access + anyway. And this is indeed the case, if the table row has been modified + recently. However, for seldom-changing data there is a way around this + problem. PostgreSQL tracks, for each page in + a table's heap, whether all rows stored in that page are old enough to be + visible to all current and future transactions. This information is + stored in a bit in the table's visibility map. An + index-only scan, after finding a candidate index entry, checks the + visibility map bit for the corresponding heap page. If it's set, the row + is known visible and so the data can be returned with no further work. + If it's not set, the heap entry must be visited to find out whether it's + visible, so no performance advantage is gained over a standard index + scan. Even in the successful case, this approach trades visibility map + accesses for heap accesses; but since the visibility map is four orders + of magnitude smaller than the heap it describes, far less physical I/O is + needed to access it. In most situations the visibility map remains + cached in memory all the time. +

+ In short, while an index-only scan is possible given the two fundamental + requirements, it will be a win only if a significant fraction of the + table's heap pages have their all-visible map bits set. But tables in + which a large fraction of the rows are unchanging are common enough to + make this type of scan very useful in practice. +

+ + To make effective use of the index-only scan feature, you might choose to + create a covering index, which is an index + specifically designed to include the columns needed by a particular + type of query that you run frequently. Since queries typically need to + retrieve more columns than just the ones they search + on, PostgreSQL allows you to create an index + in which some columns are just “payload” and are not part + of the search key. This is done by adding an INCLUDE + clause listing the extra columns. For example, if you commonly run + queries like +

+SELECT y FROM tab WHERE x = 'key';
+

+ the traditional approach to speeding up such queries would be to create + an index on x only. However, an index defined as +

+CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);
+

+ could handle these queries as index-only scans, + because y can be obtained from the index without + visiting the heap. +

+ Because column y is not part of the index's search + key, it does not have to be of a data type that the index can handle; + it's merely stored in the index and is not interpreted by the index + machinery. Also, if the index is a unique index, that is +

+CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);
+

+ the uniqueness condition applies to just column x, + not to the combination of x and y. + (An INCLUDE clause can also be written + in UNIQUE and PRIMARY KEY + constraints, providing alternative syntax for setting up an index like + this.) +

+ It's wise to be conservative about adding non-key payload columns to an + index, especially wide columns. If an index tuple exceeds the + maximum size allowed for the index type, data insertion will fail. + In any case, non-key columns duplicate data from the index's table + and bloat the size of the index, thus potentially slowing searches. + And remember that there is little point in including payload columns in an + index unless the table changes slowly enough that an index-only scan is + likely to not need to access the heap. If the heap tuple must be visited + anyway, it costs nothing more to get the column's value from there. + Other restrictions are that expressions are not currently supported as + included columns, and that only B-tree, GiST and SP-GiST indexes currently + support included columns. +

+ Before PostgreSQL had + the INCLUDE feature, people sometimes made covering + indexes by writing the payload columns as ordinary index columns, + that is writing +

+CREATE INDEX tab_x_y ON tab(x, y);
+

+ even though they had no intention of ever using y as + part of a WHERE clause. This works fine as long as + the extra columns are trailing columns; making them be leading columns is + unwise for the reasons explained in Section 11.3. + However, this method doesn't support the case where you want the index to + enforce uniqueness on the key column(s). +

+ Suffix truncation always removes non-key + columns from upper B-Tree levels. As payload columns, they are + never used to guide index scans. The truncation process also + removes one or more trailing key column(s) when the remaining + prefix of key column(s) happens to be sufficient to describe tuples + on the lowest B-Tree level. In practice, covering indexes without + an INCLUDE clause often avoid storing columns + that are effectively payload in the upper levels. However, + explicitly defining payload columns as non-key columns + reliably keeps the tuples in upper levels + small. +

+ In principle, index-only scans can be used with expression indexes. + For example, given an index on f(x) + where x is a table column, it should be possible to + execute +

+SELECT f(x) FROM tab WHERE f(x) < 1;
+

+ as an index-only scan; and this is very attractive + if f() is an expensive-to-compute function. + However, PostgreSQL's planner is currently not + very smart about such cases. It considers a query to be potentially + executable by index-only scan only when all columns + needed by the query are available from the index. In this + example, x is not needed except in the + context f(x), but the planner does not notice that and + concludes that an index-only scan is not possible. If an index-only scan + seems sufficiently worthwhile, this can be worked around by + adding x as an included column, for example +

+CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);
+

+ An additional caveat, if the goal is to avoid + recalculating f(x), is that the planner won't + necessarily match uses of f(x) that aren't in + indexable WHERE clauses to the index column. It will + usually get this right in simple queries such as shown above, but not in + queries that involve joins. These deficiencies may be remedied in future + versions of PostgreSQL. +

+ Partial indexes also have interesting interactions with index-only scans. + Consider the partial index shown in Example 11.3: +

+CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
+    WHERE success;
+

+ In principle, we could do an index-only scan on this index to satisfy a + query like +

+SELECT target FROM tests WHERE subject = 'some-subject' AND success;
+

+ But there's a problem: the WHERE clause refers + to success which is not available as a result column + of the index. Nonetheless, an index-only scan is possible because the + plan does not need to recheck that part of the WHERE + clause at run time: all entries found in the index necessarily + have success = true so this need not be explicitly + checked in the plan. PostgreSQL versions 9.6 + and later will recognize such cases and allow index-only scans to be + generated, but older versions will not. +

Prev	Up	Next
11.8. Partial Indexes	Home	11.10. Operator Classes and Operator Families