diff options
Diffstat (limited to '')
-rw-r--r-- | doc/src/sgml/indices.sgml | 1600 |
1 files changed, 1600 insertions, 0 deletions
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml new file mode 100644 index 0000000..023157d --- /dev/null +++ b/doc/src/sgml/indices.sgml @@ -0,0 +1,1600 @@ +<!-- doc/src/sgml/indices.sgml --> + +<chapter id="indexes"> + <title>Indexes</title> + + <indexterm zone="indexes"> + <primary>index</primary> + </indexterm> + + <para> + Indexes are a common way to enhance database performance. An index + allows the database server to find and retrieve specific rows much + faster than it could do without an index. But indexes also add + overhead to the database system as a whole, so they should be used + sensibly. + </para> + + + <sect1 id="indexes-intro"> + <title>Introduction</title> + + <para> + Suppose we have a table similar to this: +<programlisting> +CREATE TABLE test1 ( + id integer, + content varchar +); +</programlisting> + and the application issues many queries of the form: +<programlisting> +SELECT content FROM test1 WHERE id = <replaceable>constant</replaceable>; +</programlisting> + With no advance preparation, the system would have to scan the entire + <structname>test1</structname> table, row by row, to find all + matching entries. If there are many rows in + <structname>test1</structname> and only a few rows (perhaps zero + or one) that would be returned by such a query, this is clearly an + inefficient method. But if the system has been instructed to maintain an + index on the <structfield>id</structfield> column, it can use a more + efficient method for locating matching rows. For instance, it + might only have to walk a few levels deep into a search tree. + </para> + + <para> + A similar approach is used in most non-fiction books: terms and + concepts that are frequently looked up by readers are collected in + an alphabetic index at the end of the book. The interested reader + can scan the index relatively quickly and flip to the appropriate + page(s), rather than having to read the entire book to find the + material of interest. Just as it is the task of the author to + anticipate the items that readers are likely to look up, + it is the task of the database programmer to foresee which indexes + will be useful. + </para> + + <para> + The following command can be used to create an index on the + <structfield>id</structfield> column, as discussed: +<programlisting> +CREATE INDEX test1_id_index ON test1 (id); +</programlisting> + The name <structname>test1_id_index</structname> can be chosen + freely, but you should pick something that enables you to remember + later what the index was for. + </para> + + <para> + To remove an index, use the <command>DROP INDEX</command> command. + Indexes can be added to and removed from tables at any time. + </para> + + <para> + Once an index is created, no further intervention is required: the + system will update the index when the table is modified, and it will + use the index in queries when it thinks doing so would be more efficient + than a sequential table scan. But you might have to run the + <command>ANALYZE</command> command regularly to update + statistics to allow the query planner to make educated decisions. + See <xref linkend="performance-tips"/> for information about + how to find out whether an index is used and when and why the + planner might choose <emphasis>not</emphasis> to use an index. + </para> + + <para> + Indexes can also benefit <command>UPDATE</command> and + <command>DELETE</command> commands with search conditions. + Indexes can moreover be used in join searches. Thus, + an index defined on a column that is part of a join condition can + also significantly speed up queries with joins. + </para> + + <para> + Creating an index on a large table can take a long time. By default, + <productname>PostgreSQL</productname> allows reads (<command>SELECT</command> statements) to occur + on the table in parallel with index creation, but writes (<command>INSERT</command>, + <command>UPDATE</command>, <command>DELETE</command>) are blocked until the index build is finished. + In production environments this is often unacceptable. + It is possible to allow writes to occur in parallel with index + creation, but there are several caveats to be aware of — + for more information see <xref linkend="sql-createindex-concurrently"/>. + </para> + + <para> + After an index is created, the system has to keep it synchronized with the + table. This adds overhead to data manipulation operations. + Therefore indexes that are seldom or never used in queries + should be removed. + </para> + </sect1> + + + <sect1 id="indexes-types"> + <title>Index Types</title> + + <para> + <productname>PostgreSQL</productname> provides several index types: + B-tree, Hash, GiST, SP-GiST, GIN and BRIN. + Each index type uses a different + algorithm that is best suited to different types of queries. + By default, the <link linkend="sql-createindex"><command>CREATE + INDEX</command></link> command creates + B-tree indexes, which fit the most common situations. + The other index types are selected by writing the keyword + <literal>USING</literal> followed by the index type name. + For example, to create a Hash index: +<programlisting> +CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING HASH (<replaceable>column</replaceable>); +</programlisting> + </para> + + <sect2 id="indexes-types-btree"> + <title>B-Tree</title> + + <indexterm> + <primary>index</primary> + <secondary>B-Tree</secondary> + </indexterm> + <indexterm> + <primary>B-Tree</primary> + <see>index</see> + </indexterm> + + <para> + B-trees can handle equality and range queries on data that can be sorted + into some ordering. + In particular, the <productname>PostgreSQL</productname> query planner + will consider using a B-tree index whenever an indexed column is + involved in a comparison using one of these operators: + +<synopsis> +< <= = >= > +</synopsis> + + Constructs equivalent to combinations of these operators, such as + <literal>BETWEEN</literal> and <literal>IN</literal>, can also be implemented with + a B-tree index search. Also, an <literal>IS NULL</literal> or <literal>IS NOT + NULL</literal> condition on an index column can be used with a B-tree index. + </para> + + <para> + The optimizer can also use a B-tree index for queries involving the + pattern matching operators <literal>LIKE</literal> and <literal>~</literal> + <emphasis>if</emphasis> the pattern is a constant and is anchored to + the beginning of the string — for example, <literal>col LIKE + 'foo%'</literal> or <literal>col ~ '^foo'</literal>, but not + <literal>col LIKE '%bar'</literal>. However, if your database does not + use the C locale you will need to create the index with a special + operator class to support indexing of pattern-matching queries; see + <xref linkend="indexes-opclass"/> below. It is also possible to use + B-tree indexes for <literal>ILIKE</literal> and + <literal>~*</literal>, but only if the pattern starts with + non-alphabetic characters, i.e., characters that are not affected by + upper/lower case conversion. + </para> + + <para> + B-tree indexes can also be used to retrieve data in sorted order. + This is not always faster than a simple scan and sort, but it is + often helpful. + </para> + </sect2> + + <sect2 id="indexes-types-hash"> + <title>Hash</title> + + <indexterm> + <primary>index</primary> + <secondary>hash</secondary> + </indexterm> + <indexterm> + <primary>hash</primary> + <see>index</see> + </indexterm> + + <para> + Hash indexes store a 32-bit hash code derived from the + value of the indexed column. Hence, + such indexes can only handle simple equality comparisons. + The query planner will consider using a hash index whenever an + indexed column is involved in a comparison using the + equal operator: + +<synopsis> += +</synopsis> + </para> + </sect2> + + <sect2 id="indexes-type-gist"> + <title>GiST</title> + + <indexterm> + <primary>index</primary> + <secondary>GiST</secondary> + </indexterm> + <indexterm> + <primary>GiST</primary> + <see>index</see> + </indexterm> + + <para> + GiST indexes are not a single kind of index, but rather an infrastructure + within which many different indexing strategies can be implemented. + Accordingly, the particular operators with which a GiST index can be + used vary depending on the indexing strategy (the <firstterm>operator + class</firstterm>). As an example, the standard distribution of + <productname>PostgreSQL</productname> includes GiST operator classes + for several two-dimensional geometric data types, which support indexed + queries using these operators: + +<synopsis> +<< &< &> >> <<| &<| |&> |>> @> <@ ~= && +</synopsis> + + (See <xref linkend="functions-geometry"/> for the meaning of + these operators.) + The GiST operator classes included in the standard distribution are + documented in <xref linkend="gist-builtin-opclasses-table"/>. + Many other GiST operator + classes are available in the <literal>contrib</literal> collection or as separate + projects. For more information see <xref linkend="gist"/>. + </para> + + <para> + GiST indexes are also capable of optimizing <quote>nearest-neighbor</quote> + searches, such as +<programlisting><![CDATA[ +SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10; +]]> +</programlisting> + which finds the ten places closest to a given target point. The ability + to do this is again dependent on the particular operator class being used. + In <xref linkend="gist-builtin-opclasses-table"/>, operators that can be + used in this way are listed in the column <quote>Ordering Operators</quote>. + </para> + </sect2> + + <sect2 id="indexes-type-spgist"> + <title>SP-GiST</title> + + <indexterm> + <primary>index</primary> + <secondary>SP-GiST</secondary> + </indexterm> + <indexterm> + <primary>SP-GiST</primary> + <see>index</see> + </indexterm> + + <para> + SP-GiST indexes, like GiST indexes, offer an infrastructure that supports + various kinds of searches. SP-GiST permits implementation of a wide range + of different non-balanced disk-based data structures, such as quadtrees, + k-d trees, and radix trees (tries). As an example, the standard distribution of + <productname>PostgreSQL</productname> includes SP-GiST operator classes + for two-dimensional points, which support indexed + queries using these operators: + +<synopsis> +<< >> ~= <@ <<| |>> +</synopsis> + + (See <xref linkend="functions-geometry"/> for the meaning of + these operators.) + The SP-GiST operator classes included in the standard distribution are + documented in <xref linkend="spgist-builtin-opclasses-table"/>. + For more information see <xref linkend="spgist"/>. + </para> + + <para> + Like GiST, SP-GiST supports <quote>nearest-neighbor</quote> searches. + For SP-GiST operator classes that support distance ordering, the + corresponding operator is listed in the <quote>Ordering Operators</quote> + column in <xref linkend="spgist-builtin-opclasses-table"/>. + </para> + </sect2> + + <sect2 id="indexes-types-gin"> + <title>GIN</title> + + <indexterm> + <primary>index</primary> + <secondary>GIN</secondary> + </indexterm> + <indexterm> + <primary>GIN</primary> + <see>index</see> + </indexterm> + + <para> + GIN indexes are <quote>inverted indexes</quote> which are appropriate for + data values that contain multiple component values, such as arrays. An + inverted index contains a separate entry for each component value, and + can efficiently handle queries that test for the presence of specific + component values. + </para> + + <para> + Like GiST and SP-GiST, GIN can support + many different user-defined indexing strategies, and the particular + operators with which a GIN index can be used vary depending on the + indexing strategy. + As an example, the standard distribution of + <productname>PostgreSQL</productname> includes a GIN operator class + for arrays, which supports indexed queries using these operators: + +<synopsis> +<@ @> = && +</synopsis> + + (See <xref linkend="functions-array"/> for the meaning of + these operators.) + The GIN operator classes included in the standard distribution are + documented in <xref linkend="gin-builtin-opclasses-table"/>. + Many other GIN operator + classes are available in the <literal>contrib</literal> collection or as separate + projects. For more information see <xref linkend="gin"/>. + </para> + </sect2> + + <sect2 id="indexes-types-brin"> + <title>BRIN</title> + + <indexterm> + <primary>index</primary> + <secondary>BRIN</secondary> + </indexterm> + <indexterm> + <primary>BRIN</primary> + <see>index</see> + </indexterm> + + <para> + BRIN indexes (a shorthand for Block Range INdexes) store summaries about + the values stored in consecutive physical block ranges of a table. + Thus, they are most effective for columns whose values are well-correlated + with the physical order of the table rows. + Like GiST, SP-GiST and GIN, + BRIN can support many different indexing strategies, + and the particular operators with which a BRIN index can be used + vary depending on the indexing strategy. + For data types that have a linear sort order, the indexed data + corresponds to the minimum and maximum values of the + values in the column for each block range. This supports indexed queries + using these operators: + +<synopsis> +< <= = >= > +</synopsis> + + The BRIN operator classes included in the standard distribution are + documented in <xref linkend="brin-builtin-opclasses-table"/>. + For more information see <xref linkend="brin"/>. + </para> + </sect2> + </sect1> + + + <sect1 id="indexes-multicolumn"> + <title>Multicolumn Indexes</title> + + <indexterm zone="indexes-multicolumn"> + <primary>index</primary> + <secondary>multicolumn</secondary> + </indexterm> + + <para> + An index can be defined on more than one column of a table. For example, if + you have a table of this form: +<programlisting> +CREATE TABLE test2 ( + major int, + minor int, + name varchar +); +</programlisting> + (say, you keep your <filename class="directory">/dev</filename> + directory in a database...) and you frequently issue queries like: +<programlisting> +SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> AND minor = <replaceable>constant</replaceable>; +</programlisting> + then it might be appropriate to define an index on the columns + <structfield>major</structfield> and + <structfield>minor</structfield> together, e.g.: +<programlisting> +CREATE INDEX test2_mm_idx ON test2 (major, minor); +</programlisting> + </para> + + <para> + Currently, only the B-tree, GiST, GIN, and BRIN index types support + multiple-key-column indexes. Whether there can be multiple key + columns is independent of whether <literal>INCLUDE</literal> columns + can be added to the index. Indexes can have up to 32 columns, + including <literal>INCLUDE</literal> columns. (This limit can be + altered when building <productname>PostgreSQL</productname>; see the + file <filename>pg_config_manual.h</filename>.) + </para> + + <para> + A multicolumn B-tree index can be used with query conditions that + involve any subset of the index's columns, but the index is most + efficient when there are constraints on the leading (leftmost) columns. + The exact rule is that equality constraints on leading columns, plus + any inequality constraints on the first column that does not have an + equality constraint, will be used to limit the portion of the index + that is scanned. Constraints on columns to the right of these columns + are checked in the index, so they save visits to the table proper, but + they do not reduce the portion of the index that has to be scanned. + For example, given an index on <literal>(a, b, c)</literal> and a + query condition <literal>WHERE a = 5 AND b >= 42 AND c < 77</literal>, + the index would have to be scanned from the first entry with + <literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with + <literal>a</literal> = 5. Index entries with <literal>c</literal> >= 77 would be + skipped, but they'd still have to be scanned through. + This index could in principle be used for queries that have constraints + on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal> + — but the entire index would have to be scanned, so in most cases + the planner would prefer a sequential table scan over using the index. + </para> + + <para> + A multicolumn GiST index can be used with query conditions that + involve any subset of the index's columns. Conditions on additional + columns restrict the entries returned by the index, but the condition on + the first column is the most important one for determining how much of + the index needs to be scanned. A GiST index will be relatively + ineffective if its first column has only a few distinct values, even if + there are many distinct values in additional columns. + </para> + + <para> + A multicolumn GIN index can be used with query conditions that + involve any subset of the index's columns. Unlike B-tree or GiST, + index search effectiveness is the same regardless of which index column(s) + the query conditions use. + </para> + + <para> + A multicolumn BRIN index can be used with query conditions that + involve any subset of the index's columns. Like GIN and unlike B-tree or + GiST, index search effectiveness is the same regardless of which index + column(s) the query conditions use. The only reason to have multiple BRIN + indexes instead of one multicolumn BRIN index on a single table is to have + a different <literal>pages_per_range</literal> storage parameter. + </para> + + <para> + Of course, each column must be used with operators appropriate to the index + type; clauses that involve other operators will not be considered. + </para> + + <para> + Multicolumn indexes should be used sparingly. In most situations, + an index on a single column is sufficient and saves space and time. + Indexes with more than three columns are unlikely to be helpful + unless the usage of the table is extremely stylized. See also + <xref linkend="indexes-bitmap-scans"/> and + <xref linkend="indexes-index-only-scans"/> for some discussion of the + merits of different index configurations. + </para> + </sect1> + + + <sect1 id="indexes-ordering"> + <title>Indexes and <literal>ORDER BY</literal></title> + + <indexterm zone="indexes-ordering"> + <primary>index</primary> + <secondary>and <literal>ORDER BY</literal></secondary> + </indexterm> + + <para> + In addition to simply finding the rows to be returned by a query, + an index may be able to deliver them in a specific sorted order. + This allows a query's <literal>ORDER BY</literal> specification to be honored + without a separate sorting step. Of the index types currently + supported by <productname>PostgreSQL</productname>, only B-tree + can produce sorted output — the other index types return + matching rows in an unspecified, implementation-dependent order. + </para> + + <para> + The planner will consider satisfying an <literal>ORDER BY</literal> specification + either by scanning an available index that matches the specification, + or by scanning the table in physical order and doing an explicit + sort. For a query that requires scanning a large fraction of the + table, an explicit sort is likely to be faster than using an index + because it requires + less disk I/O due to following a sequential access pattern. Indexes are + more useful when only a few rows need be fetched. An important + special case is <literal>ORDER BY</literal> in combination with + <literal>LIMIT</literal> <replaceable>n</replaceable>: an explicit sort will have to process + all the data to identify the first <replaceable>n</replaceable> rows, but if there is + an index matching the <literal>ORDER BY</literal>, the first <replaceable>n</replaceable> + rows can be retrieved directly, without scanning the remainder at all. + </para> + + <para> + By default, B-tree indexes store their entries in ascending order + with nulls last (table TID is treated as a tiebreaker column among + otherwise equal entries). This means that a forward scan of an + index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal> + (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>). The + index can also be scanned backward, producing output satisfying + <literal>ORDER BY x DESC</literal> + (or more verbosely, <literal>ORDER BY x DESC NULLS FIRST</literal>, since + <literal>NULLS FIRST</literal> is the default for <literal>ORDER BY DESC</literal>). + </para> + + <para> + You can adjust the ordering of a B-tree index by including the + options <literal>ASC</literal>, <literal>DESC</literal>, <literal>NULLS FIRST</literal>, + and/or <literal>NULLS LAST</literal> when creating the index; for example: +<programlisting> +CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST); +CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST); +</programlisting> + An index stored in ascending order with nulls first can satisfy + either <literal>ORDER BY x ASC NULLS FIRST</literal> or + <literal>ORDER BY x DESC NULLS LAST</literal> depending on which direction + it is scanned in. + </para> + + <para> + You might wonder why bother providing all four options, when two + options together with the possibility of backward scan would cover + all the variants of <literal>ORDER BY</literal>. In single-column indexes + the options are indeed redundant, but in multicolumn indexes they can be + useful. Consider a two-column index on <literal>(x, y)</literal>: this can + satisfy <literal>ORDER BY x, y</literal> if we scan forward, or + <literal>ORDER BY x DESC, y DESC</literal> if we scan backward. + But it might be that the application frequently needs to use + <literal>ORDER BY x ASC, y DESC</literal>. There is no way to get that + ordering from a plain index, but it is possible if the index is defined + as <literal>(x ASC, y DESC)</literal> or <literal>(x DESC, y ASC)</literal>. + </para> + + <para> + Obviously, indexes with non-default sort orderings are a fairly + specialized feature, but sometimes they can produce tremendous + speedups for certain queries. Whether it's worth maintaining such an + index depends on how often you use queries that require a special + sort ordering. + </para> + </sect1> + + + <sect1 id="indexes-bitmap-scans"> + <title>Combining Multiple Indexes</title> + + <indexterm zone="indexes-bitmap-scans"> + <primary>index</primary> + <secondary>combining multiple indexes</secondary> + </indexterm> + + <indexterm zone="indexes-bitmap-scans"> + <primary>bitmap scan</primary> + </indexterm> + + <para> + A single index scan can only use query clauses that use the index's + columns with operators of its operator class and are joined with + <literal>AND</literal>. For example, given an index on <literal>(a, b)</literal> + a query condition like <literal>WHERE a = 5 AND b = 6</literal> could + use the index, but a query like <literal>WHERE a = 5 OR b = 6</literal> could not + directly use the index. + </para> + + <para> + Fortunately, + <productname>PostgreSQL</productname> has the ability to combine multiple indexes + (including multiple uses of the same index) to handle cases that cannot + be implemented by single index scans. The system can form <literal>AND</literal> + and <literal>OR</literal> conditions across several index scans. For example, + a query like <literal>WHERE x = 42 OR x = 47 OR x = 53 OR x = 99</literal> + could be broken down into four separate scans of an index on <literal>x</literal>, + each scan using one of the query clauses. The results of these scans are + then ORed together to produce the result. Another example is that if we + have separate indexes on <literal>x</literal> and <literal>y</literal>, one possible + implementation of a query like <literal>WHERE x = 5 AND y = 6</literal> is to + use each index with the appropriate query clause and then AND together + the index results to identify the result rows. + </para> + + <para> + To combine multiple indexes, the system scans each needed index and + prepares a <firstterm>bitmap</firstterm> in memory giving the locations of + table rows that are reported as matching that index's conditions. + The bitmaps are then ANDed and ORed together as needed by the query. + Finally, the actual table rows are visited and returned. The table rows + are visited in physical order, because that is how the bitmap is laid + out; this means that any ordering of the original indexes is lost, and + so a separate sort step will be needed if the query has an <literal>ORDER + BY</literal> clause. For this reason, and because each additional index scan + adds extra time, the planner will sometimes choose to use a simple index + scan even though additional indexes are available that could have been + used as well. + </para> + + <para> + In all but the simplest applications, there are various combinations of + indexes that might be useful, and the database developer must make + trade-offs to decide which indexes to provide. Sometimes multicolumn + indexes are best, but sometimes it's better to create separate indexes + and rely on the index-combination feature. For example, if your + workload includes a mix of queries that sometimes involve only column + <literal>x</literal>, sometimes only column <literal>y</literal>, and sometimes both + columns, you might choose to create two separate indexes on + <literal>x</literal> and <literal>y</literal>, relying on index combination to + process the queries that use both columns. You could also create a + multicolumn index on <literal>(x, y)</literal>. This index would typically be + more efficient than index combination for queries involving both + columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it + would be almost useless for queries involving only <literal>y</literal>, so it + should not be the only index. A combination of the multicolumn index + and a separate index on <literal>y</literal> would serve reasonably well. For + queries involving only <literal>x</literal>, the multicolumn index could be + used, though it would be larger and hence slower than an index on + <literal>x</literal> alone. The last alternative is to create all three + indexes, but this is probably only reasonable if the table is searched + much more often than it is updated and all three types of query are + common. If one of the types of query is much less common than the + others, you'd probably settle for creating just the two indexes that + best match the common types. + </para> + + </sect1> + + + <sect1 id="indexes-unique"> + <title>Unique Indexes</title> + + <indexterm zone="indexes-unique"> + <primary>index</primary> + <secondary>unique</secondary> + </indexterm> + + <para> + Indexes can also be used to enforce uniqueness of a column's value, + or the uniqueness of the combined values of more than one column. +<synopsis> +CREATE UNIQUE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <optional>, ...</optional>); +</synopsis> + Currently, only B-tree indexes can be declared unique. + </para> + + <para> + When an index is declared unique, multiple table rows with equal + indexed values are not allowed. Null values are not considered + equal. A multicolumn unique index will only reject cases where all + indexed columns are equal in multiple rows. + </para> + + <para> + <productname>PostgreSQL</productname> automatically creates a unique + index when a unique constraint or primary key is defined for a table. + The index covers the columns that make up the primary key or unique + constraint (a multicolumn index, if appropriate), and is the mechanism + that enforces the constraint. + </para> + + <note> + <para> + There's no need to manually + create indexes on unique columns; doing so would just duplicate + the automatically-created index. + </para> + </note> + </sect1> + + + <sect1 id="indexes-expressional"> + <title>Indexes on Expressions</title> + + <indexterm zone="indexes-expressional"> + <primary>index</primary> + <secondary sortas="expressions">on expressions</secondary> + </indexterm> + + <para> + An index column need not be just a column of the underlying table, + but can be a function or scalar expression computed from one or + more columns of the table. This feature is useful to obtain fast + access to tables based on the results of computations. + </para> + + <para> + For example, a common way to do case-insensitive comparisons is to + use the <function>lower</function> function: +<programlisting> +SELECT * FROM test1 WHERE lower(col1) = 'value'; +</programlisting> + This query can use an index if one has been + defined on the result of the <literal>lower(col1)</literal> + function: +<programlisting> +CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); +</programlisting> + </para> + + <para> + If we were to declare this index <literal>UNIQUE</literal>, it would prevent + creation of rows whose <literal>col1</literal> values differ only in case, + as well as rows whose <literal>col1</literal> values are actually identical. + Thus, indexes on expressions can be used to enforce constraints that + are not definable as simple unique constraints. + </para> + + <para> + As another example, if one often does queries like: +<programlisting> +SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith'; +</programlisting> + then it might be worth creating an index like this: +<programlisting> +CREATE INDEX people_names ON people ((first_name || ' ' || last_name)); +</programlisting> + </para> + + <para> + The syntax of the <command>CREATE INDEX</command> command normally requires + writing parentheses around index expressions, as shown in the second + example. The parentheses can be omitted when the expression is just + a function call, as in the first example. + </para> + + <para> + Index expressions are relatively expensive to maintain, because the + derived expression(s) must be computed for each row insertion + and non-HOT update. However, the index expressions are + <emphasis>not</emphasis> recomputed during an indexed search, since they are + already stored in the index. In both examples above, the system + sees the query as just <literal>WHERE indexedcolumn = 'constant'</literal> + and so the speed of the search is equivalent to any other simple index + query. Thus, indexes on expressions are useful when retrieval speed + is more important than insertion and update speed. + </para> + </sect1> + + + <sect1 id="indexes-partial"> + <title>Partial Indexes</title> + + <indexterm zone="indexes-partial"> + <primary>index</primary> + <secondary>partial</secondary> + </indexterm> + + <para> + A <firstterm>partial index</firstterm> is an index built over a + subset of a table; the subset is defined by a conditional + expression (called the <firstterm>predicate</firstterm> of the + partial index). The index contains entries only for those table + rows that satisfy the predicate. Partial indexes are a specialized + feature, but there are several situations in which they are useful. + </para> + + <para> + One major reason for using a partial index is to avoid indexing common + values. Since a query searching for a common value (one that + accounts for more than a few percent of all the table rows) will not + use the index anyway, there is no point in keeping those rows in the + index at all. This reduces the size of the index, which will speed + up those queries that do use the index. It will also speed up many table + update operations because the index does not need to be + updated in all cases. <xref linkend="indexes-partial-ex1"/> shows a + possible application of this idea. + </para> + + <example id="indexes-partial-ex1"> + <title>Setting up a Partial Index to Exclude Common Values</title> + + <para> + Suppose you are storing web server access logs in a database. + Most accesses originate from the IP address range of your organization but + some are from elsewhere (say, employees on dial-up connections). + If your searches by IP are primarily for outside accesses, + you probably do not need to index the IP range that corresponds to your + organization's subnet. + </para> + + <para> + Assume a table like this: +<programlisting> +CREATE TABLE access_log ( + url varchar, + client_ip inet, + ... +); +</programlisting> + </para> + + <para> + To create a partial index that suits our example, use a command + such as this: +<programlisting> +CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) +WHERE NOT (client_ip > inet '192.168.100.0' AND + client_ip < inet '192.168.100.255'); +</programlisting> + </para> + + <para> + A typical query that can use this index would be: +<programlisting> +SELECT * +FROM access_log +WHERE url = '/index.html' AND client_ip = inet '212.78.10.32'; +</programlisting> + Here the query's IP address is covered by the partial index. The + following query cannot use the partial index, as it uses an IP address + that is excluded from the index: +<programlisting> +SELECT * +FROM access_log +WHERE url = '/index.html' AND client_ip = inet '192.168.100.23'; +</programlisting> + </para> + + <para> + Observe that this kind of partial index requires that the common + values be predetermined, so such partial indexes are best used for + data distributions that do not change. Such indexes can be recreated + occasionally to adjust for new data distributions, but this adds + maintenance effort. + </para> + </example> + + <para> + Another possible use for a partial index is to exclude values from the + index that the + typical query workload is not interested in; this is shown in <xref + linkend="indexes-partial-ex2"/>. This results in the same + advantages as listed above, but it prevents the + <quote>uninteresting</quote> values from being accessed via that + index, even if an index scan might be profitable in that + case. Obviously, setting up partial indexes for this kind of + scenario will require a lot of care and experimentation. + </para> + + <example id="indexes-partial-ex2"> + <title>Setting up a Partial Index to Exclude Uninteresting Values</title> + + <para> + If you have a table that contains both billed and unbilled orders, + where the unbilled orders take up a small fraction of the total + table and yet those are the most-accessed rows, you can improve + performance by creating an index on just the unbilled rows. The + command to create the index would look like this: +<programlisting> +CREATE INDEX orders_unbilled_index ON orders (order_nr) + WHERE billed is not true; +</programlisting> + </para> + + <para> + A possible query to use this index would be: +<programlisting> +SELECT * FROM orders WHERE billed is not true AND order_nr < 10000; +</programlisting> + However, the index can also be used in queries that do not involve + <structfield>order_nr</structfield> at all, e.g.: +<programlisting> +SELECT * FROM orders WHERE billed is not true AND amount > 5000.00; +</programlisting> + This is not as efficient as a partial index on the + <structfield>amount</structfield> column would be, since the system has to + scan the entire index. Yet, if there are relatively few unbilled + orders, using this partial index just to find the unbilled orders + could be a win. + </para> + + <para> + Note that this query cannot use this index: +<programlisting> +SELECT * FROM orders WHERE order_nr = 3501; +</programlisting> + The order 3501 might be among the billed or unbilled + orders. + </para> + </example> + + <para> + <xref linkend="indexes-partial-ex2"/> also illustrates that the + indexed column and the column used in the predicate do not need to + match. <productname>PostgreSQL</productname> supports partial + indexes with arbitrary predicates, so long as only columns of the + table being indexed are involved. However, keep in mind that the + predicate must match the conditions used in the queries that + are supposed to benefit from the index. To be precise, a partial + index can be used in a query only if the system can recognize that + the <literal>WHERE</literal> condition of the query mathematically implies + the predicate of the index. + <productname>PostgreSQL</productname> does not have a sophisticated + theorem prover that can recognize mathematically equivalent + expressions that are written in different forms. (Not + only is such a general theorem prover extremely difficult to + create, it would probably be too slow to be of any real use.) + The system can recognize simple inequality implications, for example + <quote>x < 1</quote> implies <quote>x < 2</quote>; otherwise + the predicate condition must exactly match part of the query's + <literal>WHERE</literal> condition + or the index will not be recognized as usable. Matching takes + place at query planning time, not at run time. As a result, + parameterized query clauses do not work with a partial index. For + example a prepared query with a parameter might specify + <quote>x < ?</quote> which will never imply + <quote>x < 2</quote> for all possible values of the parameter. + </para> + + <para> + A third possible use for partial indexes does not require the + index to be used in queries at all. The idea here is to create + a unique index over a subset of a table, as in <xref + linkend="indexes-partial-ex3"/>. This enforces uniqueness + among the rows that satisfy the index predicate, without constraining + those that do not. + </para> + + <example id="indexes-partial-ex3"> + <title>Setting up a Partial Unique Index</title> + + <para> + Suppose that we have a table describing test outcomes. We wish + to ensure that there is only one <quote>successful</quote> entry for + a given subject and target combination, but there might be any number of + <quote>unsuccessful</quote> entries. Here is one way to do it: +<programlisting> +CREATE TABLE tests ( + subject text, + target text, + success boolean, + ... +); + +CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) + WHERE success; +</programlisting> + This is a particularly efficient approach when there are few + successful tests and many unsuccessful ones. It is also possible to + allow only one null in a column by creating a unique partial index + with an <literal>IS NULL</literal> restriction. + </para> + + </example> + + <para> + Finally, a partial index can also be used to override the system's + query plan choices. Also, data sets with peculiar + distributions might cause the system to use an index when it really + should not. In that case the index can be set up so that it is not + available for the offending query. Normally, + <productname>PostgreSQL</productname> makes reasonable choices about index + usage (e.g., it avoids them when retrieving common values, so the + earlier example really only saves index size, it is not required to + avoid index usage), and grossly incorrect plan choices are cause + for a bug report. + </para> + + <para> + Keep in mind that setting up a partial index indicates that you + know at least as much as the query planner knows, in particular you + know when an index might be profitable. Forming this knowledge + requires experience and understanding of how indexes in + <productname>PostgreSQL</productname> work. In most cases, the + advantage of a partial index over a regular index will be minimal. + There are cases where they are quite counterproductive, as in <xref + linkend="indexes-partial-ex4"/>. + </para> + + <example id="indexes-partial-ex4"> + <title>Do Not Use Partial Indexes as a Substitute for Partitioning</title> + + <para> + You might be tempted to create a large set of non-overlapping partial + indexes, for example + +<programlisting> +CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1; +CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2; +CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3; +... +CREATE INDEX mytable_cat_<replaceable>N</replaceable> ON mytable (data) WHERE category = <replaceable>N</replaceable>; +</programlisting> + + This is a bad idea! Almost certainly, you'll be better off with a + single non-partial index, declared like + +<programlisting> +CREATE INDEX mytable_cat_data ON mytable (category, data); +</programlisting> + + (Put the category column first, for the reasons described in + <xref linkend="indexes-multicolumn"/>.) While a search in this larger + index might have to descend through a couple more tree levels than a + search in a smaller index, that's almost certainly going to be cheaper + than the planner effort needed to select the appropriate one of the + partial indexes. The core of the problem is that the system does not + understand the relationship among the partial indexes, and will + laboriously test each one to see if it's applicable to the current + query. + </para> + + <para> + If your table is large enough that a single index really is a bad idea, + you should look into using partitioning instead (see + <xref linkend="ddl-partitioning"/>). With that mechanism, the system + does understand that the tables and indexes are non-overlapping, so + far better performance is possible. + </para> + </example> + + <para> + More information about partial indexes can be found in <xref + linkend="ston89b"/>, <xref linkend="olson93"/>, and <xref + linkend="seshadri95"/>. + </para> + </sect1> + + + <sect1 id="indexes-index-only-scans"> + <title>Index-Only Scans and Covering Indexes</title> + + <indexterm zone="indexes-index-only-scans"> + <primary>index</primary> + <secondary>index-only scans</secondary> + </indexterm> + <indexterm zone="indexes-index-only-scans"> + <primary>index-only scan</primary> + </indexterm> + <indexterm zone="indexes-index-only-scans"> + <primary>index</primary> + <secondary>covering</secondary> + </indexterm> + <indexterm zone="indexes-index-only-scans"> + <primary>covering index</primary> + </indexterm> + + <para> + All indexes in <productname>PostgreSQL</productname> + are <firstterm>secondary</firstterm> indexes, meaning that each index is + stored separately from the table's main data area (which is called the + table's <firstterm>heap</firstterm> + in <productname>PostgreSQL</productname> terminology). This means that + in an ordinary index scan, each row retrieval requires fetching data from + both the index and the heap. Furthermore, while the index entries that + match a given indexable <literal>WHERE</literal> condition are usually + close together in the index, the table rows they reference might be + anywhere in the heap. The heap-access portion of an index scan thus + involves a lot of random access into the heap, which can be slow, + particularly on traditional rotating media. (As described in + <xref linkend="indexes-bitmap-scans"/>, bitmap scans try to alleviate + this cost by doing the heap accesses in sorted order, but that only goes + so far.) + </para> + + <para> + To solve this performance problem, <productname>PostgreSQL</productname> + supports <firstterm>index-only scans</firstterm>, which can answer + queries from an index alone without any heap access. The basic idea is + to return values directly out of each index entry instead of consulting + the associated heap entry. There are two fundamental restrictions on + when this method can be used: + + <orderedlist> + <listitem> + <para> + The index type must support index-only scans. B-tree indexes always + do. GiST and SP-GiST indexes support index-only scans for some + operator classes but not others. Other index types have no support. + The underlying requirement is that the index must physically store, or + else be able to reconstruct, the original data value for each index + entry. As a counterexample, GIN indexes cannot support index-only + scans because each index entry typically holds only part of the + original data value. + </para> + </listitem> + + <listitem> + <para> + The query must reference only columns stored in the index. For + example, given an index on columns <literal>x</literal> + and <literal>y</literal> of a table that also has a + column <literal>z</literal>, these queries could use index-only scans: +<programlisting> +SELECT x, y FROM tab WHERE x = 'key'; +SELECT x FROM tab WHERE x = 'key' AND y < 42; +</programlisting> + but these queries could not: +<programlisting> +SELECT x, z FROM tab WHERE x = 'key'; +SELECT x FROM tab WHERE x = 'key' AND z < 42; +</programlisting> + (Expression indexes and partial indexes complicate this rule, + as discussed below.) + </para> + </listitem> + </orderedlist> + </para> + + <para> + If these two fundamental requirements are met, then all the data values + required by the query are available from the index, so an index-only scan + is physically possible. But there is an additional requirement for any + table scan in <productname>PostgreSQL</productname>: it must verify that + each retrieved row be <quote>visible</quote> to the query's MVCC + snapshot, as discussed in <xref linkend="mvcc"/>. Visibility information + is not stored in index entries, only in heap entries; so at first glance + it would seem that every row retrieval would require a heap access + anyway. And this is indeed the case, if the table row has been modified + recently. However, for seldom-changing data there is a way around this + problem. <productname>PostgreSQL</productname> tracks, for each page in + a table's heap, whether all rows stored in that page are old enough to be + visible to all current and future transactions. This information is + stored in a bit in the table's <firstterm>visibility map</firstterm>. An + index-only scan, after finding a candidate index entry, checks the + visibility map bit for the corresponding heap page. If it's set, the row + is known visible and so the data can be returned with no further work. + If it's not set, the heap entry must be visited to find out whether it's + visible, so no performance advantage is gained over a standard index + scan. Even in the successful case, this approach trades visibility map + accesses for heap accesses; but since the visibility map is four orders + of magnitude smaller than the heap it describes, far less physical I/O is + needed to access it. In most situations the visibility map remains + cached in memory all the time. + </para> + + <para> + In short, while an index-only scan is possible given the two fundamental + requirements, it will be a win only if a significant fraction of the + table's heap pages have their all-visible map bits set. But tables in + which a large fraction of the rows are unchanging are common enough to + make this type of scan very useful in practice. + </para> + + <para> + <indexterm> + <primary><literal>INCLUDE</literal></primary> + <secondary>in index definitions</secondary> + </indexterm> + To make effective use of the index-only scan feature, you might choose to + create a <firstterm>covering index</firstterm>, which is an index + specifically designed to include the columns needed by a particular + type of query that you run frequently. Since queries typically need to + retrieve more columns than just the ones they search + on, <productname>PostgreSQL</productname> allows you to create an index + in which some columns are just <quote>payload</quote> and are not part + of the search key. This is done by adding an <literal>INCLUDE</literal> + clause listing the extra columns. For example, if you commonly run + queries like +<programlisting> +SELECT y FROM tab WHERE x = 'key'; +</programlisting> + the traditional approach to speeding up such queries would be to create + an index on <literal>x</literal> only. However, an index defined as +<programlisting> +CREATE INDEX tab_x_y ON tab(x) INCLUDE (y); +</programlisting> + could handle these queries as index-only scans, + because <literal>y</literal> can be obtained from the index without + visiting the heap. + </para> + + <para> + Because column <literal>y</literal> is not part of the index's search + key, it does not have to be of a data type that the index can handle; + it's merely stored in the index and is not interpreted by the index + machinery. Also, if the index is a unique index, that is +<programlisting> +CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y); +</programlisting> + the uniqueness condition applies to just column <literal>x</literal>, + not to the combination of <literal>x</literal> and <literal>y</literal>. + (An <literal>INCLUDE</literal> clause can also be written + in <literal>UNIQUE</literal> and <literal>PRIMARY KEY</literal> + constraints, providing alternative syntax for setting up an index like + this.) + </para> + + <para> + It's wise to be conservative about adding non-key payload columns to an + index, especially wide columns. If an index tuple exceeds the + maximum size allowed for the index type, data insertion will fail. + In any case, non-key columns duplicate data from the index's table + and bloat the size of the index, thus potentially slowing searches. + And remember that there is little point in including payload columns in an + index unless the table changes slowly enough that an index-only scan is + likely to not need to access the heap. If the heap tuple must be visited + anyway, it costs nothing more to get the column's value from there. + Other restrictions are that expressions are not currently supported as + included columns, and that only B-tree, GiST and SP-GiST indexes currently + support included columns. + </para> + + <para> + Before <productname>PostgreSQL</productname> had + the <literal>INCLUDE</literal> feature, people sometimes made covering + indexes by writing the payload columns as ordinary index columns, + that is writing +<programlisting> +CREATE INDEX tab_x_y ON tab(x, y); +</programlisting> + even though they had no intention of ever using <literal>y</literal> as + part of a <literal>WHERE</literal> clause. This works fine as long as + the extra columns are trailing columns; making them be leading columns is + unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>. + However, this method doesn't support the case where you want the index to + enforce uniqueness on the key column(s). + </para> + + <para> + <firstterm>Suffix truncation</firstterm> always removes non-key + columns from upper B-Tree levels. As payload columns, they are + never used to guide index scans. The truncation process also + removes one or more trailing key column(s) when the remaining + prefix of key column(s) happens to be sufficient to describe tuples + on the lowest B-Tree level. In practice, covering indexes without + an <literal>INCLUDE</literal> clause often avoid storing columns + that are effectively payload in the upper levels. However, + explicitly defining payload columns as non-key columns + <emphasis>reliably</emphasis> keeps the tuples in upper levels + small. + </para> + + <para> + In principle, index-only scans can be used with expression indexes. + For example, given an index on <literal>f(x)</literal> + where <literal>x</literal> is a table column, it should be possible to + execute +<programlisting> +SELECT f(x) FROM tab WHERE f(x) < 1; +</programlisting> + as an index-only scan; and this is very attractive + if <literal>f()</literal> is an expensive-to-compute function. + However, <productname>PostgreSQL</productname>'s planner is currently not + very smart about such cases. It considers a query to be potentially + executable by index-only scan only when all <emphasis>columns</emphasis> + needed by the query are available from the index. In this + example, <literal>x</literal> is not needed except in the + context <literal>f(x)</literal>, but the planner does not notice that and + concludes that an index-only scan is not possible. If an index-only scan + seems sufficiently worthwhile, this can be worked around by + adding <literal>x</literal> as an included column, for example +<programlisting> +CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x); +</programlisting> + An additional caveat, if the goal is to avoid + recalculating <literal>f(x)</literal>, is that the planner won't + necessarily match uses of <literal>f(x)</literal> that aren't in + indexable <literal>WHERE</literal> clauses to the index column. It will + usually get this right in simple queries such as shown above, but not in + queries that involve joins. These deficiencies may be remedied in future + versions of <productname>PostgreSQL</productname>. + </para> + + <para> + Partial indexes also have interesting interactions with index-only scans. + Consider the partial index shown in <xref linkend="indexes-partial-ex3"/>: +<programlisting> +CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) + WHERE success; +</programlisting> + In principle, we could do an index-only scan on this index to satisfy a + query like +<programlisting> +SELECT target FROM tests WHERE subject = 'some-subject' AND success; +</programlisting> + But there's a problem: the <literal>WHERE</literal> clause refers + to <literal>success</literal> which is not available as a result column + of the index. Nonetheless, an index-only scan is possible because the + plan does not need to recheck that part of the <literal>WHERE</literal> + clause at run time: all entries found in the index necessarily + have <literal>success = true</literal> so this need not be explicitly + checked in the plan. <productname>PostgreSQL</productname> versions 9.6 + and later will recognize such cases and allow index-only scans to be + generated, but older versions will not. + </para> + </sect1> + + + <sect1 id="indexes-opclass"> + <title>Operator Classes and Operator Families</title> + + <indexterm zone="indexes-opclass"> + <primary>operator class</primary> + </indexterm> + + <indexterm zone="indexes-opclass"> + <primary>operator family</primary> + </indexterm> + + <para> + An index definition can specify an <firstterm>operator + class</firstterm> for each column of an index. +<synopsis> +CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> [ ( <replaceable>opclass_options</replaceable> ) ] <optional><replaceable>sort options</replaceable></optional> <optional>, ...</optional>); +</synopsis> + The operator class identifies the operators to be used by the index + for that column. For example, a B-tree index on the type <type>int4</type> + would use the <literal>int4_ops</literal> class; this operator + class includes comparison functions for values of type <type>int4</type>. + In practice the default operator class for the column's data type is + usually sufficient. The main reason for having operator classes is + that for some data types, there could be more than one meaningful + index behavior. For example, we might want to sort a complex-number data + type either by absolute value or by real part. We could do this by + defining two operator classes for the data type and then selecting + the proper class when making an index. The operator class determines + the basic sort ordering (which can then be modified by adding sort options + <literal>COLLATE</literal>, + <literal>ASC</literal>/<literal>DESC</literal> and/or + <literal>NULLS FIRST</literal>/<literal>NULLS LAST</literal>). + </para> + + <para> + There are also some built-in operator classes besides the default ones: + + <itemizedlist> + <listitem> + <para> + The operator classes <literal>text_pattern_ops</literal>, + <literal>varchar_pattern_ops</literal>, and + <literal>bpchar_pattern_ops</literal> support B-tree indexes on + the types <type>text</type>, <type>varchar</type>, and + <type>char</type> respectively. The + difference from the default operator classes is that the values + are compared strictly character by character rather than + according to the locale-specific collation rules. This makes + these operator classes suitable for use by queries involving + pattern matching expressions (<literal>LIKE</literal> or POSIX + regular expressions) when the database does not use the standard + <quote>C</quote> locale. As an example, you might index a + <type>varchar</type> column like this: +<programlisting> +CREATE INDEX test_index ON test_table (col varchar_pattern_ops); +</programlisting> + Note that you should also create an index with the default operator + class if you want queries involving ordinary <literal><</literal>, + <literal><=</literal>, <literal>></literal>, or <literal>>=</literal> comparisons + to use an index. Such queries cannot use the + <literal><replaceable>xxx</replaceable>_pattern_ops</literal> + operator classes. (Ordinary equality comparisons can use these + operator classes, however.) It is possible to create multiple + indexes on the same column with different operator classes. + If you do use the C locale, you do not need the + <literal><replaceable>xxx</replaceable>_pattern_ops</literal> + operator classes, because an index with the default operator class + is usable for pattern-matching queries in the C locale. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + The following query shows all defined operator classes: + +<programlisting> +SELECT am.amname AS index_method, + opc.opcname AS opclass_name, + opc.opcintype::regtype AS indexed_type, + opc.opcdefault AS is_default + FROM pg_am am, pg_opclass opc + WHERE opc.opcmethod = am.oid + ORDER BY index_method, opclass_name; +</programlisting> + </para> + + <para> + An operator class is actually just a subset of a larger structure called an + <firstterm>operator family</firstterm>. In cases where several data types have + similar behaviors, it is frequently useful to define cross-data-type + operators and allow these to work with indexes. To do this, the operator + classes for each of the types must be grouped into the same operator + family. The cross-type operators are members of the family, but are not + associated with any single class within the family. + </para> + + <para> + This expanded version of the previous query shows the operator family + each operator class belongs to: +<programlisting> +SELECT am.amname AS index_method, + opc.opcname AS opclass_name, + opf.opfname AS opfamily_name, + opc.opcintype::regtype AS indexed_type, + opc.opcdefault AS is_default + FROM pg_am am, pg_opclass opc, pg_opfamily opf + WHERE opc.opcmethod = am.oid AND + opc.opcfamily = opf.oid + ORDER BY index_method, opclass_name; +</programlisting> + </para> + + <para> + This query shows all defined operator families and all + the operators included in each family: +<programlisting> +SELECT am.amname AS index_method, + opf.opfname AS opfamily_name, + amop.amopopr::regoperator AS opfamily_operator + FROM pg_am am, pg_opfamily opf, pg_amop amop + WHERE opf.opfmethod = am.oid AND + amop.amopfamily = opf.oid + ORDER BY index_method, opfamily_name, opfamily_operator; +</programlisting> + </para> + + <tip> + <para> + <xref linkend="app-psql"/> has + commands <command>\dAc</command>, <command>\dAf</command>, + and <command>\dAo</command>, which provide slightly more sophisticated + versions of these queries. + </para> + </tip> + </sect1> + + + <sect1 id="indexes-collations"> + <title>Indexes and Collations</title> + + <para> + An index can support only one collation per index column. + If multiple collations are of interest, multiple indexes may be needed. + </para> + + <para> + Consider these statements: +<programlisting> +CREATE TABLE test1c ( + id integer, + content varchar COLLATE "x" +); + +CREATE INDEX test1c_content_index ON test1c (content); +</programlisting> + The index automatically uses the collation of the + underlying column. So a query of the form +<programlisting> +SELECT * FROM test1c WHERE content > <replaceable>constant</replaceable>; +</programlisting> + could use the index, because the comparison will by default use the + collation of the column. However, this index cannot accelerate queries + that involve some other collation. So if queries of the form, say, +<programlisting> +SELECT * FROM test1c WHERE content > <replaceable>constant</replaceable> COLLATE "y"; +</programlisting> + are also of interest, an additional index could be created that supports + the <literal>"y"</literal> collation, like this: +<programlisting> +CREATE INDEX test1c_content_y_index ON test1c (content COLLATE "y"); +</programlisting> + </para> + </sect1> + + + <sect1 id="indexes-examine"> + <title>Examining Index Usage</title> + + <indexterm zone="indexes-examine"> + <primary>index</primary> + <secondary>examining usage</secondary> + </indexterm> + + <para> + Although indexes in <productname>PostgreSQL</productname> do not need + maintenance or tuning, it is still important to check + which indexes are actually used by the real-life query workload. + Examining index usage for an individual query is done with the + <xref linkend="sql-explain"/> + command; its application for this purpose is + illustrated in <xref linkend="using-explain"/>. + It is also possible to gather overall statistics about index usage + in a running server, as described in <xref linkend="monitoring-stats"/>. + </para> + + <para> + It is difficult to formulate a general procedure for determining + which indexes to create. There are a number of typical cases that + have been shown in the examples throughout the previous sections. + A good deal of experimentation is often necessary. + The rest of this section gives some tips for that: + </para> + + <itemizedlist> + <listitem> + <para> + Always run <xref linkend="sql-analyze"/> + first. This command + collects statistics about the distribution of the values in the + table. This information is required to estimate the number of rows + returned by a query, which is needed by the planner to assign + realistic costs to each possible query plan. In absence of any + real statistics, some default values are assumed, which are + almost certain to be inaccurate. Examining an application's + index usage without having run <command>ANALYZE</command> is + therefore a lost cause. + See <xref linkend="vacuum-for-statistics"/> + and <xref linkend="autovacuum"/> for more information. + </para> + </listitem> + + <listitem> + <para> + Use real data for experimentation. Using test data for setting + up indexes will tell you what indexes you need for the test data, + but that is all. + </para> + + <para> + It is especially fatal to use very small test data sets. + While selecting 1000 out of 100000 rows could be a candidate for + an index, selecting 1 out of 100 rows will hardly be, because the + 100 rows probably fit within a single disk page, and there + is no plan that can beat sequentially fetching 1 disk page. + </para> + + <para> + Also be careful when making up test data, which is often + unavoidable when the application is not yet in production. + Values that are very similar, completely random, or inserted in + sorted order will skew the statistics away from the distribution + that real data would have. + </para> + </listitem> + + <listitem> + <para> + When indexes are not used, it can be useful for testing to force + their use. There are run-time parameters that can turn off + various plan types (see <xref linkend="runtime-config-query-enable"/>). + For instance, turning off sequential scans + (<varname>enable_seqscan</varname>) and nested-loop joins + (<varname>enable_nestloop</varname>), which are the most basic plans, + will force the system to use a different plan. If the system + still chooses a sequential scan or nested-loop join then there is + probably a more fundamental reason why the index is not being + used; for example, the query condition does not match the index. + (What kind of query can use what kind of index is explained in + the previous sections.) + </para> + </listitem> + + <listitem> + <para> + If forcing index usage does use the index, then there are two + possibilities: Either the system is right and using the index is + indeed not appropriate, or the cost estimates of the query plans + are not reflecting reality. So you should time your query with + and without indexes. The <command>EXPLAIN ANALYZE</command> + command can be useful here. + </para> + </listitem> + + <listitem> + <para> + If it turns out that the cost estimates are wrong, there are, + again, two possibilities. The total cost is computed from the + per-row costs of each plan node times the selectivity estimate of + the plan node. The costs estimated for the plan nodes can be adjusted + via run-time parameters (described in <xref + linkend="runtime-config-query-constants"/>). + An inaccurate selectivity estimate is due to + insufficient statistics. It might be possible to improve this by + tuning the statistics-gathering parameters (see + <xref linkend="sql-altertable"/>). + </para> + + <para> + If you do not succeed in adjusting the costs to be more + appropriate, then you might have to resort to forcing index usage + explicitly. You might also want to contact the + <productname>PostgreSQL</productname> developers to examine the issue. + </para> + </listitem> + </itemizedlist> + </sect1> +</chapter> |