diff options
Diffstat (limited to 'doc/src/sgml/html/planner-stats.html')
-rw-r--r-- | doc/src/sgml/html/planner-stats.html | 336 |
1 files changed, 336 insertions, 0 deletions
diff --git a/doc/src/sgml/html/planner-stats.html b/doc/src/sgml/html/planner-stats.html new file mode 100644 index 0000000..27cf1c8 --- /dev/null +++ b/doc/src/sgml/html/planner-stats.html @@ -0,0 +1,336 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr /></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#id-1.5.13.5.3">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="id-1.5.13.5.3"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics</h3></div></div></div><p> + As we saw in the previous section, the query planner needs to estimate + the number of rows retrieved by a query in order to make good choices + of query plans. This section provides a quick look at the statistics + that the system uses for these estimates. + </p><p> + One component of the statistics is the total number of entries in + each table and index, as well as the number of disk blocks occupied + by each table and index. This information is kept in the table + <a class="link" href="catalog-pg-class.html" title="53.11. pg_class"><code class="structname">pg_class</code></a>, + in the columns <code class="structfield">reltuples</code> and + <code class="structfield">relpages</code>. We can look at it with + queries similar to this one: + +</p><pre class="screen"> +SELECT relname, relkind, reltuples, relpages +FROM pg_class +WHERE relname LIKE 'tenk1%'; + + relname | relkind | reltuples | relpages +----------------------+---------+-----------+---------- + tenk1 | r | 10000 | 358 + tenk1_hundred | i | 10000 | 30 + tenk1_thous_tenthous | i | 10000 | 30 + tenk1_unique1 | i | 10000 | 30 + tenk1_unique2 | i | 10000 | 30 +(5 rows) +</pre><p> + + Here we can see that <code class="structname">tenk1</code> contains 10000 + rows, as do its indexes, but the indexes are (unsurprisingly) much + smaller than the table. + </p><p> + For efficiency reasons, <code class="structfield">reltuples</code> + and <code class="structfield">relpages</code> are not updated on-the-fly, + and so they usually contain somewhat out-of-date values. + They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a + few DDL commands such as <code class="command">CREATE INDEX</code>. A <code class="command">VACUUM</code> + or <code class="command">ANALYZE</code> operation that does not scan the entire table + (which is commonly the case) will incrementally update the + <code class="structfield">reltuples</code> count on the basis of the part + of the table it did scan, resulting in an approximate value. + In any case, the planner + will scale the values it finds in <code class="structname">pg_class</code> + to match the current physical table size, thus obtaining a closer + approximation. + </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p> + Most queries retrieve only a fraction of the rows in a table, due + to <code class="literal">WHERE</code> clauses that restrict the rows to be + examined. The planner thus needs to make an estimate of the + <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is, + the fraction of rows that match each condition in the + <code class="literal">WHERE</code> clause. The information used for this task is + stored in the + <a class="link" href="catalog-pg-statistic.html" title="53.51. pg_statistic"><code class="structname">pg_statistic</code></a> + system catalog. Entries in <code class="structname">pg_statistic</code> + are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM + ANALYZE</code> commands, and are always approximate even when freshly + updated. + </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p> + Rather than look at <code class="structname">pg_statistic</code> directly, + it's better to look at its view + <a class="link" href="view-pg-stats.html" title="54.27. pg_stats"><code class="structname">pg_stats</code></a> + when examining the statistics manually. <code class="structname">pg_stats</code> + is designed to be more easily readable. Furthermore, + <code class="structname">pg_stats</code> is readable by all, whereas + <code class="structname">pg_statistic</code> is only readable by a superuser. + (This prevents unprivileged users from learning something about + the contents of other people's tables from the statistics. The + <code class="structname">pg_stats</code> view is restricted to show only + rows about tables that the current user can read.) + For example, we might do: + +</p><pre class="screen"> +SELECT attname, inherited, n_distinct, + array_to_string(most_common_vals, E'\n') as most_common_vals +FROM pg_stats +WHERE tablename = 'road'; + + attname | inherited | n_distinct | most_common_vals +---------+-----------+------------+------------------------------------ + name | f | -0.363388 | I- 580 Ramp+ + | | | I- 880 Ramp+ + | | | Sp Railroad + + | | | I- 580 + + | | | I- 680 Ramp + name | t | -0.284859 | I- 880 Ramp+ + | | | I- 580 Ramp+ + | | | I- 680 Ramp+ + | | | I- 580 + + | | | State Hwy 13 Ramp +(2 rows) +</pre><p> + + Note that two rows are displayed for the same column, one corresponding + to the complete inheritance hierarchy starting at the + <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>), + and another one including only the <code class="literal">road</code> table itself + (<code class="literal">inherited</code>=<code class="literal">f</code>). + </p><p> + The amount of information stored in <code class="structname">pg_statistic</code> + by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the + <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code> + arrays for each column, can be set on a + column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code> + command, or globally by setting the + <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable. + The default limit is presently 100 entries. Raising the limit + might allow more accurate planner estimates to be made, particularly for + columns with irregular data distributions, at the price of consuming + more space in <code class="structname">pg_statistic</code> and slightly more + time to compute the estimates. Conversely, a lower limit might be + sufficient for columns with simple data distributions. + </p><p> + Further details about the planner's use of statistics can be found in + <a class="xref" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Chapter 75</a>. + </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics</h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p> + It is common to see slow queries running bad execution plans because + multiple columns used in the query clauses are correlated. + The planner normally assumes that multiple conditions + are independent of each other, + an assumption that does not hold when column values are correlated. + Regular statistics, because of their per-individual-column nature, + cannot capture any knowledge about cross-column correlation. + However, <span class="productname">PostgreSQL</span> has the ability to compute + <em class="firstterm">multivariate statistics</em>, which can capture + such information. + </p><p> + Because the number of possible column combinations is very large, + it's impractical to compute multivariate statistics automatically. + Instead, <em class="firstterm">extended statistics objects</em>, more often + called just <em class="firstterm">statistics objects</em>, can be created to instruct + the server to obtain statistics across interesting sets of columns. + </p><p> + Statistics objects are created using the + <a class="link" href="sql-createstatistics.html" title="CREATE STATISTICS"><code class="command">CREATE STATISTICS</code></a> command. + Creation of such an object merely creates a catalog entry expressing + interest in the statistics. Actual data collection is performed + by <code class="command">ANALYZE</code> (either a manual command, or background + auto-analyze). The collected values can be examined in the + <a class="link" href="catalog-pg-statistic-ext-data.html" title="53.53. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a> + catalog. + </p><p> + <code class="command">ANALYZE</code> computes extended statistics based on the same + sample of table rows that it takes for computing regular single-column + statistics. Since the sample size is increased by increasing the + statistics target for the table or any of its columns (as described in + the previous section), a larger statistics target will normally result in + more accurate extended statistics, as well as more time spent calculating + them. + </p><p> + The following subsections describe the kinds of extended statistics + that are currently supported. + </p><div class="sect3" id="id-1.5.13.5.4.11"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies</h4></div></div></div><p> + The simplest kind of extended statistics tracks <em class="firstterm">functional + dependencies</em>, a concept used in definitions of database normal forms. + We say that column <code class="structfield">b</code> is functionally dependent on + column <code class="structfield">a</code> if knowledge of the value of + <code class="structfield">a</code> is sufficient to determine the value + of <code class="structfield">b</code>, that is there are no two rows having the same value + of <code class="structfield">a</code> but different values of <code class="structfield">b</code>. + In a fully normalized database, functional dependencies should exist + only on primary keys and superkeys. However, in practice many data sets + are not fully normalized for various reasons; intentional + denormalization for performance reasons is a common example. + Even in a fully normalized database, there may be partial correlation + between some columns, which can be expressed as partial functional + dependency. + </p><p> + The existence of functional dependencies directly affects the accuracy + of estimates in certain queries. If a query contains conditions on + both the independent and the dependent column(s), the + conditions on the dependent columns do not further reduce the result + size; but without knowledge of the functional dependency, the query + planner will assume that the conditions are independent, resulting + in underestimating the result size. + </p><p> + To inform the planner about functional dependencies, <code class="command">ANALYZE</code> + can collect measurements of cross-column dependency. Assessing the + degree of dependency between all sets of columns would be prohibitively + expensive, so data collection is limited to those groups of columns + appearing together in a statistics object defined with + the <code class="literal">dependencies</code> option. It is advisable to create + <code class="literal">dependencies</code> statistics only for column groups that are + strongly correlated, to avoid unnecessary overhead in both + <code class="command">ANALYZE</code> and later query planning. + </p><p> + Here is an example of collecting functional-dependency statistics: +</p><pre class="programlisting"> +CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes; + +ANALYZE zipcodes; + +SELECT stxname, stxkeys, stxddependencies + FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) + WHERE stxname = 'stts'; + stxname | stxkeys | stxddependencies +---------+---------+------------------------------------------ + stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130} +(1 row) +</pre><p> + Here it can be seen that column 1 (zip code) fully determines column + 5 (city) so the coefficient is 1.0, while city only determines zip code + about 42% of the time, meaning that there are many cities (58%) that are + represented by more than a single ZIP code. + </p><p> + When computing the selectivity for a query involving functionally + dependent columns, the planner adjusts the per-condition selectivity + estimates using the dependency coefficients so as not to produce + an underestimate. + </p><div class="sect4" id="id-1.5.13.5.4.11.7"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies</h5></div></div></div><p> + Functional dependencies are currently only applied when considering + simple equality conditions that compare columns to constant values, + and <code class="literal">IN</code> clauses with constant values. + They are not used to improve estimates for equality conditions + comparing two columns or comparing a column to an expression, nor for + range clauses, <code class="literal">LIKE</code> or any other type of condition. + </p><p> + When estimating with functional dependencies, the planner assumes that + conditions on the involved columns are compatible and hence redundant. + If they are incompatible, the correct estimate would be zero rows, but + that possibility is not considered. For example, given a query like +</p><pre class="programlisting"> +SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105'; +</pre><p> + the planner will disregard the <code class="structfield">city</code> clause as not + changing the selectivity, which is correct. However, it will make + the same assumption about +</p><pre class="programlisting"> +SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210'; +</pre><p> + even though there will really be zero rows satisfying this query. + Functional dependency statistics do not provide enough information + to conclude that, however. + </p><p> + In many practical situations, this assumption is usually satisfied; + for example, there might be a GUI in the application that only allows + selecting compatible city and ZIP code values to use in a query. + But if that's not the case, functional dependencies may not be a viable + option. + </p></div></div><div class="sect3" id="id-1.5.13.5.4.12"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts</h4></div></div></div><p> + Single-column statistics store the number of distinct values in each + column. Estimates of the number of distinct values when combining more + than one column (for example, for <code class="literal">GROUP BY a, b</code>) are + frequently wrong when the planner only has single-column statistical + data, causing it to select bad plans. + </p><p> + To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct + statistics for groups of columns. As before, it's impractical to do + this for every possible column grouping, so data is collected only for + those groups of columns appearing together in a statistics object + defined with the <code class="literal">ndistinct</code> option. Data will be collected + for each possible combination of two or more columns from the set of + listed columns. + </p><p> + Continuing the previous example, the n-distinct counts in a + table of ZIP codes might look like the following: +</p><pre class="programlisting"> +CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes; + +ANALYZE zipcodes; + +SELECT stxkeys AS k, stxdndistinct AS nd + FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) + WHERE stxname = 'stts2'; +-[ RECORD 1 ]-------------------------------------------------------- +k | 1 2 5 +nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} +(1 row) +</pre><p> + This indicates that there are three combinations of columns that + have 33178 distinct values: ZIP code and state; ZIP code and city; + and ZIP code, city and state (the fact that they are all equal is + expected given that ZIP code alone is unique in this table). On the + other hand, the combination of city and state has only 27435 distinct + values. + </p><p> + It's advisable to create <code class="literal">ndistinct</code> statistics objects only + on combinations of columns that are actually used for grouping, and + for which misestimation of the number of groups is resulting in bad + plans. Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted. + </p></div><div class="sect3" id="id-1.5.13.5.4.13"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists</h4></div></div></div><p> + Another type of statistic stored for each column are most-common value + lists. This allows very accurate estimates for individual columns, but + may result in significant misestimates for queries with conditions on + multiple columns. + </p><p> + To improve such estimates, <code class="command">ANALYZE</code> can collect MCV + lists on combinations of columns. Similarly to functional dependencies + and n-distinct coefficients, it's impractical to do this for every + possible column grouping. Even more so in this case, as the MCV list + (unlike functional dependencies and n-distinct coefficients) does store + the common column values. So data is collected only for those groups + of columns appearing together in a statistics object defined with the + <code class="literal">mcv</code> option. + </p><p> + Continuing the previous example, the MCV list for a table of ZIP codes + might look like the following (unlike for simpler types of statistics, + a function is required for inspection of MCV contents): + +</p><pre class="programlisting"> +CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes; + +ANALYZE zipcodes; + +SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid), + pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3'; + + index | values | nulls | frequency | base_frequency +-------+------------------------+-------+-----------+---------------- + 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05 + 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05 + 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133 + 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113 + 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114 + 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05 + 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05 + 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05 + 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05 + 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05 + ... +(99 rows) +</pre><p> + This indicates that the most common combination of city and state is + Washington in DC, with actual frequency (in the sample) about 0.35%. + The base frequency of the combination (as computed from the simple + per-column frequencies) is only 0.0027%, resulting in two orders of + magnitude under-estimates. + </p><p> + It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only + on combinations of columns that are actually used in conditions together, + and for which misestimation of the number of groups is resulting in bad + plans. Otherwise, the <code class="command">ANALYZE</code> and planning cycles + are just wasted. + </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>
\ No newline at end of file |