1 files changed, 336 insertions, 0 deletions
diff --git a/doc/src/sgml/html/planner-stats.html b/doc/src/sgml/html/planner-stats.html
new file mode 100644
index 0000000..27cf1c8
--- /dev/null
+++ b/doc/src/sgml/html/planner-stats.html
@@ -0,0 +1,336 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr /></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#id-1.5.13.5.3">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="id-1.5.13.5.3"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics</h3></div></div></div><p>
+   As we saw in the previous section, the query planner needs to estimate
+   the number of rows retrieved by a query in order to make good choices
+   of query plans.  This section provides a quick look at the statistics
+   that the system uses for these estimates.
+  </p><p>
+   One component of the statistics is the total number of entries in
+   each table and index, as well as the number of disk blocks occupied
+   by each table and index.  This information is kept in the table
+   <a class="link" href="catalog-pg-class.html" title="53.11. pg_class"><code class="structname">pg_class</code></a>,
+   in the columns <code class="structfield">reltuples</code> and
+   <code class="structfield">relpages</code>.  We can look at it with
+   queries similar to this one:
+
+</p><pre class="screen">
+SELECT relname, relkind, reltuples, relpages
+FROM pg_class
+WHERE relname LIKE 'tenk1%';
+
+       relname        | relkind | reltuples | relpages
+----------------------+---------+-----------+----------
+ tenk1                | r       |     10000 |      358
+ tenk1_hundred        | i       |     10000 |       30
+ tenk1_thous_tenthous | i       |     10000 |       30
+ tenk1_unique1        | i       |     10000 |       30
+ tenk1_unique2        | i       |     10000 |       30
+(5 rows)
+</pre><p>
+
+   Here we can see that <code class="structname">tenk1</code> contains 10000
+   rows, as do its indexes, but the indexes are (unsurprisingly) much
+   smaller than the table.
+  </p><p>
+   For efficiency reasons, <code class="structfield">reltuples</code>
+   and <code class="structfield">relpages</code> are not updated on-the-fly,
+   and so they usually contain somewhat out-of-date values.
+   They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a
+   few DDL commands such as <code class="command">CREATE INDEX</code>.  A <code class="command">VACUUM</code>
+   or <code class="command">ANALYZE</code> operation that does not scan the entire table
+   (which is commonly the case) will incrementally update the
+   <code class="structfield">reltuples</code> count on the basis of the part
+   of the table it did scan, resulting in an approximate value.
+   In any case, the planner
+   will scale the values it finds in <code class="structname">pg_class</code>
+   to match the current physical table size, thus obtaining a closer
+   approximation.
+  </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p>
+   Most queries retrieve only a fraction of the rows in a table, due
+   to <code class="literal">WHERE</code> clauses that restrict the rows to be
+   examined.  The planner thus needs to make an estimate of the
+   <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is,
+   the fraction of rows that match each condition in the
+   <code class="literal">WHERE</code> clause.  The information used for this task is
+   stored in the
+   <a class="link" href="catalog-pg-statistic.html" title="53.51. pg_statistic"><code class="structname">pg_statistic</code></a>
+   system catalog.  Entries in <code class="structname">pg_statistic</code>
+   are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM
+   ANALYZE</code> commands, and are always approximate even when freshly
+   updated.
+  </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p>
+   Rather than look at <code class="structname">pg_statistic</code> directly,
+   it's better to look at its view
+   <a class="link" href="view-pg-stats.html" title="54.27. pg_stats"><code class="structname">pg_stats</code></a>
+   when examining the statistics manually.  <code class="structname">pg_stats</code>
+   is designed to be more easily readable.  Furthermore,
+   <code class="structname">pg_stats</code> is readable by all, whereas
+   <code class="structname">pg_statistic</code> is only readable by a superuser.
+   (This prevents unprivileged users from learning something about
+   the contents of other people's tables from the statistics.  The
+   <code class="structname">pg_stats</code> view is restricted to show only
+   rows about tables that the current user can read.)
+   For example, we might do:
+
+</p><pre class="screen">
+SELECT attname, inherited, n_distinct,
+       array_to_string(most_common_vals, E'\n') as most_common_vals
+FROM pg_stats
+WHERE tablename = 'road';
+
+ attname | inherited | n_distinct |          most_common_vals
+---------+-----------+------------+------------------------------------
+ name    | f         |  -0.363388 | I- 580                        Ramp+
+         |           |            | I- 880                        Ramp+
+         |           |            | Sp Railroad                       +
+         |           |            | I- 580                            +
+         |           |            | I- 680                        Ramp
+ name    | t         |  -0.284859 | I- 880                        Ramp+
+         |           |            | I- 580                        Ramp+
+         |           |            | I- 680                        Ramp+
+         |           |            | I- 580                            +
+         |           |            | State Hwy 13                  Ramp
+(2 rows)
+</pre><p>
+
+   Note that two rows are displayed for the same column, one corresponding
+   to the complete inheritance hierarchy starting at the
+   <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>),
+   and another one including only the <code class="literal">road</code> table itself
+   (<code class="literal">inherited</code>=<code class="literal">f</code>).
+  </p><p>
+   The amount of information stored in <code class="structname">pg_statistic</code>
+   by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the
+   <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code>
+   arrays for each column, can be set on a
+   column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code>
+   command, or globally by setting the
+   <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable.
+   The default limit is presently 100 entries.  Raising the limit
+   might allow more accurate planner estimates to be made, particularly for
+   columns with irregular data distributions, at the price of consuming
+   more space in <code class="structname">pg_statistic</code> and slightly more
+   time to compute the estimates.  Conversely, a lower limit might be
+   sufficient for columns with simple data distributions.
+  </p><p>
+   Further details about the planner's use of statistics can be found in
+   <a class="xref" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Chapter 75</a>.
+  </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics</h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p>
+    It is common to see slow queries running bad execution plans because
+    multiple columns used in the query clauses are correlated.
+    The planner normally assumes that multiple conditions
+    are independent of each other,
+    an assumption that does not hold when column values are correlated.
+    Regular statistics, because of their per-individual-column nature,
+    cannot capture any knowledge about cross-column correlation.
+    However, <span class="productname">PostgreSQL</span> has the ability to compute
+    <em class="firstterm">multivariate statistics</em>, which can capture
+    such information.
+   </p><p>
+    Because the number of possible column combinations is very large,
+    it's impractical to compute multivariate statistics automatically.
+    Instead, <em class="firstterm">extended statistics objects</em>, more often
+    called just <em class="firstterm">statistics objects</em>, can be created to instruct
+    the server to obtain statistics across interesting sets of columns.
+   </p><p>
+    Statistics objects are created using the
+    <a class="link" href="sql-createstatistics.html" title="CREATE STATISTICS"><code class="command">CREATE STATISTICS</code></a> command.
+    Creation of such an object merely creates a catalog entry expressing
+    interest in the statistics.  Actual data collection is performed
+    by <code class="command">ANALYZE</code> (either a manual command, or background
+    auto-analyze).  The collected values can be examined in the
+    <a class="link" href="catalog-pg-statistic-ext-data.html" title="53.53. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a>
+    catalog.
+   </p><p>
+    <code class="command">ANALYZE</code> computes extended statistics based on the same
+    sample of table rows that it takes for computing regular single-column
+    statistics.  Since the sample size is increased by increasing the
+    statistics target for the table or any of its columns (as described in
+    the previous section), a larger statistics target will normally result in
+    more accurate extended statistics, as well as more time spent calculating
+    them.
+   </p><p>
+    The following subsections describe the kinds of extended statistics
+    that are currently supported.
+   </p><div class="sect3" id="id-1.5.13.5.4.11"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies</h4></div></div></div><p>
+     The simplest kind of extended statistics tracks <em class="firstterm">functional
+     dependencies</em>, a concept used in definitions of database normal forms.
+     We say that column <code class="structfield">b</code> is functionally dependent on
+     column <code class="structfield">a</code> if knowledge of the value of
+     <code class="structfield">a</code> is sufficient to determine the value
+     of <code class="structfield">b</code>, that is there are no two rows having the same value
+     of <code class="structfield">a</code> but different values of <code class="structfield">b</code>.
+     In a fully normalized database, functional dependencies should exist
+     only on primary keys and superkeys. However, in practice many data sets
+     are not fully normalized for various reasons; intentional
+     denormalization for performance reasons is a common example.
+     Even in a fully normalized database, there may be partial correlation
+     between some columns, which can be expressed as partial functional
+     dependency.
+    </p><p>
+     The existence of functional dependencies directly affects the accuracy
+     of estimates in certain queries.  If a query contains conditions on
+     both the independent and the dependent column(s), the
+     conditions on the dependent columns do not further reduce the result
+     size; but without knowledge of the functional dependency, the query
+     planner will assume that the conditions are independent, resulting
+     in underestimating the result size.
+    </p><p>
+     To inform the planner about functional dependencies, <code class="command">ANALYZE</code>
+     can collect measurements of cross-column dependency. Assessing the
+     degree of dependency between all sets of columns would be prohibitively
+     expensive, so data collection is limited to those groups of columns
+     appearing together in a statistics object defined with
+     the <code class="literal">dependencies</code> option.  It is advisable to create
+     <code class="literal">dependencies</code> statistics only for column groups that are
+     strongly correlated, to avoid unnecessary overhead in both
+     <code class="command">ANALYZE</code> and later query planning.
+    </p><p>
+     Here is an example of collecting functional-dependency statistics:
+</p><pre class="programlisting">
+CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
+
+ANALYZE zipcodes;
+
+SELECT stxname, stxkeys, stxddependencies
+  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
+  WHERE stxname = 'stts';
+ stxname | stxkeys |             stxddependencies
+---------+---------+------------------------------------------
+ stts    | 1 5     | {"1 =&gt; 5": 1.000000, "5 =&gt; 1": 0.423130}
+(1 row)
+</pre><p>
+     Here it can be seen that column 1 (zip code) fully determines column
+     5 (city) so the coefficient is 1.0, while city only determines zip code
+     about 42% of the time, meaning that there are many cities (58%) that are
+     represented by more than a single ZIP code.
+    </p><p>
+     When computing the selectivity for a query involving functionally
+     dependent columns, the planner adjusts the per-condition selectivity
+     estimates using the dependency coefficients so as not to produce
+     an underestimate.
+    </p><div class="sect4" id="id-1.5.13.5.4.11.7"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies</h5></div></div></div><p>
+      Functional dependencies are currently only applied when considering
+      simple equality conditions that compare columns to constant values,
+      and <code class="literal">IN</code> clauses with constant values.
+      They are not used to improve estimates for equality conditions
+      comparing two columns or comparing a column to an expression, nor for
+      range clauses, <code class="literal">LIKE</code> or any other type of condition.
+     </p><p>
+      When estimating with functional dependencies, the planner assumes that
+      conditions on the involved columns are compatible and hence redundant.
+      If they are incompatible, the correct estimate would be zero rows, but
+      that possibility is not considered.  For example, given a query like
+</p><pre class="programlisting">
+SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
+</pre><p>
+      the planner will disregard the <code class="structfield">city</code> clause as not
+      changing the selectivity, which is correct.  However, it will make
+      the same assumption about
+</p><pre class="programlisting">
+SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
+</pre><p>
+      even though there will really be zero rows satisfying this query.
+      Functional dependency statistics do not provide enough information
+      to conclude that, however.
+     </p><p>
+      In many practical situations, this assumption is usually satisfied;
+      for example, there might be a GUI in the application that only allows
+      selecting compatible city and ZIP code values to use in a query.
+      But if that's not the case, functional dependencies may not be a viable
+      option.
+     </p></div></div><div class="sect3" id="id-1.5.13.5.4.12"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts</h4></div></div></div><p>
+     Single-column statistics store the number of distinct values in each
+     column.  Estimates of the number of distinct values when combining more
+     than one column (for example, for <code class="literal">GROUP BY a, b</code>) are
+     frequently wrong when the planner only has single-column statistical
+     data, causing it to select bad plans.
+    </p><p>
+     To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct
+     statistics for groups of columns.  As before, it's impractical to do
+     this for every possible column grouping, so data is collected only for
+     those groups of columns appearing together in a statistics object
+     defined with the <code class="literal">ndistinct</code> option.  Data will be collected
+     for each possible combination of two or more columns from the set of
+     listed columns.
+    </p><p>
+     Continuing the previous example, the n-distinct counts in a
+     table of ZIP codes might look like the following:
+</p><pre class="programlisting">
+CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
+
+ANALYZE zipcodes;
+
+SELECT stxkeys AS k, stxdndistinct AS nd
+  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
+  WHERE stxname = 'stts2';
+-[ RECORD 1 ]--------------------------------------------------------
+k  | 1 2 5
+nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
+(1 row)
+</pre><p>
+     This indicates that there are three combinations of columns that
+     have 33178 distinct values: ZIP code and state; ZIP code and city;
+     and ZIP code, city and state (the fact that they are all equal is
+     expected given that ZIP code alone is unique in this table).  On the
+     other hand, the combination of city and state has only 27435 distinct
+     values.
+    </p><p>
+     It's advisable to create <code class="literal">ndistinct</code> statistics objects only
+     on combinations of columns that are actually used for grouping, and
+     for which misestimation of the number of groups is resulting in bad
+     plans.  Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted.
+    </p></div><div class="sect3" id="id-1.5.13.5.4.13"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists</h4></div></div></div><p>
+     Another type of statistic stored for each column are most-common value
+     lists.  This allows very accurate estimates for individual columns, but
+     may result in significant misestimates for queries with conditions on
+     multiple columns.
+    </p><p>
+     To improve such estimates, <code class="command">ANALYZE</code> can collect MCV
+     lists on combinations of columns.  Similarly to functional dependencies
+     and n-distinct coefficients, it's impractical to do this for every
+     possible column grouping.  Even more so in this case, as the MCV list
+     (unlike functional dependencies and n-distinct coefficients) does store
+     the common column values.  So data is collected only for those groups
+     of columns appearing together in a statistics object defined with the
+     <code class="literal">mcv</code> option.
+    </p><p>
+     Continuing the previous example, the MCV list for a table of ZIP codes
+     might look like the following (unlike for simpler types of statistics,
+     a function is required for inspection of MCV contents):
+
+</p><pre class="programlisting">
+CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
+
+ANALYZE zipcodes;
+
+SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
+                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
+
+ index |         values         | nulls | frequency | base_frequency
+-------+------------------------+-------+-----------+----------------
+     0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
+     1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
+     2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
+     3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
+     4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
+     5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
+     6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
+     7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
+     8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
+     9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
+   ...
+(99 rows)
+</pre><p>
+     This indicates that the most common combination of city and state is
+     Washington in DC, with actual frequency (in the sample) about 0.35%.
+     The base frequency of the combination (as computed from the simple
+     per-column frequencies) is only 0.0027%, resulting in two orders of
+     magnitude under-estimates.
+    </p><p>
+     It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only
+     on combinations of columns that are actually used in conditions together,
+     and for which misestimation of the number of groups is resulting in bad
+     plans.  Otherwise, the <code class="command">ANALYZE</code> and planning cycles
+     are just wasted.
+    </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>
+\ No newline at end of file