doc/src/sgml/html/planner-stats.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>14.2. Statistics Used by the Planner</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="using-explain.html" title="14.1. Using EXPLAIN" /><link rel="next" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">14.2. Statistics Used by the Planner</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><th width="60%" align="center">Chapter 14. Performance Tips</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.7 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr></table><hr /></div><div class="sect1" id="PLANNER-STATS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">14.2. Statistics Used by the Planner</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="planner-stats.html#id-1.5.13.5.3">14.2.1. Single-Column Statistics</a></span></dt><dt><span class="sect2"><a href="planner-stats.html#PLANNER-STATS-EXTENDED">14.2.2. Extended Statistics</a></span></dt></dl></div><a id="id-1.5.13.5.2" class="indexterm"></a><div class="sect2" id="id-1.5.13.5.3"><div class="titlepage"><div><div><h3 class="title">14.2.1. Single-Column Statistics</h3></div></div></div><p>
   As we saw in the previous section, the query planner needs to estimate
   the number of rows retrieved by a query in order to make good choices
   of query plans.  This section provides a quick look at the statistics
   that the system uses for these estimates.
  </p><p>
   One component of the statistics is the total number of entries in
   each table and index, as well as the number of disk blocks occupied
   by each table and index.  This information is kept in the table
   <a class="link" href="catalog-pg-class.html" title="53.11. pg_class"><code class="structname">pg_class</code></a>,
   in the columns <code class="structfield">reltuples</code> and
   <code class="structfield">relpages</code>.  We can look at it with
   queries similar to this one:

</p><pre class="screen">
SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'tenk1%';

       relname        | relkind | reltuples | relpages
----------------------+---------+-----------+----------
 tenk1                | r       |     10000 |      358
 tenk1_hundred        | i       |     10000 |       30
 tenk1_thous_tenthous | i       |     10000 |       30
 tenk1_unique1        | i       |     10000 |       30
 tenk1_unique2        | i       |     10000 |       30
(5 rows)
</pre><p>

   Here we can see that <code class="structname">tenk1</code> contains 10000
   rows, as do its indexes, but the indexes are (unsurprisingly) much
   smaller than the table.
  </p><p>
   For efficiency reasons, <code class="structfield">reltuples</code>
   and <code class="structfield">relpages</code> are not updated on-the-fly,
   and so they usually contain somewhat out-of-date values.
   They are updated by <code class="command">VACUUM</code>, <code class="command">ANALYZE</code>, and a
   few DDL commands such as <code class="command">CREATE INDEX</code>.  A <code class="command">VACUUM</code>
   or <code class="command">ANALYZE</code> operation that does not scan the entire table
   (which is commonly the case) will incrementally update the
   <code class="structfield">reltuples</code> count on the basis of the part
   of the table it did scan, resulting in an approximate value.
   In any case, the planner
   will scale the values it finds in <code class="structname">pg_class</code>
   to match the current physical table size, thus obtaining a closer
   approximation.
  </p><a id="id-1.5.13.5.3.5" class="indexterm"></a><p>
   Most queries retrieve only a fraction of the rows in a table, due
   to <code class="literal">WHERE</code> clauses that restrict the rows to be
   examined.  The planner thus needs to make an estimate of the
   <em class="firstterm">selectivity</em> of <code class="literal">WHERE</code> clauses, that is,
   the fraction of rows that match each condition in the
   <code class="literal">WHERE</code> clause.  The information used for this task is
   stored in the
   <a class="link" href="catalog-pg-statistic.html" title="53.51. pg_statistic"><code class="structname">pg_statistic</code></a>
   system catalog.  Entries in <code class="structname">pg_statistic</code>
   are updated by the <code class="command">ANALYZE</code> and <code class="command">VACUUM
   ANALYZE</code> commands, and are always approximate even when freshly
   updated.
  </p><a id="id-1.5.13.5.3.7" class="indexterm"></a><p>
   Rather than look at <code class="structname">pg_statistic</code> directly,
   it's better to look at its view
   <a class="link" href="view-pg-stats.html" title="54.27. pg_stats"><code class="structname">pg_stats</code></a>
   when examining the statistics manually.  <code class="structname">pg_stats</code>
   is designed to be more easily readable.  Furthermore,
   <code class="structname">pg_stats</code> is readable by all, whereas
   <code class="structname">pg_statistic</code> is only readable by a superuser.
   (This prevents unprivileged users from learning something about
   the contents of other people's tables from the statistics.  The
   <code class="structname">pg_stats</code> view is restricted to show only
   rows about tables that the current user can read.)
   For example, we might do:

</p><pre class="screen">
SELECT attname, inherited, n_distinct,
       array_to_string(most_common_vals, E'\n') as most_common_vals
FROM pg_stats
WHERE tablename = 'road';

 attname | inherited | n_distinct |          most_common_vals
---------+-----------+------------+------------------------------------
 name    | f         |  -0.363388 | I- 580                        Ramp+
         |           |            | I- 880                        Ramp+
         |           |            | Sp Railroad                       +
         |           |            | I- 580                            +
         |           |            | I- 680                        Ramp
 name    | t         |  -0.284859 | I- 880                        Ramp+
         |           |            | I- 580                        Ramp+
         |           |            | I- 680                        Ramp+
         |           |            | I- 580                            +
         |           |            | State Hwy 13                  Ramp
(2 rows)
</pre><p>

   Note that two rows are displayed for the same column, one corresponding
   to the complete inheritance hierarchy starting at the
   <code class="literal">road</code> table (<code class="literal">inherited</code>=<code class="literal">t</code>),
   and another one including only the <code class="literal">road</code> table itself
   (<code class="literal">inherited</code>=<code class="literal">f</code>).
  </p><p>
   The amount of information stored in <code class="structname">pg_statistic</code>
   by <code class="command">ANALYZE</code>, in particular the maximum number of entries in the
   <code class="structfield">most_common_vals</code> and <code class="structfield">histogram_bounds</code>
   arrays for each column, can be set on a
   column-by-column basis using the <code class="command">ALTER TABLE SET STATISTICS</code>
   command, or globally by setting the
   <a class="xref" href="runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGET">default_statistics_target</a> configuration variable.
   The default limit is presently 100 entries.  Raising the limit
   might allow more accurate planner estimates to be made, particularly for
   columns with irregular data distributions, at the price of consuming
   more space in <code class="structname">pg_statistic</code> and slightly more
   time to compute the estimates.  Conversely, a lower limit might be
   sufficient for columns with simple data distributions.
  </p><p>
   Further details about the planner's use of statistics can be found in
   <a class="xref" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Chapter 75</a>.
  </p></div><div class="sect2" id="PLANNER-STATS-EXTENDED"><div class="titlepage"><div><div><h3 class="title">14.2.2. Extended Statistics</h3></div></div></div><a id="id-1.5.13.5.4.2" class="indexterm"></a><a id="id-1.5.13.5.4.3" class="indexterm"></a><a id="id-1.5.13.5.4.4" class="indexterm"></a><a id="id-1.5.13.5.4.5" class="indexterm"></a><p>
    It is common to see slow queries running bad execution plans because
    multiple columns used in the query clauses are correlated.
    The planner normally assumes that multiple conditions
    are independent of each other,
    an assumption that does not hold when column values are correlated.
    Regular statistics, because of their per-individual-column nature,
    cannot capture any knowledge about cross-column correlation.
    However, <span class="productname">PostgreSQL</span> has the ability to compute
    <em class="firstterm">multivariate statistics</em>, which can capture
    such information.
   </p><p>
    Because the number of possible column combinations is very large,
    it's impractical to compute multivariate statistics automatically.
    Instead, <em class="firstterm">extended statistics objects</em>, more often
    called just <em class="firstterm">statistics objects</em>, can be created to instruct
    the server to obtain statistics across interesting sets of columns.
   </p><p>
    Statistics objects are created using the
    <a class="link" href="sql-createstatistics.html" title="CREATE STATISTICS"><code class="command">CREATE STATISTICS</code></a> command.
    Creation of such an object merely creates a catalog entry expressing
    interest in the statistics.  Actual data collection is performed
    by <code class="command">ANALYZE</code> (either a manual command, or background
    auto-analyze).  The collected values can be examined in the
    <a class="link" href="catalog-pg-statistic-ext-data.html" title="53.53. pg_statistic_ext_data"><code class="structname">pg_statistic_ext_data</code></a>
    catalog.
   </p><p>
    <code class="command">ANALYZE</code> computes extended statistics based on the same
    sample of table rows that it takes for computing regular single-column
    statistics.  Since the sample size is increased by increasing the
    statistics target for the table or any of its columns (as described in
    the previous section), a larger statistics target will normally result in
    more accurate extended statistics, as well as more time spent calculating
    them.
   </p><p>
    The following subsections describe the kinds of extended statistics
    that are currently supported.
   </p><div class="sect3" id="id-1.5.13.5.4.11"><div class="titlepage"><div><div><h4 class="title">14.2.2.1. Functional Dependencies</h4></div></div></div><p>
     The simplest kind of extended statistics tracks <em class="firstterm">functional
     dependencies</em>, a concept used in definitions of database normal forms.
     We say that column <code class="structfield">b</code> is functionally dependent on
     column <code class="structfield">a</code> if knowledge of the value of
     <code class="structfield">a</code> is sufficient to determine the value
     of <code class="structfield">b</code>, that is there are no two rows having the same value
     of <code class="structfield">a</code> but different values of <code class="structfield">b</code>.
     In a fully normalized database, functional dependencies should exist
     only on primary keys and superkeys. However, in practice many data sets
     are not fully normalized for various reasons; intentional
     denormalization for performance reasons is a common example.
     Even in a fully normalized database, there may be partial correlation
     between some columns, which can be expressed as partial functional
     dependency.
    </p><p>
     The existence of functional dependencies directly affects the accuracy
     of estimates in certain queries.  If a query contains conditions on
     both the independent and the dependent column(s), the
     conditions on the dependent columns do not further reduce the result
     size; but without knowledge of the functional dependency, the query
     planner will assume that the conditions are independent, resulting
     in underestimating the result size.
    </p><p>
     To inform the planner about functional dependencies, <code class="command">ANALYZE</code>
     can collect measurements of cross-column dependency. Assessing the
     degree of dependency between all sets of columns would be prohibitively
     expensive, so data collection is limited to those groups of columns
     appearing together in a statistics object defined with
     the <code class="literal">dependencies</code> option.  It is advisable to create
     <code class="literal">dependencies</code> statistics only for column groups that are
     strongly correlated, to avoid unnecessary overhead in both
     <code class="command">ANALYZE</code> and later query planning.
    </p><p>
     Here is an example of collecting functional-dependency statistics:
</p><pre class="programlisting">
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxname, stxkeys, stxddependencies
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts';
 stxname | stxkeys |             stxddependencies
---------+---------+------------------------------------------
 stts    | 1 5     | {"1 =&gt; 5": 1.000000, "5 =&gt; 1": 0.423130}
(1 row)
</pre><p>
     Here it can be seen that column 1 (zip code) fully determines column
     5 (city) so the coefficient is 1.0, while city only determines zip code
     about 42% of the time, meaning that there are many cities (58%) that are
     represented by more than a single ZIP code.
    </p><p>
     When computing the selectivity for a query involving functionally
     dependent columns, the planner adjusts the per-condition selectivity
     estimates using the dependency coefficients so as not to produce
     an underestimate.
    </p><div class="sect4" id="id-1.5.13.5.4.11.7"><div class="titlepage"><div><div><h5 class="title">14.2.2.1.1. Limitations of Functional Dependencies</h5></div></div></div><p>
      Functional dependencies are currently only applied when considering
      simple equality conditions that compare columns to constant values,
      and <code class="literal">IN</code> clauses with constant values.
      They are not used to improve estimates for equality conditions
      comparing two columns or comparing a column to an expression, nor for
      range clauses, <code class="literal">LIKE</code> or any other type of condition.
     </p><p>
      When estimating with functional dependencies, the planner assumes that
      conditions on the involved columns are compatible and hence redundant.
      If they are incompatible, the correct estimate would be zero rows, but
      that possibility is not considered.  For example, given a query like
</p><pre class="programlisting">
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
</pre><p>
      the planner will disregard the <code class="structfield">city</code> clause as not
      changing the selectivity, which is correct.  However, it will make
      the same assumption about
</p><pre class="programlisting">
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
</pre><p>
      even though there will really be zero rows satisfying this query.
      Functional dependency statistics do not provide enough information
      to conclude that, however.
     </p><p>
      In many practical situations, this assumption is usually satisfied;
      for example, there might be a GUI in the application that only allows
      selecting compatible city and ZIP code values to use in a query.
      But if that's not the case, functional dependencies may not be a viable
      option.
     </p></div></div><div class="sect3" id="id-1.5.13.5.4.12"><div class="titlepage"><div><div><h4 class="title">14.2.2.2. Multivariate N-Distinct Counts</h4></div></div></div><p>
     Single-column statistics store the number of distinct values in each
     column.  Estimates of the number of distinct values when combining more
     than one column (for example, for <code class="literal">GROUP BY a, b</code>) are
     frequently wrong when the planner only has single-column statistical
     data, causing it to select bad plans.
    </p><p>
     To improve such estimates, <code class="command">ANALYZE</code> can collect n-distinct
     statistics for groups of columns.  As before, it's impractical to do
     this for every possible column grouping, so data is collected only for
     those groups of columns appearing together in a statistics object
     defined with the <code class="literal">ndistinct</code> option.  Data will be collected
     for each possible combination of two or more columns from the set of
     listed columns.
    </p><p>
     Continuing the previous example, the n-distinct counts in a
     table of ZIP codes might look like the following:
</p><pre class="programlisting">
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;

ANALYZE zipcodes;

SELECT stxkeys AS k, stxdndistinct AS nd
  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
  WHERE stxname = 'stts2';
-[ RECORD 1 ]--------------------------------------------------------
k  | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
</pre><p>
     This indicates that there are three combinations of columns that
     have 33178 distinct values: ZIP code and state; ZIP code and city;
     and ZIP code, city and state (the fact that they are all equal is
     expected given that ZIP code alone is unique in this table).  On the
     other hand, the combination of city and state has only 27435 distinct
     values.
    </p><p>
     It's advisable to create <code class="literal">ndistinct</code> statistics objects only
     on combinations of columns that are actually used for grouping, and
     for which misestimation of the number of groups is resulting in bad
     plans.  Otherwise, the <code class="command">ANALYZE</code> cycles are just wasted.
    </p></div><div class="sect3" id="id-1.5.13.5.4.13"><div class="titlepage"><div><div><h4 class="title">14.2.2.3. Multivariate MCV Lists</h4></div></div></div><p>
     Another type of statistic stored for each column are most-common value
     lists.  This allows very accurate estimates for individual columns, but
     may result in significant misestimates for queries with conditions on
     multiple columns.
    </p><p>
     To improve such estimates, <code class="command">ANALYZE</code> can collect MCV
     lists on combinations of columns.  Similarly to functional dependencies
     and n-distinct coefficients, it's impractical to do this for every
     possible column grouping.  Even more so in this case, as the MCV list
     (unlike functional dependencies and n-distinct coefficients) does store
     the common column values.  So data is collected only for those groups
     of columns appearing together in a statistics object defined with the
     <code class="literal">mcv</code> option.
    </p><p>
     Continuing the previous example, the MCV list for a table of ZIP codes
     might look like the following (unlike for simpler types of statistics,
     a function is required for inspection of MCV contents):

</p><pre class="programlisting">
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;

ANALYZE zipcodes;

SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';

 index |         values         | nulls | frequency | base_frequency
-------+------------------------+-------+-----------+----------------
     0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
     1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
     2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
     3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
     4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
     5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
     6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
     7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
     8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
     9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
   ...
(99 rows)
</pre><p>
     This indicates that the most common combination of city and state is
     Washington in DC, with actual frequency (in the sample) about 0.35%.
     The base frequency of the combination (as computed from the simple
     per-column frequencies) is only 0.0027%, resulting in two orders of
     magnitude under-estimates.
    </p><p>
     It's advisable to create <acronym class="acronym">MCV</acronym> statistics objects only
     on combinations of columns that are actually used in conditions together,
     and for which misestimation of the number of groups is resulting in bad
     plans.  Otherwise, the <code class="command">ANALYZE</code> and planning cycles
     are just wasted.
    </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="using-explain.html" title="14.1. Using EXPLAIN">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="performance-tips.html" title="Chapter 14. Performance Tips">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="explicit-joins.html" title="14.3. Controlling the Planner with Explicit JOIN Clauses">Next</a></td></tr><tr><td width="40%" align="left" valign="top">14.1. Using <code class="command">EXPLAIN</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.7 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 14.3. Controlling the Planner with Explicit <code class="literal">JOIN</code> Clauses</td></tr></table></div></body></html>