summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/row-estimation-examples.html
blob: 641970fb1dcdbbf8b1d7af16139f61bc45f0c5a5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>75.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">75.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 75. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr /></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">75.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.26.4.2" class="indexterm"></a><p>
   The examples shown below use tables in the <span class="productname">PostgreSQL</span>
   regression test database.
   The outputs shown are taken from version 8.3.
   The behavior of earlier (or later) versions might vary.
   Note also that since <code class="command">ANALYZE</code> uses random sampling
   while producing statistics, the results will change slightly after
   any new <code class="command">ANALYZE</code>.
  </p><p>
   Let's start with a very simple query:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1;

                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
</pre><p>

   How the planner determines the cardinality of <code class="structname">tenk1</code>
   is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
   completeness. The number of pages and rows is looked up in
   <code class="structname">pg_class</code>:

</p><pre class="programlisting">
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';

 relpages | reltuples
----------+-----------
      358 |     10000
</pre><p>

    These numbers are current as of the last <code class="command">VACUUM</code> or
    <code class="command">ANALYZE</code> on the table.  The planner then fetches the
    actual current number of pages in the table (this is a cheap operation,
    not requiring a table scan).  If that is different from
    <code class="structfield">relpages</code> then
    <code class="structfield">reltuples</code> is scaled accordingly to
    arrive at a current number-of-rows estimate.  In the example above, the value of
    <code class="structfield">relpages</code> is up-to-date so the rows estimate is
    the same as <code class="structfield">reltuples</code>.
  </p><p>
   Let's move on to an example with a range condition in its
   <code class="literal">WHERE</code> clause:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;

                                   QUERY PLAN
-------------------------------------------------------------------​-------------
 Bitmap Heap Scan on tenk1  (cost=24.06..394.64 rows=1007 width=244)
   Recheck Cond: (unique1 &lt; 1000)
   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
         Index Cond: (unique1 &lt; 1000)
</pre><p>

   The planner examines the <code class="literal">WHERE</code> clause condition
   and looks up the selectivity function for the operator
   <code class="literal">&lt;</code> in <code class="structname">pg_operator</code>.
   This is held in the column <code class="structfield">oprrest</code>,
   and the entry in this case is <code class="function">scalarltsel</code>.
   The <code class="function">scalarltsel</code> function retrieves the histogram for
   <code class="structfield">unique1</code> from
   <code class="structname">pg_statistic</code>.  For manual queries it is more
   convenient to look in the simpler <code class="structname">pg_stats</code>
   view:

</p><pre class="programlisting">
SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='unique1';

                   histogram_bounds
------------------------------------------------------
 {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
</pre><p>

   Next the fraction of the histogram occupied by <span class="quote"><span class="quote">&lt; 1000</span></span>
   is worked out. This is the selectivity. The histogram divides the range
   into equal frequency buckets, so all we have to do is locate the bucket
   that our value is in and count <span class="emphasis"><em>part</em></span> of it and
   <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
   the second bucket (993–1997).  Assuming a linear distribution of
   values inside each bucket, we can calculate the selectivity as:

</p><pre class="programlisting">
selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
            = (1 + (1000 - 993)/(1997 - 993))/10
            = 0.100697
</pre><p>

   that is, one whole bucket plus a linear fraction of the second, divided by
   the number of buckets. The estimated number of rows can now be calculated as
   the product of the selectivity and the cardinality of
   <code class="structname">tenk1</code>:

</p><pre class="programlisting">
rows = rel_cardinality * selectivity
     = 10000 * 0.100697
     = 1007  (rounding off)
</pre><p>
  </p><p>
   Next let's consider an example with an equality condition in its
   <code class="literal">WHERE</code> clause:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=30 width=244)
   Filter: (stringu1 = 'CRAAAA'::name)
</pre><p>

   Again the planner examines the <code class="literal">WHERE</code> clause condition
   and looks up the selectivity function for <code class="literal">=</code>, which is
   <code class="function">eqsel</code>.  For equality estimation the histogram is
   not useful; instead the list of <em class="firstterm">most
   common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
   selectivity. Let's have a look at the MCVs, with some additional columns
   that will be useful later:

</p><pre class="programlisting">
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';

null_frac         | 0
n_distinct        | 676
most_common_vals  | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,​JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,​0.003,0.003,0.003,0.003}

</pre><p>

   Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
   merely the corresponding entry in the list of most common frequencies
   (<acronym class="acronym">MCF</acronym>s):

</p><pre class="programlisting">
selectivity = mcf[3]
            = 0.003
</pre><p>

   As before, the estimated number of rows is just the product of this with the
   cardinality of <code class="structname">tenk1</code>:

</p><pre class="programlisting">
rows = 10000 * 0.003
     = 30
</pre><p>
  </p><p>
   Now consider the same query, but with a constant that is not in the
   <acronym class="acronym">MCV</acronym> list:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';

                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=15 width=244)
   Filter: (stringu1 = 'xxx'::name)
</pre><p>

   This is quite a different problem: how to estimate the selectivity when the
   value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
   The approach is to use the fact that the value is not in the list,
   combined with the knowledge of the frequencies for all of the
   <acronym class="acronym">MCV</acronym>s:

</p><pre class="programlisting">
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
            = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
                    0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
            = 0.0014559
</pre><p>

   That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
   subtract them from one, then
   divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
   This amounts to assuming that the fraction of the column that is not any
   of the MCVs is evenly distributed among all the other distinct values.
   Notice that there are no null values so we don't have to worry about those
   (otherwise we'd subtract the null fraction from the numerator as well).
   The estimated number of rows is then calculated as usual:

</p><pre class="programlisting">
rows = 10000 * 0.0014559
     = 15  (rounding off)
</pre><p>
  </p><p>
   The previous example with <code class="literal">unique1 &lt; 1000</code> was an
   oversimplification of what <code class="function">scalarltsel</code> really does;
   now that we have seen an example of the use of MCVs, we can fill in some
   more detail.  The example was correct as far as it went, because since
   <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
   value is any more common than any other value).  For a non-unique
   column, there will normally be both a histogram and an MCV list, and
   <span class="emphasis"><em>the histogram does not include the portion of the column
   population represented by the MCVs</em></span>.  We do things this way because
   it allows more precise estimation.  In this situation
   <code class="function">scalarltsel</code> directly applies the condition (e.g.,
   <span class="quote"><span class="quote">&lt; 1000</span></span>) to each value of the MCV list, and adds up the
   frequencies of the MCVs for which the condition is true.  This gives
   an exact estimate of the selectivity within the portion of the table
   that is MCVs.  The histogram is then used in the same way as above
   to estimate the selectivity in the portion of the table that is not
   MCVs, and then the two numbers are combined to estimate the overall
   selectivity.  For example, consider

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';

                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
   Filter: (stringu1 &lt; 'IAAAAA'::name)
</pre><p>

   We already saw the MCV information for <code class="structfield">stringu1</code>,
   and here is its histogram:

</p><pre class="programlisting">
SELECT histogram_bounds FROM pg_stats
WHERE tablename='tenk1' AND attname='stringu1';

                                histogram_bounds
-------------------------------------------------------------------​-------------
 {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,​XLAAAA,ZZAAAA}
</pre><p>

   Checking the MCV list, we find that the condition <code class="literal">stringu1 &lt;
   'IAAAAA'</code> is satisfied by the first six entries and not the last four,
   so the selectivity within the MCV part of the population is

</p><pre class="programlisting">
selectivity = sum(relevant mvfs)
            = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
            = 0.01833333
</pre><p>

   Summing all the MCFs also tells us that the total fraction of the
   population represented by MCVs is 0.03033333, and therefore the
   fraction represented by the histogram is 0.96966667 (again, there
   are no nulls, else we'd have to exclude them here).  We can see
   that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
   third histogram bucket.  Using some rather cheesy assumptions
   about the frequency of different characters, the planner arrives
   at the estimate 0.298387 for the portion of the histogram population
   that is less than <code class="literal">IAAAAA</code>.  We then combine the estimates
   for the MCV and non-MCV populations:

</p><pre class="programlisting">
selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
            = 0.01833333 + 0.298387 * 0.96966667
            = 0.307669

rows        = 10000 * 0.307669
            = 3077  (rounding off)
</pre><p>

   In this particular example, the correction from the MCV list is fairly
   small, because the column distribution is actually quite flat (the
   statistics showing these particular values as being more common than
   others are mostly due to sampling error).  In a more typical case where
   some values are significantly more common than others, this complicated
   process gives a useful improvement in accuracy because the selectivity
   for the most common values is found exactly.
  </p><p>
   Now let's consider a case with more than one
   condition in the <code class="literal">WHERE</code> clause:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';

                                   QUERY PLAN
-------------------------------------------------------------------​-------------
 Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
   Recheck Cond: (unique1 &lt; 1000)
   Filter: (stringu1 = 'xxx'::name)
   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
         Index Cond: (unique1 &lt; 1000)
</pre><p>

   The planner assumes that the two conditions are independent, so that
   the individual selectivities of the clauses can be multiplied together:

</p><pre class="programlisting">
selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
            = 0.100697 * 0.0014559
            = 0.0001466

rows        = 10000 * 0.0001466
            = 1  (rounding off)
</pre><p>

   Notice that the number of rows estimated to be returned from the bitmap
   index scan reflects only the condition used with the index; this is
   important since it affects the cost estimate for the subsequent heap
   fetches.
  </p><p>
   Finally we will examine a query that involves a join:

</p><pre class="programlisting">
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;

                                      QUERY PLAN
-------------------------------------------------------------------​-------------------
 Nested Loop  (cost=4.64..456.23 rows=50 width=488)
   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.64..142.17 rows=50 width=244)
         Recheck Cond: (unique1 &lt; 50)
         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.63 rows=50 width=0)
               Index Cond: (unique1 &lt; 50)
   -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..6.27 rows=1 width=244)
         Index Cond: (unique2 = t1.unique2)
</pre><p>

   The restriction on <code class="structname">tenk1</code>,
   <code class="literal">unique1 &lt; 50</code>,
   is evaluated before the nested-loop join.
   This is handled analogously to the previous range example.  This time the
   value 50 falls into the first bucket of the
   <code class="structfield">unique1</code> histogram:

</p><pre class="programlisting">
selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
            = (0 + (50 - 0)/(993 - 0))/10
            = 0.005035

rows        = 10000 * 0.005035
            = 50  (rounding off)
</pre><p>

   The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
   The operator is just
   our familiar <code class="literal">=</code>, however the selectivity function is
   obtained from the <code class="structfield">oprjoin</code> column of
   <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
   <code class="function">eqjoinsel</code> looks up the statistical information for both
   <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:

</p><pre class="programlisting">
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';

tablename  | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
 tenk1     |         0 |         -1 |
 tenk2     |         0 |         -1 |
</pre><p>

   In this case there is no <acronym class="acronym">MCV</acronym> information for
   <code class="structfield">unique2</code> because all the values appear to be
   unique, so we use an algorithm that relies only on the number of
   distinct values for both relations together with their null fractions:

</p><pre class="programlisting">
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
            = (1 - 0) * (1 - 0) / max(10000, 10000)
            = 0.0001
</pre><p>

   This is, subtract the null fraction from one for each of the relations,
   and divide by the maximum of the numbers of distinct values.
   The number of rows
   that the join is likely to emit is calculated as the cardinality of the
   Cartesian product of the two inputs, multiplied by the
   selectivity:

</p><pre class="programlisting">
rows = (outer_cardinality * inner_cardinality) * selectivity
     = (50 * 10000) * 0.0001
     = 50
</pre><p>
  </p><p>
   Had there been MCV lists for the two columns,
   <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
   lists to determine the join selectivity within the part of the column
   populations represented by the MCVs.  The estimate for the remainder of the
   populations follows the same approach shown here.
  </p><p>
   Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
   the unmodified size of <code class="structname">tenk2</code>.  It might appear from
   inspection of the <code class="command">EXPLAIN</code> output that the estimate of
   join rows comes from 50 * 1, that is, the number of outer rows times
   the estimated number of rows obtained by each inner index scan on
   <code class="structname">tenk2</code>.  But this is not the case: the join relation size
   is estimated before any particular join plan has been considered.  If
   everything is working well then the two ways of estimating the join
   size will produce about the same answer, but due to round-off error and
   other factors they sometimes diverge significantly.
  </p><p>
   For those interested in further details, estimation of the size of
   a table (before any <code class="literal">WHERE</code> clauses) is done in
   <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
   logic for clause selectivities is in
   <code class="filename">src/backend/optimizer/path/clausesel.c</code>.  The
   operator-specific selectivity functions are mostly found
   in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
  </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 75. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 75.2. Multivariate Statistics Examples</td></tr></table></div></body></html>