diff options
Diffstat (limited to 'doc/src/sgml/html/row-estimation-examples.html')
-rw-r--r-- | doc/src/sgml/html/row-estimation-examples.html | 399 |
1 files changed, 399 insertions, 0 deletions
diff --git a/doc/src/sgml/html/row-estimation-examples.html b/doc/src/sgml/html/row-estimation-examples.html new file mode 100644 index 0000000..3352502 --- /dev/null +++ b/doc/src/sgml/html/row-estimation-examples.html @@ -0,0 +1,399 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>72.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">72.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 72. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">72.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.24.4.2" class="indexterm"></a><p> + The examples shown below use tables in the <span class="productname">PostgreSQL</span> + regression test database. + The outputs shown are taken from version 8.3. + The behavior of earlier (or later) versions might vary. + Note also that since <code class="command">ANALYZE</code> uses random sampling + while producing statistics, the results will change slightly after + any new <code class="command">ANALYZE</code>. + </p><p> + Let's start with a very simple query: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1; + + QUERY PLAN +------------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244) +</pre><p> + + How the planner determines the cardinality of <code class="structname">tenk1</code> + is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for + completeness. The number of pages and rows is looked up in + <code class="structname">pg_class</code>: + +</p><pre class="programlisting"> +SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; + + relpages | reltuples +----------+----------- + 358 | 10000 +</pre><p> + + These numbers are current as of the last <code class="command">VACUUM</code> or + <code class="command">ANALYZE</code> on the table. The planner then fetches the + actual current number of pages in the table (this is a cheap operation, + not requiring a table scan). If that is different from + <code class="structfield">relpages</code> then + <code class="structfield">reltuples</code> is scaled accordingly to + arrive at a current number-of-rows estimate. In the example above, the value of + <code class="structfield">relpages</code> is up-to-date so the rows estimate is + the same as <code class="structfield">reltuples</code>. + </p><p> + Let's move on to an example with a range condition in its + <code class="literal">WHERE</code> clause: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; + + QUERY PLAN +-------------------------------------------------------------------------------- + Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244) + Recheck Cond: (unique1 < 1000) + -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) + Index Cond: (unique1 < 1000) +</pre><p> + + The planner examines the <code class="literal">WHERE</code> clause condition + and looks up the selectivity function for the operator + <code class="literal"><</code> in <code class="structname">pg_operator</code>. + This is held in the column <code class="structfield">oprrest</code>, + and the entry in this case is <code class="function">scalarltsel</code>. + The <code class="function">scalarltsel</code> function retrieves the histogram for + <code class="structfield">unique1</code> from + <code class="structname">pg_statistic</code>. For manual queries it is more + convenient to look in the simpler <code class="structname">pg_stats</code> + view: + +</p><pre class="programlisting"> +SELECT histogram_bounds FROM pg_stats +WHERE tablename='tenk1' AND attname='unique1'; + + histogram_bounds +------------------------------------------------------ + {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995} +</pre><p> + + Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">< 1000</span>”</span> + is worked out. This is the selectivity. The histogram divides the range + into equal frequency buckets, so all we have to do is locate the bucket + that our value is in and count <span class="emphasis"><em>part</em></span> of it and + <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in + the second bucket (993–1997). Assuming a linear distribution of + values inside each bucket, we can calculate the selectivity as: + +</p><pre class="programlisting"> +selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets + = (1 + (1000 - 993)/(1997 - 993))/10 + = 0.100697 +</pre><p> + + that is, one whole bucket plus a linear fraction of the second, divided by + the number of buckets. The estimated number of rows can now be calculated as + the product of the selectivity and the cardinality of + <code class="structname">tenk1</code>: + +</p><pre class="programlisting"> +rows = rel_cardinality * selectivity + = 10000 * 0.100697 + = 1007 (rounding off) +</pre><p> + </p><p> + Next let's consider an example with an equality condition in its + <code class="literal">WHERE</code> clause: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA'; + + QUERY PLAN +---------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244) + Filter: (stringu1 = 'CRAAAA'::name) +</pre><p> + + Again the planner examines the <code class="literal">WHERE</code> clause condition + and looks up the selectivity function for <code class="literal">=</code>, which is + <code class="function">eqsel</code>. For equality estimation the histogram is + not useful; instead the list of <em class="firstterm">most + common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the + selectivity. Let's have a look at the MCVs, with some additional columns + that will be useful later: + +</p><pre class="programlisting"> +SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats +WHERE tablename='tenk1' AND attname='stringu1'; + +null_frac | 0 +n_distinct | 676 +most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA} +most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003} + +</pre><p> + + Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is + merely the corresponding entry in the list of most common frequencies + (<acronym class="acronym">MCF</acronym>s): + +</p><pre class="programlisting"> +selectivity = mcf[3] + = 0.003 +</pre><p> + + As before, the estimated number of rows is just the product of this with the + cardinality of <code class="structname">tenk1</code>: + +</p><pre class="programlisting"> +rows = 10000 * 0.003 + = 30 +</pre><p> + </p><p> + Now consider the same query, but with a constant that is not in the + <acronym class="acronym">MCV</acronym> list: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; + + QUERY PLAN +---------------------------------------------------------- + Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244) + Filter: (stringu1 = 'xxx'::name) +</pre><p> + + This is quite a different problem: how to estimate the selectivity when the + value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list. + The approach is to use the fact that the value is not in the list, + combined with the knowledge of the frequencies for all of the + <acronym class="acronym">MCV</acronym>s: + +</p><pre class="programlisting"> +selectivity = (1 - sum(mvf))/(num_distinct - num_mcv) + = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + + 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10) + = 0.0014559 +</pre><p> + + That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and + subtract them from one, then + divide by the number of <span class="emphasis"><em>other</em></span> distinct values. + This amounts to assuming that the fraction of the column that is not any + of the MCVs is evenly distributed among all the other distinct values. + Notice that there are no null values so we don't have to worry about those + (otherwise we'd subtract the null fraction from the numerator as well). + The estimated number of rows is then calculated as usual: + +</p><pre class="programlisting"> +rows = 10000 * 0.0014559 + = 15 (rounding off) +</pre><p> + </p><p> + The previous example with <code class="literal">unique1 < 1000</code> was an + oversimplification of what <code class="function">scalarltsel</code> really does; + now that we have seen an example of the use of MCVs, we can fill in some + more detail. The example was correct as far as it went, because since + <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no + value is any more common than any other value). For a non-unique + column, there will normally be both a histogram and an MCV list, and + <span class="emphasis"><em>the histogram does not include the portion of the column + population represented by the MCVs</em></span>. We do things this way because + it allows more precise estimation. In this situation + <code class="function">scalarltsel</code> directly applies the condition (e.g., + <span class="quote">“<span class="quote">< 1000</span>”</span>) to each value of the MCV list, and adds up the + frequencies of the MCVs for which the condition is true. This gives + an exact estimate of the selectivity within the portion of the table + that is MCVs. The histogram is then used in the same way as above + to estimate the selectivity in the portion of the table that is not + MCVs, and then the two numbers are combined to estimate the overall + selectivity. For example, consider + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA'; + + QUERY PLAN +------------------------------------------------------------ + Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244) + Filter: (stringu1 < 'IAAAAA'::name) +</pre><p> + + We already saw the MCV information for <code class="structfield">stringu1</code>, + and here is its histogram: + +</p><pre class="programlisting"> +SELECT histogram_bounds FROM pg_stats +WHERE tablename='tenk1' AND attname='stringu1'; + + histogram_bounds +-------------------------------------------------------------------------------- + {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA} +</pre><p> + + Checking the MCV list, we find that the condition <code class="literal">stringu1 < + 'IAAAAA'</code> is satisfied by the first six entries and not the last four, + so the selectivity within the MCV part of the population is + +</p><pre class="programlisting"> +selectivity = sum(relevant mvfs) + = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 + = 0.01833333 +</pre><p> + + Summing all the MCFs also tells us that the total fraction of the + population represented by MCVs is 0.03033333, and therefore the + fraction represented by the histogram is 0.96966667 (again, there + are no nulls, else we'd have to exclude them here). We can see + that the value <code class="literal">IAAAAA</code> falls nearly at the end of the + third histogram bucket. Using some rather cheesy assumptions + about the frequency of different characters, the planner arrives + at the estimate 0.298387 for the portion of the histogram population + that is less than <code class="literal">IAAAAA</code>. We then combine the estimates + for the MCV and non-MCV populations: + +</p><pre class="programlisting"> +selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction + = 0.01833333 + 0.298387 * 0.96966667 + = 0.307669 + +rows = 10000 * 0.307669 + = 3077 (rounding off) +</pre><p> + + In this particular example, the correction from the MCV list is fairly + small, because the column distribution is actually quite flat (the + statistics showing these particular values as being more common than + others are mostly due to sampling error). In a more typical case where + some values are significantly more common than others, this complicated + process gives a useful improvement in accuracy because the selectivity + for the most common values is found exactly. + </p><p> + Now let's consider a case with more than one + condition in the <code class="literal">WHERE</code> clause: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx'; + + QUERY PLAN +-------------------------------------------------------------------------------- + Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244) + Recheck Cond: (unique1 < 1000) + Filter: (stringu1 = 'xxx'::name) + -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0) + Index Cond: (unique1 < 1000) +</pre><p> + + The planner assumes that the two conditions are independent, so that + the individual selectivities of the clauses can be multiplied together: + +</p><pre class="programlisting"> +selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx') + = 0.100697 * 0.0014559 + = 0.0001466 + +rows = 10000 * 0.0001466 + = 1 (rounding off) +</pre><p> + + Notice that the number of rows estimated to be returned from the bitmap + index scan reflects only the condition used with the index; this is + important since it affects the cost estimate for the subsequent heap + fetches. + </p><p> + Finally we will examine a query that involves a join: + +</p><pre class="programlisting"> +EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 +WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2; + + QUERY PLAN +-------------------------------------------------------------------------------------- + Nested Loop (cost=4.64..456.23 rows=50 width=488) + -> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244) + Recheck Cond: (unique1 < 50) + -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0) + Index Cond: (unique1 < 50) + -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244) + Index Cond: (unique2 = t1.unique2) +</pre><p> + + The restriction on <code class="structname">tenk1</code>, + <code class="literal">unique1 < 50</code>, + is evaluated before the nested-loop join. + This is handled analogously to the previous range example. This time the + value 50 falls into the first bucket of the + <code class="structfield">unique1</code> histogram: + +</p><pre class="programlisting"> +selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets + = (0 + (50 - 0)/(993 - 0))/10 + = 0.005035 + +rows = 10000 * 0.005035 + = 50 (rounding off) +</pre><p> + + The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>. + The operator is just + our familiar <code class="literal">=</code>, however the selectivity function is + obtained from the <code class="structfield">oprjoin</code> column of + <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>. + <code class="function">eqjoinsel</code> looks up the statistical information for both + <code class="structname">tenk2</code> and <code class="structname">tenk1</code>: + +</p><pre class="programlisting"> +SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats +WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2'; + +tablename | null_frac | n_distinct | most_common_vals +-----------+-----------+------------+------------------ + tenk1 | 0 | -1 | + tenk2 | 0 | -1 | +</pre><p> + + In this case there is no <acronym class="acronym">MCV</acronym> information for + <code class="structfield">unique2</code> because all the values appear to be + unique, so we use an algorithm that relies only on the number of + distinct values for both relations together with their null fractions: + +</p><pre class="programlisting"> +selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2) + = (1 - 0) * (1 - 0) / max(10000, 10000) + = 0.0001 +</pre><p> + + This is, subtract the null fraction from one for each of the relations, + and divide by the maximum of the numbers of distinct values. + The number of rows + that the join is likely to emit is calculated as the cardinality of the + Cartesian product of the two inputs, multiplied by the + selectivity: + +</p><pre class="programlisting"> +rows = (outer_cardinality * inner_cardinality) * selectivity + = (50 * 10000) * 0.0001 + = 50 +</pre><p> + </p><p> + Had there been MCV lists for the two columns, + <code class="function">eqjoinsel</code> would have used direct comparison of the MCV + lists to determine the join selectivity within the part of the column + populations represented by the MCVs. The estimate for the remainder of the + populations follows the same approach shown here. + </p><p> + Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is, + the unmodified size of <code class="structname">tenk2</code>. It might appear from + inspection of the <code class="command">EXPLAIN</code> output that the estimate of + join rows comes from 50 * 1, that is, the number of outer rows times + the estimated number of rows obtained by each inner index scan on + <code class="structname">tenk2</code>. But this is not the case: the join relation size + is estimated before any particular join plan has been considered. If + everything is working well then the two ways of estimating the join + size will produce about the same answer, but due to round-off error and + other factors they sometimes diverge significantly. + </p><p> + For those interested in further details, estimation of the size of + a table (before any <code class="literal">WHERE</code> clauses) is done in + <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic + logic for clause selectivities is in + <code class="filename">src/backend/optimizer/path/clausesel.c</code>. The + operator-specific selectivity functions are mostly found + in <code class="filename">src/backend/utils/adt/selfuncs.c</code>. + </p></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 72. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 72.2. Multivariate Statistics Examples</td></tr></table></div></body></html>
\ No newline at end of file |