summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/row-estimation-examples.html
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
commit46651ce6fe013220ed397add242004d764fc0153 (patch)
tree6e5299f990f88e60174a1d3ae6e48eedd2688b2b /doc/src/sgml/html/row-estimation-examples.html
parentInitial commit. (diff)
downloadpostgresql-14-46651ce6fe013220ed397add242004d764fc0153.tar.xz
postgresql-14-46651ce6fe013220ed397add242004d764fc0153.zip
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/html/row-estimation-examples.html')
-rw-r--r--doc/src/sgml/html/row-estimation-examples.html399
1 files changed, 399 insertions, 0 deletions
diff --git a/doc/src/sgml/html/row-estimation-examples.html b/doc/src/sgml/html/row-estimation-examples.html
new file mode 100644
index 0000000..3352502
--- /dev/null
+++ b/doc/src/sgml/html/row-estimation-examples.html
@@ -0,0 +1,399 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>72.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">72.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 72. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">72.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.24.4.2" class="indexterm"></a><p>
+ The examples shown below use tables in the <span class="productname">PostgreSQL</span>
+ regression test database.
+ The outputs shown are taken from version 8.3.
+ The behavior of earlier (or later) versions might vary.
+ Note also that since <code class="command">ANALYZE</code> uses random sampling
+ while producing statistics, the results will change slightly after
+ any new <code class="command">ANALYZE</code>.
+ </p><p>
+ Let's start with a very simple query:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1;
+
+ QUERY PLAN
+-------------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
+</pre><p>
+
+ How the planner determines the cardinality of <code class="structname">tenk1</code>
+ is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
+ completeness. The number of pages and rows is looked up in
+ <code class="structname">pg_class</code>:
+
+</p><pre class="programlisting">
+SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
+
+ relpages | reltuples
+----------+-----------
+ 358 | 10000
+</pre><p>
+
+ These numbers are current as of the last <code class="command">VACUUM</code> or
+ <code class="command">ANALYZE</code> on the table. The planner then fetches the
+ actual current number of pages in the table (this is a cheap operation,
+ not requiring a table scan). If that is different from
+ <code class="structfield">relpages</code> then
+ <code class="structfield">reltuples</code> is scaled accordingly to
+ arrive at a current number-of-rows estimate. In the example above, the value of
+ <code class="structfield">relpages</code> is up-to-date so the rows estimate is
+ the same as <code class="structfield">reltuples</code>.
+ </p><p>
+ Let's move on to an example with a range condition in its
+ <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
+
+ QUERY PLAN
+-------------------------------------------------------------------​-------------
+ Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
+ Recheck Cond: (unique1 &lt; 1000)
+ -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
+ Index Cond: (unique1 &lt; 1000)
+</pre><p>
+
+ The planner examines the <code class="literal">WHERE</code> clause condition
+ and looks up the selectivity function for the operator
+ <code class="literal">&lt;</code> in <code class="structname">pg_operator</code>.
+ This is held in the column <code class="structfield">oprrest</code>,
+ and the entry in this case is <code class="function">scalarltsel</code>.
+ The <code class="function">scalarltsel</code> function retrieves the histogram for
+ <code class="structfield">unique1</code> from
+ <code class="structname">pg_statistic</code>. For manual queries it is more
+ convenient to look in the simpler <code class="structname">pg_stats</code>
+ view:
+
+</p><pre class="programlisting">
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='unique1';
+
+ histogram_bounds
+------------------------------------------------------
+ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
+</pre><p>
+
+ Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">&lt; 1000</span>”</span>
+ is worked out. This is the selectivity. The histogram divides the range
+ into equal frequency buckets, so all we have to do is locate the bucket
+ that our value is in and count <span class="emphasis"><em>part</em></span> of it and
+ <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
+ the second bucket (993–1997). Assuming a linear distribution of
+ values inside each bucket, we can calculate the selectivity as:
+
+</p><pre class="programlisting">
+selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
+ = (1 + (1000 - 993)/(1997 - 993))/10
+ = 0.100697
+</pre><p>
+
+ that is, one whole bucket plus a linear fraction of the second, divided by
+ the number of buckets. The estimated number of rows can now be calculated as
+ the product of the selectivity and the cardinality of
+ <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+rows = rel_cardinality * selectivity
+ = 10000 * 0.100697
+ = 1007 (rounding off)
+</pre><p>
+ </p><p>
+ Next let's consider an example with an equality condition in its
+ <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
+
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
+ Filter: (stringu1 = 'CRAAAA'::name)
+</pre><p>
+
+ Again the planner examines the <code class="literal">WHERE</code> clause condition
+ and looks up the selectivity function for <code class="literal">=</code>, which is
+ <code class="function">eqsel</code>. For equality estimation the histogram is
+ not useful; instead the list of <em class="firstterm">most
+ common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
+ selectivity. Let's have a look at the MCVs, with some additional columns
+ that will be useful later:
+
+</p><pre class="programlisting">
+SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+null_frac | 0
+n_distinct | 676
+most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,​JOAAAA,MCAAAA,NAAAAA,WGAAAA}
+most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,​0.003,0.003,0.003,0.003}
+
+</pre><p>
+
+ Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
+ merely the corresponding entry in the list of most common frequencies
+ (<acronym class="acronym">MCF</acronym>s):
+
+</p><pre class="programlisting">
+selectivity = mcf[3]
+ = 0.003
+</pre><p>
+
+ As before, the estimated number of rows is just the product of this with the
+ cardinality of <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+rows = 10000 * 0.003
+ = 30
+</pre><p>
+ </p><p>
+ Now consider the same query, but with a constant that is not in the
+ <acronym class="acronym">MCV</acronym> list:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
+
+ QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
+ Filter: (stringu1 = 'xxx'::name)
+</pre><p>
+
+ This is quite a different problem: how to estimate the selectivity when the
+ value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
+ The approach is to use the fact that the value is not in the list,
+ combined with the knowledge of the frequencies for all of the
+ <acronym class="acronym">MCV</acronym>s:
+
+</p><pre class="programlisting">
+selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
+ = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
+ 0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
+ = 0.0014559
+</pre><p>
+
+ That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
+ subtract them from one, then
+ divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
+ This amounts to assuming that the fraction of the column that is not any
+ of the MCVs is evenly distributed among all the other distinct values.
+ Notice that there are no null values so we don't have to worry about those
+ (otherwise we'd subtract the null fraction from the numerator as well).
+ The estimated number of rows is then calculated as usual:
+
+</p><pre class="programlisting">
+rows = 10000 * 0.0014559
+ = 15 (rounding off)
+</pre><p>
+ </p><p>
+ The previous example with <code class="literal">unique1 &lt; 1000</code> was an
+ oversimplification of what <code class="function">scalarltsel</code> really does;
+ now that we have seen an example of the use of MCVs, we can fill in some
+ more detail. The example was correct as far as it went, because since
+ <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
+ value is any more common than any other value). For a non-unique
+ column, there will normally be both a histogram and an MCV list, and
+ <span class="emphasis"><em>the histogram does not include the portion of the column
+ population represented by the MCVs</em></span>. We do things this way because
+ it allows more precise estimation. In this situation
+ <code class="function">scalarltsel</code> directly applies the condition (e.g.,
+ <span class="quote">“<span class="quote">&lt; 1000</span>”</span>) to each value of the MCV list, and adds up the
+ frequencies of the MCVs for which the condition is true. This gives
+ an exact estimate of the selectivity within the portion of the table
+ that is MCVs. The histogram is then used in the same way as above
+ to estimate the selectivity in the portion of the table that is not
+ MCVs, and then the two numbers are combined to estimate the overall
+ selectivity. For example, consider
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';
+
+ QUERY PLAN
+------------------------------------------------------------
+ Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
+ Filter: (stringu1 &lt; 'IAAAAA'::name)
+</pre><p>
+
+ We already saw the MCV information for <code class="structfield">stringu1</code>,
+ and here is its histogram:
+
+</p><pre class="programlisting">
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+ histogram_bounds
+-------------------------------------------------------------------​-------------
+ {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,​XLAAAA,ZZAAAA}
+</pre><p>
+
+ Checking the MCV list, we find that the condition <code class="literal">stringu1 &lt;
+ 'IAAAAA'</code> is satisfied by the first six entries and not the last four,
+ so the selectivity within the MCV part of the population is
+
+</p><pre class="programlisting">
+selectivity = sum(relevant mvfs)
+ = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
+ = 0.01833333
+</pre><p>
+
+ Summing all the MCFs also tells us that the total fraction of the
+ population represented by MCVs is 0.03033333, and therefore the
+ fraction represented by the histogram is 0.96966667 (again, there
+ are no nulls, else we'd have to exclude them here). We can see
+ that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
+ third histogram bucket. Using some rather cheesy assumptions
+ about the frequency of different characters, the planner arrives
+ at the estimate 0.298387 for the portion of the histogram population
+ that is less than <code class="literal">IAAAAA</code>. We then combine the estimates
+ for the MCV and non-MCV populations:
+
+</p><pre class="programlisting">
+selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
+ = 0.01833333 + 0.298387 * 0.96966667
+ = 0.307669
+
+rows = 10000 * 0.307669
+ = 3077 (rounding off)
+</pre><p>
+
+ In this particular example, the correction from the MCV list is fairly
+ small, because the column distribution is actually quite flat (the
+ statistics showing these particular values as being more common than
+ others are mostly due to sampling error). In a more typical case where
+ some values are significantly more common than others, this complicated
+ process gives a useful improvement in accuracy because the selectivity
+ for the most common values is found exactly.
+ </p><p>
+ Now let's consider a case with more than one
+ condition in the <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
+
+ QUERY PLAN
+-------------------------------------------------------------------​-------------
+ Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
+ Recheck Cond: (unique1 &lt; 1000)
+ Filter: (stringu1 = 'xxx'::name)
+ -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
+ Index Cond: (unique1 &lt; 1000)
+</pre><p>
+
+ The planner assumes that the two conditions are independent, so that
+ the individual selectivities of the clauses can be multiplied together:
+
+</p><pre class="programlisting">
+selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
+ = 0.100697 * 0.0014559
+ = 0.0001466
+
+rows = 10000 * 0.0001466
+ = 1 (rounding off)
+</pre><p>
+
+ Notice that the number of rows estimated to be returned from the bitmap
+ index scan reflects only the condition used with the index; this is
+ important since it affects the cost estimate for the subsequent heap
+ fetches.
+ </p><p>
+ Finally we will examine a query that involves a join:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
+WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
+
+ QUERY PLAN
+-------------------------------------------------------------------​-------------------
+ Nested Loop (cost=4.64..456.23 rows=50 width=488)
+ -&gt; Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
+ Recheck Cond: (unique1 &lt; 50)
+ -&gt; Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
+ Index Cond: (unique1 &lt; 50)
+ -&gt; Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
+ Index Cond: (unique2 = t1.unique2)
+</pre><p>
+
+ The restriction on <code class="structname">tenk1</code>,
+ <code class="literal">unique1 &lt; 50</code>,
+ is evaluated before the nested-loop join.
+ This is handled analogously to the previous range example. This time the
+ value 50 falls into the first bucket of the
+ <code class="structfield">unique1</code> histogram:
+
+</p><pre class="programlisting">
+selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
+ = (0 + (50 - 0)/(993 - 0))/10
+ = 0.005035
+
+rows = 10000 * 0.005035
+ = 50 (rounding off)
+</pre><p>
+
+ The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
+ The operator is just
+ our familiar <code class="literal">=</code>, however the selectivity function is
+ obtained from the <code class="structfield">oprjoin</code> column of
+ <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
+ <code class="function">eqjoinsel</code> looks up the statistical information for both
+ <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
+WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
+
+tablename | null_frac | n_distinct | most_common_vals
+-----------+-----------+------------+------------------
+ tenk1 | 0 | -1 |
+ tenk2 | 0 | -1 |
+</pre><p>
+
+ In this case there is no <acronym class="acronym">MCV</acronym> information for
+ <code class="structfield">unique2</code> because all the values appear to be
+ unique, so we use an algorithm that relies only on the number of
+ distinct values for both relations together with their null fractions:
+
+</p><pre class="programlisting">
+selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+ = (1 - 0) * (1 - 0) / max(10000, 10000)
+ = 0.0001
+</pre><p>
+
+ This is, subtract the null fraction from one for each of the relations,
+ and divide by the maximum of the numbers of distinct values.
+ The number of rows
+ that the join is likely to emit is calculated as the cardinality of the
+ Cartesian product of the two inputs, multiplied by the
+ selectivity:
+
+</p><pre class="programlisting">
+rows = (outer_cardinality * inner_cardinality) * selectivity
+ = (50 * 10000) * 0.0001
+ = 50
+</pre><p>
+ </p><p>
+ Had there been MCV lists for the two columns,
+ <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
+ lists to determine the join selectivity within the part of the column
+ populations represented by the MCVs. The estimate for the remainder of the
+ populations follows the same approach shown here.
+ </p><p>
+ Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
+ the unmodified size of <code class="structname">tenk2</code>. It might appear from
+ inspection of the <code class="command">EXPLAIN</code> output that the estimate of
+ join rows comes from 50 * 1, that is, the number of outer rows times
+ the estimated number of rows obtained by each inner index scan on
+ <code class="structname">tenk2</code>. But this is not the case: the join relation size
+ is estimated before any particular join plan has been considered. If
+ everything is working well then the two ways of estimating the join
+ size will produce about the same answer, but due to round-off error and
+ other factors they sometimes diverge significantly.
+ </p><p>
+ For those interested in further details, estimation of the size of
+ a table (before any <code class="literal">WHERE</code> clauses) is done in
+ <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
+ logic for clause selectivities is in
+ <code class="filename">src/backend/optimizer/path/clausesel.c</code>. The
+ operator-specific selectivity functions are mostly found
+ in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
+ </p></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 72. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="72.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 72. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 72.2. Multivariate Statistics Examples</td></tr></table></div></body></html> \ No newline at end of file