1 files changed, 399 insertions, 0 deletions
diff --git a/doc/src/sgml/html/row-estimation-examples.html b/doc/src/sgml/html/row-estimation-examples.html
new file mode 100644
index 0000000..641970f
--- /dev/null
+++ b/doc/src/sgml/html/row-estimation-examples.html
@@ -0,0 +1,399 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>75.1. Row Estimation Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics" /><link rel="next" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">75.1. Row Estimation Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 75. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples">Next</a></td></tr></table><hr /></div><div class="sect1" id="ROW-ESTIMATION-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">75.1. Row Estimation Examples</h2></div></div></div><a id="id-1.10.26.4.2" class="indexterm"></a><p>
+   The examples shown below use tables in the <span class="productname">PostgreSQL</span>
+   regression test database.
+   The outputs shown are taken from version 8.3.
+   The behavior of earlier (or later) versions might vary.
+   Note also that since <code class="command">ANALYZE</code> uses random sampling
+   while producing statistics, the results will change slightly after
+   any new <code class="command">ANALYZE</code>.
+  </p><p>
+   Let's start with a very simple query:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1;
+
+                         QUERY PLAN
+-------------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
+</pre><p>
+
+   How the planner determines the cardinality of <code class="structname">tenk1</code>
+   is covered in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, but is repeated here for
+   completeness. The number of pages and rows is looked up in
+   <code class="structname">pg_class</code>:
+
+</p><pre class="programlisting">
+SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
+
+ relpages | reltuples
+----------+-----------
+      358 |     10000
+</pre><p>
+
+    These numbers are current as of the last <code class="command">VACUUM</code> or
+    <code class="command">ANALYZE</code> on the table.  The planner then fetches the
+    actual current number of pages in the table (this is a cheap operation,
+    not requiring a table scan).  If that is different from
+    <code class="structfield">relpages</code> then
+    <code class="structfield">reltuples</code> is scaled accordingly to
+    arrive at a current number-of-rows estimate.  In the example above, the value of
+    <code class="structfield">relpages</code> is up-to-date so the rows estimate is
+    the same as <code class="structfield">reltuples</code>.
+  </p><p>
+   Let's move on to an example with a range condition in its
+   <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
+
+                                   QUERY PLAN
+--------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1  (cost=24.06..394.64 rows=1007 width=244)
+   Recheck Cond: (unique1 &lt; 1000)
+   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
+         Index Cond: (unique1 &lt; 1000)
+</pre><p>
+
+   The planner examines the <code class="literal">WHERE</code> clause condition
+   and looks up the selectivity function for the operator
+   <code class="literal">&lt;</code> in <code class="structname">pg_operator</code>.
+   This is held in the column <code class="structfield">oprrest</code>,
+   and the entry in this case is <code class="function">scalarltsel</code>.
+   The <code class="function">scalarltsel</code> function retrieves the histogram for
+   <code class="structfield">unique1</code> from
+   <code class="structname">pg_statistic</code>.  For manual queries it is more
+   convenient to look in the simpler <code class="structname">pg_stats</code>
+   view:
+
+</p><pre class="programlisting">
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='unique1';
+
+                   histogram_bounds
+------------------------------------------------------
+ {0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
+</pre><p>
+
+   Next the fraction of the histogram occupied by <span class="quote">“<span class="quote">&lt; 1000</span>”</span>
+   is worked out. This is the selectivity. The histogram divides the range
+   into equal frequency buckets, so all we have to do is locate the bucket
+   that our value is in and count <span class="emphasis"><em>part</em></span> of it and
+   <span class="emphasis"><em>all</em></span> of the ones before. The value 1000 is clearly in
+   the second bucket (993–1997).  Assuming a linear distribution of
+   values inside each bucket, we can calculate the selectivity as:
+
+</p><pre class="programlisting">
+selectivity = (1 + (1000 - bucket[2].min)/(bucket[2].max - bucket[2].min))/num_buckets
+            = (1 + (1000 - 993)/(1997 - 993))/10
+            = 0.100697
+</pre><p>
+
+   that is, one whole bucket plus a linear fraction of the second, divided by
+   the number of buckets. The estimated number of rows can now be calculated as
+   the product of the selectivity and the cardinality of
+   <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+rows = rel_cardinality * selectivity
+     = 10000 * 0.100697
+     = 1007  (rounding off)
+</pre><p>
+  </p><p>
+   Next let's consider an example with an equality condition in its
+   <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA';
+
+                        QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..483.00 rows=30 width=244)
+   Filter: (stringu1 = 'CRAAAA'::name)
+</pre><p>
+
+   Again the planner examines the <code class="literal">WHERE</code> clause condition
+   and looks up the selectivity function for <code class="literal">=</code>, which is
+   <code class="function">eqsel</code>.  For equality estimation the histogram is
+   not useful; instead the list of <em class="firstterm">most
+   common values</em> (<acronym class="acronym">MCV</acronym>s) is used to determine the
+   selectivity. Let's have a look at the MCVs, with some additional columns
+   that will be useful later:
+
+</p><pre class="programlisting">
+SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+null_frac         | 0
+n_distinct        | 676
+most_common_vals  | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
+most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
+
+</pre><p>
+
+   Since <code class="literal">CRAAAA</code> appears in the list of MCVs, the selectivity is
+   merely the corresponding entry in the list of most common frequencies
+   (<acronym class="acronym">MCF</acronym>s):
+
+</p><pre class="programlisting">
+selectivity = mcf[3]
+            = 0.003
+</pre><p>
+
+   As before, the estimated number of rows is just the product of this with the
+   cardinality of <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+rows = 10000 * 0.003
+     = 30
+</pre><p>
+  </p><p>
+   Now consider the same query, but with a constant that is not in the
+   <acronym class="acronym">MCV</acronym> list:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
+
+                        QUERY PLAN
+----------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..483.00 rows=15 width=244)
+   Filter: (stringu1 = 'xxx'::name)
+</pre><p>
+
+   This is quite a different problem: how to estimate the selectivity when the
+   value is <span class="emphasis"><em>not</em></span> in the <acronym class="acronym">MCV</acronym> list.
+   The approach is to use the fact that the value is not in the list,
+   combined with the knowledge of the frequencies for all of the
+   <acronym class="acronym">MCV</acronym>s:
+
+</p><pre class="programlisting">
+selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
+            = (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
+                    0.003 + 0.003 + 0.003 + 0.003))/(676 - 10)
+            = 0.0014559
+</pre><p>
+
+   That is, add up all the frequencies for the <acronym class="acronym">MCV</acronym>s and
+   subtract them from one, then
+   divide by the number of <span class="emphasis"><em>other</em></span> distinct values.
+   This amounts to assuming that the fraction of the column that is not any
+   of the MCVs is evenly distributed among all the other distinct values.
+   Notice that there are no null values so we don't have to worry about those
+   (otherwise we'd subtract the null fraction from the numerator as well).
+   The estimated number of rows is then calculated as usual:
+
+</p><pre class="programlisting">
+rows = 10000 * 0.0014559
+     = 15  (rounding off)
+</pre><p>
+  </p><p>
+   The previous example with <code class="literal">unique1 &lt; 1000</code> was an
+   oversimplification of what <code class="function">scalarltsel</code> really does;
+   now that we have seen an example of the use of MCVs, we can fill in some
+   more detail.  The example was correct as far as it went, because since
+   <code class="structfield">unique1</code> is a unique column it has no MCVs (obviously, no
+   value is any more common than any other value).  For a non-unique
+   column, there will normally be both a histogram and an MCV list, and
+   <span class="emphasis"><em>the histogram does not include the portion of the column
+   population represented by the MCVs</em></span>.  We do things this way because
+   it allows more precise estimation.  In this situation
+   <code class="function">scalarltsel</code> directly applies the condition (e.g.,
+   <span class="quote">“<span class="quote">&lt; 1000</span>”</span>) to each value of the MCV list, and adds up the
+   frequencies of the MCVs for which the condition is true.  This gives
+   an exact estimate of the selectivity within the portion of the table
+   that is MCVs.  The histogram is then used in the same way as above
+   to estimate the selectivity in the portion of the table that is not
+   MCVs, and then the two numbers are combined to estimate the overall
+   selectivity.  For example, consider
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE stringu1 &lt; 'IAAAAA';
+
+                         QUERY PLAN
+------------------------------------------------------------
+ Seq Scan on tenk1  (cost=0.00..483.00 rows=3077 width=244)
+   Filter: (stringu1 &lt; 'IAAAAA'::name)
+</pre><p>
+
+   We already saw the MCV information for <code class="structfield">stringu1</code>,
+   and here is its histogram:
+
+</p><pre class="programlisting">
+SELECT histogram_bounds FROM pg_stats
+WHERE tablename='tenk1' AND attname='stringu1';
+
+                                histogram_bounds
+--------------------------------------------------------------------------------
+ {AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
+</pre><p>
+
+   Checking the MCV list, we find that the condition <code class="literal">stringu1 &lt;
+   'IAAAAA'</code> is satisfied by the first six entries and not the last four,
+   so the selectivity within the MCV part of the population is
+
+</p><pre class="programlisting">
+selectivity = sum(relevant mvfs)
+            = 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
+            = 0.01833333
+</pre><p>
+
+   Summing all the MCFs also tells us that the total fraction of the
+   population represented by MCVs is 0.03033333, and therefore the
+   fraction represented by the histogram is 0.96966667 (again, there
+   are no nulls, else we'd have to exclude them here).  We can see
+   that the value <code class="literal">IAAAAA</code> falls nearly at the end of the
+   third histogram bucket.  Using some rather cheesy assumptions
+   about the frequency of different characters, the planner arrives
+   at the estimate 0.298387 for the portion of the histogram population
+   that is less than <code class="literal">IAAAAA</code>.  We then combine the estimates
+   for the MCV and non-MCV populations:
+
+</p><pre class="programlisting">
+selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
+            = 0.01833333 + 0.298387 * 0.96966667
+            = 0.307669
+
+rows        = 10000 * 0.307669
+            = 3077  (rounding off)
+</pre><p>
+
+   In this particular example, the correction from the MCV list is fairly
+   small, because the column distribution is actually quite flat (the
+   statistics showing these particular values as being more common than
+   others are mostly due to sampling error).  In a more typical case where
+   some values are significantly more common than others, this complicated
+   process gives a useful improvement in accuracy because the selectivity
+   for the most common values is found exactly.
+  </p><p>
+   Now let's consider a case with more than one
+   condition in the <code class="literal">WHERE</code> clause:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
+
+                                   QUERY PLAN
+--------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1  (cost=23.80..396.91 rows=1 width=244)
+   Recheck Cond: (unique1 &lt; 1000)
+   Filter: (stringu1 = 'xxx'::name)
+   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..23.80 rows=1007 width=0)
+         Index Cond: (unique1 &lt; 1000)
+</pre><p>
+
+   The planner assumes that the two conditions are independent, so that
+   the individual selectivities of the clauses can be multiplied together:
+
+</p><pre class="programlisting">
+selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
+            = 0.100697 * 0.0014559
+            = 0.0001466
+
+rows        = 10000 * 0.0001466
+            = 1  (rounding off)
+</pre><p>
+
+   Notice that the number of rows estimated to be returned from the bitmap
+   index scan reflects only the condition used with the index; this is
+   important since it affects the cost estimate for the subsequent heap
+   fetches.
+  </p><p>
+   Finally we will examine a query that involves a join:
+
+</p><pre class="programlisting">
+EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
+WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
+
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Nested Loop  (cost=4.64..456.23 rows=50 width=488)
+   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.64..142.17 rows=50 width=244)
+         Recheck Cond: (unique1 &lt; 50)
+         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.63 rows=50 width=0)
+               Index Cond: (unique1 &lt; 50)
+   -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..6.27 rows=1 width=244)
+         Index Cond: (unique2 = t1.unique2)
+</pre><p>
+
+   The restriction on <code class="structname">tenk1</code>,
+   <code class="literal">unique1 &lt; 50</code>,
+   is evaluated before the nested-loop join.
+   This is handled analogously to the previous range example.  This time the
+   value 50 falls into the first bucket of the
+   <code class="structfield">unique1</code> histogram:
+
+</p><pre class="programlisting">
+selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
+            = (0 + (50 - 0)/(993 - 0))/10
+            = 0.005035
+
+rows        = 10000 * 0.005035
+            = 50  (rounding off)
+</pre><p>
+
+   The restriction for the join is <code class="literal">t2.unique2 = t1.unique2</code>.
+   The operator is just
+   our familiar <code class="literal">=</code>, however the selectivity function is
+   obtained from the <code class="structfield">oprjoin</code> column of
+   <code class="structname">pg_operator</code>, and is <code class="function">eqjoinsel</code>.
+   <code class="function">eqjoinsel</code> looks up the statistical information for both
+   <code class="structname">tenk2</code> and <code class="structname">tenk1</code>:
+
+</p><pre class="programlisting">
+SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
+WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
+
+tablename  | null_frac | n_distinct | most_common_vals
+-----------+-----------+------------+------------------
+ tenk1     |         0 |         -1 |
+ tenk2     |         0 |         -1 |
+</pre><p>
+
+   In this case there is no <acronym class="acronym">MCV</acronym> information for
+   <code class="structfield">unique2</code> because all the values appear to be
+   unique, so we use an algorithm that relies only on the number of
+   distinct values for both relations together with their null fractions:
+
+</p><pre class="programlisting">
+selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+            = (1 - 0) * (1 - 0) / max(10000, 10000)
+            = 0.0001
+</pre><p>
+
+   This is, subtract the null fraction from one for each of the relations,
+   and divide by the maximum of the numbers of distinct values.
+   The number of rows
+   that the join is likely to emit is calculated as the cardinality of the
+   Cartesian product of the two inputs, multiplied by the
+   selectivity:
+
+</p><pre class="programlisting">
+rows = (outer_cardinality * inner_cardinality) * selectivity
+     = (50 * 10000) * 0.0001
+     = 50
+</pre><p>
+  </p><p>
+   Had there been MCV lists for the two columns,
+   <code class="function">eqjoinsel</code> would have used direct comparison of the MCV
+   lists to determine the join selectivity within the part of the column
+   populations represented by the MCVs.  The estimate for the remainder of the
+   populations follows the same approach shown here.
+  </p><p>
+   Notice that we showed <code class="literal">inner_cardinality</code> as 10000, that is,
+   the unmodified size of <code class="structname">tenk2</code>.  It might appear from
+   inspection of the <code class="command">EXPLAIN</code> output that the estimate of
+   join rows comes from 50 * 1, that is, the number of outer rows times
+   the estimated number of rows obtained by each inner index scan on
+   <code class="structname">tenk2</code>.  But this is not the case: the join relation size
+   is estimated before any particular join plan has been considered.  If
+   everything is working well then the two ways of estimating the join
+   size will produce about the same answer, but due to round-off error and
+   other factors they sometimes diverge significantly.
+  </p><p>
+   For those interested in further details, estimation of the size of
+   a table (before any <code class="literal">WHERE</code> clauses) is done in
+   <code class="filename">src/backend/optimizer/util/plancat.c</code>. The generic
+   logic for clause selectivities is in
+   <code class="filename">src/backend/optimizer/path/clausesel.c</code>.  The
+   operator-specific selectivity functions are mostly found
+   in <code class="filename">src/backend/utils/adt/selfuncs.c</code>.
+  </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 75. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="multivariate-statistics-examples.html" title="75.2. Multivariate Statistics Examples">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 75. How the Planner Uses Statistics </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 75.2. Multivariate Statistics Examples</td></tr></table></div></body></html>
+\ No newline at end of file