summaryrefslogtreecommitdiffstats
path: root/src/backend/statistics/README.mcv
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/statistics/README.mcv')
-rw-r--r--src/backend/statistics/README.mcv103
1 files changed, 103 insertions, 0 deletions
diff --git a/src/backend/statistics/README.mcv b/src/backend/statistics/README.mcv
new file mode 100644
index 0000000..8455b0d
--- /dev/null
+++ b/src/backend/statistics/README.mcv
@@ -0,0 +1,103 @@
+MCV lists
+=========
+
+Multivariate MCV (most-common values) lists are a straightforward extension of
+regular MCV list, tracking most frequent combinations of values for a group of
+attributes.
+
+This works particularly well for columns with a small number of distinct values,
+as the list may include all the combinations and approximate the distribution
+very accurately.
+
+For columns with a large number of distinct values (e.g. those with continuous
+domains), the list will only track the most frequent combinations. If the
+distribution is mostly uniform (all combinations about equally frequent), the
+MCV list will be empty.
+
+Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
+than when using histograms.
+
+Also, MCV lists don't necessarily require sorting of the values (the fact that
+we use sorting when building them is implementation detail), but even more
+importantly the ordering is not built into the approximation (while histograms
+are built on ordering). So MCV lists work well even for attributes where the
+ordering of the data type is disconnected from the meaning of the data. For
+example we know how to sort strings, but it's unlikely to make much sense for
+city names (or other label-like attributes).
+
+
+Selectivity estimation
+----------------------
+
+The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
+in principle - we need to identify MCV items matching all the clauses and sum
+frequencies of all those items.
+
+Currently MCV lists support estimation of the following clause types:
+
+ (a) equality clauses WHERE (a = 1) AND (b = 2)
+ (b) inequality clauses WHERE (a < 1) AND (b >= 2)
+ (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
+ (d) OR clauses WHERE (a < 1) OR (b >= 2)
+
+It's possible to add support for additional clauses, for example:
+
+ (e) multi-var clauses WHERE (a > b)
+
+and possibly others. These are tasks for the future, not yet implemented.
+
+
+Hashed MCV (not yet implemented)
+--------------------------------
+
+Regular MCV lists have to include actual values for each item, so if those items
+are large the list may be quite large. This is especially true for multivariate
+MCV lists, although the current implementation partially mitigates this by
+performing de-duplicating the values before storing them on disk.
+
+It's possible to only store hashes (32-bit values) instead of the actual values,
+significantly reducing the space requirements. Obviously, this would only make
+the MCV lists useful for estimating equality conditions (assuming the 32-bit
+hashes make the collisions rare enough).
+
+This might also complicate matching the columns to available stats.
+
+
+TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
+ of the actual values. This type of MCV list will be useful only for
+ estimating equality clauses, and will reduce space requirements for large
+ varlena types (in such cases we usually only want equality anyway).
+
+
+Inspecting the MCV list
+-----------------------
+
+Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
+to select the columns from pg_stats. The data is encoded as anyarrays, and
+all the items have the same data type, so anyarray provides a simple way to
+get a text representation.
+
+With multivariate MCV lists the columns may use different data types, making
+it impossible to use anyarrays. It might be possible to produce a similar
+array-like representation, but that would complicate further processing and
+analysis of the MCV list.
+
+So instead the MCV lists are stored in a custom data type (pg_mcv_list),
+which however makes it more difficult to inspect the contents. To make that
+easier, there's a SRF returning detailed information about the MCV lists.
+
+ SELECT m.* FROM pg_statistic_ext s,
+ pg_statistic_ext_data d,
+ pg_mcv_list_items(stxdmcv) m
+ WHERE s.stxname = 'stts2'
+ AND d.stxoid = s.oid;
+
+It accepts one parameter - a pg_mcv_list value (which can only be obtained
+from pg_statistic_ext_data catalog, to defend against malicious input), and
+returns these columns:
+
+ - item index (0, ..., (nitems-1))
+ - values (string array)
+ - nulls only (boolean array)
+ - frequency (double precision)
+ - base_frequency (double precision)