diff options
Diffstat (limited to 'src/backend/statistics/README.mcv')
-rw-r--r-- | src/backend/statistics/README.mcv | 103 |
1 files changed, 103 insertions, 0 deletions
diff --git a/src/backend/statistics/README.mcv b/src/backend/statistics/README.mcv new file mode 100644 index 0000000..8455b0d --- /dev/null +++ b/src/backend/statistics/README.mcv @@ -0,0 +1,103 @@ +MCV lists +========= + +Multivariate MCV (most-common values) lists are a straightforward extension of +regular MCV list, tracking most frequent combinations of values for a group of +attributes. + +This works particularly well for columns with a small number of distinct values, +as the list may include all the combinations and approximate the distribution +very accurately. + +For columns with a large number of distinct values (e.g. those with continuous +domains), the list will only track the most frequent combinations. If the +distribution is mostly uniform (all combinations about equally frequent), the +MCV list will be empty. + +Estimates of some clauses (e.g. equality) based on MCV lists are more accurate +than when using histograms. + +Also, MCV lists don't necessarily require sorting of the values (the fact that +we use sorting when building them is implementation detail), but even more +importantly the ordering is not built into the approximation (while histograms +are built on ordering). So MCV lists work well even for attributes where the +ordering of the data type is disconnected from the meaning of the data. For +example we know how to sort strings, but it's unlikely to make much sense for +city names (or other label-like attributes). + + +Selectivity estimation +---------------------- + +The estimation, implemented in mcv_clauselist_selectivity(), is quite simple +in principle - we need to identify MCV items matching all the clauses and sum +frequencies of all those items. + +Currently MCV lists support estimation of the following clause types: + + (a) equality clauses WHERE (a = 1) AND (b = 2) + (b) inequality clauses WHERE (a < 1) AND (b >= 2) + (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL) + (d) OR clauses WHERE (a < 1) OR (b >= 2) + +It's possible to add support for additional clauses, for example: + + (e) multi-var clauses WHERE (a > b) + +and possibly others. These are tasks for the future, not yet implemented. + + +Hashed MCV (not yet implemented) +-------------------------------- + +Regular MCV lists have to include actual values for each item, so if those items +are large the list may be quite large. This is especially true for multivariate +MCV lists, although the current implementation partially mitigates this by +performing de-duplicating the values before storing them on disk. + +It's possible to only store hashes (32-bit values) instead of the actual values, +significantly reducing the space requirements. Obviously, this would only make +the MCV lists useful for estimating equality conditions (assuming the 32-bit +hashes make the collisions rare enough). + +This might also complicate matching the columns to available stats. + + +TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead + of the actual values. This type of MCV list will be useful only for + estimating equality clauses, and will reduce space requirements for large + varlena types (in such cases we usually only want equality anyway). + + +Inspecting the MCV list +----------------------- + +Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough +to select the columns from pg_stats. The data is encoded as anyarrays, and +all the items have the same data type, so anyarray provides a simple way to +get a text representation. + +With multivariate MCV lists the columns may use different data types, making +it impossible to use anyarrays. It might be possible to produce a similar +array-like representation, but that would complicate further processing and +analysis of the MCV list. + +So instead the MCV lists are stored in a custom data type (pg_mcv_list), +which however makes it more difficult to inspect the contents. To make that +easier, there's a SRF returning detailed information about the MCV lists. + + SELECT m.* FROM pg_statistic_ext s, + pg_statistic_ext_data d, + pg_mcv_list_items(stxdmcv) m + WHERE s.stxname = 'stts2' + AND d.stxoid = s.oid; + +It accepts one parameter - a pg_mcv_list value (which can only be obtained +from pg_statistic_ext_data catalog, to defend against malicious input), and +returns these columns: + + - item index (0, ..., (nitems-1)) + - values (string array) + - nulls only (boolean array) + - frequency (double precision) + - base_frequency (double precision) |