diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
commit | 46651ce6fe013220ed397add242004d764fc0153 (patch) | |
tree | 6e5299f990f88e60174a1d3ae6e48eedd2688b2b /src/backend/statistics/README | |
parent | Initial commit. (diff) | |
download | postgresql-14-upstream.tar.xz postgresql-14-upstream.zip |
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | src/backend/statistics/README | 101 | ||||
-rw-r--r-- | src/backend/statistics/README.dependencies | 116 | ||||
-rw-r--r-- | src/backend/statistics/README.mcv | 103 |
3 files changed, 320 insertions, 0 deletions
diff --git a/src/backend/statistics/README b/src/backend/statistics/README new file mode 100644 index 0000000..7fda13e --- /dev/null +++ b/src/backend/statistics/README @@ -0,0 +1,101 @@ +Extended statistics +=================== + +When estimating various quantities (e.g. condition selectivities) the default +approach relies on the assumption of independence. In practice that's often +not true, resulting in estimation errors. + +Extended statistics track different types of dependencies between the columns, +hopefully improving the estimates and producing better plans. + + +Types of statistics +------------------- + +There are currently two kinds of extended statistics: + + (a) ndistinct coefficients + + (b) soft functional dependencies (README.dependencies) + + (c) MCV lists (README.mcv) + + +Compatible clause types +----------------------- + +Each type of statistics may be used to estimate some subset of clause types. + + (a) functional dependencies - equality clauses (AND), possibly IS NULL + + (b) MCV lists - equality and inequality clauses (AND, OR, NOT), IS [NOT] NULL + +Currently, only OpExprs in the form Var op Const, or Const op Var are +supported, however it's feasible to expand the code later to also estimate the +selectivities on clauses such as Var op Var. + + +Complex clauses +--------------- + +We also support estimating more complex clauses - essentially AND/OR clauses +with (Var op Const) as leaves, as long as all the referenced attributes are +covered by a single statistics object. + +For example this condition + + (a=1) AND ((b=2) OR ((c=3) AND (d=4))) + +may be estimated using statistics on (a,b,c,d). If we only have statistics on +(b,c,d) we may estimate the second part, and estimate (a=1) using simple stats. + +If we only have statistics on (a,b,c) we can't apply it at all at this point, +but it's worth pointing out clauselist_selectivity() works recursively and when +handling the second part (the OR-clause), we'll be able to apply the statistics. + +Note: The multi-statistics estimation patch also makes it possible to pass some +clauses as 'conditions' into the deeper parts of the expression tree. + + +Selectivity estimation +---------------------- + +Throughout the planner clauselist_selectivity() still remains in charge of +most selectivity estimate requests. clauselist_selectivity() can be instructed +to try to make use of any extended statistics on the given RelOptInfo, which +it will do if: + + (a) An actual valid RelOptInfo was given. Join relations are passed in as + NULL, therefore are invalid. + + (b) The relation given actually has any extended statistics defined which + are actually built. + +When the above conditions are met, clauselist_selectivity() first attempts to +pass the clause list off to the extended statistics selectivity estimation +function. This functions may not find any clauses which is can perform any +estimations on. In such cases these clauses are simply ignored. When actual +estimation work is performed in these functions they're expected to mark which +clauses they've performed estimations for so that any other function +performing estimations knows which clauses are to be skipped. + +Size of sample in ANALYZE +------------------------- + +When performing ANALYZE, the number of rows to sample is determined as + + (300 * statistics_target) + +That works reasonably well for statistics on individual columns, but perhaps +it's not enough for extended statistics. Papers analyzing estimation errors +all use samples proportional to the table (usually finding that 1-3% of the +table is enough to build accurate stats). + +The requested accuracy (number of MCV items or histogram bins) should also +be considered when determining the sample size, and in extended statistics +those are not necessarily limited by statistics_target. + +This however merits further discussion, because collecting the sample is quite +expensive and increasing it further would make ANALYZE even more painful. +Judging by the experiments with the current implementation, the fixed size +seems to work reasonably well for now, so we leave this as future work. diff --git a/src/backend/statistics/README.dependencies b/src/backend/statistics/README.dependencies new file mode 100644 index 0000000..6c446bd --- /dev/null +++ b/src/backend/statistics/README.dependencies @@ -0,0 +1,116 @@ +Soft functional dependencies +============================ + +Functional dependencies are a concept well described in relational theory, +particularly in the definition of normalization and "normal forms". Wikipedia +has a nice definition of a functional dependency [1]: + + In a given table, an attribute Y is said to have a functional dependency + on a set of attributes X (written X -> Y) if and only if each X value is + associated with precisely one Y value. For example, in an "Employee" + table that includes the attributes "Employee ID" and "Employee Date of + Birth", the functional dependency + + {Employee ID} -> {Employee Date of Birth} + + would hold. It follows from the previous two sentences that each + {Employee ID} is associated with precisely one {Employee Date of Birth}. + + [1] https://en.wikipedia.org/wiki/Functional_dependency + +In practical terms, functional dependencies mean that a value in one column +determines values in some other column. Consider for example this trivial +table with two integer columns: + + CREATE TABLE t (a INT, b INT) + AS SELECT i, i/10 FROM generate_series(1,100000) s(i); + +Clearly, knowledge of the value in column 'a' is sufficient to determine the +value in column 'b', as it's simply (a/10). A more practical example may be +addresses, where the knowledge of a ZIP code (usually) determines city. Larger +cities may have multiple ZIP codes, so the dependency can't be reversed. + +Many datasets might be normalized not to contain such dependencies, but often +it's not practical for various reasons. In some cases, it's actually a conscious +design choice to model the dataset in a denormalized way, either because of +performance or to make querying easier. + + +Soft dependencies +----------------- + +Real-world data sets often contain data errors, either because of data entry +mistakes (user mistyping the ZIP code) or perhaps issues in generating the +data (e.g. a ZIP code mistakenly assigned to two cities in different states). + +A strict implementation would either ignore dependencies in such cases, +rendering the approach mostly useless even for slightly noisy data sets, or +result in sudden changes in behavior depending on minor differences between +samples provided to ANALYZE. + +For this reason, extended statistics implement "soft" functional dependencies, +associating each functional dependency with a degree of validity (a number +between 0 and 1). This degree is then used to combine selectivities in a +smooth manner. + + +Mining dependencies (ANALYZE) +----------------------------- + +The current algorithm is fairly simple - generate all possible functional +dependencies, and for each one count the number of rows consistent with it. +Then use the fraction of rows (supporting/total) as the degree. + +To count the rows consistent with the dependency (a => b): + + (a) Sort the data lexicographically, i.e. first by 'a' then 'b'. + + (b) For each group of rows with the same 'a' value, count the number of + distinct values in 'b'. + + (c) If there's a single distinct value in 'b', the rows are consistent with + the functional dependency, otherwise they contradict it. + + +Clause reduction (planner/optimizer) +------------------------------------ + +Applying the functional dependencies is fairly simple: given a list of +equality clauses, we compute selectivities of each clause and then use the +degree to combine them using this formula + + P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) + +Where 'd' is the degree of functional dependency (a => b). + +With more than two equality clauses, this process happens recursively. For +example for (a,b,c) we first use (a,b => c) to break the computation into + + P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?)) + +where 'e' is the degree of functional dependency (a,b => c); then we can +apply (a=>b) the same way on P(a=?,b=?). + + +Consistency of clauses +---------------------- + +Functional dependencies only express general dependencies between columns, +without referencing particular values. This assumes that the equality clauses +are in fact consistent with the functional dependency, i.e. that given a +dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). +If that's not the case, the clauses are "inconsistent" with the functional +dependency and the result will be over-estimation. + +This may happen, for example, when using conditions on the ZIP code and city +name with mismatching values (ZIP code for a different city), etc. In such a +case, the result set will be empty, but we'll estimate the selectivity using +the ZIP code condition. + +In this case, the default estimation based on AVIA principle happens to work +better, but mostly by chance. + +This issue is the price for the simplicity of functional dependencies. If the +application frequently constructs queries with clauses inconsistent with +functional dependencies present in the data, the best solution is not to +use functional dependencies, but one of the more complex types of statistics. diff --git a/src/backend/statistics/README.mcv b/src/backend/statistics/README.mcv new file mode 100644 index 0000000..8455b0d --- /dev/null +++ b/src/backend/statistics/README.mcv @@ -0,0 +1,103 @@ +MCV lists +========= + +Multivariate MCV (most-common values) lists are a straightforward extension of +regular MCV list, tracking most frequent combinations of values for a group of +attributes. + +This works particularly well for columns with a small number of distinct values, +as the list may include all the combinations and approximate the distribution +very accurately. + +For columns with a large number of distinct values (e.g. those with continuous +domains), the list will only track the most frequent combinations. If the +distribution is mostly uniform (all combinations about equally frequent), the +MCV list will be empty. + +Estimates of some clauses (e.g. equality) based on MCV lists are more accurate +than when using histograms. + +Also, MCV lists don't necessarily require sorting of the values (the fact that +we use sorting when building them is implementation detail), but even more +importantly the ordering is not built into the approximation (while histograms +are built on ordering). So MCV lists work well even for attributes where the +ordering of the data type is disconnected from the meaning of the data. For +example we know how to sort strings, but it's unlikely to make much sense for +city names (or other label-like attributes). + + +Selectivity estimation +---------------------- + +The estimation, implemented in mcv_clauselist_selectivity(), is quite simple +in principle - we need to identify MCV items matching all the clauses and sum +frequencies of all those items. + +Currently MCV lists support estimation of the following clause types: + + (a) equality clauses WHERE (a = 1) AND (b = 2) + (b) inequality clauses WHERE (a < 1) AND (b >= 2) + (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL) + (d) OR clauses WHERE (a < 1) OR (b >= 2) + +It's possible to add support for additional clauses, for example: + + (e) multi-var clauses WHERE (a > b) + +and possibly others. These are tasks for the future, not yet implemented. + + +Hashed MCV (not yet implemented) +-------------------------------- + +Regular MCV lists have to include actual values for each item, so if those items +are large the list may be quite large. This is especially true for multivariate +MCV lists, although the current implementation partially mitigates this by +performing de-duplicating the values before storing them on disk. + +It's possible to only store hashes (32-bit values) instead of the actual values, +significantly reducing the space requirements. Obviously, this would only make +the MCV lists useful for estimating equality conditions (assuming the 32-bit +hashes make the collisions rare enough). + +This might also complicate matching the columns to available stats. + + +TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead + of the actual values. This type of MCV list will be useful only for + estimating equality clauses, and will reduce space requirements for large + varlena types (in such cases we usually only want equality anyway). + + +Inspecting the MCV list +----------------------- + +Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough +to select the columns from pg_stats. The data is encoded as anyarrays, and +all the items have the same data type, so anyarray provides a simple way to +get a text representation. + +With multivariate MCV lists the columns may use different data types, making +it impossible to use anyarrays. It might be possible to produce a similar +array-like representation, but that would complicate further processing and +analysis of the MCV list. + +So instead the MCV lists are stored in a custom data type (pg_mcv_list), +which however makes it more difficult to inspect the contents. To make that +easier, there's a SRF returning detailed information about the MCV lists. + + SELECT m.* FROM pg_statistic_ext s, + pg_statistic_ext_data d, + pg_mcv_list_items(stxdmcv) m + WHERE s.stxname = 'stts2' + AND d.stxoid = s.oid; + +It accepts one parameter - a pg_mcv_list value (which can only be obtained +from pg_statistic_ext_data catalog, to defend against malicious input), and +returns these columns: + + - item index (0, ..., (nitems-1)) + - values (string array) + - nulls only (boolean array) + - frequency (double precision) + - base_frequency (double precision) |