diff options
Diffstat (limited to 'src/backend/statistics/README.dependencies')
-rw-r--r-- | src/backend/statistics/README.dependencies | 116 |
1 files changed, 116 insertions, 0 deletions
diff --git a/src/backend/statistics/README.dependencies b/src/backend/statistics/README.dependencies new file mode 100644 index 0000000..6c446bd --- /dev/null +++ b/src/backend/statistics/README.dependencies @@ -0,0 +1,116 @@ +Soft functional dependencies +============================ + +Functional dependencies are a concept well described in relational theory, +particularly in the definition of normalization and "normal forms". Wikipedia +has a nice definition of a functional dependency [1]: + + In a given table, an attribute Y is said to have a functional dependency + on a set of attributes X (written X -> Y) if and only if each X value is + associated with precisely one Y value. For example, in an "Employee" + table that includes the attributes "Employee ID" and "Employee Date of + Birth", the functional dependency + + {Employee ID} -> {Employee Date of Birth} + + would hold. It follows from the previous two sentences that each + {Employee ID} is associated with precisely one {Employee Date of Birth}. + + [1] https://en.wikipedia.org/wiki/Functional_dependency + +In practical terms, functional dependencies mean that a value in one column +determines values in some other column. Consider for example this trivial +table with two integer columns: + + CREATE TABLE t (a INT, b INT) + AS SELECT i, i/10 FROM generate_series(1,100000) s(i); + +Clearly, knowledge of the value in column 'a' is sufficient to determine the +value in column 'b', as it's simply (a/10). A more practical example may be +addresses, where the knowledge of a ZIP code (usually) determines city. Larger +cities may have multiple ZIP codes, so the dependency can't be reversed. + +Many datasets might be normalized not to contain such dependencies, but often +it's not practical for various reasons. In some cases, it's actually a conscious +design choice to model the dataset in a denormalized way, either because of +performance or to make querying easier. + + +Soft dependencies +----------------- + +Real-world data sets often contain data errors, either because of data entry +mistakes (user mistyping the ZIP code) or perhaps issues in generating the +data (e.g. a ZIP code mistakenly assigned to two cities in different states). + +A strict implementation would either ignore dependencies in such cases, +rendering the approach mostly useless even for slightly noisy data sets, or +result in sudden changes in behavior depending on minor differences between +samples provided to ANALYZE. + +For this reason, extended statistics implement "soft" functional dependencies, +associating each functional dependency with a degree of validity (a number +between 0 and 1). This degree is then used to combine selectivities in a +smooth manner. + + +Mining dependencies (ANALYZE) +----------------------------- + +The current algorithm is fairly simple - generate all possible functional +dependencies, and for each one count the number of rows consistent with it. +Then use the fraction of rows (supporting/total) as the degree. + +To count the rows consistent with the dependency (a => b): + + (a) Sort the data lexicographically, i.e. first by 'a' then 'b'. + + (b) For each group of rows with the same 'a' value, count the number of + distinct values in 'b'. + + (c) If there's a single distinct value in 'b', the rows are consistent with + the functional dependency, otherwise they contradict it. + + +Clause reduction (planner/optimizer) +------------------------------------ + +Applying the functional dependencies is fairly simple: given a list of +equality clauses, we compute selectivities of each clause and then use the +degree to combine them using this formula + + P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) + +Where 'd' is the degree of functional dependency (a => b). + +With more than two equality clauses, this process happens recursively. For +example for (a,b,c) we first use (a,b => c) to break the computation into + + P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?)) + +where 'e' is the degree of functional dependency (a,b => c); then we can +apply (a=>b) the same way on P(a=?,b=?). + + +Consistency of clauses +---------------------- + +Functional dependencies only express general dependencies between columns, +without referencing particular values. This assumes that the equality clauses +are in fact consistent with the functional dependency, i.e. that given a +dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). +If that's not the case, the clauses are "inconsistent" with the functional +dependency and the result will be over-estimation. + +This may happen, for example, when using conditions on the ZIP code and city +name with mismatching values (ZIP code for a different city), etc. In such a +case, the result set will be empty, but we'll estimate the selectivity using +the ZIP code condition. + +In this case, the default estimation based on AVIA principle happens to work +better, but mostly by chance. + +This issue is the price for the simplicity of functional dependencies. If the +application frequently constructs queries with clauses inconsistent with +functional dependencies present in the data, the best solution is not to +use functional dependencies, but one of the more complex types of statistics. |