62.3. Genetic Query Optimization (GEQO) in PostgreSQL

62.3. Genetic Query Optimization (GEQO) in PostgreSQL
Prev	Up	Chapter 62. Genetic Query Optimizer	Home	Next

62.3. Genetic Query Optimization (GEQO) in PostgreSQL #

62.3.1. Generating Possible Plans with GEQO
62.3.2. Future Implementation Tasks for + PostgreSQL GEQO

+ The GEQO module approaches the query + optimization problem as though it were the well-known traveling salesman + problem (TSP). + Possible query plans are encoded as integer strings. Each string + represents the join order from one relation of the query to the next. + For example, the join tree +

+   /\
+  /\ 2
+ /\ 3
+4  1
+

+ is encoded by the integer string '4-1-3-2', + which means, first join relation '4' and '1', then '3', and + then '2', where 1, 2, 3, 4 are relation IDs within the + PostgreSQL optimizer. +

+ Specific characteristics of the GEQO + implementation in PostgreSQL + are: + +

+ Usage of a steady state GA (replacement of the least fit + individuals in a population, not whole-generational replacement) + allows fast convergence towards improved query plans. This is + essential for query handling with reasonable time; +
+ Usage of edge recombination crossover + which is especially suited to keep edge losses low for the + solution of the TSP by means of a + GA; +
+ Mutation as genetic operator is deprecated so that no repair + mechanisms are needed to generate legal TSP tours. +

+ Parts of the GEQO module are adapted from D. Whitley's + Genitor algorithm. +

+ The GEQO module allows + the PostgreSQL query optimizer to + support large join queries effectively through + non-exhaustive search. +

62.3.1. Generating Possible Plans with GEQO #

+ The GEQO planning process uses the standard planner + code to generate plans for scans of individual relations. Then join + plans are developed using the genetic approach. As shown above, each + candidate join plan is represented by a sequence in which to join + the base relations. In the initial stage, the GEQO + code simply generates some possible join sequences at random. For each + join sequence considered, the standard planner code is invoked to + estimate the cost of performing the query using that join sequence. + (For each step of the join sequence, all three possible join strategies + are considered; and all the initially-determined relation scan plans + are available. The estimated cost is the cheapest of these + possibilities.) Join sequences with lower estimated cost are considered + “more fit” than those with higher cost. The genetic algorithm + discards the least fit candidates. Then new candidates are generated + by combining genes of more-fit candidates — that is, by using + randomly-chosen portions of known low-cost join sequences to create + new sequences for consideration. This process is repeated until a + preset number of join sequences have been considered; then the best + one found at any time during the search is used to generate the finished + plan. +

+ This process is inherently nondeterministic, because of the randomized + choices made during both the initial population selection and subsequent + “mutation” of the best candidates. To avoid surprising changes + of the selected plan, each run of the GEQO algorithm restarts its + random number generator with the current geqo_seed + parameter setting. As long as geqo_seed and the other + GEQO parameters are kept fixed, the same plan will be generated for a + given query (and other planner inputs such as statistics). To experiment + with different search paths, try changing geqo_seed. +

62.3.2. Future Implementation Tasks for + PostgreSQL GEQO #

+ Work is still needed to improve the genetic algorithm parameter + settings. + In file src/backend/optimizer/geqo/geqo_main.c, + routines + gimme_pool_size and gimme_number_generations, + we have to find a compromise for the parameter settings + to satisfy two competing demands: +

+ Optimality of the query plan +
+ Computing time +

+ In the current implementation, the fitness of each candidate join + sequence is estimated by running the standard planner's join selection + and cost estimation code from scratch. To the extent that different + candidates use similar sub-sequences of joins, a great deal of work + will be repeated. This could be made significantly faster by retaining + cost estimates for sub-joins. The problem is to avoid expending + unreasonable amounts of memory on retaining that state. +

+ At a more basic level, it is not clear that solving query optimization + with a GA algorithm designed for TSP is appropriate. In the TSP case, + the cost associated with any substring (partial tour) is independent + of the rest of the tour, but this is certainly not true for query + optimization. Thus it is questionable whether edge recombination + crossover is the most effective mutation procedure. +

Prev	Up	Next
62.2. Genetic Algorithms	Home	62.4. Further Reading