summaryrefslogtreecommitdiffstats
path: root/src/backend/optimizer/README
blob: 41c120e0cdf6f9d2de24f2adb6b1c284eda2e2e0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
src/backend/optimizer/README

Optimizer
=========

These directories take the Query structure returned by the parser, and
generate a plan used by the executor.  The /plan directory generates the
actual output plan, the /path code generates all possible ways to join the
tables, and /prep handles various preprocessing steps for special cases.
/util is utility stuff.  /geqo is the separate "genetic optimization" planner
--- it does a semi-random search through the join tree space, rather than
exhaustively considering all possible join trees.  (But each join considered
by /geqo is given to /path to create paths for, so we consider all possible
implementation paths for each specific join pair even in GEQO mode.)


Paths and Join Pairs
--------------------

During the planning/optimizing process, we build "Path" trees representing
the different ways of doing a query.  We select the cheapest Path that
generates the desired relation and turn it into a Plan to pass to the
executor.  (There is pretty nearly a one-to-one correspondence between the
Path and Plan trees, but Path nodes omit info that won't be needed during
planning, and include info needed for planning that won't be needed by the
executor.)

The optimizer builds a RelOptInfo structure for each base relation used in
the query.  Base rels are either primitive tables, or subquery subselects
that are planned via a separate recursive invocation of the planner.  A
RelOptInfo is also built for each join relation that is considered during
planning.  A join rel is simply a combination of base rels.  There is only
one join RelOptInfo for any given set of baserels --- for example, the join
{A B C} is represented by the same RelOptInfo no matter whether we build it
by joining A and B first and then adding C, or joining B and C first and
then adding A, etc.  These different means of building the joinrel are
represented as Paths.  For each RelOptInfo we build a list of Paths that
represent plausible ways to implement the scan or join of that relation.
Once we've considered all the plausible Paths for a rel, we select the one
that is cheapest according to the planner's cost estimates.  The final plan
is derived from the cheapest Path for the RelOptInfo that includes all the
base rels of the query.

Possible Paths for a primitive table relation include plain old sequential
scan, plus index scans for any indexes that exist on the table, plus bitmap
index scans using one or more indexes.  Specialized RTE types, such as
function RTEs, may have only one possible Path.

Joins always occur using two RelOptInfos.  One is outer, the other inner.
Outers drive lookups of values in the inner.  In a nested loop, lookups of
values in the inner occur by scanning the inner path once per outer tuple
to find each matching inner row.  In a mergejoin, inner and outer rows are
ordered, and are accessed in order, so only one scan is required to perform
the entire join: both inner and outer paths are scanned in-sync.  (There's
not a lot of difference between inner and outer in a mergejoin...)  In a
hashjoin, the inner is scanned first and all its rows are entered in a
hashtable, then the outer is scanned and for each row we lookup the join
key in the hashtable.

A Path for a join relation is actually a tree structure, with the topmost
Path node representing the last-applied join method.  It has left and right
subpaths that represent the scan or join methods used for the two input
relations.


Join Tree Construction
----------------------

The optimizer generates optimal query plans by doing a more-or-less
exhaustive search through the ways of executing the query.  The best Path
tree is found by a recursive process:

1) Take each base relation in the query, and make a RelOptInfo structure
for it.  Find each potentially useful way of accessing the relation,
including sequential and index scans, and make Paths representing those
ways.  All the Paths made for a given relation are placed in its
RelOptInfo.pathlist.  (Actually, we discard Paths that are obviously
inferior alternatives before they ever get into the pathlist --- what
ends up in the pathlist is the cheapest way of generating each potentially
useful sort ordering and parameterization of the relation.)  Also create a
RelOptInfo.joininfo list including all the join clauses that involve this
relation.  For example, the WHERE clause "tab1.col1 = tab2.col1" generates
entries in both tab1 and tab2's joininfo lists.

If we have only a single base relation in the query, we are done.
Otherwise we have to figure out how to join the base relations into a
single join relation.

2) Normally, any explicit JOIN clauses are "flattened" so that we just
have a list of relations to join.  However, FULL OUTER JOIN clauses are
never flattened, and other kinds of JOIN might not be either, if the
flattening process is stopped by join_collapse_limit or from_collapse_limit
restrictions.  Therefore, we end up with a planning problem that contains
lists of relations to be joined in any order, where any individual item
might be a sub-list that has to be joined together before we can consider
joining it to its siblings.  We process these sub-problems recursively,
bottom up.  Note that the join list structure constrains the possible join
orders, but it doesn't constrain the join implementation method at each
join (nestloop, merge, hash), nor does it say which rel is considered outer
or inner at each join.  We consider all these possibilities in building
Paths. We generate a Path for each feasible join method, and select the
cheapest Path.

For each planning problem, therefore, we will have a list of relations
that are either base rels or joinrels constructed per sub-join-lists.
We can join these rels together in any order the planner sees fit.
The standard (non-GEQO) planner does this as follows:

Consider joining each RelOptInfo to each other RelOptInfo for which there
is a usable joinclause, and generate a Path for each possible join method
for each such pair.  (If we have a RelOptInfo with no join clauses, we have
no choice but to generate a clauseless Cartesian-product join; so we
consider joining that rel to each other available rel.  But in the presence
of join clauses we will only consider joins that use available join
clauses.  Note that join-order restrictions induced by outer joins and
IN/EXISTS clauses are also checked, to ensure that we find a workable join
order in cases where those restrictions force a clauseless join to be done.)

If we only had two relations in the list, we are done: we just pick
the cheapest path for the join RelOptInfo.  If we had more than two, we now
need to consider ways of joining join RelOptInfos to each other to make
join RelOptInfos that represent more than two list items.

The join tree is constructed using a "dynamic programming" algorithm:
in the first pass (already described) we consider ways to create join rels
representing exactly two list items.  The second pass considers ways
to make join rels that represent exactly three list items; the next pass,
four items, etc.  The last pass considers how to make the final join
relation that includes all list items --- obviously there can be only one
join rel at this top level, whereas there can be more than one join rel
at lower levels.  At each level we use joins that follow available join
clauses, if possible, just as described for the first level.

For example:

    SELECT  *
    FROM    tab1, tab2, tab3, tab4
    WHERE   tab1.col = tab2.col AND
        tab2.col = tab3.col AND
        tab3.col = tab4.col

    Tables 1, 2, 3, and 4 are joined as:
    {1 2},{2 3},{3 4}
    {1 2 3},{2 3 4}
    {1 2 3 4}
    (other possibilities will be excluded for lack of join clauses)

    SELECT  *
    FROM    tab1, tab2, tab3, tab4
    WHERE   tab1.col = tab2.col AND
        tab1.col = tab3.col AND
        tab1.col = tab4.col

    Tables 1, 2, 3, and 4 are joined as:
    {1 2},{1 3},{1 4}
    {1 2 3},{1 3 4},{1 2 4}
    {1 2 3 4}

We consider left-handed plans (the outer rel of an upper join is a joinrel,
but the inner is always a single list item); right-handed plans (outer rel
is always a single item); and bushy plans (both inner and outer can be
joins themselves).  For example, when building {1 2 3 4} we consider
joining {1 2 3} to {4} (left-handed), {4} to {1 2 3} (right-handed), and
{1 2} to {3 4} (bushy), among other choices.  Although the jointree
scanning code produces these potential join combinations one at a time,
all the ways to produce the same set of joined base rels will share the
same RelOptInfo, so the paths produced from different join combinations
that produce equivalent joinrels will compete in add_path().

The dynamic-programming approach has an important property that's not
immediately obvious: we will finish constructing all paths for a given
relation before we construct any paths for relations containing that rel.
This means that we can reliably identify the "cheapest path" for each rel
before higher-level relations need to know that.  Also, we can safely
discard a path when we find that another path for the same rel is better,
without worrying that maybe there is already a reference to that path in
some higher-level join path.  Without this, memory management for paths
would be much more complicated.

Once we have built the final join rel, we use either the cheapest path
for it or the cheapest path with the desired ordering (if that's cheaper
than applying a sort to the cheapest other path).

If the query contains one-sided outer joins (LEFT or RIGHT joins), or
IN or EXISTS WHERE clauses that were converted to semijoins or antijoins,
then some of the possible join orders may be illegal.  These are excluded
by having join_is_legal consult a side list of such "special" joins to see
whether a proposed join is illegal.  (The same consultation allows it to
see which join style should be applied for a valid join, ie, JOIN_INNER,
JOIN_LEFT, etc.)


Valid OUTER JOIN Optimizations
------------------------------

The planner's treatment of outer join reordering is based on the following
identities:

1.	(A leftjoin B on (Pab)) innerjoin C on (Pac)
	= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2.	(A leftjoin B on (Pab)) leftjoin C on (Pac)
	= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3.	(A leftjoin B on (Pab)) leftjoin C on (Pbc)
	= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B).  If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.

RIGHT JOIN is equivalent to LEFT JOIN after switching the two input
tables, so the same identities work for right joins.

An example of a case that does *not* work is moving an innerjoin into or
out of the nullable side of an outer join:

	A leftjoin (B join C on (Pbc)) on (Pab)
	!= (A leftjoin B on (Pab)) join C on (Pbc)

SEMI joins work a little bit differently.  A semijoin can be reassociated
into or out of the lefthand side of another semijoin, left join, or
antijoin, but not into or out of the righthand side.  Likewise, an inner
join, left join, or antijoin can be reassociated into or out of the
lefthand side of a semijoin, but not into or out of the righthand side.

ANTI joins work approximately like LEFT joins, except that identity 3
fails if the join to C is an antijoin (even if Pbc is strict, and in
both the cases where the other join is a leftjoin and where it is an
antijoin).  So we can't reorder antijoins into or out of the RHS of a
leftjoin or antijoin, even if the relevant clause is strict.

The current code does not attempt to re-order FULL JOINs at all.
FULL JOIN ordering is enforced by not collapsing FULL JOIN nodes when
translating the jointree to "joinlist" representation.  Other types of
JOIN nodes are normally collapsed so that they participate fully in the
join order search.  To avoid generating illegal join orders, the planner
creates a SpecialJoinInfo node for each non-inner join, and join_is_legal
checks this list to decide if a proposed join is legal.

What we store in SpecialJoinInfo nodes are the minimum sets of Relids
required on each side of the join to form the outer join.  Note that
these are minimums; there's no explicit maximum, since joining other
rels to the OJ's syntactic rels may be legal.  Per identities 1 and 2,
non-FULL joins can be freely associated into the lefthand side of an
OJ, but in some cases they can't be associated into the righthand side.
So the restriction enforced by join_is_legal is that a proposed join
can't join a rel within or partly within an RHS boundary to one outside
the boundary, unless the proposed join is a LEFT join that can associate
into the SpecialJoinInfo's RHS using identity 3.

The use of minimum Relid sets has some pitfalls; consider a query like
	A leftjoin (B leftjoin (C innerjoin D) on (Pbcd)) on Pa
where Pa doesn't mention B/C/D at all.  In this case a naive computation
would give the upper leftjoin's min LHS as {A} and min RHS as {C,D} (since
we know that the innerjoin can't associate out of the leftjoin's RHS, and
enforce that by including its relids in the leftjoin's min RHS).  And the
lower leftjoin has min LHS of {B} and min RHS of {C,D}.  Given such
information, join_is_legal would think it's okay to associate the upper
join into the lower join's RHS, transforming the query to
	B leftjoin (A leftjoin (C innerjoin D) on Pa) on (Pbcd)
which yields totally wrong answers.  We prevent that by forcing the min RHS
for the upper join to include B.  This is perhaps overly restrictive, but
such cases don't arise often so it's not clear that it's worth developing a
more complicated system.


Pulling Up Subqueries
---------------------

As we described above, a subquery appearing in the range table is planned
independently and treated as a "black box" during planning of the outer
query.  This is necessary when the subquery uses features such as
aggregates, GROUP, or DISTINCT.  But if the subquery is just a simple
scan or join, treating the subquery as a black box may produce a poor plan
compared to considering it as part of the entire plan search space.
Therefore, at the start of the planning process the planner looks for
simple subqueries and pulls them up into the main query's jointree.

Pulling up a subquery may result in FROM-list joins appearing below the top
of the join tree.  Each FROM-list is planned using the dynamic-programming
search method described above.

If pulling up a subquery produces a FROM-list as a direct child of another
FROM-list, then we can merge the two FROM-lists together.  Once that's
done, the subquery is an absolutely integral part of the outer query and
will not constrain the join tree search space at all.  However, that could
result in unpleasant growth of planning time, since the dynamic-programming
search has runtime exponential in the number of FROM-items considered.
Therefore, we don't merge FROM-lists if the result would have too many
FROM-items in one list.


Optimizer Functions
-------------------

The primary entry point is planner().

planner()
set up for recursive handling of subqueries
-subquery_planner()
 pull up sublinks and subqueries from rangetable, if possible
 canonicalize qual
     Attempt to simplify WHERE clause to the most useful form; this includes
     flattening nested AND/ORs and detecting clauses that are duplicated in
     different branches of an OR.
 simplify constant expressions
 process sublinks
 convert Vars of outer query levels into Params
--grouping_planner()
  preprocess target list for non-SELECT queries
  handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates,
	ORDER BY, DISTINCT, LIMIT
---query_planner()
   make list of base relations used in query
   split up the qual into restrictions (a=1) and joins (b=c)
   find qual clauses that enable merge and hash joins
----make_one_rel()
     set_base_rel_pathlists()
      find seqscan and all index paths for each base relation
      find selectivity of columns used in joins
     make_rel_from_joinlist()
      hand off join subproblems to a plugin, GEQO, or standard_join_search()
------standard_join_search()
      call join_search_one_level() for each level of join tree needed
      join_search_one_level():
        For each joinrel of the prior level, do make_rels_by_clause_joins()
        if it has join clauses, or make_rels_by_clauseless_joins() if not.
        Also generate "bushy plan" joins between joinrels of lower levels.
      Back at standard_join_search(), generate gather paths if needed for
      each newly constructed joinrel, then apply set_cheapest() to extract
      the cheapest path for it.
      Loop back if this wasn't the top join level.
  Back at grouping_planner:
  do grouping (GROUP BY) and aggregation
  do window functions
  make unique (DISTINCT)
  do sorting (ORDER BY)
  do limit (LIMIT/OFFSET)
Back at planner():
convert finished Path tree into a Plan tree
do final cleanup after planning


Optimizer Data Structures
-------------------------

PlannerGlobal   - global information for a single planner invocation

PlannerInfo     - information for planning a particular Query (we make
                  a separate PlannerInfo node for each sub-Query)

RelOptInfo      - a relation or joined relations

 RestrictInfo   - WHERE clauses, like "x = 3" or "y = z"
                  (note the same structure is used for restriction and
                   join clauses)

 Path           - every way to generate a RelOptInfo(sequential,index,joins)
  A plain Path node can represent several simple plans, per its pathtype:
    T_SeqScan   - sequential scan
    T_SampleScan - tablesample scan
    T_FunctionScan - function-in-FROM scan
    T_TableFuncScan - table function scan
    T_ValuesScan - VALUES scan
    T_CteScan   - CTE (WITH) scan
    T_NamedTuplestoreScan - ENR scan
    T_WorkTableScan - scan worktable of a recursive CTE
    T_Result    - childless Result plan node (used for FROM-less SELECT)
  IndexPath     - index scan
  BitmapHeapPath - top of a bitmapped index scan
  TidPath       - scan by CTID
  TidRangePath  - scan a contiguous range of CTIDs
  SubqueryScanPath - scan a subquery-in-FROM
  ForeignPath   - scan a foreign table, foreign join or foreign upper-relation
  CustomPath    - for custom scan providers
  AppendPath    - append multiple subpaths together
  MergeAppendPath - merge multiple subpaths, preserving their common sort order
  GroupResultPath - childless Result plan node (used for degenerate grouping)
  MaterialPath  - a Material plan node
  MemoizePath   - a Memoize plan node for caching tuples from sub-paths
  UniquePath    - remove duplicate rows (either by hashing or sorting)
  GatherPath    - collect the results of parallel workers
  GatherMergePath - collect parallel results, preserving their common sort order
  ProjectionPath - a Result plan node with child (used for projection)
  ProjectSetPath - a ProjectSet plan node applied to some sub-path
  SortPath      - a Sort plan node applied to some sub-path
  IncrementalSortPath - an IncrementalSort plan node applied to some sub-path
  GroupPath     - a Group plan node applied to some sub-path
  UpperUniquePath - a Unique plan node applied to some sub-path
  AggPath       - an Agg plan node applied to some sub-path
  GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
  MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
  WindowAggPath - a WindowAgg plan node applied to some sub-path
  SetOpPath     - a SetOp plan node applied to some sub-path
  RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
  LockRowsPath  - a LockRows plan node applied to some sub-path
  ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
  LimitPath     - a Limit plan node applied to some sub-path
  NestPath      - nested-loop joins
  MergePath     - merge joins
  HashPath      - hash joins

 EquivalenceClass - a data structure representing a set of values known equal

 PathKey        - a data structure representing the sort ordering of a path

The optimizer spends a good deal of its time worrying about the ordering
of the tuples returned by a path.  The reason this is useful is that by
knowing the sort ordering of a path, we may be able to use that path as
the left or right input of a mergejoin and avoid an explicit sort step.
Nestloops and hash joins don't really care what the order of their inputs
is, but mergejoin needs suitably ordered inputs.  Therefore, all paths
generated during the optimization process are marked with their sort order
(to the extent that it is known) for possible use by a higher-level merge.

It is also possible to avoid an explicit sort step to implement a user's
ORDER BY clause if the final path has the right ordering already, so the
sort ordering is of interest even at the top level.  grouping_planner() will
look for the cheapest path with a sort order matching the desired order,
then compare its cost to the cost of using the cheapest-overall path and
doing an explicit sort on that.

When we are generating paths for a particular RelOptInfo, we discard a path
if it is more expensive than another known path that has the same or better
sort order.  We will never discard a path that is the only known way to
achieve a given sort order (without an explicit sort, that is).  In this
way, the next level up will have the maximum freedom to build mergejoins
without sorting, since it can pick from any of the paths retained for its
inputs.


EquivalenceClasses
------------------

During the deconstruct_jointree() scan of the query's qual clauses, we look
for mergejoinable equality clauses A = B whose applicability is not delayed
by an outer join; these are called "equivalence clauses".  When we find
one, we create an EquivalenceClass containing the expressions A and B to
record this knowledge.  If we later find another equivalence clause B = C,
we add C to the existing EquivalenceClass for {A B}; this may require
merging two existing EquivalenceClasses.  At the end of the scan, we have
sets of values that are known all transitively equal to each other.  We can
therefore use a comparison of any pair of the values as a restriction or
join clause (when these values are available at the scan or join, of
course); furthermore, we need test only one such comparison, not all of
them.  Therefore, equivalence clauses are removed from the standard qual
distribution process.  Instead, when preparing a restriction or join clause
list, we examine each EquivalenceClass to see if it can contribute a
clause, and if so we select an appropriate pair of values to compare.  For
example, if we are trying to join A's relation to C's, we can generate the
clause A = C, even though this appeared nowhere explicitly in the original
query.  This may allow us to explore join paths that otherwise would have
been rejected as requiring Cartesian-product joins.

Sometimes an EquivalenceClass may contain a pseudo-constant expression
(i.e., one not containing Vars or Aggs of the current query level, nor
volatile functions).  In this case we do not follow the policy of
dynamically generating join clauses: instead, we dynamically generate
restriction clauses "var = const" wherever one of the variable members of
the class can first be computed.  For example, if we have A = B and B = 42,
we effectively generate the restriction clauses A = 42 and B = 42, and then
we need not bother with explicitly testing the join clause A = B when the
relations are joined.  In effect, all the class members can be tested at
relation-scan level and there's never a need for join tests.

The precise technical interpretation of an EquivalenceClass is that it
asserts that at any plan node where more than one of its member values
can be computed, output rows in which the values are not all equal may
be discarded without affecting the query result.  (We require all levels
of the plan to enforce EquivalenceClasses, hence a join need not recheck
equality of values that were computable by one of its children.)  For an
ordinary EquivalenceClass that is "valid everywhere", we can further infer
that the values are all non-null, because all mergejoinable operators are
strict.  However, we also allow equivalence clauses that appear below the
nullable side of an outer join to form EquivalenceClasses; for these
classes, the interpretation is that either all the values are equal, or
all (except pseudo-constants) have gone to null.  (This requires a
limitation that non-constant members be strict, else they might not go
to null when the other members do.)  Consider for example

	SELECT *
	  FROM a LEFT JOIN
	       (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss
	       ON a.x = ss.y
	  WHERE a.x = 42;

We can form the below-outer-join EquivalenceClass {b.y c.z 10} and thereby
apply c.z = 10 while scanning c.  (The reason we disallow outerjoin-delayed
clauses from forming EquivalenceClasses is exactly that we want to be able
to push any derived clauses as far down as possible.)  But once above the
outer join it's no longer necessarily the case that b.y = 10, and thus we
cannot use such EquivalenceClasses to conclude that sorting is unnecessary
(see discussion of PathKeys below).

In this example, notice also that a.x = ss.y (really a.x = b.y) is not an
equivalence clause because its applicability to b is delayed by the outer
join; thus we do not try to insert b.y into the equivalence class {a.x 42}.
But since we see that a.x has been equated to 42 above the outer join, we
are able to form a below-outer-join class {b.y 42}; this restriction can be
added because no b/c row not having b.y = 42 can contribute to the result
of the outer join, and so we need not compute such rows.  Now this class
will get merged with {b.y c.z 10}, leading to the contradiction 10 = 42,
which lets the planner deduce that the b/c join need not be computed at all
because none of its rows can contribute to the outer join.  (This gets
implemented as a gating Result filter, since more usually the potential
contradiction involves Param values rather than just Consts, and thus has
to be checked at runtime.)

To aid in determining the sort ordering(s) that can work with a mergejoin,
we mark each mergejoinable clause with the EquivalenceClasses of its left
and right inputs.  For an equivalence clause, these are of course the same
EquivalenceClass.  For a non-equivalence mergejoinable clause (such as an
outer-join qualification), we generate two separate EquivalenceClasses for
the left and right inputs.  This may result in creating single-item
equivalence "classes", though of course these are still subject to merging
if other equivalence clauses are later found to bear on the same
expressions.

Another way that we may form a single-item EquivalenceClass is in creation
of a PathKey to represent a desired sort order (see below).  This is a bit
different from the above cases because such an EquivalenceClass might
contain an aggregate function or volatile expression.  (A clause containing
a volatile function will never be considered mergejoinable, even if its top
operator is mergejoinable, so there is no way for a volatile expression to
get into EquivalenceClasses otherwise.  Aggregates are disallowed in WHERE
altogether, so will never be found in a mergejoinable clause.)  This is just
a convenience to maintain a uniform PathKey representation: such an
EquivalenceClass will never be merged with any other.  Note in particular
that a single-item EquivalenceClass {a.x} is *not* meant to imply an
assertion that a.x = a.x; the practical effect of this is that a.x could
be NULL.

An EquivalenceClass also contains a list of btree opfamily OIDs, which
determines what the equalities it represents actually "mean".  All the
equivalence clauses that contribute to an EquivalenceClass must have
equality operators that belong to the same set of opfamilies.  (Note: most
of the time, a particular equality operator belongs to only one family, but
it's possible that it belongs to more than one.  We keep track of all the
families to ensure that we can make use of an index belonging to any one of
the families for mergejoin purposes.)

An EquivalenceClass can contain "em_is_child" members, which are copies
of members that contain appendrel parent relation Vars, transposed to
contain the equivalent child-relation variables or expressions.  These
members are *not* full-fledged members of the EquivalenceClass and do not
affect the class's overall properties at all.  They are kept only to
simplify matching of child-relation expressions to EquivalenceClasses.
Most operations on EquivalenceClasses should ignore child members.


PathKeys
--------

The PathKeys data structure represents what is known about the sort order
of the tuples generated by a particular Path.  A path's pathkeys field is a
list of PathKey nodes, where the n'th item represents the n'th sort key of
the result.  Each PathKey contains these fields:

	* a reference to an EquivalenceClass
	* a btree opfamily OID (must match one of those in the EC)
	* a sort direction (ascending or descending)
	* a nulls-first-or-last flag

The EquivalenceClass represents the value being sorted on.  Since the
various members of an EquivalenceClass are known equal according to the
opfamily, we can consider a path sorted by any one of them to be sorted by
any other too; this is what justifies referencing the whole
EquivalenceClass rather than just one member of it.

In single/base relation RelOptInfo's, the Paths represent various ways
of scanning the relation and the resulting ordering of the tuples.
Sequential scan Paths have NIL pathkeys, indicating no known ordering.
Index scans have Path.pathkeys that represent the chosen index's ordering,
if any.  A single-key index would create a single-PathKey list, while a
multi-column index generates a list with one element per key index column.
Non-key columns specified in the INCLUDE clause of covering indexes don't
have corresponding PathKeys in the list, because the have no influence on
index ordering.  (Actually, since an index can be scanned either forward or
backward, there are two possible sort orders and two possible PathKey lists
it can generate.)

Note that a bitmap scan has NIL pathkeys since we can say nothing about
the overall order of its result.  Also, an indexscan on an unordered type
of index generates NIL pathkeys.  However, we can always create a pathkey
by doing an explicit sort.  The pathkeys for a Sort plan's output just
represent the sort key fields and the ordering operators used.

Things get more interesting when we consider joins.  Suppose we do a
mergejoin between A and B using the mergeclause A.X = B.Y.  The output
of the mergejoin is sorted by X --- but it is also sorted by Y.  Again,
this can be represented by a PathKey referencing an EquivalenceClass
containing both X and Y.

With a little further thought, it becomes apparent that nestloop joins
can also produce sorted output.  For example, if we do a nestloop join
between outer relation A and inner relation B, then any pathkeys relevant
to A are still valid for the join result: we have not altered the order of
the tuples from A.  Even more interesting, if there was an equivalence clause
A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert
that B.Y is a pathkey for the join result; X was ordered before and still
is, and the joined values of Y are equal to the joined values of X, so Y
must now be ordered too.  This is true even though we used neither an
explicit sort nor a mergejoin on Y.  (Note: hash joins cannot be counted
on to preserve the order of their outer relation, because the executor
might decide to "batch" the join, so we always set pathkeys to NIL for
a hashjoin path.)  Exception: a RIGHT or FULL join doesn't preserve the
ordering of its outer relation, because it might insert nulls at random
points in the ordering.

In general, we can justify using EquivalenceClasses as the basis for
pathkeys because, whenever we scan a relation containing multiple
EquivalenceClass members or join two relations each containing
EquivalenceClass members, we apply restriction or join clauses derived from
the EquivalenceClass.  This guarantees that any two values listed in the
EquivalenceClass are in fact equal in all tuples emitted by the scan or
join, and therefore that if the tuples are sorted by one of the values,
they can be considered sorted by any other as well.  It does not matter
whether the test clause is used as a mergeclause, or merely enforced
after-the-fact as a qpqual filter.

Note that there is no particular difficulty in labeling a path's sort
order with a PathKey referencing an EquivalenceClass that contains
variables not yet joined into the path's output.  We can simply ignore
such entries as not being relevant (yet).  This makes it possible to
use the same EquivalenceClasses throughout the join planning process.
In fact, by being careful not to generate multiple identical PathKey
objects, we can reduce comparison of EquivalenceClasses and PathKeys
to simple pointer comparison, which is a huge savings because add_path
has to make a large number of PathKey comparisons in deciding whether
competing Paths are equivalently sorted.

Pathkeys are also useful to represent an ordering that we wish to achieve,
since they are easily compared to the pathkeys of a potential candidate
path.  So, SortGroupClause lists are turned into pathkeys lists for use
inside the optimizer.

An additional refinement we can make is to insist that canonical pathkey
lists (sort orderings) do not mention the same EquivalenceClass more than
once.  For example, in all these cases the second sort column is redundant,
because it cannot distinguish values that are the same according to the
first sort column:
	SELECT ... ORDER BY x, x
	SELECT ... ORDER BY x, x DESC
	SELECT ... WHERE x = y ORDER BY x, y
Although a user probably wouldn't write "ORDER BY x,x" directly, such
redundancies are more probable once equivalence classes have been
considered.  Also, the system may generate redundant pathkey lists when
computing the sort ordering needed for a mergejoin.  By eliminating the
redundancy, we save time and improve planning, since the planner will more
easily recognize equivalent orderings as being equivalent.

Another interesting property is that if the underlying EquivalenceClass
contains a constant and is not below an outer join, then the pathkey is
completely redundant and need not be sorted by at all!  Every row must
contain the same constant value, so there's no need to sort.  (If the EC is
below an outer join, we still have to sort, since some of the rows might
have gone to null and others not.  In this case we must be careful to pick
a non-const member to sort by.  The assumption that all the non-const
members go to null at the same plan level is critical here, else they might
not produce the same sort order.)  This might seem pointless because users
are unlikely to write "... WHERE x = 42 ORDER BY x", but it allows us to
recognize when particular index columns are irrelevant to the sort order:
if we have "... WHERE x = 42 ORDER BY y", scanning an index on (x,y)
produces correctly ordered data without a sort step.  We used to have very
ugly ad-hoc code to recognize that in limited contexts, but discarding
constant ECs from pathkeys makes it happen cleanly and automatically.

You might object that a below-outer-join EquivalenceClass doesn't always
represent the same values at every level of the join tree, and so using
it to uniquely identify a sort order is dubious.  This is true, but we
can avoid dealing with the fact explicitly because we always consider that
an outer join destroys any ordering of its nullable inputs.  Thus, even
if a path was sorted by {a.x} below an outer join, we'll re-sort if that
sort ordering was important; and so using the same PathKey for both sort
orderings doesn't create any real problem.


Order of processing for EquivalenceClasses and PathKeys
-------------------------------------------------------

As alluded to above, there is a specific sequence of phases in the
processing of EquivalenceClasses and PathKeys during planning.  During the
initial scanning of the query's quals (deconstruct_jointree followed by
reconsider_outer_join_clauses), we construct EquivalenceClasses based on
mergejoinable clauses found in the quals.  At the end of this process,
we know all we can know about equivalence of different variables, so
subsequently there will be no further merging of EquivalenceClasses.
At that point it is possible to consider the EquivalenceClasses as
"canonical" and build canonical PathKeys that reference them.  At this
time we construct PathKeys for the query's ORDER BY and related clauses.
(Any ordering expressions that do not appear elsewhere will result in
the creation of new EquivalenceClasses, but this cannot result in merging
existing classes, so canonical-ness is not lost.)

Because all the EquivalenceClasses are known before we begin path
generation, we can use them as a guide to which indexes are of interest:
if an index's column is not mentioned in any EquivalenceClass then that
index's sort order cannot possibly be helpful for the query.  This allows
short-circuiting of much of the processing of create_index_paths() for
irrelevant indexes.

There are some cases where planner.c constructs additional
EquivalenceClasses and PathKeys after query_planner has completed.
In these cases, the extra ECs/PKs are needed to represent sort orders
that were not considered during query_planner.  Such situations should be
minimized since it is impossible for query_planner to return a plan
producing such a sort order, meaning an explicit sort will always be needed.
Currently this happens only for queries involving multiple window functions
with different orderings, for which extra sorts are needed anyway.


Parameterized Paths
-------------------

The naive way to join two relations using a clause like WHERE A.X = B.Y
is to generate a nestloop plan like this:

	NestLoop
		Filter: A.X = B.Y
		-> Seq Scan on A
		-> Seq Scan on B

We can make this better by using a merge or hash join, but it still
requires scanning all of both input relations.  If A is very small and B is
very large, but there is an index on B.Y, it can be enormously better to do
something like this:

	NestLoop
		-> Seq Scan on A
		-> Index Scan using B_Y_IDX on B
			Index Condition: B.Y = A.X

Here, we are expecting that for each row scanned from A, the nestloop
plan node will pass down the current value of A.X into the scan of B.
That allows the indexscan to treat A.X as a constant for any one
invocation, and thereby use it as an index key.  This is the only plan type
that can avoid fetching all of B, and for small numbers of rows coming from
A, that will dominate every other consideration.  (As A gets larger, this
gets less attractive, and eventually a merge or hash join will win instead.
So we have to cost out all the alternatives to decide what to do.)

It can be useful for the parameter value to be passed down through
intermediate layers of joins, for example:

	NestLoop
		-> Seq Scan on A
		Hash Join
			Join Condition: B.Y = C.W
			-> Seq Scan on B
			-> Index Scan using C_Z_IDX on C
				Index Condition: C.Z = A.X

If all joins are plain inner joins then this is usually unnecessary,
because it's possible to reorder the joins so that a parameter is used
immediately below the nestloop node that provides it.  But in the
presence of outer joins, such join reordering may not be possible.

Also, the bottom-level scan might require parameters from more than one
other relation.  In principle we could join the other relations first
so that all the parameters are supplied from a single nestloop level.
But if those other relations have no join clause in common (which is
common in star-schema queries for instance), the planner won't consider
joining them directly to each other.  In such a case we need to be able
to create a plan like

    NestLoop
        -> Seq Scan on SmallTable1 A
        NestLoop
            -> Seq Scan on SmallTable2 B
            -> Index Scan using XYIndex on LargeTable C
                 Index Condition: C.X = A.AID and C.Y = B.BID

so we should be willing to pass down A.AID through a join even though
there is no join order constraint forcing the plan to look like this.

Before version 9.2, Postgres used ad-hoc methods for planning and
executing nestloop queries of this kind, and those methods could not
handle passing parameters down through multiple join levels.

To plan such queries, we now use a notion of a "parameterized path",
which is a path that makes use of a join clause to a relation that's not
scanned by the path.  In the example two above, we would construct a
path representing the possibility of doing this:

	-> Index Scan using C_Z_IDX on C
		Index Condition: C.Z = A.X

This path will be marked as being parameterized by relation A.  (Note that
this is only one of the possible access paths for C; we'd still have a
plain unparameterized seqscan, and perhaps other possibilities.)  The
parameterization marker does not prevent joining the path to B, so one of
the paths generated for the joinrel {B C} will represent

	Hash Join
		Join Condition: B.Y = C.W
		-> Seq Scan on B
		-> Index Scan using C_Z_IDX on C
			Index Condition: C.Z = A.X

This path is still marked as being parameterized by A.  When we attempt to
join {B C} to A to form the complete join tree, such a path can only be
used as the inner side of a nestloop join: it will be ignored for other
possible join types.  So we will form a join path representing the query
plan shown above, and it will compete in the usual way with paths built
from non-parameterized scans.

While all ordinary paths for a particular relation generate the same set
of rows (since they must all apply the same set of restriction clauses),
parameterized paths typically generate fewer rows than less-parameterized
paths, since they have additional clauses to work with.  This means we
must consider the number of rows generated as an additional figure of
merit.  A path that costs more than another, but generates fewer rows,
must be kept since the smaller number of rows might save work at some
intermediate join level.  (It would not save anything if joined
immediately to the source of the parameters.)

To keep cost estimation rules relatively simple, we make an implementation
restriction that all paths for a given relation of the same parameterization
(i.e., the same set of outer relations supplying parameters) must have the
same rowcount estimate.  This is justified by insisting that each such path
apply *all* join clauses that are available with the named outer relations.
Different paths might, for instance, choose different join clauses to use
as index clauses; but they must then apply any other join clauses available
from the same outer relations as filter conditions, so that the set of rows
returned is held constant.  This restriction doesn't degrade the quality of
the finished plan: it amounts to saying that we should always push down
movable join clauses to the lowest possible evaluation level, which is a
good thing anyway.  The restriction is useful in particular to support
pre-filtering of join paths in add_path_precheck.  Without this rule we
could never reject a parameterized path in advance of computing its rowcount
estimate, which would greatly reduce the value of the pre-filter mechanism.

To limit planning time, we have to avoid generating an unreasonably large
number of parameterized paths.  We do this by only generating parameterized
relation scan paths for index scans, and then only for indexes for which
suitable join clauses are available.  There are also heuristics in join
planning that try to limit the number of parameterized paths considered.

In particular, there's been a deliberate policy decision to favor hash
joins over merge joins for parameterized join steps (those occurring below
a nestloop that provides parameters to the lower join's inputs).  While we
do not ignore merge joins entirely, joinpath.c does not fully explore the
space of potential merge joins with parameterized inputs.  Also, add_path
treats parameterized paths as having no pathkeys, so that they compete
only on cost and rowcount; they don't get preference for producing a
special sort order.  This creates additional bias against merge joins,
since we might discard a path that could have been useful for performing
a merge without an explicit sort step.  Since a parameterized path must
ultimately be used on the inside of a nestloop, where its sort order is
uninteresting, these choices do not affect any requirement for the final
output order of a query --- they only make it harder to use a merge join
at a lower level.  The savings in planning work justifies that.

Similarly, parameterized paths do not normally get preference in add_path
for having cheap startup cost; that's seldom of much value when on the
inside of a nestloop, so it seems not worth keeping extra paths solely for
that.  An exception occurs for parameterized paths for the RHS relation of
a SEMI or ANTI join: in those cases, we can stop the inner scan after the
first match, so it's primarily startup not total cost that we care about.


LATERAL subqueries
------------------

As of 9.3 we support SQL-standard LATERAL references from subqueries in
FROM (and also functions in FROM).  The planner implements these by
generating parameterized paths for any RTE that contains lateral
references.  In such cases, *all* paths for that relation will be
parameterized by at least the set of relations used in its lateral
references.  (And in turn, join relations including such a subquery might
not have any unparameterized paths.)  All the other comments made above for
parameterized paths still apply, though; in particular, each such path is
still expected to enforce any join clauses that can be pushed down to it,
so that all paths of the same parameterization have the same rowcount.

We also allow LATERAL subqueries to be flattened (pulled up into the parent
query) by the optimizer, but only when this does not introduce lateral
references into JOIN/ON quals that would refer to relations outside the
lowest outer join at/above that qual.  The semantics of such a qual would
be unclear.  Note that even with this restriction, pullup of a LATERAL
subquery can result in creating PlaceHolderVars that contain lateral
references to relations outside their syntactic scope.  We still evaluate
such PHVs at their syntactic location or lower, but the presence of such a
PHV in the quals or targetlist of a plan node requires that node to appear
on the inside of a nestloop join relative to the rel(s) supplying the
lateral reference.  (Perhaps now that that stuff works, we could relax the
pullup restriction?)


Security-level constraints on qual clauses
------------------------------------------

To support row-level security and security-barrier views efficiently,
we mark qual clauses (RestrictInfo nodes) with a "security_level" field.
The basic concept is that a qual with a lower security_level must be
evaluated before one with a higher security_level.  This ensures that
"leaky" quals that might expose sensitive data are not evaluated until
after the security barrier quals that are supposed to filter out
security-sensitive rows.  However, many qual conditions are "leakproof",
that is we trust the functions they use to not expose data.  To avoid
unnecessarily inefficient plans, a leakproof qual is not delayed by
security-level considerations, even if it has a higher syntactic
security_level than another qual.

In a query that contains no use of RLS or security-barrier views, all
quals will have security_level zero, so that none of these restrictions
kick in; we don't even need to check leakproofness of qual conditions.

If there are security-barrier quals, they get security_level zero (and
possibly higher, if there are multiple layers of barriers).  Regular quals
coming from the query text get a security_level one more than the highest
level used for barrier quals.

When new qual clauses are generated by EquivalenceClass processing,
they must be assigned a security_level.  This is trickier than it seems.
One's first instinct is that it would be safe to use the largest level
found among the source quals for the EquivalenceClass, but that isn't
safe at all, because it allows unwanted delays of security-barrier quals.
Consider a barrier qual "t.x = t.y" plus a query qual "t.x = constant",
and suppose there is another query qual "leaky_function(t.z)" that
we mustn't evaluate before the barrier qual has been checked.
We will have an EC {t.x, t.y, constant} which will lead us to replace
the EC quals with "t.x = constant AND t.y = constant".  (We do not want
to give up that behavior, either, since the latter condition could allow
use of an index on t.y, which we would never discover from the original
quals.)  If these generated quals are assigned the same security_level as
the query quals, then it's possible for the leaky_function qual to be
evaluated first, allowing leaky_function to see data from rows that
possibly don't pass the barrier condition.

Instead, our handling of security levels with ECs works like this:
* Quals are not accepted as source clauses for ECs in the first place
unless they are leakproof or have security_level zero.
* EC-derived quals are assigned the minimum (not maximum) security_level
found among the EC's source clauses.
* If the maximum security_level found among the EC's source clauses is
above zero, then the equality operators selected for derived quals must
be leakproof.  When no such operator can be found, the EC is treated as
"broken" and we fall back to emitting its source clauses without any
additional derived quals.

These rules together ensure that an untrusted qual clause (one with
security_level above zero) cannot cause an EC to generate a leaky derived
clause.  This makes it safe to use the minimum not maximum security_level
for derived clauses.  The rules could result in poor plans due to not
being able to generate derived clauses at all, but the risk of that is
small in practice because most btree equality operators are leakproof.
Also, by making exceptions for level-zero quals, we ensure that there is
no plan degradation when no barrier quals are present.

Once we have security levels assigned to all clauses, enforcement
of barrier-qual ordering restrictions boils down to two rules:

* Table scan plan nodes must not select quals for early execution
(for example, use them as index qualifiers in an indexscan) unless
they are leakproof or have security_level no higher than any other
qual that is due to be executed at the same plan node.  (Use the
utility function restriction_is_securely_promotable() to check
whether it's okay to select a qual for early execution.)

* Normal execution of a list of quals must execute them in an order
that satisfies the same security rule, ie higher security_levels must
be evaluated later unless leakproof.  (This is handled in a single place
by order_qual_clauses() in createplan.c.)

order_qual_clauses() uses a heuristic to decide exactly what to do with
leakproof clauses.  Normally it sorts clauses by security_level then cost,
being careful that the sort is stable so that we don't reorder clauses
without a clear reason.  But this could result in a very expensive qual
being done before a cheaper one that is of higher security_level.
If the cheaper qual is leaky we have no choice, but if it is leakproof
we could put it first.  We choose to sort leakproof quals as if they
have security_level zero, but only when their cost is less than 10X
cpu_operator_cost; that restriction alleviates the opposite problem of
doing expensive quals first just because they're leakproof.

Additional rules will be needed to support safe handling of join quals
when there is a mix of security levels among join quals; for example, it
will be necessary to prevent leaky higher-security-level quals from being
evaluated at a lower join level than other quals of lower security level.
Currently there is no need to consider that since security-prioritized
quals can only be single-table restriction quals coming from RLS policies
or security-barrier views, and security-barrier view subqueries are never
flattened into the parent query.  Hence enforcement of security-prioritized
quals only happens at the table scan level.  With extra rules for safe
handling of security levels among join quals, it should be possible to let
security-barrier views be flattened into the parent query, allowing more
flexibility of planning while still preserving required ordering of qual
evaluation.  But that will come later.


Post scan/join planning
-----------------------

So far we have discussed only scan/join planning, that is, implementation
of the FROM and WHERE clauses of a SQL query.  But the planner must also
determine how to deal with GROUP BY, aggregation, and other higher-level
features of queries; and in many cases there are multiple ways to do these
steps and thus opportunities for optimization choices.  These steps, like
scan/join planning, are handled by constructing Paths representing the
different ways to do a step, then choosing the cheapest Path.

Since all Paths require a RelOptInfo as "parent", we create RelOptInfos
representing the outputs of these upper-level processing steps.  These
RelOptInfos are mostly dummy, but their pathlist lists hold all the Paths
considered useful for each step.  Currently, we may create these types of
additional RelOptInfos during upper-level planning:

UPPERREL_SETOP		result of UNION/INTERSECT/EXCEPT, if any
UPPERREL_PARTIAL_GROUP_AGG	result of partial grouping/aggregation, if any
UPPERREL_GROUP_AGG	result of grouping/aggregation, if any
UPPERREL_WINDOW		result of window functions, if any
UPPERREL_PARTIAL_DISTINCT	result of partial "SELECT DISTINCT", if any
UPPERREL_DISTINCT	result of "SELECT DISTINCT", if any
UPPERREL_ORDERED	result of ORDER BY, if any
UPPERREL_FINAL		result of any remaining top-level actions

UPPERREL_FINAL is used to represent any final processing steps, currently
LockRows (SELECT FOR UPDATE), LIMIT/OFFSET, and ModifyTable.  There is no
flexibility about the order in which these steps are done, and thus no need
to subdivide this stage more finely.

These "upper relations" are identified by the UPPERREL enum values shown
above, plus a relids set, which allows there to be more than one upperrel
of the same kind.  We use NULL for the relids if there's no need for more
than one upperrel of the same kind.  Currently, in fact, the relids set
is vestigial because it's always NULL, but that's expected to change in
the future.  For example, in planning set operations, we might need the
relids to denote which subset of the leaf SELECTs has been combined in a
particular group of Paths that are competing with each other.

The result of subquery_planner() is always returned as a set of Paths
stored in the UPPERREL_FINAL rel with NULL relids.  The other types of
upperrels are created only if needed for the particular query.


Parallel Query and Partial Paths
--------------------------------

Parallel query involves dividing up the work that needs to be performed
either by an entire query or some portion of the query in such a way that
some of that work can be done by one or more worker processes, which are
called parallel workers.  Parallel workers are a subtype of dynamic
background workers; see src/backend/access/transam/README.parallel for a
fuller description.  The academic literature on parallel query suggests
that parallel execution strategies can be divided into essentially two
categories: pipelined parallelism, where the execution of the query is
divided into multiple stages and each stage is handled by a separate
process; and partitioning parallelism, where the data is split between
multiple processes and each process handles a subset of it.  The
literature, however, suggests that gains from pipeline parallelism are
often very limited due to the difficulty of avoiding pipeline stalls.
Consequently, we do not currently attempt to generate query plans that
use this technique.

Instead, we focus on partitioning parallelism, which does not require
that the underlying table be partitioned.  It only requires that (1)
there is some method of dividing the data from at least one of the base
tables involved in the relation across multiple processes, (2) allowing
each process to handle its own portion of the data, and then (3)
collecting the results.  Requirements (2) and (3) are satisfied by the
executor node Gather (or GatherMerge), which launches any number of worker
processes and executes its single child plan in all of them, and perhaps
in the leader also, if the children aren't generating enough data to keep
the leader busy.  Requirement (1) is handled by the table scan node: when
invoked with parallel_aware = true, this node will, in effect, partition
the table on a block by block basis, returning a subset of the tuples from
the relation in each worker where that scan node is executed.

Just as we do for non-parallel access methods, we build Paths to
represent access strategies that can be used in a parallel plan.  These
are, in essence, the same strategies that are available in the
non-parallel plan, but there is an important difference: a path that
will run beneath a Gather node returns only a subset of the query
results in each worker, not all of them.  To form a path that can
actually be executed, the (rather large) cost of the Gather node must be
accounted for.  For this reason among others, paths intended to run
beneath a Gather node - which we call "partial" paths since they return
only a subset of the results in each worker - must be kept separate from
ordinary paths (see RelOptInfo's partial_pathlist and the function
add_partial_path).

One of the keys to making parallel query effective is to run as much of
the query in parallel as possible.  Therefore, we expect it to generally
be desirable to postpone the Gather stage until as near to the top of the
plan as possible.  Expanding the range of cases in which more work can be
pushed below the Gather (and costing them accurately) is likely to keep us
busy for a long time to come.

Partitionwise joins
-------------------

A join between two similarly partitioned tables can be broken down into joins
between their matching partitions if there exists an equi-join condition
between the partition keys of the joining tables. The equi-join between
partition keys implies that all join partners for a given row in one
partitioned table must be in the corresponding partition of the other
partitioned table. Because of this the join between partitioned tables to be
broken into joins between the matching partitions. The resultant join is
partitioned in the same way as the joining relations, thus allowing an N-way
join between similarly partitioned tables having equi-join condition between
their partition keys to be broken down into N-way joins between their matching
partitions. This technique of breaking down a join between partitioned tables
into joins between their partitions is called partitionwise join. We will use
term "partitioned relation" for either a partitioned table or a join between
compatibly partitioned tables.

Even if the joining relations don't have exactly the same partition bounds,
partitionwise join can still be applied by using an advanced
partition-matching algorithm.  For both the joining relations, the algorithm
checks whether every partition of one joining relation only matches one
partition of the other joining relation at most.  In such a case the join
between the joining relations can be broken down into joins between the
matching partitions.  The join relation can then be considered partitioned.
The algorithm produces the pairs of the matching partitions, plus the
partition bounds for the join relation, to allow partitionwise join for
computing the join.  The algorithm is implemented in partition_bounds_merge().
For an N-way join relation considered partitioned this way, not every pair of
joining relations can use partitionwise join.  For example:

	(A leftjoin B on (Pab)) innerjoin C on (Pac)

where A, B, and C are partitioned tables, and A has an extra partition
compared to B and C.  When considering partitionwise join for the join {A B},
the extra partition of A doesn't have a matching partition on the nullable
side, which is the case that the current implementation of partitionwise join
can't handle.  So {A B} is not considered partitioned, and the pair of {A B}
and C considered for the 3-way join can't use partitionwise join.  On the
other hand, the pair of {A C} and B can use partitionwise join because {A C}
is considered partitioned by eliminating the extra partition (see identity 1
on outer join reordering).  Whether an N-way join can use partitionwise join
is determined based on the first pair of joining relations that are both
partitioned and can use partitionwise join.

The partitioning properties of a partitioned relation are stored in its
RelOptInfo.  The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object.  This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations.

Partitionwise aggregates/grouping
---------------------------------

If the GROUP BY clause contains all of the partition keys, all the rows
that belong to a given group must come from a single partition; therefore,
aggregation can be done completely separately for each partition. Otherwise,
partial aggregates can be computed for each partition, and then finalized
after appending the results from the individual partitions.  This technique of
breaking down aggregation or grouping over a partitioned relation into
aggregation or grouping over its partitions is called partitionwise
aggregation.  Especially when the partition keys match the GROUP BY clause,
this can be significantly faster than the regular method.