1 files changed, 598 insertions, 0 deletions
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
new file mode 100644
index 0000000..5acc953
--- /dev/null
+++ b/doc/src/sgml/parallel.sgml
@@ -0,0 +1,598 @@
+<!-- doc/src/sgml/parallel.sgml -->
+
+ <chapter id="parallel-query">
+  <title>Parallel Query</title>
+
+  <indexterm zone="parallel-query">
+   <primary>parallel query</primary>
+  </indexterm>
+
+  <para>
+   <productname>PostgreSQL</productname> can devise query plans that can leverage
+   multiple CPUs in order to answer queries faster.  This feature is known
+   as parallel query.  Many queries cannot benefit from parallel query, either
+   due to limitations of the current implementation or because there is no
+   imaginable query plan that is any faster than the serial query plan.
+   However, for queries that can benefit, the speedup from parallel query
+   is often very significant.  Many queries can run more than twice as fast
+   when using parallel query, and some queries can run four times faster or
+   even more.  Queries that touch a large amount of data but return only a
+   few rows to the user will typically benefit most.  This chapter explains
+   some details of how parallel query works and in which situations it can be
+   used so that users who wish to make use of it can understand what to expect.
+  </para>
+
+ <sect1 id="how-parallel-query-works">
+  <title>How Parallel Query Works</title>
+
+   <para>
+    When the optimizer determines that parallel query is the fastest execution
+    strategy for a particular query, it will create a query plan that includes
+    a <firstterm>Gather</firstterm> or <firstterm>Gather Merge</firstterm>
+    node.  Here is a simple example:
+
+<screen>
+EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
+                                     QUERY PLAN
+-------------------------------------------------------------------&zwsp;------------------
+ Gather  (cost=1000.00..217018.43 rows=1 width=97)
+   Workers Planned: 2
+   ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..216018.33 rows=1 width=97)
+         Filter: (filler ~~ '%x%'::text)
+(4 rows)
+</screen>
+   </para>
+
+   <para>
+    In all cases, the <literal>Gather</literal> or
+    <literal>Gather Merge</literal> node will have exactly one
+    child plan, which is the portion of the plan that will be executed in
+    parallel.  If the <literal>Gather</literal> or <literal>Gather Merge</literal> node is
+    at the very top of the plan tree, then the entire query will execute in
+    parallel.  If it is somewhere else in the plan tree, then only the portion
+    of the plan below it will run in parallel.  In the example above, the
+    query accesses only one table, so there is only one plan node other than
+    the <literal>Gather</literal> node itself; since that plan node is a child of the
+    <literal>Gather</literal> node, it will run in parallel.
+   </para>
+
+   <para>
+    <link linkend="using-explain">Using EXPLAIN</link>, you can see the number of
+    workers chosen by the planner.  When the <literal>Gather</literal> node is reached
+    during query execution, the process that is implementing the user's
+    session will request a number of <link linkend="bgworker">background
+    worker processes</link> equal to the number
+    of workers chosen by the planner.  The number of background workers that
+    the planner will consider using is limited to at most
+    <xref linkend="guc-max-parallel-workers-per-gather"/>.  The total number
+    of background workers that can exist at any one time is limited by both
+    <xref linkend="guc-max-worker-processes"/> and
+    <xref linkend="guc-max-parallel-workers"/>.  Therefore, it is possible for a
+    parallel query to run with fewer workers than planned, or even with
+    no workers at all.  The optimal plan may depend on the number of workers
+    that are available, so this can result in poor query performance.  If this
+    occurrence is frequent, consider increasing
+    <varname>max_worker_processes</varname> and <varname>max_parallel_workers</varname>
+    so that more workers can be run simultaneously or alternatively reducing
+    <varname>max_parallel_workers_per_gather</varname> so that the planner
+    requests fewer workers.
+   </para>
+
+   <para>
+    Every background worker process that is successfully started for a given
+    parallel query will execute the parallel portion of the plan.  The leader
+    will also execute that portion of the plan, but it has an additional
+    responsibility: it must also read all of the tuples generated by the
+    workers.  When the parallel portion of the plan generates only a small
+    number of tuples, the leader will often behave very much like an additional
+    worker, speeding up query execution.  Conversely, when the parallel portion
+    of the plan generates a large number of tuples, the leader may be almost
+    entirely occupied with reading the tuples generated by the workers and
+    performing any further processing steps that are required by plan nodes
+    above the level of the <literal>Gather</literal> node or
+    <literal>Gather Merge</literal> node.  In such cases, the leader will
+    do very little of the work of executing the parallel portion of the plan.
+   </para>
+
+   <para>
+    When the node at the top of the parallel portion of the plan is
+    <literal>Gather Merge</literal> rather than <literal>Gather</literal>, it indicates that
+    each process executing the parallel portion of the plan is producing
+    tuples in sorted order, and that the leader is performing an
+    order-preserving merge.  In contrast, <literal>Gather</literal> reads tuples
+    from the workers in whatever order is convenient, destroying any sort
+    order that may have existed.
+   </para>
+ </sect1>
+
+ <sect1 id="when-can-parallel-query-be-used">
+  <title>When Can Parallel Query Be Used?</title>
+
+  <para>
+    There are several settings that can cause the query planner not to
+    generate a parallel query plan under any circumstances.  In order for
+    any parallel query plans whatsoever to be generated, the following
+    settings must be configured as indicated.
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        <xref linkend="guc-max-parallel-workers-per-gather"/> must be set to a
+        value that is greater than zero. This is a special case of the more
+        general principle that no more workers should be used than the number
+        configured via <varname>max_parallel_workers_per_gather</varname>.
+      </para>
+    </listitem>
+  </itemizedlist>
+
+  <para>
+    In addition, the system must not be running in single-user mode.  Since
+    the entire database system is running as a single process in this situation,
+    no background workers will be available.
+  </para>
+
+  <para>
+    Even when it is in general possible for parallel query plans to be
+    generated, the planner will not generate them for a given query
+    if any of the following are true:
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        The query writes any data or locks any database rows. If a query
+        contains a data-modifying operation either at the top level or within
+        a CTE, no parallel plans for that query will be generated.  As an
+        exception, the following commands, which create a new table and populate
+        it, can use a parallel plan for the underlying <literal>SELECT</literal>
+        part of the query:
+
+        <itemizedlist>
+          <listitem>
+            <para><command>CREATE TABLE ... AS</command></para>
+          </listitem>
+          <listitem>
+            <para><command>SELECT INTO</command></para>
+          </listitem>
+          <listitem>
+            <para><command>CREATE MATERIALIZED VIEW</command></para>
+          </listitem>
+          <listitem>
+            <para><command>REFRESH MATERIALIZED VIEW</command></para>
+          </listitem>
+        </itemizedlist>
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        The query might be suspended during execution. In any situation in
+        which the system thinks that partial or incremental execution might
+        occur, no parallel plan is generated. For example, a cursor created
+        using <link linkend="sql-declare">DECLARE CURSOR</link> will never use
+        a parallel plan. Similarly, a PL/pgSQL loop of the form
+        <literal>FOR x IN query LOOP .. END LOOP</literal> will never use a
+        parallel plan, because the parallel query system is unable to verify
+        that the code in the loop is safe to execute while parallel query is
+        active.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        The query uses any function marked <literal>PARALLEL UNSAFE</literal>.
+        Most system-defined functions are <literal>PARALLEL SAFE</literal>,
+        but user-defined functions are marked <literal>PARALLEL
+        UNSAFE</literal> by default. See the discussion of
+        <xref linkend="parallel-safety"/>.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        The query is running inside of another query that is already parallel.
+        For example, if a function called by a parallel query issues an SQL
+        query itself, that query will never use a parallel plan. This is a
+        limitation of the current implementation, but it may not be desirable
+        to remove this limitation, since it could result in a single query
+        using a very large number of processes.
+      </para>
+    </listitem>
+  </itemizedlist>
+
+  <para>
+    Even when parallel query plan is generated for a particular query, there
+    are several circumstances under which it will be impossible to execute
+    that plan in parallel at execution time.  If this occurs, the leader
+    will execute the portion of the plan below the <literal>Gather</literal>
+    node entirely by itself, almost as if the <literal>Gather</literal> node were
+    not present.  This will happen if any of the following conditions are met:
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        No background workers can be obtained because of the limitation that
+        the total number of background workers cannot exceed
+        <xref linkend="guc-max-worker-processes"/>.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        No background workers can be obtained because of the limitation that
+        the total number of background workers launched for purposes of
+        parallel query cannot exceed <xref linkend="guc-max-parallel-workers"/>.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        The client sends an Execute message with a non-zero fetch count.
+        See the discussion of the
+        <link linkend="protocol-flow-ext-query">extended query protocol</link>.
+        Since <link linkend="libpq">libpq</link> currently provides no way to
+        send such a message, this can only occur when using a client that
+        does not rely on libpq.  If this is a frequent
+        occurrence, it may be a good idea to set
+        <xref linkend="guc-max-parallel-workers-per-gather"/> to zero in
+        sessions where it is likely, so as to avoid generating query plans
+        that may be suboptimal when run serially.
+      </para>
+    </listitem>
+  </itemizedlist>
+ </sect1>
+
+ <sect1 id="parallel-plans">
+  <title>Parallel Plans</title>
+
+  <para>
+    Because each worker executes the parallel portion of the plan to
+    completion, it is not possible to simply take an ordinary query plan
+    and run it using multiple workers.  Each worker would produce a full
+    copy of the output result set, so the query would not run any faster
+    than normal but would produce incorrect results.  Instead, the parallel
+    portion of the plan must be what is known internally to the query
+    optimizer as a <firstterm>partial plan</firstterm>; that is, it must be constructed
+    so that each process that executes the plan will generate only a
+    subset of the output rows in such a way that each required output row
+    is guaranteed to be generated by exactly one of the cooperating processes.
+    Generally, this means that the scan on the driving table of the query
+    must be a parallel-aware scan.
+  </para>
+
+ <sect2 id="parallel-scans">
+  <title>Parallel Scans</title>
+
+  <para>
+    The following types of parallel-aware table scans are currently supported.
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        In a <emphasis>parallel sequential scan</emphasis>, the table's blocks will
+        be divided into ranges and shared among the cooperating processes.  Each
+        worker process will complete the scanning of its given range of blocks before
+        requesting an additional range of blocks.
+      </para>
+    </listitem>
+    <listitem>
+      <para>
+        In a <emphasis>parallel bitmap heap scan</emphasis>, one process is chosen
+        as the leader.  That process performs a scan of one or more indexes
+        and builds a bitmap indicating which table blocks need to be visited.
+        These blocks are then divided among the cooperating processes as in
+        a parallel sequential scan.  In other words, the heap scan is performed
+        in parallel, but the underlying index scan is not.
+      </para>
+    </listitem>
+    <listitem>
+      <para>
+        In a <emphasis>parallel index scan</emphasis> or <emphasis>parallel index-only
+        scan</emphasis>, the cooperating processes take turns reading data from the
+        index.  Currently, parallel index scans are supported only for
+        btree indexes.  Each process will claim a single index block and will
+        scan and return all tuples referenced by that block; other processes can
+        at the same time be returning tuples from a different index block.
+        The results of a parallel btree scan are returned in sorted order
+        within each worker process.
+      </para>
+    </listitem>
+  </itemizedlist>
+
+    Other scan types, such as scans of non-btree indexes, may support
+    parallel scans in the future.
+  </para>
+ </sect2>
+
+ <sect2 id="parallel-joins">
+  <title>Parallel Joins</title>
+
+  <para>
+    Just as in a non-parallel plan, the driving table may be joined to one or
+    more other tables using a nested loop, hash join, or merge join.  The
+    inner side of the join may be any kind of non-parallel plan that is
+    otherwise supported by the planner provided that it is safe to run within
+    a parallel worker.  Depending on the join type, the inner side may also be
+    a parallel plan.
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        In a <emphasis>nested loop join</emphasis>, the inner side is always
+        non-parallel.  Although it is executed in full, this is efficient if
+        the inner side is an index scan, because the outer tuples and thus
+        the loops that look up values in the index are divided over the
+        cooperating processes.
+      </para>
+    </listitem>
+    <listitem>
+      <para>
+        In a <emphasis>merge join</emphasis>, the inner side is always
+        a non-parallel plan and therefore executed in full.  This may be
+        inefficient, especially if a sort must be performed, because the work
+        and resulting data are duplicated in every cooperating process.
+      </para>
+    </listitem>
+    <listitem>
+      <para>
+        In a <emphasis>hash join</emphasis> (without the "parallel" prefix),
+        the inner side is executed in full by every cooperating process
+        to build identical copies of the hash table.  This may be inefficient
+        if the hash table is large or the plan is expensive.  In a
+        <emphasis>parallel hash join</emphasis>, the inner side is a
+        <emphasis>parallel hash</emphasis> that divides the work of building
+        a shared hash table over the cooperating processes.
+      </para>
+    </listitem>
+  </itemizedlist>
+ </sect2>
+
+ <sect2 id="parallel-aggregation">
+  <title>Parallel Aggregation</title>
+  <para>
+    <productname>PostgreSQL</productname> supports parallel aggregation by aggregating in
+    two stages.  First, each process participating in the parallel portion of
+    the query performs an aggregation step, producing a partial result for
+    each group of which that process is aware.  This is reflected in the plan
+    as a <literal>Partial Aggregate</literal> node.  Second, the partial results are
+    transferred to the leader via <literal>Gather</literal> or <literal>Gather
+    Merge</literal>.  Finally, the leader re-aggregates the results across all
+    workers in order to produce the final result.  This is reflected in the
+    plan as a <literal>Finalize Aggregate</literal> node.
+  </para>
+
+  <para>
+    Because the <literal>Finalize Aggregate</literal> node runs on the leader
+    process, queries that produce a relatively large number of groups in
+    comparison to the number of input rows will appear less favorable to the
+    query planner. For example, in the worst-case scenario the number of
+    groups seen by the <literal>Finalize Aggregate</literal> node could be as many as
+    the number of input rows that were seen by all worker processes in the
+    <literal>Partial Aggregate</literal> stage. For such cases, there is clearly
+    going to be no performance benefit to using parallel aggregation. The
+    query planner takes this into account during the planning process and is
+    unlikely to choose parallel aggregate in this scenario.
+  </para>
+
+  <para>
+    Parallel aggregation is not supported in all situations.  Each aggregate
+    must be <link linkend="parallel-safety">safe</link> for parallelism and must
+    have a combine function.  If the aggregate has a transition state of type
+    <literal>internal</literal>, it must have serialization and deserialization
+    functions.  See <xref linkend="sql-createaggregate"/> for more details.
+    Parallel aggregation is not supported if any aggregate function call
+    contains <literal>DISTINCT</literal> or <literal>ORDER BY</literal> clause and is also
+    not supported for ordered set aggregates or when  the query involves
+    <literal>GROUPING SETS</literal>.  It can only be used when all joins involved in
+    the query are also part of the parallel portion of the plan.
+  </para>
+
+ </sect2>
+
+ <sect2 id="parallel-append">
+  <title>Parallel Append</title>
+
+  <para>
+    Whenever <productname>PostgreSQL</productname> needs to combine rows
+    from multiple sources into a single result set, it uses an
+    <literal>Append</literal> or <literal>MergeAppend</literal> plan node.
+    This commonly happens when implementing <literal>UNION ALL</literal> or
+    when scanning a partitioned table.  Such nodes can be used in parallel
+    plans just as they can in any other plan.  However, in a parallel plan,
+    the planner may instead use a <literal>Parallel Append</literal> node.
+  </para>
+
+  <para>
+    When an <literal>Append</literal> node is used in a parallel plan, each
+    process will execute the child plans in the order in which they appear,
+    so that all participating processes cooperate to execute the first child
+    plan until it is complete and then move to the second plan at around the
+    same time.  When a <literal>Parallel Append</literal> is used instead, the
+    executor will instead spread out the participating processes as evenly as
+    possible across its child plans, so that multiple child plans are executed
+    simultaneously.  This avoids contention, and also avoids paying the startup
+    cost of a child plan in those processes that never execute it.
+  </para>
+
+  <para>
+    Also, unlike a regular <literal>Append</literal> node, which can only have
+    partial children when used within a parallel plan, a <literal>Parallel
+    Append</literal> node can have both partial and non-partial child plans.
+    Non-partial children will be scanned by only a single process, since
+    scanning them more than once would produce duplicate results.  Plans that
+    involve appending multiple results sets can therefore achieve
+    coarse-grained parallelism even when efficient partial plans are not
+    available.  For example, consider a query against a partitioned table
+    that can only be implemented efficiently by using an index that does
+    not support parallel scans.  The planner might choose a <literal>Parallel
+    Append</literal> of regular <literal>Index Scan</literal> plans; each
+    individual index scan would have to be executed to completion by a single
+    process, but different scans could be performed at the same time by
+    different processes.
+  </para>
+
+  <para>
+    <xref linkend="guc-enable-parallel-append" /> can be used to disable
+    this feature.
+  </para>
+ </sect2>
+
+ <sect2 id="parallel-plan-tips">
+  <title>Parallel Plan Tips</title>
+
+  <para>
+    If a query that is expected to do so does not produce a parallel plan,
+    you can try reducing <xref linkend="guc-parallel-setup-cost"/> or
+    <xref linkend="guc-parallel-tuple-cost"/>.  Of course, this plan may turn
+    out to be slower than the serial plan that the planner preferred, but
+    this will not always be the case.  If you don't get a parallel
+    plan even with very small values of these settings (e.g., after setting
+    them both to zero), there may be some reason why the query planner is
+    unable to generate a parallel plan for your query.  See
+    <xref linkend="when-can-parallel-query-be-used"/> and
+    <xref linkend="parallel-safety"/> for information on why this may be
+    the case.
+  </para>
+
+  <para>
+    When executing a parallel plan, you can use <literal>EXPLAIN (ANALYZE,
+    VERBOSE)</literal> to display per-worker statistics for each plan node.
+    This may be useful in determining whether the work is being evenly
+    distributed between all plan nodes and more generally in understanding the
+    performance characteristics of the plan.
+  </para>
+
+ </sect2>
+ </sect1>
+
+ <sect1 id="parallel-safety">
+  <title>Parallel Safety</title>
+
+  <para>
+    The planner classifies operations involved in a query as either
+    <firstterm>parallel safe</firstterm>, <firstterm>parallel restricted</firstterm>,
+    or <firstterm>parallel unsafe</firstterm>.  A parallel safe operation is one that
+    does not conflict with the use of parallel query.  A parallel restricted
+    operation is one that cannot be performed in a parallel worker, but that
+    can be performed in the leader while parallel query is in use.  Therefore,
+    parallel restricted operations can never occur below a <literal>Gather</literal>
+    or <literal>Gather Merge</literal> node, but can occur elsewhere in a plan that
+    contains such a node.  A parallel unsafe operation is one that cannot
+    be performed while parallel query is in use, not even in the leader.
+    When a query contains anything that is parallel unsafe, parallel query
+    is completely disabled for that query.
+  </para>
+
+  <para>
+    The following operations are always parallel restricted:
+  </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        Scans of common table expressions (CTEs).
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        Scans of temporary tables.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        Scans of foreign tables, unless the foreign data wrapper has
+        an <literal>IsForeignScanParallelSafe</literal> API that indicates otherwise.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        Plan nodes to which an <literal>InitPlan</literal> is attached.
+      </para>
+    </listitem>
+
+    <listitem>
+      <para>
+        Plan nodes that reference a correlated <literal>SubPlan</literal>.
+      </para>
+    </listitem>
+  </itemizedlist>
+
+ <sect2 id="parallel-labeling">
+  <title>Parallel Labeling for Functions and Aggregates</title>
+
+  <para>
+    The planner cannot automatically determine whether a user-defined
+    function or aggregate is parallel safe, parallel restricted, or parallel
+    unsafe, because this would require predicting every operation that the
+    function could possibly perform.  In general, this is equivalent to the
+    Halting Problem and therefore impossible.  Even for simple functions
+    where it could conceivably be done, we do not try, since this would be expensive
+    and error-prone.  Instead, all user-defined functions are assumed to
+    be parallel unsafe unless otherwise marked.  When using
+    <xref linkend="sql-createfunction"/> or
+    <xref linkend="sql-alterfunction"/>, markings can be set by specifying
+    <literal>PARALLEL SAFE</literal>, <literal>PARALLEL RESTRICTED</literal>, or
+    <literal>PARALLEL UNSAFE</literal> as appropriate.  When using
+    <xref linkend="sql-createaggregate"/>, the
+    <literal>PARALLEL</literal> option can be specified with <literal>SAFE</literal>,
+    <literal>RESTRICTED</literal>, or <literal>UNSAFE</literal> as the corresponding value.
+  </para>
+
+  <para>
+    Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal> if
+    they write to the database, access sequences, change the transaction state
+    even temporarily (e.g., a PL/pgSQL function that establishes an
+    <literal>EXCEPTION</literal> block to catch errors), or make persistent changes to
+    settings.  Similarly, functions must be marked <literal>PARALLEL
+    RESTRICTED</literal> if they access temporary tables, client connection state,
+    cursors, prepared statements, or miscellaneous backend-local state that
+    the system cannot synchronize across workers. For example,
+    <literal>setseed</literal> and <literal>random</literal> are parallel restricted for
+    this last reason.
+  </para>
+
+  <para>
+    In general, if a function is labeled as being safe when it is restricted or
+    unsafe, or if it is labeled as being restricted when it is in fact unsafe,
+    it may throw errors or produce wrong answers when used in a parallel query.
+    C-language functions could in theory exhibit totally undefined behavior if
+    mislabeled, since there is no way for the system to protect itself against
+    arbitrary C code, but in most likely cases the result will be no worse than
+    for any other function. If in doubt, it is probably best to label functions
+    as <literal>UNSAFE</literal>.
+  </para>
+
+  <para>
+    If a function executed within a parallel worker acquires locks that are
+    not held by the leader, for example by querying a table not referenced in
+    the query, those locks will be released at worker exit, not end of
+    transaction. If you write a function that does this, and this behavior
+    difference is important to you, mark such functions as
+    <literal>PARALLEL RESTRICTED</literal>
+    to ensure that they execute only in the leader.
+  </para>
+
+  <para>
+    Note that the query planner does not consider deferring the evaluation of
+    parallel-restricted functions or aggregates involved in the query in
+    order to obtain a superior plan.  So, for example, if a <literal>WHERE</literal>
+    clause applied to a particular table is parallel restricted, the query
+    planner will not consider performing a scan of that table in the parallel
+    portion of a plan.  In some cases, it would be
+    possible (and perhaps even efficient) to include the scan of that table in
+    the parallel portion of the query and defer the evaluation of the
+    <literal>WHERE</literal> clause so that it happens above the <literal>Gather</literal>
+    node.  However, the planner does not do this.
+  </para>
+
+ </sect2>
+
+ </sect1>
+
+ </chapter>