summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/arch-dev.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/arch-dev.sgml')
-rw-r--r--doc/src/sgml/arch-dev.sgml575
1 files changed, 575 insertions, 0 deletions
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml
new file mode 100644
index 0000000..8aeac02
--- /dev/null
+++ b/doc/src/sgml/arch-dev.sgml
@@ -0,0 +1,575 @@
+<!-- doc/src/sgml/arch-dev.sgml -->
+
+ <chapter id="overview">
+ <title>Overview of PostgreSQL Internals</title>
+
+ <note>
+ <title>Author</title>
+ <para>
+ This chapter originated as part of
+ <xref linkend="sim98"/> Stefan Simkovics'
+ Master's Thesis prepared at Vienna University of Technology under the direction
+ of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
+ </para>
+ </note>
+
+ <para>
+ This chapter gives an overview of the internal structure of the
+ backend of <productname>PostgreSQL</productname>. After having
+ read the following sections you should have an idea of how a query
+ is processed. This chapter is intended to help the reader
+ understand the general sequence of operations that occur within the
+ backend from the point at which a query is received, to the point
+ at which the results are returned to the client.
+ </para>
+
+ <sect1 id="query-path">
+ <title>The Path of a Query</title>
+
+ <para>
+ Here we give a short overview of the stages a query has to pass
+ to obtain a result.
+ </para>
+
+ <procedure>
+ <step>
+ <para>
+ A connection from an application program to the <productname>PostgreSQL</productname>
+ server has to be established. The application program transmits a
+ query to the server and waits to receive the results sent back by the
+ server.
+ </para>
+ </step>
+
+ <step>
+ <para>
+ The <firstterm>parser stage</firstterm> checks the query
+ transmitted by the application
+ program for correct syntax and creates
+ a <firstterm>query tree</firstterm>.
+ </para>
+ </step>
+
+ <step>
+ <para>
+ The <firstterm>rewrite system</firstterm> takes
+ the query tree created by the parser stage and looks for
+ any <firstterm>rules</firstterm> (stored in the
+ <firstterm>system catalogs</firstterm>) to apply to
+ the query tree. It performs the
+ transformations given in the <firstterm>rule bodies</firstterm>.
+ </para>
+
+ <para>
+ One application of the rewrite system is in the realization of
+ <firstterm>views</firstterm>.
+ Whenever a query against a view
+ (i.e., a <firstterm>virtual table</firstterm>) is made,
+ the rewrite system rewrites the user's query to
+ a query that accesses the <firstterm>base tables</firstterm> given in
+ the <firstterm>view definition</firstterm> instead.
+ </para>
+ </step>
+
+ <step>
+ <para>
+ The <firstterm>planner/optimizer</firstterm> takes
+ the (rewritten) query tree and creates a
+ <firstterm>query plan</firstterm> that will be the input to the
+ <firstterm>executor</firstterm>.
+ </para>
+
+ <para>
+ It does so by first creating all possible <firstterm>paths</firstterm>
+ leading to the same result. For example if there is an index on a
+ relation to be scanned, there are two paths for the
+ scan. One possibility is a simple sequential scan and the other
+ possibility is to use the index. Next the cost for the execution of
+ each path is estimated and the cheapest path is chosen. The cheapest
+ path is expanded into a complete plan that the executor can use.
+ </para>
+ </step>
+
+ <step>
+ <para>
+ The executor recursively steps through
+ the <firstterm>plan tree</firstterm> and
+ retrieves rows in the way represented by the plan.
+ The executor makes use of the
+ <firstterm>storage system</firstterm> while scanning
+ relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
+ evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
+ </para>
+ </step>
+ </procedure>
+
+ <para>
+ In the following sections we will cover each of the above listed items
+ in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
+ control and data structures.
+ </para>
+ </sect1>
+
+ <sect1 id="connect-estab">
+ <title>How Connections Are Established</title>
+
+ <para>
+ <productname>PostgreSQL</productname> implements a
+ <quote>process per user</quote> client/server model.
+ In this model, every
+ <glossterm linkend="glossary-client">client process</glossterm>
+ connects to exactly one
+ <glossterm linkend="glossary-backend">backend process</glossterm>.
+ As we do not know ahead of time how many connections will be made,
+ we have to use a <quote>supervisor process</quote> that spawns a new
+ backend process every time a connection is requested. This supervisor
+ process is called
+ <glossterm linkend="glossary-postmaster">postmaster</glossterm>
+ and listens at a specified TCP/IP port for incoming connections.
+ Whenever it detects a request for a connection, it spawns a new
+ backend process. Those backend processes communicate with each
+ other and with other processes of the
+ <glossterm linkend="glossary-instance">instance</glossterm>
+ using <firstterm>semaphores</firstterm> and
+ <glossterm linkend="glossary-shared-memory">shared memory</glossterm>
+ to ensure data integrity throughout concurrent data access.
+ </para>
+
+ <para>
+ The client process can be any program that understands the
+ <productname>PostgreSQL</productname> protocol described in
+ <xref linkend="protocol"/>. Many clients are based on the
+ C-language library <application>libpq</application>, but several independent
+ implementations of the protocol exist, such as the Java
+ <application>JDBC</application> driver.
+ </para>
+
+ <para>
+ Once a connection is established, the client process can send a query
+ to the backend process it's connected to. The query is transmitted using
+ plain text, i.e., there is no parsing done in the client. The backend
+ process parses the query, creates an <firstterm>execution plan</firstterm>,
+ executes the plan, and returns the retrieved rows to the client
+ by transmitting them over the established connection.
+ </para>
+ </sect1>
+
+ <sect1 id="parser-stage">
+ <title>The Parser Stage</title>
+
+ <para>
+ The <firstterm>parser stage</firstterm> consists of two parts:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The <firstterm>parser</firstterm> defined in
+ <filename>gram.y</filename> and <filename>scan.l</filename> is
+ built using the Unix tools <application>bison</application>
+ and <application>flex</application>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The <firstterm>transformation process</firstterm> does
+ modifications and augmentations to the data structures returned by the parser.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <sect2>
+ <title>Parser</title>
+
+ <para>
+ The parser has to check the query string (which arrives as plain
+ text) for valid syntax. If the syntax is correct a
+ <firstterm>parse tree</firstterm> is built up and handed back;
+ otherwise an error is returned. The parser and lexer are
+ implemented using the well-known Unix tools <application>bison</application>
+ and <application>flex</application>.
+ </para>
+
+ <para>
+ The <firstterm>lexer</firstterm> is defined in the file
+ <filename>scan.l</filename> and is responsible
+ for recognizing <firstterm>identifiers</firstterm>,
+ the <firstterm>SQL key words</firstterm> etc. For
+ every key word or identifier that is found, a <firstterm>token</firstterm>
+ is generated and handed to the parser.
+ </para>
+
+ <para>
+ The parser is defined in the file <filename>gram.y</filename> and
+ consists of a set of <firstterm>grammar rules</firstterm> and
+ <firstterm>actions</firstterm> that are executed whenever a rule
+ is fired. The code of the actions (which is actually C code) is
+ used to build up the parse tree.
+ </para>
+
+ <para>
+ The file <filename>scan.l</filename> is transformed to the C
+ source file <filename>scan.c</filename> using the program
+ <application>flex</application> and <filename>gram.y</filename> is
+ transformed to <filename>gram.c</filename> using
+ <application>bison</application>. After these transformations
+ have taken place a normal C compiler can be used to create the
+ parser. Never make any changes to the generated C files as they
+ will be overwritten the next time <application>flex</application>
+ or <application>bison</application> is called.
+
+ <note>
+ <para>
+ The mentioned transformations and compilations are normally done
+ automatically using the <firstterm>makefiles</firstterm>
+ shipped with the <productname>PostgreSQL</productname>
+ source distribution.
+ </para>
+ </note>
+ </para>
+
+ <para>
+ A detailed description of <application>bison</application> or
+ the grammar rules given in <filename>gram.y</filename> would be
+ beyond the scope of this manual. There are many books and
+ documents dealing with <application>flex</application> and
+ <application>bison</application>. You should be familiar with
+ <application>bison</application> before you start to study the
+ grammar given in <filename>gram.y</filename> otherwise you won't
+ understand what happens there.
+ </para>
+
+ </sect2>
+
+ <sect2>
+ <title>Transformation Process</title>
+
+ <para>
+ The parser stage creates a parse tree using only fixed rules about
+ the syntactic structure of SQL. It does not make any lookups in the
+ system catalogs, so there is no possibility to understand the detailed
+ semantics of the requested operations. After the parser completes,
+ the <firstterm>transformation process</firstterm> takes the tree handed
+ back by the parser as input and does the semantic interpretation needed
+ to understand which tables, functions, and operators are referenced by
+ the query. The data structure that is built to represent this
+ information is called the <firstterm>query tree</firstterm>.
+ </para>
+
+ <para>
+ The reason for separating raw parsing from semantic analysis is that
+ system catalog lookups can only be done within a transaction, and we
+ do not wish to start a transaction immediately upon receiving a query
+ string. The raw parsing stage is sufficient to identify the transaction
+ control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc.), and
+ these can then be correctly executed without any further analysis.
+ Once we know that we are dealing with an actual query (such as
+ <command>SELECT</command> or <command>UPDATE</command>), it is okay to
+ start a transaction if we're not already in one. Only then can the
+ transformation process be invoked.
+ </para>
+
+ <para>
+ The query tree created by the transformation process is structurally
+ similar to the raw parse tree in most places, but it has many differences
+ in detail. For example, a <structname>FuncCall</structname> node in the
+ parse tree represents something that looks syntactically like a function
+ call. This might be transformed to either a <structname>FuncExpr</structname>
+ or <structname>Aggref</structname> node depending on whether the referenced
+ name turns out to be an ordinary function or an aggregate function.
+ Also, information about the actual data types of columns and expression
+ results is added to the query tree.
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="rule-system">
+ <title>The <productname>PostgreSQL</productname> Rule System</title>
+
+ <para>
+ <productname>PostgreSQL</productname> supports a powerful
+ <firstterm>rule system</firstterm> for the specification
+ of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
+ Originally the <productname>PostgreSQL</productname>
+ rule system consisted of two implementations:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ The first one worked using <firstterm>row level</firstterm> processing and was
+ implemented deep in the <firstterm>executor</firstterm>. The rule system was
+ called whenever an individual row had been accessed. This
+ implementation was removed in 1995 when the last official release
+ of the <productname>Berkeley Postgres</productname> project was
+ transformed into <productname>Postgres95</productname>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The second implementation of the rule system is a technique
+ called <firstterm>query rewriting</firstterm>.
+ The <firstterm>rewrite system</firstterm> is a module
+ that exists between the <firstterm>parser stage</firstterm> and the
+ <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ The query rewriter is discussed in some detail in
+ <xref linkend="rules"/>, so there is no need to cover it here.
+ We will only point out that both the input and the output of the
+ rewriter are query trees, that is, there is no change in the
+ representation or level of semantic detail in the trees. Rewriting
+ can be thought of as a form of macro expansion.
+ </para>
+
+ </sect1>
+
+ <sect1 id="planner-optimizer">
+ <title>Planner/Optimizer</title>
+
+ <para>
+ The task of the <firstterm>planner/optimizer</firstterm> is to
+ create an optimal execution plan. A given SQL query (and hence, a
+ query tree) can be actually executed in a wide variety of
+ different ways, each of which will produce the same set of
+ results. If it is computationally feasible, the query optimizer
+ will examine each of these possible execution plans, ultimately
+ selecting the execution plan that is expected to run the fastest.
+ </para>
+
+ <note>
+ <para>
+ In some situations, examining each possible way in which a query
+ can be executed would take an excessive amount of time and memory.
+ In particular, this occurs when executing queries
+ involving large numbers of join operations. In order to determine
+ a reasonable (not necessarily optimal) query plan in a reasonable amount
+ of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic
+ Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins
+ exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>).
+ </para>
+ </note>
+
+ <para>
+ The planner's search procedure actually works with data structures
+ called <firstterm>paths</firstterm>, which are simply cut-down representations of
+ plans containing only as much information as the planner needs to make
+ its decisions. After the cheapest path is determined, a full-fledged
+ <firstterm>plan tree</firstterm> is built to pass to the executor. This represents
+ the desired execution plan in sufficient detail for the executor to run it.
+ In the rest of this section we'll ignore the distinction between paths
+ and plans.
+ </para>
+
+ <sect2>
+ <title>Generating Possible Plans</title>
+
+ <para>
+ The planner/optimizer starts by generating plans for scanning each
+ individual relation (table) used in the query. The possible plans
+ are determined by the available indexes on each relation.
+ There is always the possibility of performing a
+ sequential scan on a relation, so a sequential scan plan is always
+ created. Assume an index is defined on a
+ relation (for example a B-tree index) and a query contains the
+ restriction
+ <literal>relation.attribute OPR constant</literal>. If
+ <literal>relation.attribute</literal> happens to match the key of the B-tree
+ index and <literal>OPR</literal> is one of the operators listed in
+ the index's <firstterm>operator class</firstterm>, another plan is created using
+ the B-tree index to scan the relation. If there are further indexes
+ present and the restrictions in the query happen to match a key of an
+ index, further plans will be considered. Index scan plans are also
+ generated for indexes that have a sort ordering that can match the
+ query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that
+ might be useful for merge joining (see below).
+ </para>
+
+ <para>
+ If the query requires joining two or more relations,
+ plans for joining relations are considered
+ after all feasible plans have been found for scanning single relations.
+ The three available join strategies are:
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <firstterm>nested loop join</firstterm>: The right relation is scanned
+ once for every row found in the left relation. This strategy
+ is easy to implement but can be very time consuming. (However,
+ if the right relation can be scanned with an index scan, this can
+ be a good strategy. It is possible to use values from the current
+ row of the left relation as keys for the index scan of the right.)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <firstterm>merge join</firstterm>: Each relation is sorted on the join
+ attributes before the join starts. Then the two relations are
+ scanned in parallel, and matching rows are combined to form
+ join rows. This kind of join is
+ attractive because each relation has to be scanned only once.
+ The required sorting might be achieved either by an explicit sort
+ step, or by scanning the relation in the proper order using an
+ index on the join key.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <firstterm>hash join</firstterm>: the right relation is first scanned
+ and loaded into a hash table, using its join attributes as hash keys.
+ Next the left relation is scanned and the
+ appropriate values of every row found are used as hash keys to
+ locate the matching rows in the table.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ When the query involves more than two relations, the final result
+ must be built up by a tree of join steps, each with two inputs.
+ The planner examines different possible join sequences to find the
+ cheapest one.
+ </para>
+
+ <para>
+ If the query uses fewer than <xref linkend="guc-geqo-threshold"/>
+ relations, a near-exhaustive search is conducted to find the best
+ join sequence. The planner preferentially considers joins between any
+ two relations for which there exists a corresponding join clause in the
+ <literal>WHERE</literal> qualification (i.e., for
+ which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
+ exists). Join pairs with no join clause are considered only when there
+ is no other choice, that is, a particular relation has no available
+ join clauses to any other relation. All possible plans are generated for
+ every join pair considered by the planner, and the one that is
+ (estimated to be) the cheapest is chosen.
+ </para>
+
+ <para>
+ When <varname>geqo_threshold</varname> is exceeded, the join
+ sequences considered are determined by heuristics, as described
+ in <xref linkend="geqo"/>. Otherwise the process is the same.
+ </para>
+
+ <para>
+ The finished plan tree consists of sequential or index scans of
+ the base relations, plus nested-loop, merge, or hash join nodes as
+ needed, plus any auxiliary steps needed, such as sort nodes or
+ aggregate-function calculation nodes. Most of these plan node
+ types have the additional ability to do <firstterm>selection</firstterm>
+ (discarding rows that do not meet a specified Boolean condition)
+ and <firstterm>projection</firstterm> (computation of a derived column set
+ based on given column values, that is, evaluation of scalar
+ expressions where needed). One of the responsibilities of the
+ planner is to attach selection conditions from the
+ <literal>WHERE</literal> clause and computation of required
+ output expressions to the most appropriate nodes of the plan
+ tree.
+ </para>
+ </sect2>
+ </sect1>
+
+ <sect1 id="executor">
+ <title>Executor</title>
+
+ <para>
+ The <firstterm>executor</firstterm> takes the plan created by the
+ planner/optimizer and recursively processes it to extract the required set
+ of rows. This is essentially a demand-pull pipeline mechanism.
+ Each time a plan node is called, it must deliver one more row, or
+ report that it is done delivering rows.
+ </para>
+
+ <para>
+ To provide a concrete example, assume that the top
+ node is a <literal>MergeJoin</literal> node.
+ Before any merge can be done two rows have to be fetched (one from
+ each subplan). So the executor recursively calls itself to
+ process the subplans (it starts with the subplan attached to
+ <literal>lefttree</literal>). The new top node (the top node of the left
+ subplan) is, let's say, a
+ <literal>Sort</literal> node and again recursion is needed to obtain
+ an input row. The child node of the <literal>Sort</literal> might
+ be a <literal>SeqScan</literal> node, representing actual reading of a table.
+ Execution of this node causes the executor to fetch a row from the
+ table and return it up to the calling node. The <literal>Sort</literal>
+ node will repeatedly call its child to obtain all the rows to be sorted.
+ When the input is exhausted (as indicated by the child node returning
+ a NULL instead of a row), the <literal>Sort</literal> code performs
+ the sort, and finally is able to return its first output row, namely
+ the first one in sorted order. It keeps the remaining rows stored so
+ that it can deliver them in sorted order in response to later demands.
+ </para>
+
+ <para>
+ The <literal>MergeJoin</literal> node similarly demands the first row
+ from its right subplan. Then it compares the two rows to see if they
+ can be joined; if so, it returns a join row to its caller. On the next
+ call, or immediately if it cannot join the current pair of inputs,
+ it advances to the next row of one table
+ or the other (depending on how the comparison came out), and again
+ checks for a match. Eventually, one subplan or the other is exhausted,
+ and the <literal>MergeJoin</literal> node returns NULL to indicate that
+ no more join rows can be formed.
+ </para>
+
+ <para>
+ Complex queries can involve many levels of plan nodes, but the general
+ approach is the same: each node computes and returns its next output
+ row each time it is called. Each node is also responsible for applying
+ any selection or projection expressions that were assigned to it by
+ the planner.
+ </para>
+
+ <para>
+ The executor mechanism is used to evaluate all five basic SQL query
+ types: <command>SELECT</command>, <command>INSERT</command>,
+ <command>UPDATE</command>, <command>DELETE</command>, and
+ <command>MERGE</command>.
+ For <command>SELECT</command>, the top-level executor code
+ only needs to send each row returned by the query plan tree
+ off to the client. <command>INSERT ... SELECT</command>,
+ <command>UPDATE</command>, <command>DELETE</command>, and
+ <command>MERGE</command>
+ are effectively <command>SELECT</command>s under a special
+ top-level plan node called <literal>ModifyTable</literal>.
+ </para>
+
+ <para>
+ <command>INSERT ... SELECT</command> feeds the rows up
+ to <literal>ModifyTable</literal> for insertion. For
+ <command>UPDATE</command>, the planner arranges that each
+ computed row includes all the updated column values, plus the
+ <firstterm>TID</firstterm> (tuple ID, or row ID) of the original
+ target row; this data is fed up to the <literal>ModifyTable</literal>
+ node, which uses the information to create a new updated row and
+ mark the old row deleted. For <command>DELETE</command>, the only
+ column that is actually returned by the plan is the TID, and the
+ <literal>ModifyTable</literal> node simply uses the TID to visit each
+ target row and mark it deleted. For <command>MERGE</command>, the
+ planner joins the source and target relations, and includes all
+ column values required by any of the <literal>WHEN</literal> clauses,
+ plus the TID of the target row; this data is fed up to the
+ <literal>ModifyTable</literal> node, which uses the information to
+ work out which <literal>WHEN</literal> clause to execute, and then
+ inserts, updates or deletes the target row, as required.
+ </para>
+
+ <para>
+ A simple <command>INSERT ... VALUES</command> command creates a
+ trivial plan tree consisting of a single <literal>Result</literal>
+ node, which computes just one result row, feeding that up
+ to <literal>ModifyTable</literal> to perform the insertion.
+ </para>
+
+ </sect1>
+
+ </chapter>