diff options
Diffstat (limited to 'doc/src/sgml/arch-dev.sgml')
-rw-r--r-- | doc/src/sgml/arch-dev.sgml | 575 |
1 files changed, 575 insertions, 0 deletions
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml new file mode 100644 index 0000000..8aeac02 --- /dev/null +++ b/doc/src/sgml/arch-dev.sgml @@ -0,0 +1,575 @@ +<!-- doc/src/sgml/arch-dev.sgml --> + + <chapter id="overview"> + <title>Overview of PostgreSQL Internals</title> + + <note> + <title>Author</title> + <para> + This chapter originated as part of + <xref linkend="sim98"/> Stefan Simkovics' + Master's Thesis prepared at Vienna University of Technology under the direction + of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr. + </para> + </note> + + <para> + This chapter gives an overview of the internal structure of the + backend of <productname>PostgreSQL</productname>. After having + read the following sections you should have an idea of how a query + is processed. This chapter is intended to help the reader + understand the general sequence of operations that occur within the + backend from the point at which a query is received, to the point + at which the results are returned to the client. + </para> + + <sect1 id="query-path"> + <title>The Path of a Query</title> + + <para> + Here we give a short overview of the stages a query has to pass + to obtain a result. + </para> + + <procedure> + <step> + <para> + A connection from an application program to the <productname>PostgreSQL</productname> + server has to be established. The application program transmits a + query to the server and waits to receive the results sent back by the + server. + </para> + </step> + + <step> + <para> + The <firstterm>parser stage</firstterm> checks the query + transmitted by the application + program for correct syntax and creates + a <firstterm>query tree</firstterm>. + </para> + </step> + + <step> + <para> + The <firstterm>rewrite system</firstterm> takes + the query tree created by the parser stage and looks for + any <firstterm>rules</firstterm> (stored in the + <firstterm>system catalogs</firstterm>) to apply to + the query tree. It performs the + transformations given in the <firstterm>rule bodies</firstterm>. + </para> + + <para> + One application of the rewrite system is in the realization of + <firstterm>views</firstterm>. + Whenever a query against a view + (i.e., a <firstterm>virtual table</firstterm>) is made, + the rewrite system rewrites the user's query to + a query that accesses the <firstterm>base tables</firstterm> given in + the <firstterm>view definition</firstterm> instead. + </para> + </step> + + <step> + <para> + The <firstterm>planner/optimizer</firstterm> takes + the (rewritten) query tree and creates a + <firstterm>query plan</firstterm> that will be the input to the + <firstterm>executor</firstterm>. + </para> + + <para> + It does so by first creating all possible <firstterm>paths</firstterm> + leading to the same result. For example if there is an index on a + relation to be scanned, there are two paths for the + scan. One possibility is a simple sequential scan and the other + possibility is to use the index. Next the cost for the execution of + each path is estimated and the cheapest path is chosen. The cheapest + path is expanded into a complete plan that the executor can use. + </para> + </step> + + <step> + <para> + The executor recursively steps through + the <firstterm>plan tree</firstterm> and + retrieves rows in the way represented by the plan. + The executor makes use of the + <firstterm>storage system</firstterm> while scanning + relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>, + evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived. + </para> + </step> + </procedure> + + <para> + In the following sections we will cover each of the above listed items + in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal + control and data structures. + </para> + </sect1> + + <sect1 id="connect-estab"> + <title>How Connections Are Established</title> + + <para> + <productname>PostgreSQL</productname> implements a + <quote>process per user</quote> client/server model. + In this model, every + <glossterm linkend="glossary-client">client process</glossterm> + connects to exactly one + <glossterm linkend="glossary-backend">backend process</glossterm>. + As we do not know ahead of time how many connections will be made, + we have to use a <quote>supervisor process</quote> that spawns a new + backend process every time a connection is requested. This supervisor + process is called + <glossterm linkend="glossary-postmaster">postmaster</glossterm> + and listens at a specified TCP/IP port for incoming connections. + Whenever it detects a request for a connection, it spawns a new + backend process. Those backend processes communicate with each + other and with other processes of the + <glossterm linkend="glossary-instance">instance</glossterm> + using <firstterm>semaphores</firstterm> and + <glossterm linkend="glossary-shared-memory">shared memory</glossterm> + to ensure data integrity throughout concurrent data access. + </para> + + <para> + The client process can be any program that understands the + <productname>PostgreSQL</productname> protocol described in + <xref linkend="protocol"/>. Many clients are based on the + C-language library <application>libpq</application>, but several independent + implementations of the protocol exist, such as the Java + <application>JDBC</application> driver. + </para> + + <para> + Once a connection is established, the client process can send a query + to the backend process it's connected to. The query is transmitted using + plain text, i.e., there is no parsing done in the client. The backend + process parses the query, creates an <firstterm>execution plan</firstterm>, + executes the plan, and returns the retrieved rows to the client + by transmitting them over the established connection. + </para> + </sect1> + + <sect1 id="parser-stage"> + <title>The Parser Stage</title> + + <para> + The <firstterm>parser stage</firstterm> consists of two parts: + + <itemizedlist> + <listitem> + <para> + The <firstterm>parser</firstterm> defined in + <filename>gram.y</filename> and <filename>scan.l</filename> is + built using the Unix tools <application>bison</application> + and <application>flex</application>. + </para> + </listitem> + <listitem> + <para> + The <firstterm>transformation process</firstterm> does + modifications and augmentations to the data structures returned by the parser. + </para> + </listitem> + </itemizedlist> + </para> + + <sect2> + <title>Parser</title> + + <para> + The parser has to check the query string (which arrives as plain + text) for valid syntax. If the syntax is correct a + <firstterm>parse tree</firstterm> is built up and handed back; + otherwise an error is returned. The parser and lexer are + implemented using the well-known Unix tools <application>bison</application> + and <application>flex</application>. + </para> + + <para> + The <firstterm>lexer</firstterm> is defined in the file + <filename>scan.l</filename> and is responsible + for recognizing <firstterm>identifiers</firstterm>, + the <firstterm>SQL key words</firstterm> etc. For + every key word or identifier that is found, a <firstterm>token</firstterm> + is generated and handed to the parser. + </para> + + <para> + The parser is defined in the file <filename>gram.y</filename> and + consists of a set of <firstterm>grammar rules</firstterm> and + <firstterm>actions</firstterm> that are executed whenever a rule + is fired. The code of the actions (which is actually C code) is + used to build up the parse tree. + </para> + + <para> + The file <filename>scan.l</filename> is transformed to the C + source file <filename>scan.c</filename> using the program + <application>flex</application> and <filename>gram.y</filename> is + transformed to <filename>gram.c</filename> using + <application>bison</application>. After these transformations + have taken place a normal C compiler can be used to create the + parser. Never make any changes to the generated C files as they + will be overwritten the next time <application>flex</application> + or <application>bison</application> is called. + + <note> + <para> + The mentioned transformations and compilations are normally done + automatically using the <firstterm>makefiles</firstterm> + shipped with the <productname>PostgreSQL</productname> + source distribution. + </para> + </note> + </para> + + <para> + A detailed description of <application>bison</application> or + the grammar rules given in <filename>gram.y</filename> would be + beyond the scope of this manual. There are many books and + documents dealing with <application>flex</application> and + <application>bison</application>. You should be familiar with + <application>bison</application> before you start to study the + grammar given in <filename>gram.y</filename> otherwise you won't + understand what happens there. + </para> + + </sect2> + + <sect2> + <title>Transformation Process</title> + + <para> + The parser stage creates a parse tree using only fixed rules about + the syntactic structure of SQL. It does not make any lookups in the + system catalogs, so there is no possibility to understand the detailed + semantics of the requested operations. After the parser completes, + the <firstterm>transformation process</firstterm> takes the tree handed + back by the parser as input and does the semantic interpretation needed + to understand which tables, functions, and operators are referenced by + the query. The data structure that is built to represent this + information is called the <firstterm>query tree</firstterm>. + </para> + + <para> + The reason for separating raw parsing from semantic analysis is that + system catalog lookups can only be done within a transaction, and we + do not wish to start a transaction immediately upon receiving a query + string. The raw parsing stage is sufficient to identify the transaction + control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc.), and + these can then be correctly executed without any further analysis. + Once we know that we are dealing with an actual query (such as + <command>SELECT</command> or <command>UPDATE</command>), it is okay to + start a transaction if we're not already in one. Only then can the + transformation process be invoked. + </para> + + <para> + The query tree created by the transformation process is structurally + similar to the raw parse tree in most places, but it has many differences + in detail. For example, a <structname>FuncCall</structname> node in the + parse tree represents something that looks syntactically like a function + call. This might be transformed to either a <structname>FuncExpr</structname> + or <structname>Aggref</structname> node depending on whether the referenced + name turns out to be an ordinary function or an aggregate function. + Also, information about the actual data types of columns and expression + results is added to the query tree. + </para> + </sect2> + </sect1> + + <sect1 id="rule-system"> + <title>The <productname>PostgreSQL</productname> Rule System</title> + + <para> + <productname>PostgreSQL</productname> supports a powerful + <firstterm>rule system</firstterm> for the specification + of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>. + Originally the <productname>PostgreSQL</productname> + rule system consisted of two implementations: + + <itemizedlist> + <listitem> + <para> + The first one worked using <firstterm>row level</firstterm> processing and was + implemented deep in the <firstterm>executor</firstterm>. The rule system was + called whenever an individual row had been accessed. This + implementation was removed in 1995 when the last official release + of the <productname>Berkeley Postgres</productname> project was + transformed into <productname>Postgres95</productname>. + </para> + </listitem> + + <listitem> + <para> + The second implementation of the rule system is a technique + called <firstterm>query rewriting</firstterm>. + The <firstterm>rewrite system</firstterm> is a module + that exists between the <firstterm>parser stage</firstterm> and the + <firstterm>planner/optimizer</firstterm>. This technique is still implemented. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + The query rewriter is discussed in some detail in + <xref linkend="rules"/>, so there is no need to cover it here. + We will only point out that both the input and the output of the + rewriter are query trees, that is, there is no change in the + representation or level of semantic detail in the trees. Rewriting + can be thought of as a form of macro expansion. + </para> + + </sect1> + + <sect1 id="planner-optimizer"> + <title>Planner/Optimizer</title> + + <para> + The task of the <firstterm>planner/optimizer</firstterm> is to + create an optimal execution plan. A given SQL query (and hence, a + query tree) can be actually executed in a wide variety of + different ways, each of which will produce the same set of + results. If it is computationally feasible, the query optimizer + will examine each of these possible execution plans, ultimately + selecting the execution plan that is expected to run the fastest. + </para> + + <note> + <para> + In some situations, examining each possible way in which a query + can be executed would take an excessive amount of time and memory. + In particular, this occurs when executing queries + involving large numbers of join operations. In order to determine + a reasonable (not necessarily optimal) query plan in a reasonable amount + of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic + Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins + exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>). + </para> + </note> + + <para> + The planner's search procedure actually works with data structures + called <firstterm>paths</firstterm>, which are simply cut-down representations of + plans containing only as much information as the planner needs to make + its decisions. After the cheapest path is determined, a full-fledged + <firstterm>plan tree</firstterm> is built to pass to the executor. This represents + the desired execution plan in sufficient detail for the executor to run it. + In the rest of this section we'll ignore the distinction between paths + and plans. + </para> + + <sect2> + <title>Generating Possible Plans</title> + + <para> + The planner/optimizer starts by generating plans for scanning each + individual relation (table) used in the query. The possible plans + are determined by the available indexes on each relation. + There is always the possibility of performing a + sequential scan on a relation, so a sequential scan plan is always + created. Assume an index is defined on a + relation (for example a B-tree index) and a query contains the + restriction + <literal>relation.attribute OPR constant</literal>. If + <literal>relation.attribute</literal> happens to match the key of the B-tree + index and <literal>OPR</literal> is one of the operators listed in + the index's <firstterm>operator class</firstterm>, another plan is created using + the B-tree index to scan the relation. If there are further indexes + present and the restrictions in the query happen to match a key of an + index, further plans will be considered. Index scan plans are also + generated for indexes that have a sort ordering that can match the + query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that + might be useful for merge joining (see below). + </para> + + <para> + If the query requires joining two or more relations, + plans for joining relations are considered + after all feasible plans have been found for scanning single relations. + The three available join strategies are: + + <itemizedlist> + <listitem> + <para> + <firstterm>nested loop join</firstterm>: The right relation is scanned + once for every row found in the left relation. This strategy + is easy to implement but can be very time consuming. (However, + if the right relation can be scanned with an index scan, this can + be a good strategy. It is possible to use values from the current + row of the left relation as keys for the index scan of the right.) + </para> + </listitem> + + <listitem> + <para> + <firstterm>merge join</firstterm>: Each relation is sorted on the join + attributes before the join starts. Then the two relations are + scanned in parallel, and matching rows are combined to form + join rows. This kind of join is + attractive because each relation has to be scanned only once. + The required sorting might be achieved either by an explicit sort + step, or by scanning the relation in the proper order using an + index on the join key. + </para> + </listitem> + + <listitem> + <para> + <firstterm>hash join</firstterm>: the right relation is first scanned + and loaded into a hash table, using its join attributes as hash keys. + Next the left relation is scanned and the + appropriate values of every row found are used as hash keys to + locate the matching rows in the table. + </para> + </listitem> + </itemizedlist> + </para> + + <para> + When the query involves more than two relations, the final result + must be built up by a tree of join steps, each with two inputs. + The planner examines different possible join sequences to find the + cheapest one. + </para> + + <para> + If the query uses fewer than <xref linkend="guc-geqo-threshold"/> + relations, a near-exhaustive search is conducted to find the best + join sequence. The planner preferentially considers joins between any + two relations for which there exists a corresponding join clause in the + <literal>WHERE</literal> qualification (i.e., for + which a restriction like <literal>where rel1.attr1=rel2.attr2</literal> + exists). Join pairs with no join clause are considered only when there + is no other choice, that is, a particular relation has no available + join clauses to any other relation. All possible plans are generated for + every join pair considered by the planner, and the one that is + (estimated to be) the cheapest is chosen. + </para> + + <para> + When <varname>geqo_threshold</varname> is exceeded, the join + sequences considered are determined by heuristics, as described + in <xref linkend="geqo"/>. Otherwise the process is the same. + </para> + + <para> + The finished plan tree consists of sequential or index scans of + the base relations, plus nested-loop, merge, or hash join nodes as + needed, plus any auxiliary steps needed, such as sort nodes or + aggregate-function calculation nodes. Most of these plan node + types have the additional ability to do <firstterm>selection</firstterm> + (discarding rows that do not meet a specified Boolean condition) + and <firstterm>projection</firstterm> (computation of a derived column set + based on given column values, that is, evaluation of scalar + expressions where needed). One of the responsibilities of the + planner is to attach selection conditions from the + <literal>WHERE</literal> clause and computation of required + output expressions to the most appropriate nodes of the plan + tree. + </para> + </sect2> + </sect1> + + <sect1 id="executor"> + <title>Executor</title> + + <para> + The <firstterm>executor</firstterm> takes the plan created by the + planner/optimizer and recursively processes it to extract the required set + of rows. This is essentially a demand-pull pipeline mechanism. + Each time a plan node is called, it must deliver one more row, or + report that it is done delivering rows. + </para> + + <para> + To provide a concrete example, assume that the top + node is a <literal>MergeJoin</literal> node. + Before any merge can be done two rows have to be fetched (one from + each subplan). So the executor recursively calls itself to + process the subplans (it starts with the subplan attached to + <literal>lefttree</literal>). The new top node (the top node of the left + subplan) is, let's say, a + <literal>Sort</literal> node and again recursion is needed to obtain + an input row. The child node of the <literal>Sort</literal> might + be a <literal>SeqScan</literal> node, representing actual reading of a table. + Execution of this node causes the executor to fetch a row from the + table and return it up to the calling node. The <literal>Sort</literal> + node will repeatedly call its child to obtain all the rows to be sorted. + When the input is exhausted (as indicated by the child node returning + a NULL instead of a row), the <literal>Sort</literal> code performs + the sort, and finally is able to return its first output row, namely + the first one in sorted order. It keeps the remaining rows stored so + that it can deliver them in sorted order in response to later demands. + </para> + + <para> + The <literal>MergeJoin</literal> node similarly demands the first row + from its right subplan. Then it compares the two rows to see if they + can be joined; if so, it returns a join row to its caller. On the next + call, or immediately if it cannot join the current pair of inputs, + it advances to the next row of one table + or the other (depending on how the comparison came out), and again + checks for a match. Eventually, one subplan or the other is exhausted, + and the <literal>MergeJoin</literal> node returns NULL to indicate that + no more join rows can be formed. + </para> + + <para> + Complex queries can involve many levels of plan nodes, but the general + approach is the same: each node computes and returns its next output + row each time it is called. Each node is also responsible for applying + any selection or projection expressions that were assigned to it by + the planner. + </para> + + <para> + The executor mechanism is used to evaluate all five basic SQL query + types: <command>SELECT</command>, <command>INSERT</command>, + <command>UPDATE</command>, <command>DELETE</command>, and + <command>MERGE</command>. + For <command>SELECT</command>, the top-level executor code + only needs to send each row returned by the query plan tree + off to the client. <command>INSERT ... SELECT</command>, + <command>UPDATE</command>, <command>DELETE</command>, and + <command>MERGE</command> + are effectively <command>SELECT</command>s under a special + top-level plan node called <literal>ModifyTable</literal>. + </para> + + <para> + <command>INSERT ... SELECT</command> feeds the rows up + to <literal>ModifyTable</literal> for insertion. For + <command>UPDATE</command>, the planner arranges that each + computed row includes all the updated column values, plus the + <firstterm>TID</firstterm> (tuple ID, or row ID) of the original + target row; this data is fed up to the <literal>ModifyTable</literal> + node, which uses the information to create a new updated row and + mark the old row deleted. For <command>DELETE</command>, the only + column that is actually returned by the plan is the TID, and the + <literal>ModifyTable</literal> node simply uses the TID to visit each + target row and mark it deleted. For <command>MERGE</command>, the + planner joins the source and target relations, and includes all + column values required by any of the <literal>WHEN</literal> clauses, + plus the TID of the target row; this data is fed up to the + <literal>ModifyTable</literal> node, which uses the information to + work out which <literal>WHEN</literal> clause to execute, and then + inserts, updates or deletes the target row, as required. + </para> + + <para> + A simple <command>INSERT ... VALUES</command> command creates a + trivial plan tree consisting of a single <literal>Result</literal> + node, which computes just one result row, feeding that up + to <literal>ModifyTable</literal> to perform the insertion. + </para> + + </sect1> + + </chapter> |