diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
commit | 293913568e6a7a86fd1479e1cff8e2ecb58d6568 (patch) | |
tree | fc3b469a3ec5ab71b36ea97cc7aaddb838423a0c /doc/src/sgml/queries.sgml | |
parent | Initial commit. (diff) | |
download | postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.tar.xz postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.zip |
Adding upstream version 16.2.upstream/16.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | doc/src/sgml/queries.sgml | 2746 |
1 files changed, 2746 insertions, 0 deletions
diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml new file mode 100644 index 0000000..3f95849 --- /dev/null +++ b/doc/src/sgml/queries.sgml @@ -0,0 +1,2746 @@ +<!-- doc/src/sgml/queries.sgml --> + +<chapter id="queries"> + <title>Queries</title> + + <indexterm zone="queries"> + <primary>query</primary> + </indexterm> + + <indexterm zone="queries"> + <primary>SELECT</primary> + </indexterm> + + <para> + The previous chapters explained how to create tables, how to fill + them with data, and how to manipulate that data. Now we finally + discuss how to retrieve the data from the database. + </para> + + + <sect1 id="queries-overview"> + <title>Overview</title> + + <para> + The process of retrieving or the command to retrieve data from a + database is called a <firstterm>query</firstterm>. In SQL the + <link linkend="sql-select"><command>SELECT</command></link> command is + used to specify queries. The general syntax of the + <command>SELECT</command> command is +<synopsis> +<optional>WITH <replaceable>with_queries</replaceable></optional> SELECT <replaceable>select_list</replaceable> FROM <replaceable>table_expression</replaceable> <optional><replaceable>sort_specification</replaceable></optional> +</synopsis> + The following sections describe the details of the select list, the + table expression, and the sort specification. <literal>WITH</literal> + queries are treated last since they are an advanced feature. + </para> + + <para> + A simple kind of query has the form: +<programlisting> +SELECT * FROM table1; +</programlisting> + Assuming that there is a table called <literal>table1</literal>, + this command would retrieve all rows and all user-defined columns from + <literal>table1</literal>. (The method of retrieval depends on the + client application. For example, the + <application>psql</application> program will display an ASCII-art + table on the screen, while client libraries will offer functions to + extract individual values from the query result.) The select list + specification <literal>*</literal> means all columns that the table + expression happens to provide. A select list can also select a + subset of the available columns or make calculations using the + columns. For example, if + <literal>table1</literal> has columns named <literal>a</literal>, + <literal>b</literal>, and <literal>c</literal> (and perhaps others) you can make + the following query: +<programlisting> +SELECT a, b + c FROM table1; +</programlisting> + (assuming that <literal>b</literal> and <literal>c</literal> are of a numerical + data type). + See <xref linkend="queries-select-lists"/> for more details. + </para> + + <para> + <literal>FROM table1</literal> is a simple kind of + table expression: it reads just one table. In general, table + expressions can be complex constructs of base tables, joins, and + subqueries. But you can also omit the table expression entirely and + use the <command>SELECT</command> command as a calculator: +<programlisting> +SELECT 3 * 4; +</programlisting> + This is more useful if the expressions in the select list return + varying results. For example, you could call a function this way: +<programlisting> +SELECT random(); +</programlisting> + </para> + </sect1> + + + <sect1 id="queries-table-expressions"> + <title>Table Expressions</title> + + <indexterm zone="queries-table-expressions"> + <primary>table expression</primary> + </indexterm> + + <para> + A <firstterm>table expression</firstterm> computes a table. The + table expression contains a <literal>FROM</literal> clause that is + optionally followed by <literal>WHERE</literal>, <literal>GROUP BY</literal>, and + <literal>HAVING</literal> clauses. Trivial table expressions simply refer + to a table on disk, a so-called base table, but more complex + expressions can be used to modify or combine base tables in various + ways. + </para> + + <para> + The optional <literal>WHERE</literal>, <literal>GROUP BY</literal>, and + <literal>HAVING</literal> clauses in the table expression specify a + pipeline of successive transformations performed on the table + derived in the <literal>FROM</literal> clause. All these transformations + produce a virtual table that provides the rows that are passed to + the select list to compute the output rows of the query. + </para> + + <sect2 id="queries-from"> + <title>The <literal>FROM</literal> Clause</title> + + <para> + The <link linkend="sql-from"><literal>FROM</literal></link> clause derives a + table from one or more other tables given in a comma-separated + table reference list. +<synopsis> +FROM <replaceable>table_reference</replaceable> <optional>, <replaceable>table_reference</replaceable> <optional>, ...</optional></optional> +</synopsis> + + A table reference can be a table name (possibly schema-qualified), + or a derived table such as a subquery, a <literal>JOIN</literal> construct, or + complex combinations of these. If more than one table reference is + listed in the <literal>FROM</literal> clause, the tables are cross-joined + (that is, the Cartesian product of their rows is formed; see below). + The result of the <literal>FROM</literal> list is an intermediate virtual + table that can then be subject to + transformations by the <literal>WHERE</literal>, <literal>GROUP BY</literal>, + and <literal>HAVING</literal> clauses and is finally the result of the + overall table expression. + </para> + + <indexterm> + <primary>ONLY</primary> + </indexterm> + + <para> + When a table reference names a table that is the parent of a + table inheritance hierarchy, the table reference produces rows of + not only that table but all of its descendant tables, unless the + key word <literal>ONLY</literal> precedes the table name. However, the + reference produces only the columns that appear in the named table + — any columns added in subtables are ignored. + </para> + + <para> + Instead of writing <literal>ONLY</literal> before the table name, you can write + <literal>*</literal> after the table name to explicitly specify that descendant + tables are included. There is no real reason to use this syntax any more, + because searching descendant tables is now always the default behavior. + However, it is supported for compatibility with older releases. + </para> + + <sect3 id="queries-join"> + <title>Joined Tables</title> + + <indexterm zone="queries-join"> + <primary>join</primary> + </indexterm> + + <para> + A joined table is a table derived from two other (real or + derived) tables according to the rules of the particular join + type. Inner, outer, and cross-joins are available. + The general syntax of a joined table is +<synopsis> +<replaceable>T1</replaceable> <replaceable>join_type</replaceable> <replaceable>T2</replaceable> <optional> <replaceable>join_condition</replaceable> </optional> +</synopsis> + Joins of all types can be chained together, or nested: either or + both <replaceable>T1</replaceable> and + <replaceable>T2</replaceable> can be joined tables. Parentheses + can be used around <literal>JOIN</literal> clauses to control the join + order. In the absence of parentheses, <literal>JOIN</literal> clauses + nest left-to-right. + </para> + + <variablelist> + <title>Join Types</title> + + <varlistentry> + <term>Cross join + <indexterm> + <primary>join</primary> + <secondary>cross</secondary> + </indexterm> + + <indexterm> + <primary>cross join</primary> + </indexterm> + </term> + + <listitem> +<synopsis> +<replaceable>T1</replaceable> CROSS JOIN <replaceable>T2</replaceable> +</synopsis> + + <para> + For every possible combination of rows from + <replaceable>T1</replaceable> and + <replaceable>T2</replaceable> (i.e., a Cartesian product), + the joined table will contain a + row consisting of all columns in <replaceable>T1</replaceable> + followed by all columns in <replaceable>T2</replaceable>. If + the tables have N and M rows respectively, the joined + table will have N * M rows. + </para> + + <para> + <literal>FROM <replaceable>T1</replaceable> CROSS JOIN + <replaceable>T2</replaceable></literal> is equivalent to + <literal>FROM <replaceable>T1</replaceable> INNER JOIN + <replaceable>T2</replaceable> ON TRUE</literal> (see below). + It is also equivalent to + <literal>FROM <replaceable>T1</replaceable>, + <replaceable>T2</replaceable></literal>. + <note> + <para> + This latter equivalence does not hold exactly when more than two + tables appear, because <literal>JOIN</literal> binds more tightly than + comma. For example + <literal>FROM <replaceable>T1</replaceable> CROSS JOIN + <replaceable>T2</replaceable> INNER JOIN <replaceable>T3</replaceable> + ON <replaceable>condition</replaceable></literal> + is not the same as + <literal>FROM <replaceable>T1</replaceable>, + <replaceable>T2</replaceable> INNER JOIN <replaceable>T3</replaceable> + ON <replaceable>condition</replaceable></literal> + because the <replaceable>condition</replaceable> can + reference <replaceable>T1</replaceable> in the first case but not + the second. + </para> + </note> + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Qualified joins + <indexterm> + <primary>join</primary> + <secondary>outer</secondary> + </indexterm> + + <indexterm> + <primary>outer join</primary> + </indexterm> + </term> + + <listitem> +<synopsis> +<replaceable>T1</replaceable> { <optional>INNER</optional> | { LEFT | RIGHT | FULL } <optional>OUTER</optional> } JOIN <replaceable>T2</replaceable> ON <replaceable>boolean_expression</replaceable> +<replaceable>T1</replaceable> { <optional>INNER</optional> | { LEFT | RIGHT | FULL } <optional>OUTER</optional> } JOIN <replaceable>T2</replaceable> USING ( <replaceable>join column list</replaceable> ) +<replaceable>T1</replaceable> NATURAL { <optional>INNER</optional> | { LEFT | RIGHT | FULL } <optional>OUTER</optional> } JOIN <replaceable>T2</replaceable> +</synopsis> + + <para> + The words <literal>INNER</literal> and + <literal>OUTER</literal> are optional in all forms. + <literal>INNER</literal> is the default; + <literal>LEFT</literal>, <literal>RIGHT</literal>, and + <literal>FULL</literal> imply an outer join. + </para> + + <para> + The <firstterm>join condition</firstterm> is specified in the + <literal>ON</literal> or <literal>USING</literal> clause, or implicitly by + the word <literal>NATURAL</literal>. The join condition determines + which rows from the two source tables are considered to + <quote>match</quote>, as explained in detail below. + </para> + + <para> + The possible types of qualified join are: + + <variablelist> + <varlistentry> + <term><literal>INNER JOIN</literal></term> + + <listitem> + <para> + For each row R1 of T1, the joined table has a row for each + row in T2 that satisfies the join condition with R1. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>LEFT OUTER JOIN</literal> + <indexterm> + <primary>join</primary> + <secondary>left</secondary> + </indexterm> + + <indexterm> + <primary>left join</primary> + </indexterm> + </term> + + <listitem> + <para> + First, an inner join is performed. Then, for each row in + T1 that does not satisfy the join condition with any row in + T2, a joined row is added with null values in columns of + T2. Thus, the joined table always has at least + one row for each row in T1. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>RIGHT OUTER JOIN</literal> + <indexterm> + <primary>join</primary> + <secondary>right</secondary> + </indexterm> + + <indexterm> + <primary>right join</primary> + </indexterm> + </term> + + <listitem> + <para> + First, an inner join is performed. Then, for each row in + T2 that does not satisfy the join condition with any row in + T1, a joined row is added with null values in columns of + T1. This is the converse of a left join: the result table + will always have a row for each row in T2. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>FULL OUTER JOIN</literal></term> + + <listitem> + <para> + First, an inner join is performed. Then, for each row in + T1 that does not satisfy the join condition with any row in + T2, a joined row is added with null values in columns of + T2. Also, for each row of T2 that does not satisfy the + join condition with any row in T1, a joined row with null + values in the columns of T1 is added. + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + + <para> + The <literal>ON</literal> clause is the most general kind of join + condition: it takes a Boolean value expression of the same + kind as is used in a <literal>WHERE</literal> clause. A pair of rows + from <replaceable>T1</replaceable> and <replaceable>T2</replaceable> match if the + <literal>ON</literal> expression evaluates to true. + </para> + + <para> + The <literal>USING</literal> clause is a shorthand that allows you to take + advantage of the specific situation where both sides of the join use + the same name for the joining column(s). It takes a + comma-separated list of the shared column names + and forms a join condition that includes an equality comparison + for each one. For example, joining <replaceable>T1</replaceable> + and <replaceable>T2</replaceable> with <literal>USING (a, b)</literal> produces + the join condition <literal>ON <replaceable>T1</replaceable>.a + = <replaceable>T2</replaceable>.a AND <replaceable>T1</replaceable>.b + = <replaceable>T2</replaceable>.b</literal>. + </para> + + <para> + Furthermore, the output of <literal>JOIN USING</literal> suppresses + redundant columns: there is no need to print both of the matched + columns, since they must have equal values. While <literal>JOIN + ON</literal> produces all columns from <replaceable>T1</replaceable> followed by all + columns from <replaceable>T2</replaceable>, <literal>JOIN USING</literal> produces one + output column for each of the listed column pairs (in the listed + order), followed by any remaining columns from <replaceable>T1</replaceable>, + followed by any remaining columns from <replaceable>T2</replaceable>. + </para> + + <para> + <indexterm> + <primary>join</primary> + <secondary>natural</secondary> + </indexterm> + <indexterm> + <primary>natural join</primary> + </indexterm> + Finally, <literal>NATURAL</literal> is a shorthand form of + <literal>USING</literal>: it forms a <literal>USING</literal> list + consisting of all column names that appear in both + input tables. As with <literal>USING</literal>, these columns appear + only once in the output table. If there are no common + column names, <literal>NATURAL JOIN</literal> behaves like + <literal>JOIN ... ON TRUE</literal>, producing a cross-product join. + </para> + + <note> + <para> + <literal>USING</literal> is reasonably safe from column changes + in the joined relations since only the listed columns + are combined. <literal>NATURAL</literal> is considerably more risky since + any schema changes to either relation that cause a new matching + column name to be present will cause the join to combine that new + column as well. + </para> + </note> + </listitem> + </varlistentry> + </variablelist> + + <para> + To put this together, assume we have tables <literal>t1</literal>: +<programlisting> + num | name +-----+------ + 1 | a + 2 | b + 3 | c +</programlisting> + and <literal>t2</literal>: +<programlisting> + num | value +-----+------- + 1 | xxx + 3 | yyy + 5 | zzz +</programlisting> + then we get the following results for the various joins: +<screen> +<prompt>=></prompt> <userinput>SELECT * FROM t1 CROSS JOIN t2;</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 1 | a | 3 | yyy + 1 | a | 5 | zzz + 2 | b | 1 | xxx + 2 | b | 3 | yyy + 2 | b | 5 | zzz + 3 | c | 1 | xxx + 3 | c | 3 | yyy + 3 | c | 5 | zzz +(9 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 INNER JOIN t2 ON t1.num = t2.num;</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 3 | c | 3 | yyy +(2 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 INNER JOIN t2 USING (num);</userinput> + num | name | value +-----+------+------- + 1 | a | xxx + 3 | c | yyy +(2 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 NATURAL INNER JOIN t2;</userinput> + num | name | value +-----+------+------- + 1 | a | xxx + 3 | c | yyy +(2 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num;</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 2 | b | | + 3 | c | 3 | yyy +(3 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 LEFT JOIN t2 USING (num);</userinput> + num | name | value +-----+------+------- + 1 | a | xxx + 2 | b | + 3 | c | yyy +(3 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 RIGHT JOIN t2 ON t1.num = t2.num;</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 3 | c | 3 | yyy + | | 5 | zzz +(3 rows) + +<prompt>=></prompt> <userinput>SELECT * FROM t1 FULL JOIN t2 ON t1.num = t2.num;</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 2 | b | | + 3 | c | 3 | yyy + | | 5 | zzz +(4 rows) +</screen> + </para> + + <para> + The join condition specified with <literal>ON</literal> can also contain + conditions that do not relate directly to the join. This can + prove useful for some queries but needs to be thought out + carefully. For example: +<screen> +<prompt>=></prompt> <userinput>SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num AND t2.value = 'xxx';</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx + 2 | b | | + 3 | c | | +(3 rows) +</screen> + Notice that placing the restriction in the <literal>WHERE</literal> clause + produces a different result: +<screen> +<prompt>=></prompt> <userinput>SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx';</userinput> + num | name | num | value +-----+------+-----+------- + 1 | a | 1 | xxx +(1 row) +</screen> + This is because a restriction placed in the <literal>ON</literal> + clause is processed <emphasis>before</emphasis> the join, while + a restriction placed in the <literal>WHERE</literal> clause is processed + <emphasis>after</emphasis> the join. + That does not matter with inner joins, but it matters a lot with outer + joins. + </para> + </sect3> + + <sect3 id="queries-table-aliases"> + <title>Table and Column Aliases</title> + + <indexterm zone="queries-table-aliases"> + <primary>alias</primary> + <secondary>in the FROM clause</secondary> + </indexterm> + + <indexterm> + <primary>label</primary> + <see>alias</see> + </indexterm> + + <para> + A temporary name can be given to tables and complex table + references to be used for references to the derived table in + the rest of the query. This is called a <firstterm>table + alias</firstterm>. + </para> + + <para> + To create a table alias, write +<synopsis> +FROM <replaceable>table_reference</replaceable> AS <replaceable>alias</replaceable> +</synopsis> + or +<synopsis> +FROM <replaceable>table_reference</replaceable> <replaceable>alias</replaceable> +</synopsis> + The <literal>AS</literal> key word is optional noise. + <replaceable>alias</replaceable> can be any identifier. + </para> + + <para> + A typical application of table aliases is to assign short + identifiers to long table names to keep the join clauses + readable. For example: +<programlisting> +SELECT * FROM some_very_long_table_name s JOIN another_fairly_long_name a ON s.id = a.num; +</programlisting> + </para> + + <para> + The alias becomes the new name of the table reference so far as the + current query is concerned — it is not allowed to refer to the + table by the original name elsewhere in the query. Thus, this is not + valid: +<programlisting> +SELECT * FROM my_table AS m WHERE my_table.a > 5; -- wrong +</programlisting> + </para> + + <para> + Table aliases are mainly for notational convenience, but it is + necessary to use them when joining a table to itself, e.g.: +<programlisting> +SELECT * FROM people AS mother JOIN people AS child ON mother.id = child.mother_id; +</programlisting> + </para> + + <para> + Parentheses are used to resolve ambiguities. In the following example, + the first statement assigns the alias <literal>b</literal> to the second + instance of <literal>my_table</literal>, but the second statement assigns the + alias to the result of the join: +<programlisting> +SELECT * FROM my_table AS a CROSS JOIN my_table AS b ... +SELECT * FROM (my_table AS a CROSS JOIN my_table) AS b ... +</programlisting> + </para> + + <para> + Another form of table aliasing gives temporary names to the columns of + the table, as well as the table itself: +<synopsis> +FROM <replaceable>table_reference</replaceable> <optional>AS</optional> <replaceable>alias</replaceable> ( <replaceable>column1</replaceable> <optional>, <replaceable>column2</replaceable> <optional>, ...</optional></optional> ) +</synopsis> + If fewer column aliases are specified than the actual table has + columns, the remaining columns are not renamed. This syntax is + especially useful for self-joins or subqueries. + </para> + + <para> + When an alias is applied to the output of a <literal>JOIN</literal> + clause, the alias hides the original + name(s) within the <literal>JOIN</literal>. For example: +<programlisting> +SELECT a.* FROM my_table AS a JOIN your_table AS b ON ... +</programlisting> + is valid SQL, but: +<programlisting> +SELECT a.* FROM (my_table AS a JOIN your_table AS b ON ...) AS c +</programlisting> + is not valid; the table alias <literal>a</literal> is not visible + outside the alias <literal>c</literal>. + </para> + </sect3> + + <sect3 id="queries-subqueries"> + <title>Subqueries</title> + + <indexterm zone="queries-subqueries"> + <primary>subquery</primary> + </indexterm> + + <para> + Subqueries specifying a derived table must be enclosed in + parentheses. They may be assigned a table alias name, and optionally + column alias names (as in <xref linkend="queries-table-aliases"/>). + For example: +<programlisting> +FROM (SELECT * FROM table1) AS alias_name +</programlisting> + </para> + + <para> + This example is equivalent to <literal>FROM table1 AS + alias_name</literal>. More interesting cases, which cannot be + reduced to a plain join, arise when the subquery involves + grouping or aggregation. + </para> + + <para> + A subquery can also be a <command>VALUES</command> list: +<programlisting> +FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 'blow')) + AS names(first, last) +</programlisting> + Again, a table alias is optional. Assigning alias names to the columns + of the <command>VALUES</command> list is optional, but is good practice. + For more information see <xref linkend="queries-values"/>. + </para> + + <para> + According to the SQL standard, a table alias name must be supplied + for a subquery. <productname>PostgreSQL</productname> + allows <literal>AS</literal> and the alias to be omitted, but + writing one is good practice in SQL code that might be ported to + another system. + </para> + </sect3> + + <sect3 id="queries-tablefunctions"> + <title>Table Functions</title> + + <indexterm zone="queries-tablefunctions"><primary>table function</primary></indexterm> + + <indexterm zone="queries-tablefunctions"> + <primary>function</primary> + <secondary>in the FROM clause</secondary> + </indexterm> + + <para> + Table functions are functions that produce a set of rows, made up + of either base data types (scalar types) or composite data types + (table rows). They are used like a table, view, or subquery in + the <literal>FROM</literal> clause of a query. Columns returned by table + functions can be included in <literal>SELECT</literal>, + <literal>JOIN</literal>, or <literal>WHERE</literal> clauses in the same manner + as columns of a table, view, or subquery. + </para> + + <para> + Table functions may also be combined using the <literal>ROWS FROM</literal> + syntax, with the results returned in parallel columns; the number of + result rows in this case is that of the largest function result, with + smaller results padded with null values to match. + </para> + +<synopsis> +<replaceable>function_call</replaceable> <optional>WITH ORDINALITY</optional> <optional><optional>AS</optional> <replaceable>table_alias</replaceable> <optional>(<replaceable>column_alias</replaceable> <optional>, ... </optional>)</optional></optional> +ROWS FROM( <replaceable>function_call</replaceable> <optional>, ... </optional> ) <optional>WITH ORDINALITY</optional> <optional><optional>AS</optional> <replaceable>table_alias</replaceable> <optional>(<replaceable>column_alias</replaceable> <optional>, ... </optional>)</optional></optional> +</synopsis> + + <para> + If the <literal>WITH ORDINALITY</literal> clause is specified, an + additional column of type <type>bigint</type> will be added to the + function result columns. This column numbers the rows of the function + result set, starting from 1. (This is a generalization of the + SQL-standard syntax for <literal>UNNEST ... WITH ORDINALITY</literal>.) + By default, the ordinal column is called <literal>ordinality</literal>, but + a different column name can be assigned to it using + an <literal>AS</literal> clause. + </para> + + <para> + The special table function <literal>UNNEST</literal> may be called with + any number of array parameters, and it returns a corresponding number of + columns, as if <literal>UNNEST</literal> + (<xref linkend="functions-array"/>) had been called on each parameter + separately and combined using the <literal>ROWS FROM</literal> construct. + </para> + +<synopsis> +UNNEST( <replaceable>array_expression</replaceable> <optional>, ... </optional> ) <optional>WITH ORDINALITY</optional> <optional><optional>AS</optional> <replaceable>table_alias</replaceable> <optional>(<replaceable>column_alias</replaceable> <optional>, ... </optional>)</optional></optional> +</synopsis> + + <para> + If no <replaceable>table_alias</replaceable> is specified, the function + name is used as the table name; in the case of a <literal>ROWS FROM()</literal> + construct, the first function's name is used. + </para> + + <para> + If column aliases are not supplied, then for a function returning a base + data type, the column name is also the same as the function name. For a + function returning a composite type, the result columns get the names + of the individual attributes of the type. + </para> + + <para> + Some examples: +<programlisting> +CREATE TABLE foo (fooid int, foosubid int, fooname text); + +CREATE FUNCTION getfoo(int) RETURNS SETOF foo AS $$ + SELECT * FROM foo WHERE fooid = $1; +$$ LANGUAGE SQL; + +SELECT * FROM getfoo(1) AS t1; + +SELECT * FROM foo + WHERE foosubid IN ( + SELECT foosubid + FROM getfoo(foo.fooid) z + WHERE z.fooid = foo.fooid + ); + +CREATE VIEW vw_getfoo AS SELECT * FROM getfoo(1); + +SELECT * FROM vw_getfoo; +</programlisting> + </para> + + <para> + In some cases it is useful to define table functions that can + return different column sets depending on how they are invoked. + To support this, the table function can be declared as returning + the pseudo-type <type>record</type> with no <literal>OUT</literal> + parameters. When such a function is used in + a query, the expected row structure must be specified in the + query itself, so that the system can know how to parse and plan + the query. This syntax looks like: + </para> + +<synopsis> +<replaceable>function_call</replaceable> <optional>AS</optional> <replaceable>alias</replaceable> (<replaceable>column_definition</replaceable> <optional>, ... </optional>) +<replaceable>function_call</replaceable> AS <optional><replaceable>alias</replaceable></optional> (<replaceable>column_definition</replaceable> <optional>, ... </optional>) +ROWS FROM( ... <replaceable>function_call</replaceable> AS (<replaceable>column_definition</replaceable> <optional>, ... </optional>) <optional>, ... </optional> ) +</synopsis> + + <para> + When not using the <literal>ROWS FROM()</literal> syntax, + the <replaceable>column_definition</replaceable> list replaces the column + alias list that could otherwise be attached to the <literal>FROM</literal> + item; the names in the column definitions serve as column aliases. + When using the <literal>ROWS FROM()</literal> syntax, + a <replaceable>column_definition</replaceable> list can be attached to + each member function separately; or if there is only one member function + and no <literal>WITH ORDINALITY</literal> clause, + a <replaceable>column_definition</replaceable> list can be written in + place of a column alias list following <literal>ROWS FROM()</literal>. + </para> + + <para> + Consider this example: +<programlisting> +SELECT * + FROM dblink('dbname=mydb', 'SELECT proname, prosrc FROM pg_proc') + AS t1(proname name, prosrc text) + WHERE proname LIKE 'bytea%'; +</programlisting> + The <xref linkend="contrib-dblink-function"/> function + (part of the <xref linkend="dblink"/> module) executes + a remote query. It is declared to return + <type>record</type> since it might be used for any kind of query. + The actual column set must be specified in the calling query so + that the parser knows, for example, what <literal>*</literal> should + expand to. + </para> + + <para> + This example uses <literal>ROWS FROM</literal>: +<programlisting> +SELECT * +FROM ROWS FROM + ( + json_to_recordset('[{"a":40,"b":"foo"},{"a":"100","b":"bar"}]') + AS (a INTEGER, b TEXT), + generate_series(1, 3) + ) AS x (p, q, s) +ORDER BY p; + + p | q | s +-----+-----+--- + 40 | foo | 1 + 100 | bar | 2 + | | 3 +</programlisting> + It joins two functions into a single <literal>FROM</literal> + target. <function>json_to_recordset()</function> is instructed + to return two columns, the first <type>integer</type> + and the second <type>text</type>. The result of + <function>generate_series()</function> is used directly. + The <literal>ORDER BY</literal> clause sorts the column values + as integers. + </para> + </sect3> + + <sect3 id="queries-lateral"> + <title><literal>LATERAL</literal> Subqueries</title> + + <indexterm zone="queries-lateral"> + <primary>LATERAL</primary> + <secondary>in the FROM clause</secondary> + </indexterm> + + <para> + Subqueries appearing in <literal>FROM</literal> can be + preceded by the key word <literal>LATERAL</literal>. This allows them to + reference columns provided by preceding <literal>FROM</literal> items. + (Without <literal>LATERAL</literal>, each subquery is + evaluated independently and so cannot cross-reference any other + <literal>FROM</literal> item.) + </para> + + <para> + Table functions appearing in <literal>FROM</literal> can also be + preceded by the key word <literal>LATERAL</literal>, but for functions the + key word is optional; the function's arguments can contain references + to columns provided by preceding <literal>FROM</literal> items in any case. + </para> + + <para> + A <literal>LATERAL</literal> item can appear at the top level in the + <literal>FROM</literal> list, or within a <literal>JOIN</literal> tree. In the latter + case it can also refer to any items that are on the left-hand side of a + <literal>JOIN</literal> that it is on the right-hand side of. + </para> + + <para> + When a <literal>FROM</literal> item contains <literal>LATERAL</literal> + cross-references, evaluation proceeds as follows: for each row of the + <literal>FROM</literal> item providing the cross-referenced column(s), or + set of rows of multiple <literal>FROM</literal> items providing the + columns, the <literal>LATERAL</literal> item is evaluated using that + row or row set's values of the columns. The resulting row(s) are + joined as usual with the rows they were computed from. This is + repeated for each row or set of rows from the column source table(s). + </para> + + <para> + A trivial example of <literal>LATERAL</literal> is +<programlisting> +SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss; +</programlisting> + This is not especially useful since it has exactly the same result as + the more conventional +<programlisting> +SELECT * FROM foo, bar WHERE bar.id = foo.bar_id; +</programlisting> + <literal>LATERAL</literal> is primarily useful when the cross-referenced + column is necessary for computing the row(s) to be joined. A common + application is providing an argument value for a set-returning function. + For example, supposing that <function>vertices(polygon)</function> returns the + set of vertices of a polygon, we could identify close-together vertices + of polygons stored in a table with: +<programlisting> +SELECT p1.id, p2.id, v1, v2 +FROM polygons p1, polygons p2, + LATERAL vertices(p1.poly) v1, + LATERAL vertices(p2.poly) v2 +WHERE (v1 <-> v2) < 10 AND p1.id != p2.id; +</programlisting> + This query could also be written +<programlisting> +SELECT p1.id, p2.id, v1, v2 +FROM polygons p1 CROSS JOIN LATERAL vertices(p1.poly) v1, + polygons p2 CROSS JOIN LATERAL vertices(p2.poly) v2 +WHERE (v1 <-> v2) < 10 AND p1.id != p2.id; +</programlisting> + or in several other equivalent formulations. (As already mentioned, + the <literal>LATERAL</literal> key word is unnecessary in this example, but + we use it for clarity.) + </para> + + <para> + It is often particularly handy to <literal>LEFT JOIN</literal> to a + <literal>LATERAL</literal> subquery, so that source rows will appear in + the result even if the <literal>LATERAL</literal> subquery produces no + rows for them. For example, if <function>get_product_names()</function> returns + the names of products made by a manufacturer, but some manufacturers in + our table currently produce no products, we could find out which ones + those are like this: +<programlisting> +SELECT m.name +FROM manufacturers m LEFT JOIN LATERAL get_product_names(m.id) pname ON true +WHERE pname IS NULL; +</programlisting> + </para> + </sect3> + </sect2> + + <sect2 id="queries-where"> + <title>The <literal>WHERE</literal> Clause</title> + + <indexterm zone="queries-where"> + <primary>WHERE</primary> + </indexterm> + + <para> + The syntax of the <link linkend="sql-where"><literal>WHERE</literal></link> + clause is +<synopsis> +WHERE <replaceable>search_condition</replaceable> +</synopsis> + where <replaceable>search_condition</replaceable> is any value + expression (see <xref linkend="sql-expressions"/>) that + returns a value of type <type>boolean</type>. + </para> + + <para> + After the processing of the <literal>FROM</literal> clause is done, each + row of the derived virtual table is checked against the search + condition. If the result of the condition is true, the row is + kept in the output table, otherwise (i.e., if the result is + false or null) it is discarded. The search condition typically + references at least one column of the table generated in the + <literal>FROM</literal> clause; this is not required, but otherwise the + <literal>WHERE</literal> clause will be fairly useless. + </para> + + <note> + <para> + The join condition of an inner join can be written either in + the <literal>WHERE</literal> clause or in the <literal>JOIN</literal> clause. + For example, these table expressions are equivalent: +<programlisting> +FROM a, b WHERE a.id = b.id AND b.val > 5 +</programlisting> + and: +<programlisting> +FROM a INNER JOIN b ON (a.id = b.id) WHERE b.val > 5 +</programlisting> + or perhaps even: +<programlisting> +FROM a NATURAL JOIN b WHERE b.val > 5 +</programlisting> + Which one of these you use is mainly a matter of style. The + <literal>JOIN</literal> syntax in the <literal>FROM</literal> clause is + probably not as portable to other SQL database management systems, + even though it is in the SQL standard. For + outer joins there is no choice: they must be done in + the <literal>FROM</literal> clause. The <literal>ON</literal> or <literal>USING</literal> + clause of an outer join is <emphasis>not</emphasis> equivalent to a + <literal>WHERE</literal> condition, because it results in the addition + of rows (for unmatched input rows) as well as the removal of rows + in the final result. + </para> + </note> + + <para> + Here are some examples of <literal>WHERE</literal> clauses: +<programlisting> +SELECT ... FROM fdt WHERE c1 > 5 + +SELECT ... FROM fdt WHERE c1 IN (1, 2, 3) + +SELECT ... FROM fdt WHERE c1 IN (SELECT c1 FROM t2) + +SELECT ... FROM fdt WHERE c1 IN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10) + +SELECT ... FROM fdt WHERE c1 BETWEEN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10) AND 100 + +SELECT ... FROM fdt WHERE EXISTS (SELECT c1 FROM t2 WHERE c2 > fdt.c1) +</programlisting> + <literal>fdt</literal> is the table derived in the + <literal>FROM</literal> clause. Rows that do not meet the search + condition of the <literal>WHERE</literal> clause are eliminated from + <literal>fdt</literal>. Notice the use of scalar subqueries as + value expressions. Just like any other query, the subqueries can + employ complex table expressions. Notice also how + <literal>fdt</literal> is referenced in the subqueries. + Qualifying <literal>c1</literal> as <literal>fdt.c1</literal> is only necessary + if <literal>c1</literal> is also the name of a column in the derived + input table of the subquery. But qualifying the column name adds + clarity even when it is not needed. This example shows how the column + naming scope of an outer query extends into its inner queries. + </para> + </sect2> + + + <sect2 id="queries-group"> + <title>The <literal>GROUP BY</literal> and <literal>HAVING</literal> Clauses</title> + + <indexterm zone="queries-group"> + <primary>GROUP BY</primary> + </indexterm> + + <indexterm zone="queries-group"> + <primary>grouping</primary> + </indexterm> + + <para> + After passing the <literal>WHERE</literal> filter, the derived input + table might be subject to grouping, using the <literal>GROUP BY</literal> + clause, and elimination of group rows using the <literal>HAVING</literal> + clause. + </para> + +<synopsis> +SELECT <replaceable>select_list</replaceable> + FROM ... + <optional>WHERE ...</optional> + GROUP BY <replaceable>grouping_column_reference</replaceable> <optional>, <replaceable>grouping_column_reference</replaceable></optional>... +</synopsis> + + <para> + The <link linkend="sql-groupby"><literal>GROUP BY</literal></link> clause is + used to group together those rows in a table that have the same + values in all the columns listed. The order in which the columns + are listed does not matter. The effect is to combine each set + of rows having common values into one group row that + represents all rows in the group. This is done to + eliminate redundancy in the output and/or compute aggregates that + apply to these groups. For instance: +<screen> +<prompt>=></prompt> <userinput>SELECT * FROM test1;</userinput> + x | y +---+--- + a | 3 + c | 2 + b | 5 + a | 1 +(4 rows) + +<prompt>=></prompt> <userinput>SELECT x FROM test1 GROUP BY x;</userinput> + x +--- + a + b + c +(3 rows) +</screen> + </para> + + <para> + In the second query, we could not have written <literal>SELECT * + FROM test1 GROUP BY x</literal>, because there is no single value + for the column <literal>y</literal> that could be associated with each + group. The grouped-by columns can be referenced in the select list since + they have a single value in each group. + </para> + + <para> + In general, if a table is grouped, columns that are not + listed in <literal>GROUP BY</literal> cannot be referenced except in aggregate + expressions. An example with aggregate expressions is: +<screen> +<prompt>=></prompt> <userinput>SELECT x, sum(y) FROM test1 GROUP BY x;</userinput> + x | sum +---+----- + a | 4 + b | 5 + c | 2 +(3 rows) +</screen> + Here <literal>sum</literal> is an aggregate function that + computes a single value over the entire group. More information + about the available aggregate functions can be found in <xref + linkend="functions-aggregate"/>. + </para> + + <tip> + <para> + Grouping without aggregate expressions effectively calculates the + set of distinct values in a column. This can also be achieved + using the <literal>DISTINCT</literal> clause (see <xref + linkend="queries-distinct"/>). + </para> + </tip> + + <para> + Here is another example: it calculates the total sales for each + product (rather than the total sales of all products): +<programlisting> +SELECT product_id, p.name, (sum(s.units) * p.price) AS sales + FROM products p LEFT JOIN sales s USING (product_id) + GROUP BY product_id, p.name, p.price; +</programlisting> + In this example, the columns <literal>product_id</literal>, + <literal>p.name</literal>, and <literal>p.price</literal> must be + in the <literal>GROUP BY</literal> clause since they are referenced in + the query select list (but see below). The column + <literal>s.units</literal> does not have to be in the <literal>GROUP + BY</literal> list since it is only used in an aggregate expression + (<literal>sum(...)</literal>), which represents the sales + of a product. For each product, the query returns a summary row about + all sales of the product. + </para> + + <indexterm><primary>functional dependency</primary></indexterm> + + <para> + If the products table is set up so that, say, + <literal>product_id</literal> is the primary key, then it would be + enough to group by <literal>product_id</literal> in the above example, + since name and price would be <firstterm>functionally + dependent</firstterm> on the product ID, and so there would be no + ambiguity about which name and price value to return for each product + ID group. + </para> + + <para> + In strict SQL, <literal>GROUP BY</literal> can only group by columns of + the source table but <productname>PostgreSQL</productname> extends + this to also allow <literal>GROUP BY</literal> to group by columns in the + select list. Grouping by value expressions instead of simple + column names is also allowed. + </para> + + <indexterm> + <primary>HAVING</primary> + </indexterm> + + <para> + If a table has been grouped using <literal>GROUP BY</literal>, + but only certain groups are of interest, the + <literal>HAVING</literal> clause can be used, much like a + <literal>WHERE</literal> clause, to eliminate groups from the result. + The syntax is: +<synopsis> +SELECT <replaceable>select_list</replaceable> FROM ... <optional>WHERE ...</optional> GROUP BY ... HAVING <replaceable>boolean_expression</replaceable> +</synopsis> + Expressions in the <literal>HAVING</literal> clause can refer both to + grouped expressions and to ungrouped expressions (which necessarily + involve an aggregate function). + </para> + + <para> + Example: +<screen> +<prompt>=></prompt> <userinput>SELECT x, sum(y) FROM test1 GROUP BY x HAVING sum(y) > 3;</userinput> + x | sum +---+----- + a | 4 + b | 5 +(2 rows) + +<prompt>=></prompt> <userinput>SELECT x, sum(y) FROM test1 GROUP BY x HAVING x < 'c';</userinput> + x | sum +---+----- + a | 4 + b | 5 +(2 rows) +</screen> + </para> + + <para> + Again, a more realistic example: +<programlisting> +SELECT product_id, p.name, (sum(s.units) * (p.price - p.cost)) AS profit + FROM products p LEFT JOIN sales s USING (product_id) + WHERE s.date > CURRENT_DATE - INTERVAL '4 weeks' + GROUP BY product_id, p.name, p.price, p.cost + HAVING sum(p.price * s.units) > 5000; +</programlisting> + In the example above, the <literal>WHERE</literal> clause is selecting + rows by a column that is not grouped (the expression is only true for + sales during the last four weeks), while the <literal>HAVING</literal> + clause restricts the output to groups with total gross sales over + 5000. Note that the aggregate expressions do not necessarily need + to be the same in all parts of the query. + </para> + + <para> + If a query contains aggregate function calls, but no <literal>GROUP BY</literal> + clause, grouping still occurs: the result is a single group row (or + perhaps no rows at all, if the single row is then eliminated by + <literal>HAVING</literal>). + The same is true if it contains a <literal>HAVING</literal> clause, even + without any aggregate function calls or <literal>GROUP BY</literal> clause. + </para> + </sect2> + + <sect2 id="queries-grouping-sets"> + <title><literal>GROUPING SETS</literal>, <literal>CUBE</literal>, and <literal>ROLLUP</literal></title> + + <indexterm zone="queries-grouping-sets"> + <primary>GROUPING SETS</primary> + </indexterm> + <indexterm zone="queries-grouping-sets"> + <primary>CUBE</primary> + </indexterm> + <indexterm zone="queries-grouping-sets"> + <primary>ROLLUP</primary> + </indexterm> + + <para> + More complex grouping operations than those described above are possible + using the concept of <firstterm>grouping sets</firstterm>. The data selected by + the <literal>FROM</literal> and <literal>WHERE</literal> clauses is grouped separately + by each specified grouping set, aggregates computed for each group just as + for simple <literal>GROUP BY</literal> clauses, and then the results returned. + For example: +<screen> +<prompt>=></prompt> <userinput>SELECT * FROM items_sold;</userinput> + brand | size | sales +-------+------+------- + Foo | L | 10 + Foo | M | 20 + Bar | M | 15 + Bar | L | 5 +(4 rows) + +<prompt>=></prompt> <userinput>SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());</userinput> + brand | size | sum +-------+------+----- + Foo | | 30 + Bar | | 20 + | L | 15 + | M | 35 + | | 50 +(5 rows) +</screen> + </para> + + <para> + Each sublist of <literal>GROUPING SETS</literal> may specify zero or more columns + or expressions and is interpreted the same way as though it were directly + in the <literal>GROUP BY</literal> clause. An empty grouping set means that all + rows are aggregated down to a single group (which is output even if no + input rows were present), as described above for the case of aggregate + functions with no <literal>GROUP BY</literal> clause. + </para> + + <para> + References to the grouping columns or expressions are replaced + by null values in result rows for grouping sets in which those + columns do not appear. To distinguish which grouping a particular output + row resulted from, see <xref linkend="functions-grouping-table"/>. + </para> + + <para> + A shorthand notation is provided for specifying two common types of grouping set. + A clause of the form +<programlisting> +ROLLUP ( <replaceable>e1</replaceable>, <replaceable>e2</replaceable>, <replaceable>e3</replaceable>, ... ) +</programlisting> + represents the given list of expressions and all prefixes of the list including + the empty list; thus it is equivalent to +<programlisting> +GROUPING SETS ( + ( <replaceable>e1</replaceable>, <replaceable>e2</replaceable>, <replaceable>e3</replaceable>, ... ), + ... + ( <replaceable>e1</replaceable>, <replaceable>e2</replaceable> ), + ( <replaceable>e1</replaceable> ), + ( ) +) +</programlisting> + This is commonly used for analysis over hierarchical data; e.g., total + salary by department, division, and company-wide total. + </para> + + <para> + A clause of the form +<programlisting> +CUBE ( <replaceable>e1</replaceable>, <replaceable>e2</replaceable>, ... ) +</programlisting> + represents the given list and all of its possible subsets (i.e., the power + set). Thus +<programlisting> +CUBE ( a, b, c ) +</programlisting> + is equivalent to +<programlisting> +GROUPING SETS ( + ( a, b, c ), + ( a, b ), + ( a, c ), + ( a ), + ( b, c ), + ( b ), + ( c ), + ( ) +) +</programlisting> + </para> + + <para> + The individual elements of a <literal>CUBE</literal> or <literal>ROLLUP</literal> + clause may be either individual expressions, or sublists of elements in + parentheses. In the latter case, the sublists are treated as single + units for the purposes of generating the individual grouping sets. + For example: +<programlisting> +CUBE ( (a, b), (c, d) ) +</programlisting> + is equivalent to +<programlisting> +GROUPING SETS ( + ( a, b, c, d ), + ( a, b ), + ( c, d ), + ( ) +) +</programlisting> + and +<programlisting> +ROLLUP ( a, (b, c), d ) +</programlisting> + is equivalent to +<programlisting> +GROUPING SETS ( + ( a, b, c, d ), + ( a, b, c ), + ( a ), + ( ) +) +</programlisting> + </para> + + <para> + The <literal>CUBE</literal> and <literal>ROLLUP</literal> constructs can be used either + directly in the <literal>GROUP BY</literal> clause, or nested inside a + <literal>GROUPING SETS</literal> clause. If one <literal>GROUPING SETS</literal> clause + is nested inside another, the effect is the same as if all the elements of + the inner clause had been written directly in the outer clause. + </para> + + <para> + If multiple grouping items are specified in a single <literal>GROUP BY</literal> + clause, then the final list of grouping sets is the cross product of the + individual items. For example: +<programlisting> +GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e)) +</programlisting> + is equivalent to +<programlisting> +GROUP BY GROUPING SETS ( + (a, b, c, d), (a, b, c, e), + (a, b, d), (a, b, e), + (a, c, d), (a, c, e), + (a, d), (a, e) +) +</programlisting> + </para> + + <para> + <indexterm zone="queries-grouping-sets"> + <primary>ALL</primary> + <secondary>GROUP BY ALL</secondary> + </indexterm> + <indexterm zone="queries-grouping-sets"> + <primary>DISTINCT</primary> + <secondary>GROUP BY DISTINCT</secondary> + </indexterm> + When specifying multiple grouping items together, the final set of grouping + sets might contain duplicates. For example: +<programlisting> +GROUP BY ROLLUP (a, b), ROLLUP (a, c) +</programlisting> + is equivalent to +<programlisting> +GROUP BY GROUPING SETS ( + (a, b, c), + (a, b), + (a, b), + (a, c), + (a), + (a), + (a, c), + (a), + () +) +</programlisting> + If these duplicates are undesirable, they can be removed using the + <literal>DISTINCT</literal> clause directly on the <literal>GROUP BY</literal>. + Therefore: +<programlisting> +GROUP BY <emphasis>DISTINCT</emphasis> ROLLUP (a, b), ROLLUP (a, c) +</programlisting> + is equivalent to +<programlisting> +GROUP BY GROUPING SETS ( + (a, b, c), + (a, b), + (a, c), + (a), + () +) +</programlisting> + This is not the same as using <literal>SELECT DISTINCT</literal> because the output + rows may still contain duplicates. If any of the ungrouped columns contains NULL, + it will be indistinguishable from the NULL used when that same column is grouped. + </para> + + <note> + <para> + The construct <literal>(a, b)</literal> is normally recognized in expressions as + a <link linkend="sql-syntax-row-constructors">row constructor</link>. + Within the <literal>GROUP BY</literal> clause, this does not apply at the top + levels of expressions, and <literal>(a, b)</literal> is parsed as a list of + expressions as described above. If for some reason you <emphasis>need</emphasis> + a row constructor in a grouping expression, use <literal>ROW(a, b)</literal>. + </para> + </note> + </sect2> + + <sect2 id="queries-window"> + <title>Window Function Processing</title> + + <indexterm zone="queries-window"> + <primary>window function</primary> + <secondary>order of execution</secondary> + </indexterm> + + <para> + If the query contains any window functions (see + <xref linkend="tutorial-window"/>, + <xref linkend="functions-window"/> and + <xref linkend="syntax-window-functions"/>), these functions are evaluated + after any grouping, aggregation, and <literal>HAVING</literal> filtering is + performed. That is, if the query uses any aggregates, <literal>GROUP + BY</literal>, or <literal>HAVING</literal>, then the rows seen by the window functions + are the group rows instead of the original table rows from + <literal>FROM</literal>/<literal>WHERE</literal>. + </para> + + <para> + When multiple window functions are used, all the window functions having + syntactically equivalent <literal>PARTITION BY</literal> and <literal>ORDER BY</literal> + clauses in their window definitions are guaranteed to be evaluated in a + single pass over the data. Therefore they will see the same sort ordering, + even if the <literal>ORDER BY</literal> does not uniquely determine an ordering. + However, no guarantees are made about the evaluation of functions having + different <literal>PARTITION BY</literal> or <literal>ORDER BY</literal> specifications. + (In such cases a sort step is typically required between the passes of + window function evaluations, and the sort is not guaranteed to preserve + ordering of rows that its <literal>ORDER BY</literal> sees as equivalent.) + </para> + + <para> + Currently, window functions always require presorted data, and so the + query output will be ordered according to one or another of the window + functions' <literal>PARTITION BY</literal>/<literal>ORDER BY</literal> clauses. + It is not recommended to rely on this, however. Use an explicit + top-level <literal>ORDER BY</literal> clause if you want to be sure the + results are sorted in a particular way. + </para> + </sect2> + </sect1> + + + <sect1 id="queries-select-lists"> + <title>Select Lists</title> + + <indexterm> + <primary>SELECT</primary> + <secondary>select list</secondary> + </indexterm> + + <para> + As shown in the previous section, + the table expression in the <command>SELECT</command> command + constructs an intermediate virtual table by possibly combining + tables, views, eliminating rows, grouping, etc. This table is + finally passed on to processing by the <firstterm>select list</firstterm>. The select + list determines which <emphasis>columns</emphasis> of the + intermediate table are actually output. + </para> + + <sect2 id="queries-select-list-items"> + <title>Select-List Items</title> + + <indexterm> + <primary>*</primary> + </indexterm> + + <para> + The simplest kind of select list is <literal>*</literal> which + emits all columns that the table expression produces. Otherwise, + a select list is a comma-separated list of value expressions (as + defined in <xref linkend="sql-expressions"/>). For instance, it + could be a list of column names: +<programlisting> +SELECT a, b, c FROM ... +</programlisting> + The columns names <literal>a</literal>, <literal>b</literal>, and <literal>c</literal> + are either the actual names of the columns of tables referenced + in the <literal>FROM</literal> clause, or the aliases given to them as + explained in <xref linkend="queries-table-aliases"/>. The name + space available in the select list is the same as in the + <literal>WHERE</literal> clause, unless grouping is used, in which case + it is the same as in the <literal>HAVING</literal> clause. + </para> + + <para> + If more than one table has a column of the same name, the table + name must also be given, as in: +<programlisting> +SELECT tbl1.a, tbl2.a, tbl1.b FROM ... +</programlisting> + When working with multiple tables, it can also be useful to ask for + all the columns of a particular table: +<programlisting> +SELECT tbl1.*, tbl2.a FROM ... +</programlisting> + See <xref linkend="rowtypes-usage"/> for more about + the <replaceable>table_name</replaceable><literal>.*</literal> notation. + </para> + + <para> + If an arbitrary value expression is used in the select list, it + conceptually adds a new virtual column to the returned table. The + value expression is evaluated once for each result row, with + the row's values substituted for any column references. But the + expressions in the select list do not have to reference any + columns in the table expression of the <literal>FROM</literal> clause; + they can be constant arithmetic expressions, for instance. + </para> + </sect2> + + <sect2 id="queries-column-labels"> + <title>Column Labels</title> + + <indexterm zone="queries-column-labels"> + <primary>alias</primary> + <secondary>in the select list</secondary> + </indexterm> + + <para> + The entries in the select list can be assigned names for subsequent + processing, such as for use in an <literal>ORDER BY</literal> clause + or for display by the client application. For example: +<programlisting> +SELECT a AS value, b + c AS sum FROM ... +</programlisting> + </para> + + <para> + If no output column name is specified using <literal>AS</literal>, + the system assigns a default column name. For simple column references, + this is the name of the referenced column. For function + calls, this is the name of the function. For complex expressions, + the system will generate a generic name. + </para> + + <para> + The <literal>AS</literal> key word is usually optional, but in some + cases where the desired column name matches a + <productname>PostgreSQL</productname> key word, you must write + <literal>AS</literal> or double-quote the column name in order to + avoid ambiguity. + (<xref linkend="sql-keywords-appendix"/> shows which key words + require <literal>AS</literal> to be used as a column label.) + For example, <literal>FROM</literal> is one such key word, so this + does not work: +<programlisting> +SELECT a from, b + c AS sum FROM ... +</programlisting> + but either of these do: +<programlisting> +SELECT a AS from, b + c AS sum FROM ... +SELECT a "from", b + c AS sum FROM ... +</programlisting> + For greatest safety against possible + future key word additions, it is recommended that you always either + write <literal>AS</literal> or double-quote the output column name. + </para> + + <note> + <para> + The naming of output columns here is different from that done in + the <literal>FROM</literal> clause (see <xref + linkend="queries-table-aliases"/>). It is possible + to rename the same column twice, but the name assigned in + the select list is the one that will be passed on. + </para> + </note> + </sect2> + + <sect2 id="queries-distinct"> + <title><literal>DISTINCT</literal></title> + + <indexterm zone="queries-distinct"> + <primary>ALL</primary> + <secondary>SELECT ALL</secondary> + </indexterm> + <indexterm zone="queries-distinct"> + <primary>DISTINCT</primary> + <secondary>SELECT DISTINCT</secondary> + </indexterm> + + <indexterm zone="queries-distinct"> + <primary>duplicates</primary> + </indexterm> + + <para> + After the select list has been processed, the result table can + optionally be subject to the elimination of duplicate rows. The + <literal>DISTINCT</literal> key word is written directly after + <literal>SELECT</literal> to specify this: +<synopsis> +SELECT DISTINCT <replaceable>select_list</replaceable> ... +</synopsis> + (Instead of <literal>DISTINCT</literal> the key word <literal>ALL</literal> + can be used to specify the default behavior of retaining all rows.) + </para> + + <indexterm> + <primary>null value</primary> + <secondary sortas="DISTINCT">in DISTINCT</secondary> + </indexterm> + + <para> + Obviously, two rows are considered distinct if they differ in at + least one column value. Null values are considered equal in this + comparison. + </para> + + <para> + Alternatively, an arbitrary expression can determine what rows are + to be considered distinct: +<synopsis> +SELECT DISTINCT ON (<replaceable>expression</replaceable> <optional>, <replaceable>expression</replaceable> ...</optional>) <replaceable>select_list</replaceable> ... +</synopsis> + Here <replaceable>expression</replaceable> is an arbitrary value + expression that is evaluated for all rows. A set of rows for + which all the expressions are equal are considered duplicates, and + only the first row of the set is kept in the output. Note that + the <quote>first row</quote> of a set is unpredictable unless the + query is sorted on enough columns to guarantee a unique ordering + of the rows arriving at the <literal>DISTINCT</literal> filter. + (<literal>DISTINCT ON</literal> processing occurs after <literal>ORDER + BY</literal> sorting.) + </para> + + <para> + The <literal>DISTINCT ON</literal> clause is not part of the SQL standard + and is sometimes considered bad style because of the potentially + indeterminate nature of its results. With judicious use of + <literal>GROUP BY</literal> and subqueries in <literal>FROM</literal>, this + construct can be avoided, but it is often the most convenient + alternative. + </para> + </sect2> + </sect1> + + + <sect1 id="queries-union"> + <title>Combining Queries (<literal>UNION</literal>, <literal>INTERSECT</literal>, <literal>EXCEPT</literal>)</title> + + <indexterm zone="queries-union"> + <primary>UNION</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>INTERSECT</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>EXCEPT</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>set union</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>set intersection</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>set difference</primary> + </indexterm> + <indexterm zone="queries-union"> + <primary>set operation</primary> + </indexterm> + + <para> + The results of two queries can be combined using the set operations + union, intersection, and difference. The syntax is +<synopsis> +<replaceable>query1</replaceable> UNION <optional>ALL</optional> <replaceable>query2</replaceable> +<replaceable>query1</replaceable> INTERSECT <optional>ALL</optional> <replaceable>query2</replaceable> +<replaceable>query1</replaceable> EXCEPT <optional>ALL</optional> <replaceable>query2</replaceable> +</synopsis> + where <replaceable>query1</replaceable> and + <replaceable>query2</replaceable> are queries that can use any of + the features discussed up to this point. + </para> + + <para> + <literal>UNION</literal> effectively appends the result of + <replaceable>query2</replaceable> to the result of + <replaceable>query1</replaceable> (although there is no guarantee + that this is the order in which the rows are actually returned). + Furthermore, it eliminates duplicate rows from its result, in the same + way as <literal>DISTINCT</literal>, unless <literal>UNION ALL</literal> is used. + </para> + + <para> + <literal>INTERSECT</literal> returns all rows that are both in the result + of <replaceable>query1</replaceable> and in the result of + <replaceable>query2</replaceable>. Duplicate rows are eliminated + unless <literal>INTERSECT ALL</literal> is used. + </para> + + <para> + <literal>EXCEPT</literal> returns all rows that are in the result of + <replaceable>query1</replaceable> but not in the result of + <replaceable>query2</replaceable>. (This is sometimes called the + <firstterm>difference</firstterm> between two queries.) Again, duplicates + are eliminated unless <literal>EXCEPT ALL</literal> is used. + </para> + + <para> + In order to calculate the union, intersection, or difference of two + queries, the two queries must be <quote>union compatible</quote>, + which means that they return the same number of columns and + the corresponding columns have compatible data types, as + described in <xref linkend="typeconv-union-case"/>. + </para> + + <para> + Set operations can be combined, for example +<synopsis> +<replaceable>query1</replaceable> UNION <replaceable>query2</replaceable> EXCEPT <replaceable>query3</replaceable> +</synopsis> + which is equivalent to +<synopsis> +(<replaceable>query1</replaceable> UNION <replaceable>query2</replaceable>) EXCEPT <replaceable>query3</replaceable> +</synopsis> + As shown here, you can use parentheses to control the order of + evaluation. Without parentheses, <literal>UNION</literal> + and <literal>EXCEPT</literal> associate left-to-right, + but <literal>INTERSECT</literal> binds more tightly than those two + operators. Thus +<synopsis> +<replaceable>query1</replaceable> UNION <replaceable>query2</replaceable> INTERSECT <replaceable>query3</replaceable> +</synopsis> + means +<synopsis> +<replaceable>query1</replaceable> UNION (<replaceable>query2</replaceable> INTERSECT <replaceable>query3</replaceable>) +</synopsis> + You can also surround an individual <replaceable>query</replaceable> + with parentheses. This is important if + the <replaceable>query</replaceable> needs to use any of the clauses + discussed in following sections, such as <literal>LIMIT</literal>. + Without parentheses, you'll get a syntax error, or else the clause will + be understood as applying to the output of the set operation rather + than one of its inputs. For example, +<synopsis> +SELECT a FROM b UNION SELECT x FROM y LIMIT 10 +</synopsis> + is accepted, but it means +<synopsis> +(SELECT a FROM b UNION SELECT x FROM y) LIMIT 10 +</synopsis> + not +<synopsis> +SELECT a FROM b UNION (SELECT x FROM y LIMIT 10) +</synopsis> + </para> + </sect1> + + + <sect1 id="queries-order"> + <title>Sorting Rows (<literal>ORDER BY</literal>)</title> + + <indexterm zone="queries-order"> + <primary>sorting</primary> + </indexterm> + + <indexterm zone="queries-order"> + <primary>ORDER BY</primary> + </indexterm> + + <para> + After a query has produced an output table (after the select list + has been processed) it can optionally be sorted. If sorting is not + chosen, the rows will be returned in an unspecified order. The actual + order in that case will depend on the scan and join plan types and + the order on disk, but it must not be relied on. A particular + output ordering can only be guaranteed if the sort step is explicitly + chosen. + </para> + + <para> + The <literal>ORDER BY</literal> clause specifies the sort order: +<synopsis> +SELECT <replaceable>select_list</replaceable> + FROM <replaceable>table_expression</replaceable> + ORDER BY <replaceable>sort_expression1</replaceable> <optional>ASC | DESC</optional> <optional>NULLS { FIRST | LAST }</optional> + <optional>, <replaceable>sort_expression2</replaceable> <optional>ASC | DESC</optional> <optional>NULLS { FIRST | LAST }</optional> ...</optional> +</synopsis> + The sort expression(s) can be any expression that would be valid in the + query's select list. An example is: +<programlisting> +SELECT a, b FROM table1 ORDER BY a + b, c; +</programlisting> + When more than one expression is specified, + the later values are used to sort rows that are equal according to the + earlier values. Each expression can be followed by an optional + <literal>ASC</literal> or <literal>DESC</literal> keyword to set the sort direction to + ascending or descending. <literal>ASC</literal> order is the default. + Ascending order puts smaller values first, where + <quote>smaller</quote> is defined in terms of the + <literal><</literal> operator. Similarly, descending order is + determined with the <literal>></literal> operator. + <footnote> + <para> + Actually, <productname>PostgreSQL</productname> uses the <firstterm>default B-tree + operator class</firstterm> for the expression's data type to determine the sort + ordering for <literal>ASC</literal> and <literal>DESC</literal>. Conventionally, + data types will be set up so that the <literal><</literal> and + <literal>></literal> operators correspond to this sort ordering, + but a user-defined data type's designer could choose to do something + different. + </para> + </footnote> + </para> + + <para> + The <literal>NULLS FIRST</literal> and <literal>NULLS LAST</literal> options can be + used to determine whether nulls appear before or after non-null values + in the sort ordering. By default, null values sort as if larger than any + non-null value; that is, <literal>NULLS FIRST</literal> is the default for + <literal>DESC</literal> order, and <literal>NULLS LAST</literal> otherwise. + </para> + + <para> + Note that the ordering options are considered independently for each + sort column. For example <literal>ORDER BY x, y DESC</literal> means + <literal>ORDER BY x ASC, y DESC</literal>, which is not the same as + <literal>ORDER BY x DESC, y DESC</literal>. + </para> + + <para> + A <replaceable>sort_expression</replaceable> can also be the column label or number + of an output column, as in: +<programlisting> +SELECT a + b AS sum, c FROM table1 ORDER BY sum; +SELECT a, max(b) FROM table1 GROUP BY a ORDER BY 1; +</programlisting> + both of which sort by the first output column. Note that an output + column name has to stand alone, that is, it cannot be used in an expression + — for example, this is <emphasis>not</emphasis> correct: +<programlisting> +SELECT a + b AS sum, c FROM table1 ORDER BY sum + c; -- wrong +</programlisting> + This restriction is made to reduce ambiguity. There is still + ambiguity if an <literal>ORDER BY</literal> item is a simple name that + could match either an output column name or a column from the table + expression. The output column is used in such cases. This would + only cause confusion if you use <literal>AS</literal> to rename an output + column to match some other table column's name. + </para> + + <para> + <literal>ORDER BY</literal> can be applied to the result of a + <literal>UNION</literal>, <literal>INTERSECT</literal>, or <literal>EXCEPT</literal> + combination, but in this case it is only permitted to sort by + output column names or numbers, not by expressions. + </para> + </sect1> + + + <sect1 id="queries-limit"> + <title><literal>LIMIT</literal> and <literal>OFFSET</literal></title> + + <indexterm zone="queries-limit"> + <primary>LIMIT</primary> + </indexterm> + + <indexterm zone="queries-limit"> + <primary>OFFSET</primary> + </indexterm> + + <para> + <literal>LIMIT</literal> and <literal>OFFSET</literal> allow you to retrieve just + a portion of the rows that are generated by the rest of the query: +<synopsis> +SELECT <replaceable>select_list</replaceable> + FROM <replaceable>table_expression</replaceable> + <optional> ORDER BY ... </optional> + <optional> LIMIT { <replaceable>number</replaceable> | ALL } </optional> <optional> OFFSET <replaceable>number</replaceable> </optional> +</synopsis> + </para> + + <para> + If a limit count is given, no more than that many rows will be + returned (but possibly fewer, if the query itself yields fewer rows). + <literal>LIMIT ALL</literal> is the same as omitting the <literal>LIMIT</literal> + clause, as is <literal>LIMIT</literal> with a NULL argument. + </para> + + <para> + <literal>OFFSET</literal> says to skip that many rows before beginning to + return rows. <literal>OFFSET 0</literal> is the same as omitting the + <literal>OFFSET</literal> clause, as is <literal>OFFSET</literal> with a NULL argument. + </para> + + <para> + If both <literal>OFFSET</literal> + and <literal>LIMIT</literal> appear, then <literal>OFFSET</literal> rows are + skipped before starting to count the <literal>LIMIT</literal> rows that + are returned. + </para> + + <para> + When using <literal>LIMIT</literal>, it is important to use an + <literal>ORDER BY</literal> clause that constrains the result rows into a + unique order. Otherwise you will get an unpredictable subset of + the query's rows. You might be asking for the tenth through + twentieth rows, but tenth through twentieth in what ordering? The + ordering is unknown, unless you specified <literal>ORDER BY</literal>. + </para> + + <para> + The query optimizer takes <literal>LIMIT</literal> into account when + generating query plans, so you are very likely to get different + plans (yielding different row orders) depending on what you give + for <literal>LIMIT</literal> and <literal>OFFSET</literal>. Thus, using + different <literal>LIMIT</literal>/<literal>OFFSET</literal> values to select + different subsets of a query result <emphasis>will give + inconsistent results</emphasis> unless you enforce a predictable + result ordering with <literal>ORDER BY</literal>. This is not a bug; it + is an inherent consequence of the fact that SQL does not promise to + deliver the results of a query in any particular order unless + <literal>ORDER BY</literal> is used to constrain the order. + </para> + + <para> + The rows skipped by an <literal>OFFSET</literal> clause still have to be + computed inside the server; therefore a large <literal>OFFSET</literal> + might be inefficient. + </para> + </sect1> + + + <sect1 id="queries-values"> + <title><literal>VALUES</literal> Lists</title> + + <indexterm zone="queries-values"> + <primary>VALUES</primary> + </indexterm> + + <para> + <literal>VALUES</literal> provides a way to generate a <quote>constant table</quote> + that can be used in a query without having to actually create and populate + a table on-disk. The syntax is +<synopsis> +VALUES ( <replaceable class="parameter">expression</replaceable> [, ...] ) [, ...] +</synopsis> + Each parenthesized list of expressions generates a row in the table. + The lists must all have the same number of elements (i.e., the number + of columns in the table), and corresponding entries in each list must + have compatible data types. The actual data type assigned to each column + of the result is determined using the same rules as for <literal>UNION</literal> + (see <xref linkend="typeconv-union-case"/>). + </para> + + <para> + As an example: +<programlisting> +VALUES (1, 'one'), (2, 'two'), (3, 'three'); +</programlisting> + + will return a table of two columns and three rows. It's effectively + equivalent to: +<programlisting> +SELECT 1 AS column1, 'one' AS column2 +UNION ALL +SELECT 2, 'two' +UNION ALL +SELECT 3, 'three'; +</programlisting> + + By default, <productname>PostgreSQL</productname> assigns the names + <literal>column1</literal>, <literal>column2</literal>, etc. to the columns of a + <literal>VALUES</literal> table. The column names are not specified by the + SQL standard and different database systems do it differently, so + it's usually better to override the default names with a table alias + list, like this: +<programlisting> +=> SELECT * FROM (VALUES (1, 'one'), (2, 'two'), (3, 'three')) AS t (num,letter); + num | letter +-----+-------- + 1 | one + 2 | two + 3 | three +(3 rows) +</programlisting> + </para> + + <para> + Syntactically, <literal>VALUES</literal> followed by expression lists is + treated as equivalent to: +<synopsis> +SELECT <replaceable>select_list</replaceable> FROM <replaceable>table_expression</replaceable> +</synopsis> + and can appear anywhere a <literal>SELECT</literal> can. For example, you can + use it as part of a <literal>UNION</literal>, or attach a + <replaceable>sort_specification</replaceable> (<literal>ORDER BY</literal>, + <literal>LIMIT</literal>, and/or <literal>OFFSET</literal>) to it. <literal>VALUES</literal> + is most commonly used as the data source in an <command>INSERT</command> command, + and next most commonly as a subquery. + </para> + + <para> + For more information see <xref linkend="sql-values"/>. + </para> + + </sect1> + + + <sect1 id="queries-with"> + <title><literal>WITH</literal> Queries (Common Table Expressions)</title> + + <indexterm zone="queries-with"> + <primary>WITH</primary> + <secondary>in SELECT</secondary> + </indexterm> + + <indexterm> + <primary>common table expression</primary> + <see>WITH</see> + </indexterm> + + <para> + <literal>WITH</literal> provides a way to write auxiliary statements for use in a + larger query. These statements, which are often referred to as Common + Table Expressions or <acronym>CTE</acronym>s, can be thought of as defining + temporary tables that exist just for one query. Each auxiliary statement + in a <literal>WITH</literal> clause can be a <command>SELECT</command>, + <command>INSERT</command>, <command>UPDATE</command>, or <command>DELETE</command>; and the + <literal>WITH</literal> clause itself is attached to a primary statement that can + be a <command>SELECT</command>, <command>INSERT</command>, <command>UPDATE</command>, + <command>DELETE</command>, or <command>MERGE</command>. + </para> + + <sect2 id="queries-with-select"> + <title><command>SELECT</command> in <literal>WITH</literal></title> + + <para> + The basic value of <command>SELECT</command> in <literal>WITH</literal> is to + break down complicated queries into simpler parts. An example is: + +<programlisting> +WITH regional_sales AS ( + SELECT region, SUM(amount) AS total_sales + FROM orders + GROUP BY region +), top_regions AS ( + SELECT region + FROM regional_sales + WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales) +) +SELECT region, + product, + SUM(quantity) AS product_units, + SUM(amount) AS product_sales +FROM orders +WHERE region IN (SELECT region FROM top_regions) +GROUP BY region, product; +</programlisting> + + which displays per-product sales totals in only the top sales regions. + The <literal>WITH</literal> clause defines two auxiliary statements named + <structname>regional_sales</structname> and <structname>top_regions</structname>, + where the output of <structname>regional_sales</structname> is used in + <structname>top_regions</structname> and the output of <structname>top_regions</structname> + is used in the primary <command>SELECT</command> query. + This example could have been written without <literal>WITH</literal>, + but we'd have needed two levels of nested sub-<command>SELECT</command>s. It's a bit + easier to follow this way. + </para> + </sect2> + + <sect2 id="queries-with-recursive"> + <title>Recursive Queries</title> + + <para> + <indexterm> + <primary>RECURSIVE</primary> + <secondary>in common table expressions</secondary> + </indexterm> + The optional <literal>RECURSIVE</literal> modifier changes <literal>WITH</literal> + from a mere syntactic convenience into a feature that accomplishes + things not otherwise possible in standard SQL. Using + <literal>RECURSIVE</literal>, a <literal>WITH</literal> query can refer to its own + output. A very simple example is this query to sum the integers from 1 + through 100: + +<programlisting> +WITH RECURSIVE t(n) AS ( + VALUES (1) + UNION ALL + SELECT n+1 FROM t WHERE n < 100 +) +SELECT sum(n) FROM t; +</programlisting> + + The general form of a recursive <literal>WITH</literal> query is always a + <firstterm>non-recursive term</firstterm>, then <literal>UNION</literal> (or + <literal>UNION ALL</literal>), then a + <firstterm>recursive term</firstterm>, where only the recursive term can contain + a reference to the query's own output. Such a query is executed as + follows: + </para> + + <procedure> + <title>Recursive Query Evaluation</title> + + <step performance="required"> + <para> + Evaluate the non-recursive term. For <literal>UNION</literal> (but not + <literal>UNION ALL</literal>), discard duplicate rows. Include all remaining + rows in the result of the recursive query, and also place them in a + temporary <firstterm>working table</firstterm>. + </para> + </step> + + <step performance="required"> + <para> + So long as the working table is not empty, repeat these steps: + </para> + <substeps> + <step performance="required"> + <para> + Evaluate the recursive term, substituting the current contents of + the working table for the recursive self-reference. + For <literal>UNION</literal> (but not <literal>UNION ALL</literal>), discard + duplicate rows and rows that duplicate any previous result row. + Include all remaining rows in the result of the recursive query, and + also place them in a temporary <firstterm>intermediate table</firstterm>. + </para> + </step> + + <step performance="required"> + <para> + Replace the contents of the working table with the contents of the + intermediate table, then empty the intermediate table. + </para> + </step> + </substeps> + </step> + </procedure> + + <note> + <para> + While <literal>RECURSIVE</literal> allows queries to be specified + recursively, internally such queries are evaluated iteratively. + </para> + </note> + + <para> + In the example above, the working table has just a single row in each step, + and it takes on the values from 1 through 100 in successive steps. In + the 100th step, there is no output because of the <literal>WHERE</literal> + clause, and so the query terminates. + </para> + + <para> + Recursive queries are typically used to deal with hierarchical or + tree-structured data. A useful example is this query to find all the + direct and indirect sub-parts of a product, given only a table that + shows immediate inclusions: + +<programlisting> +WITH RECURSIVE included_parts(sub_part, part, quantity) AS ( + SELECT sub_part, part, quantity FROM parts WHERE part = 'our_product' + UNION ALL + SELECT p.sub_part, p.part, p.quantity * pr.quantity + FROM included_parts pr, parts p + WHERE p.part = pr.sub_part +) +SELECT sub_part, SUM(quantity) as total_quantity +FROM included_parts +GROUP BY sub_part +</programlisting> + </para> + + <sect3 id="queries-with-search"> + <title>Search Order</title> + + <para> + When computing a tree traversal using a recursive query, you might want to + order the results in either depth-first or breadth-first order. This can + be done by computing an ordering column alongside the other data columns + and using that to sort the results at the end. Note that this does not + actually control in which order the query evaluation visits the rows; that + is as always in SQL implementation-dependent. This approach merely + provides a convenient way to order the results afterwards. + </para> + + <para> + To create a depth-first order, we compute for each result row an array of + rows that we have visited so far. For example, consider the following + query that searches a table <structname>tree</structname> using a + <structfield>link</structfield> field: + +<programlisting> +WITH RECURSIVE search_tree(id, link, data) AS ( + SELECT t.id, t.link, t.data + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data + FROM tree t, search_tree st + WHERE t.id = st.link +) +SELECT * FROM search_tree; +</programlisting> + + To add depth-first ordering information, you can write this: + +<programlisting> +WITH RECURSIVE search_tree(id, link, data, <emphasis>path</emphasis>) AS ( + SELECT t.id, t.link, t.data, <emphasis>ARRAY[t.id]</emphasis> + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data, <emphasis>path || t.id</emphasis> + FROM tree t, search_tree st + WHERE t.id = st.link +) +SELECT * FROM search_tree <emphasis>ORDER BY path</emphasis>; +</programlisting> + </para> + + <para> + In the general case where more than one field needs to be used to identify + a row, use an array of rows. For example, if we needed to track fields + <structfield>f1</structfield> and <structfield>f2</structfield>: + +<programlisting> +WITH RECURSIVE search_tree(id, link, data, <emphasis>path</emphasis>) AS ( + SELECT t.id, t.link, t.data, <emphasis>ARRAY[ROW(t.f1, t.f2)]</emphasis> + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data, <emphasis>path || ROW(t.f1, t.f2)</emphasis> + FROM tree t, search_tree st + WHERE t.id = st.link +) +SELECT * FROM search_tree <emphasis>ORDER BY path</emphasis>; +</programlisting> + </para> + + <tip> + <para> + Omit the <literal>ROW()</literal> syntax in the common case where only one + field needs to be tracked. This allows a simple array rather than a + composite-type array to be used, gaining efficiency. + </para> + </tip> + + <para> + To create a breadth-first order, you can add a column that tracks the depth + of the search, for example: + +<programlisting> +WITH RECURSIVE search_tree(id, link, data, <emphasis>depth</emphasis>) AS ( + SELECT t.id, t.link, t.data, <emphasis>0</emphasis> + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data, <emphasis>depth + 1</emphasis> + FROM tree t, search_tree st + WHERE t.id = st.link +) +SELECT * FROM search_tree <emphasis>ORDER BY depth</emphasis>; +</programlisting> + + To get a stable sort, add data columns as secondary sorting columns. + </para> + + <tip> + <para> + The recursive query evaluation algorithm produces its output in + breadth-first search order. However, this is an implementation detail and + it is perhaps unsound to rely on it. The order of the rows within each + level is certainly undefined, so some explicit ordering might be desired + in any case. + </para> + </tip> + + <para> + There is built-in syntax to compute a depth- or breadth-first sort column. + For example: + +<programlisting> +WITH RECURSIVE search_tree(id, link, data) AS ( + SELECT t.id, t.link, t.data + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data + FROM tree t, search_tree st + WHERE t.id = st.link +) <emphasis>SEARCH DEPTH FIRST BY id SET ordercol</emphasis> +SELECT * FROM search_tree ORDER BY ordercol; + +WITH RECURSIVE search_tree(id, link, data) AS ( + SELECT t.id, t.link, t.data + FROM tree t + UNION ALL + SELECT t.id, t.link, t.data + FROM tree t, search_tree st + WHERE t.id = st.link +) <emphasis>SEARCH BREADTH FIRST BY id SET ordercol</emphasis> +SELECT * FROM search_tree ORDER BY ordercol; +</programlisting> + This syntax is internally expanded to something similar to the above + hand-written forms. The <literal>SEARCH</literal> clause specifies whether + depth- or breadth first search is wanted, the list of columns to track for + sorting, and a column name that will contain the result data that can be + used for sorting. That column will implicitly be added to the output rows + of the CTE. + </para> + </sect3> + + <sect3 id="queries-with-cycle"> + <title>Cycle Detection</title> + + <para> + When working with recursive queries it is important to be sure that + the recursive part of the query will eventually return no tuples, + or else the query will loop indefinitely. Sometimes, using + <literal>UNION</literal> instead of <literal>UNION ALL</literal> can accomplish this + by discarding rows that duplicate previous output rows. However, often a + cycle does not involve output rows that are completely duplicate: it may be + necessary to check just one or a few fields to see if the same point has + been reached before. The standard method for handling such situations is + to compute an array of the already-visited values. For example, consider again + the following query that searches a table <structname>graph</structname> using a + <structfield>link</structfield> field: + +<programlisting> +WITH RECURSIVE search_graph(id, link, data, depth) AS ( + SELECT g.id, g.link, g.data, 0 + FROM graph g + UNION ALL + SELECT g.id, g.link, g.data, sg.depth + 1 + FROM graph g, search_graph sg + WHERE g.id = sg.link +) +SELECT * FROM search_graph; +</programlisting> + + This query will loop if the <structfield>link</structfield> relationships contain + cycles. Because we require a <quote>depth</quote> output, just changing + <literal>UNION ALL</literal> to <literal>UNION</literal> would not eliminate the looping. + Instead we need to recognize whether we have reached the same row again + while following a particular path of links. We add two columns + <structfield>is_cycle</structfield> and <structfield>path</structfield> to the loop-prone query: + +<programlisting> +WITH RECURSIVE search_graph(id, link, data, depth, <emphasis>is_cycle, path</emphasis>) AS ( + SELECT g.id, g.link, g.data, 0, + <emphasis>false, + ARRAY[g.id]</emphasis> + FROM graph g + UNION ALL + SELECT g.id, g.link, g.data, sg.depth + 1, + <emphasis>g.id = ANY(path), + path || g.id</emphasis> + FROM graph g, search_graph sg + WHERE g.id = sg.link <emphasis>AND NOT is_cycle</emphasis> +) +SELECT * FROM search_graph; +</programlisting> + + Aside from preventing cycles, the array value is often useful in its own + right as representing the <quote>path</quote> taken to reach any particular row. + </para> + + <para> + In the general case where more than one field needs to be checked to + recognize a cycle, use an array of rows. For example, if we needed to + compare fields <structfield>f1</structfield> and <structfield>f2</structfield>: + +<programlisting> +WITH RECURSIVE search_graph(id, link, data, depth, <emphasis>is_cycle, path</emphasis>) AS ( + SELECT g.id, g.link, g.data, 0, + <emphasis>false, + ARRAY[ROW(g.f1, g.f2)]</emphasis> + FROM graph g + UNION ALL + SELECT g.id, g.link, g.data, sg.depth + 1, + <emphasis>ROW(g.f1, g.f2) = ANY(path), + path || ROW(g.f1, g.f2)</emphasis> + FROM graph g, search_graph sg + WHERE g.id = sg.link <emphasis>AND NOT is_cycle</emphasis> +) +SELECT * FROM search_graph; +</programlisting> + </para> + + <tip> + <para> + Omit the <literal>ROW()</literal> syntax in the common case where only one field + needs to be checked to recognize a cycle. This allows a simple array + rather than a composite-type array to be used, gaining efficiency. + </para> + </tip> + + <para> + There is built-in syntax to simplify cycle detection. The above query can + also be written like this: +<programlisting> +WITH RECURSIVE search_graph(id, link, data, depth) AS ( + SELECT g.id, g.link, g.data, 1 + FROM graph g + UNION ALL + SELECT g.id, g.link, g.data, sg.depth + 1 + FROM graph g, search_graph sg + WHERE g.id = sg.link +) <emphasis>CYCLE id SET is_cycle USING path</emphasis> +SELECT * FROM search_graph; +</programlisting> + and it will be internally rewritten to the above form. The + <literal>CYCLE</literal> clause specifies first the list of columns to + track for cycle detection, then a column name that will show whether a + cycle has been detected, and finally the name of another column that will track the + path. The cycle and path columns will implicitly be added to the output + rows of the CTE. + </para> + + <tip> + <para> + The cycle path column is computed in the same way as the depth-first + ordering column show in the previous section. A query can have both a + <literal>SEARCH</literal> and a <literal>CYCLE</literal> clause, but a + depth-first search specification and a cycle detection specification would + create redundant computations, so it's more efficient to just use the + <literal>CYCLE</literal> clause and order by the path column. If + breadth-first ordering is wanted, then specifying both + <literal>SEARCH</literal> and <literal>CYCLE</literal> can be useful. + </para> + </tip> + + <para> + A helpful trick for testing queries + when you are not certain if they might loop is to place a <literal>LIMIT</literal> + in the parent query. For example, this query would loop forever without + the <literal>LIMIT</literal>: + +<programlisting> +WITH RECURSIVE t(n) AS ( + SELECT 1 + UNION ALL + SELECT n+1 FROM t +) +SELECT n FROM t <emphasis>LIMIT 100</emphasis>; +</programlisting> + + This works because <productname>PostgreSQL</productname>'s implementation + evaluates only as many rows of a <literal>WITH</literal> query as are actually + fetched by the parent query. Using this trick in production is not + recommended, because other systems might work differently. Also, it + usually won't work if you make the outer query sort the recursive query's + results or join them to some other table, because in such cases the + outer query will usually try to fetch all of the <literal>WITH</literal> query's + output anyway. + </para> + </sect3> + </sect2> + + <sect2 id="queries-with-cte-materialization"> + <title>Common Table Expression Materialization</title> + + <para> + A useful property of <literal>WITH</literal> queries is that they are + normally evaluated only once per execution of the parent query, even if + they are referred to more than once by the parent query or + sibling <literal>WITH</literal> queries. + Thus, expensive calculations that are needed in multiple places can be + placed within a <literal>WITH</literal> query to avoid redundant work. Another + possible application is to prevent unwanted multiple evaluations of + functions with side-effects. + However, the other side of this coin is that the optimizer is not able to + push restrictions from the parent query down into a multiply-referenced + <literal>WITH</literal> query, since that might affect all uses of the + <literal>WITH</literal> query's output when it should affect only one. + The multiply-referenced <literal>WITH</literal> query will be + evaluated as written, without suppression of rows that the parent query + might discard afterwards. (But, as mentioned above, evaluation might stop + early if the reference(s) to the query demand only a limited number of + rows.) + </para> + + <para> + However, if a <literal>WITH</literal> query is non-recursive and + side-effect-free (that is, it is a <literal>SELECT</literal> containing + no volatile functions) then it can be folded into the parent query, + allowing joint optimization of the two query levels. By default, this + happens if the parent query references the <literal>WITH</literal> query + just once, but not if it references the <literal>WITH</literal> query + more than once. You can override that decision by + specifying <literal>MATERIALIZED</literal> to force separate calculation + of the <literal>WITH</literal> query, or by specifying <literal>NOT + MATERIALIZED</literal> to force it to be merged into the parent query. + The latter choice risks duplicate computation of + the <literal>WITH</literal> query, but it can still give a net savings if + each usage of the <literal>WITH</literal> query needs only a small part + of the <literal>WITH</literal> query's full output. + </para> + + <para> + A simple example of these rules is +<programlisting> +WITH w AS ( + SELECT * FROM big_table +) +SELECT * FROM w WHERE key = 123; +</programlisting> + This <literal>WITH</literal> query will be folded, producing the same + execution plan as +<programlisting> +SELECT * FROM big_table WHERE key = 123; +</programlisting> + In particular, if there's an index on <structfield>key</structfield>, + it will probably be used to fetch just the rows having <literal>key = + 123</literal>. On the other hand, in +<programlisting> +WITH w AS ( + SELECT * FROM big_table +) +SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref +WHERE w2.key = 123; +</programlisting> + the <literal>WITH</literal> query will be materialized, producing a + temporary copy of <structname>big_table</structname> that is then + joined with itself — without benefit of any index. This query + will be executed much more efficiently if written as +<programlisting> +WITH w AS NOT MATERIALIZED ( + SELECT * FROM big_table +) +SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref +WHERE w2.key = 123; +</programlisting> + so that the parent query's restrictions can be applied directly + to scans of <structname>big_table</structname>. + </para> + + <para> + An example where <literal>NOT MATERIALIZED</literal> could be + undesirable is +<programlisting> +WITH w AS ( + SELECT key, very_expensive_function(val) as f FROM some_table +) +SELECT * FROM w AS w1 JOIN w AS w2 ON w1.f = w2.f; +</programlisting> + Here, materialization of the <literal>WITH</literal> query ensures + that <function>very_expensive_function</function> is evaluated only + once per table row, not twice. + </para> + + <para> + The examples above only show <literal>WITH</literal> being used with + <command>SELECT</command>, but it can be attached in the same way to + <command>INSERT</command>, <command>UPDATE</command>, + <command>DELETE</command>, or <command>MERGE</command>. + In each case it effectively provides temporary table(s) that can + be referred to in the main command. + </para> + </sect2> + + <sect2 id="queries-with-modifying"> + <title>Data-Modifying Statements in <literal>WITH</literal></title> + + <para> + You can use most data-modifying statements (<command>INSERT</command>, + <command>UPDATE</command>, or <command>DELETE</command>, but not + <command>MERGE</command>) in <literal>WITH</literal>. This + allows you to perform several different operations in the same query. + An example is: + +<programlisting> +WITH moved_rows AS ( + DELETE FROM products + WHERE + "date" >= '2010-10-01' AND + "date" < '2010-11-01' + RETURNING * +) +INSERT INTO products_log +SELECT * FROM moved_rows; +</programlisting> + + This query effectively moves rows from <structname>products</structname> to + <structname>products_log</structname>. The <command>DELETE</command> in <literal>WITH</literal> + deletes the specified rows from <structname>products</structname>, returning their + contents by means of its <literal>RETURNING</literal> clause; and then the + primary query reads that output and inserts it into + <structname>products_log</structname>. + </para> + + <para> + A fine point of the above example is that the <literal>WITH</literal> clause is + attached to the <command>INSERT</command>, not the sub-<command>SELECT</command> within + the <command>INSERT</command>. This is necessary because data-modifying + statements are only allowed in <literal>WITH</literal> clauses that are attached + to the top-level statement. However, normal <literal>WITH</literal> visibility + rules apply, so it is possible to refer to the <literal>WITH</literal> + statement's output from the sub-<command>SELECT</command>. + </para> + + <para> + Data-modifying statements in <literal>WITH</literal> usually have + <literal>RETURNING</literal> clauses (see <xref linkend="dml-returning"/>), + as shown in the example above. + It is the output of the <literal>RETURNING</literal> clause, <emphasis>not</emphasis> the + target table of the data-modifying statement, that forms the temporary + table that can be referred to by the rest of the query. If a + data-modifying statement in <literal>WITH</literal> lacks a <literal>RETURNING</literal> + clause, then it forms no temporary table and cannot be referred to in + the rest of the query. Such a statement will be executed nonetheless. + A not-particularly-useful example is: + +<programlisting> +WITH t AS ( + DELETE FROM foo +) +DELETE FROM bar; +</programlisting> + + This example would remove all rows from tables <structname>foo</structname> and + <structname>bar</structname>. The number of affected rows reported to the client + would only include rows removed from <structname>bar</structname>. + </para> + + <para> + Recursive self-references in data-modifying statements are not + allowed. In some cases it is possible to work around this limitation by + referring to the output of a recursive <literal>WITH</literal>, for example: + +<programlisting> +WITH RECURSIVE included_parts(sub_part, part) AS ( + SELECT sub_part, part FROM parts WHERE part = 'our_product' + UNION ALL + SELECT p.sub_part, p.part + FROM included_parts pr, parts p + WHERE p.part = pr.sub_part +) +DELETE FROM parts + WHERE part IN (SELECT part FROM included_parts); +</programlisting> + + This query would remove all direct and indirect subparts of a product. + </para> + + <para> + Data-modifying statements in <literal>WITH</literal> are executed exactly once, + and always to completion, independently of whether the primary query + reads all (or indeed any) of their output. Notice that this is different + from the rule for <command>SELECT</command> in <literal>WITH</literal>: as stated in the + previous section, execution of a <command>SELECT</command> is carried only as far + as the primary query demands its output. + </para> + + <para> + The sub-statements in <literal>WITH</literal> are executed concurrently with + each other and with the main query. Therefore, when using data-modifying + statements in <literal>WITH</literal>, the order in which the specified updates + actually happen is unpredictable. All the statements are executed with + the same <firstterm>snapshot</firstterm> (see <xref linkend="mvcc"/>), so they + cannot <quote>see</quote> one another's effects on the target tables. This + alleviates the effects of the unpredictability of the actual order of row + updates, and means that <literal>RETURNING</literal> data is the only way to + communicate changes between different <literal>WITH</literal> sub-statements and + the main query. An example of this is that in + +<programlisting> +WITH t AS ( + UPDATE products SET price = price * 1.05 + RETURNING * +) +SELECT * FROM products; +</programlisting> + + the outer <command>SELECT</command> would return the original prices before the + action of the <command>UPDATE</command>, while in + +<programlisting> +WITH t AS ( + UPDATE products SET price = price * 1.05 + RETURNING * +) +SELECT * FROM t; +</programlisting> + + the outer <command>SELECT</command> would return the updated data. + </para> + + <para> + Trying to update the same row twice in a single statement is not + supported. Only one of the modifications takes place, but it is not easy + (and sometimes not possible) to reliably predict which one. This also + applies to deleting a row that was already updated in the same statement: + only the update is performed. Therefore you should generally avoid trying + to modify a single row twice in a single statement. In particular avoid + writing <literal>WITH</literal> sub-statements that could affect the same rows + changed by the main statement or a sibling sub-statement. The effects + of such a statement will not be predictable. + </para> + + <para> + At present, any table used as the target of a data-modifying statement in + <literal>WITH</literal> must not have a conditional rule, nor an <literal>ALSO</literal> + rule, nor an <literal>INSTEAD</literal> rule that expands to multiple statements. + </para> + + </sect2> + + </sect1> + +</chapter> |