diff options
Diffstat (limited to 'doc/src/sgml/xoper.sgml')
-rw-r--r-- | doc/src/sgml/xoper.sgml | 486 |
1 files changed, 486 insertions, 0 deletions
diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml new file mode 100644 index 0000000..56b0849 --- /dev/null +++ b/doc/src/sgml/xoper.sgml @@ -0,0 +1,486 @@ +<!-- doc/src/sgml/xoper.sgml --> + + <sect1 id="xoper"> + <title>User-Defined Operators</title> + + <indexterm zone="xoper"> + <primary>operator</primary> + <secondary>user-defined</secondary> + </indexterm> + + <para> + Every operator is <quote>syntactic sugar</quote> for a call to an + underlying function that does the real work; so you must + first create the underlying function before you can create + the operator. However, an operator is <emphasis>not merely</emphasis> + syntactic sugar, because it carries additional information + that helps the query planner optimize queries that use the + operator. The next section will be devoted to explaining + that additional information. + </para> + + <para> + <productname>PostgreSQL</productname> supports left unary, right + unary, and binary operators. Operators can be + overloaded;<indexterm><primary>overloading</primary><secondary>operators</secondary></indexterm> + that is, the same operator name can be used for different operators + that have different numbers and types of operands. When a query is + executed, the system determines the operator to call from the + number and types of the provided operands. + </para> + + <para> + Here is an example of creating an operator for adding two complex + numbers. We assume we've already created the definition of type + <type>complex</type> (see <xref linkend="xtypes"/>). First we need a + function that does the work, then we can define the operator: + +<programlisting> +CREATE FUNCTION complex_add(complex, complex) + RETURNS complex + AS '<replaceable>filename</replaceable>', 'complex_add' + LANGUAGE C IMMUTABLE STRICT; + +CREATE OPERATOR + ( + leftarg = complex, + rightarg = complex, + function = complex_add, + commutator = + +); +</programlisting> + </para> + + <para> + Now we could execute a query like this: + +<screen> +SELECT (a + b) AS c FROM test_complex; + + c +----------------- + (5.2,6.05) + (133.42,144.95) +</screen> + </para> + + <para> + We've shown how to create a binary operator here. To create unary + operators, just omit one of <literal>leftarg</literal> (for left unary) or + <literal>rightarg</literal> (for right unary). The <literal>function</literal> + clause and the argument clauses are the only required items in + <command>CREATE OPERATOR</command>. The <literal>commutator</literal> + clause shown in the example is an optional hint to the query + optimizer. Further details about <literal>commutator</literal> and other + optimizer hints appear in the next section. + </para> + </sect1> + + <sect1 id="xoper-optimization"> + <title>Operator Optimization Information</title> + + <indexterm zone="xoper-optimization"> + <primary>optimization information</primary> + <secondary>for operators</secondary> + </indexterm> + + <para> + A <productname>PostgreSQL</productname> operator definition can include + several optional clauses that tell the system useful things about how + the operator behaves. These clauses should be provided whenever + appropriate, because they can make for considerable speedups in execution + of queries that use the operator. But if you provide them, you must be + sure that they are right! Incorrect use of an optimization clause can + result in slow queries, subtly wrong output, or other Bad Things. + You can always leave out an optimization clause if you are not sure + about it; the only consequence is that queries might run slower than + they need to. + </para> + + <para> + Additional optimization clauses might be added in future versions of + <productname>PostgreSQL</productname>. The ones described here are all + the ones that release &version; understands. + </para> + + <para> + It is also possible to attach a planner support function to the function + that underlies an operator, providing another way of telling the system + about the behavior of the operator. + See <xref linkend="xfunc-optimization"/> for more information. + </para> + + <sect2> + <title><literal>COMMUTATOR</literal></title> + + <para> + The <literal>COMMUTATOR</literal> clause, if provided, names an operator that is the + commutator of the operator being defined. We say that operator A is the + commutator of operator B if (x A y) equals (y B x) for all possible input + values x, y. Notice that B is also the commutator of A. For example, + operators <literal><</literal> and <literal>></literal> for a particular data type are usually each others' + commutators, and operator <literal>+</literal> is usually commutative with itself. + But operator <literal>-</literal> is usually not commutative with anything. + </para> + + <para> + The left operand type of a commutable operator is the same as the + right operand type of its commutator, and vice versa. So the name of + the commutator operator is all that <productname>PostgreSQL</productname> + needs to be given to look up the commutator, and that's all that needs to + be provided in the <literal>COMMUTATOR</literal> clause. + </para> + + <para> + It's critical to provide commutator information for operators that + will be used in indexes and join clauses, because this allows the + query optimizer to <quote>flip around</quote> such a clause to the forms + needed for different plan types. For example, consider a query with + a WHERE clause like <literal>tab1.x = tab2.y</literal>, where <literal>tab1.x</literal> + and <literal>tab2.y</literal> are of a user-defined type, and suppose that + <literal>tab2.y</literal> is indexed. The optimizer cannot generate an + index scan unless it can determine how to flip the clause around to + <literal>tab2.y = tab1.x</literal>, because the index-scan machinery expects + to see the indexed column on the left of the operator it is given. + <productname>PostgreSQL</productname> will <emphasis>not</emphasis> simply + assume that this is a valid transformation — the creator of the + <literal>=</literal> operator must specify that it is valid, by marking the + operator with commutator information. + </para> + + <para> + When you are defining a self-commutative operator, you just do it. + When you are defining a pair of commutative operators, things are + a little trickier: how can the first one to be defined refer to the + other one, which you haven't defined yet? There are two solutions + to this problem: + + <itemizedlist> + <listitem> + <para> + One way is to omit the <literal>COMMUTATOR</literal> clause in the first operator that + you define, and then provide one in the second operator's definition. + Since <productname>PostgreSQL</productname> knows that commutative + operators come in pairs, when it sees the second definition it will + automatically go back and fill in the missing <literal>COMMUTATOR</literal> clause in + the first definition. + </para> + </listitem> + + <listitem> + <para> + The other, more straightforward way is just to include <literal>COMMUTATOR</literal> clauses + in both definitions. When <productname>PostgreSQL</productname> processes + the first definition and realizes that <literal>COMMUTATOR</literal> refers to a nonexistent + operator, the system will make a dummy entry for that operator in the + system catalog. This dummy entry will have valid data only + for the operator name, left and right operand types, and result type, + since that's all that <productname>PostgreSQL</productname> can deduce + at this point. The first operator's catalog entry will link to this + dummy entry. Later, when you define the second operator, the system + updates the dummy entry with the additional information from the second + definition. If you try to use the dummy operator before it's been filled + in, you'll just get an error message. + </para> + </listitem> + </itemizedlist> + </para> + </sect2> + + <sect2> + <title><literal>NEGATOR</literal></title> + + <para> + The <literal>NEGATOR</literal> clause, if provided, names an operator that is the + negator of the operator being defined. We say that operator A + is the negator of operator B if both return Boolean results and + (x A y) equals NOT (x B y) for all possible inputs x, y. + Notice that B is also the negator of A. + For example, <literal><</literal> and <literal>>=</literal> are a negator pair for most data types. + An operator can never validly be its own negator. + </para> + + <para> + Unlike commutators, a pair of unary operators could validly be marked + as each other's negators; that would mean (A x) equals NOT (B x) + for all x, or the equivalent for right unary operators. + </para> + + <para> + An operator's negator must have the same left and/or right operand types + as the operator to be defined, so just as with <literal>COMMUTATOR</literal>, only the operator + name need be given in the <literal>NEGATOR</literal> clause. + </para> + + <para> + Providing a negator is very helpful to the query optimizer since + it allows expressions like <literal>NOT (x = y)</literal> to be simplified into + <literal>x <> y</literal>. This comes up more often than you might think, because + <literal>NOT</literal> operations can be inserted as a consequence of other rearrangements. + </para> + + <para> + Pairs of negator operators can be defined using the same methods + explained above for commutator pairs. + </para> + + </sect2> + + <sect2> + <title><literal>RESTRICT</literal></title> + + <para> + The <literal>RESTRICT</literal> clause, if provided, names a restriction selectivity + estimation function for the operator. (Note that this is a function + name, not an operator name.) <literal>RESTRICT</literal> clauses only make sense for + binary operators that return <type>boolean</type>. The idea behind a restriction + selectivity estimator is to guess what fraction of the rows in a + table will satisfy a <literal>WHERE</literal>-clause condition of the form: +<programlisting> +column OP constant +</programlisting> + for the current operator and a particular constant value. + This assists the optimizer by + giving it some idea of how many rows will be eliminated by <literal>WHERE</literal> + clauses that have this form. (What happens if the constant is on + the left, you might be wondering? Well, that's one of the things that + <literal>COMMUTATOR</literal> is for...) + </para> + + <para> + Writing new restriction selectivity estimation functions is far beyond + the scope of this chapter, but fortunately you can usually just use + one of the system's standard estimators for many of your own operators. + These are the standard restriction estimators: + <simplelist> + <member><function>eqsel</function> for <literal>=</literal></member> + <member><function>neqsel</function> for <literal><></literal></member> + <member><function>scalarltsel</function> for <literal><</literal></member> + <member><function>scalarlesel</function> for <literal><=</literal></member> + <member><function>scalargtsel</function> for <literal>></literal></member> + <member><function>scalargesel</function> for <literal>>=</literal></member> + </simplelist> + </para> + + <para> + You can frequently get away with using either <function>eqsel</function> or <function>neqsel</function> for + operators that have very high or very low selectivity, even if they + aren't really equality or inequality. For example, the + approximate-equality geometric operators use <function>eqsel</function> on the assumption that + they'll usually only match a small fraction of the entries in a table. + </para> + + <para> + You can use <function>scalarltsel</function>, <function>scalarlesel</function>, + <function>scalargtsel</function> and <function>scalargesel</function> for comparisons on + data types that have some sensible means of being converted into numeric + scalars for range comparisons. If possible, add the data type to those + understood by the function <function>convert_to_scalar()</function> in + <filename>src/backend/utils/adt/selfuncs.c</filename>. + (Eventually, this function should be replaced by per-data-type functions + identified through a column of the <classname>pg_type</classname> system catalog; but that hasn't happened + yet.) If you do not do this, things will still work, but the optimizer's + estimates won't be as good as they could be. + </para> + + <para> + Another useful built-in selectivity estimation function + is <function>matchingsel</function>, which will work for almost any + binary operator, if standard MCV and/or histogram statistics are + collected for the input data type(s). Its default estimate is set to + twice the default estimate used in <function>eqsel</function>, making + it most suitable for comparison operators that are somewhat less + strict than equality. (Or you could call the + underlying <function>generic_restriction_selectivity</function> + function, providing a different default estimate.) + </para> + + <para> + There are additional selectivity estimation functions designed for geometric + operators in <filename>src/backend/utils/adt/geo_selfuncs.c</filename>: <function>areasel</function>, <function>positionsel</function>, + and <function>contsel</function>. At this writing these are just stubs, but you might want + to use them (or even better, improve them) anyway. + </para> + </sect2> + + <sect2> + <title><literal>JOIN</literal></title> + + <para> + The <literal>JOIN</literal> clause, if provided, names a join selectivity + estimation function for the operator. (Note that this is a function + name, not an operator name.) <literal>JOIN</literal> clauses only make sense for + binary operators that return <type>boolean</type>. The idea behind a join + selectivity estimator is to guess what fraction of the rows in a + pair of tables will satisfy a <literal>WHERE</literal>-clause condition of the form: +<programlisting> +table1.column1 OP table2.column2 +</programlisting> + for the current operator. As with the <literal>RESTRICT</literal> clause, this helps + the optimizer very substantially by letting it figure out which + of several possible join sequences is likely to take the least work. + </para> + + <para> + As before, this chapter will make no attempt to explain how to write + a join selectivity estimator function, but will just suggest that + you use one of the standard estimators if one is applicable: + <simplelist> + <member><function>eqjoinsel</function> for <literal>=</literal></member> + <member><function>neqjoinsel</function> for <literal><></literal></member> + <member><function>scalarltjoinsel</function> for <literal><</literal></member> + <member><function>scalarlejoinsel</function> for <literal><=</literal></member> + <member><function>scalargtjoinsel</function> for <literal>></literal></member> + <member><function>scalargejoinsel</function> for <literal>>=</literal></member> + <member><function>matchingjoinsel</function> for generic matching operators</member> + <member><function>areajoinsel</function> for 2D area-based comparisons</member> + <member><function>positionjoinsel</function> for 2D position-based comparisons</member> + <member><function>contjoinsel</function> for 2D containment-based comparisons</member> + </simplelist> + </para> + </sect2> + + <sect2> + <title><literal>HASHES</literal></title> + + <para> + The <literal>HASHES</literal> clause, if present, tells the system that + it is permissible to use the hash join method for a join based on this + operator. <literal>HASHES</literal> only makes sense for a binary operator that + returns <literal>boolean</literal>, and in practice the operator must represent + equality for some data type or pair of data types. + </para> + + <para> + The assumption underlying hash join is that the join operator can + only return true for pairs of left and right values that hash to the + same hash code. If two values get put in different hash buckets, the + join will never compare them at all, implicitly assuming that the + result of the join operator must be false. So it never makes sense + to specify <literal>HASHES</literal> for operators that do not represent + some form of equality. In most cases it is only practical to support + hashing for operators that take the same data type on both sides. + However, sometimes it is possible to design compatible hash functions + for two or more data types; that is, functions that will generate the + same hash codes for <quote>equal</quote> values, even though the values + have different representations. For example, it's fairly simple + to arrange this property when hashing integers of different widths. + </para> + + <para> + To be marked <literal>HASHES</literal>, the join operator must appear + in a hash index operator family. This is not enforced when you create + the operator, since of course the referencing operator family couldn't + exist yet. But attempts to use the operator in hash joins will fail + at run time if no such operator family exists. The system needs the + operator family to find the data-type-specific hash function(s) for the + operator's input data type(s). Of course, you must also create suitable + hash functions before you can create the operator family. + </para> + + <para> + Care should be exercised when preparing a hash function, because there + are machine-dependent ways in which it might fail to do the right thing. + For example, if your data type is a structure in which there might be + uninteresting pad bits, you cannot simply pass the whole structure to + <function>hash_any</function>. (Unless you write your other operators and + functions to ensure that the unused bits are always zero, which is the + recommended strategy.) + Another example is that on machines that meet the <acronym>IEEE</acronym> + floating-point standard, negative zero and positive zero are different + values (different bit patterns) but they are defined to compare equal. + If a float value might contain negative zero then extra steps are needed + to ensure it generates the same hash value as positive zero. + </para> + + <para> + A hash-joinable operator must have a commutator (itself if the two + operand data types are the same, or a related equality operator + if they are different) that appears in the same operator family. + If this is not the case, planner errors might occur when the operator + is used. Also, it is a good idea (but not strictly required) for + a hash operator family that supports multiple data types to provide + equality operators for every combination of the data types; this + allows better optimization. + </para> + + <note> + <para> + The function underlying a hash-joinable operator must be marked + immutable or stable. If it is volatile, the system will never + attempt to use the operator for a hash join. + </para> + </note> + + <note> + <para> + If a hash-joinable operator has an underlying function that is marked + strict, the + function must also be complete: that is, it should return true or + false, never null, for any two nonnull inputs. If this rule is + not followed, hash-optimization of <literal>IN</literal> operations might + generate wrong results. (Specifically, <literal>IN</literal> might return + false where the correct answer according to the standard would be null; + or it might yield an error complaining that it wasn't prepared for a + null result.) + </para> + </note> + + </sect2> + + <sect2> + <title><literal>MERGES</literal></title> + + <para> + The <literal>MERGES</literal> clause, if present, tells the system that + it is permissible to use the merge-join method for a join based on this + operator. <literal>MERGES</literal> only makes sense for a binary operator that + returns <literal>boolean</literal>, and in practice the operator must represent + equality for some data type or pair of data types. + </para> + + <para> + Merge join is based on the idea of sorting the left- and right-hand tables + into order and then scanning them in parallel. So, both data types must + be capable of being fully ordered, and the join operator must be one + that can only succeed for pairs of values that fall at the + <quote>same place</quote> + in the sort order. In practice this means that the join operator must + behave like equality. But it is possible to merge-join two + distinct data types so long as they are logically compatible. For + example, the <type>smallint</type>-versus-<type>integer</type> + equality operator is merge-joinable. + We only need sorting operators that will bring both data types into a + logically compatible sequence. + </para> + + <para> + To be marked <literal>MERGES</literal>, the join operator must appear + as an equality member of a <literal>btree</literal> index operator family. + This is not enforced when you create + the operator, since of course the referencing operator family couldn't + exist yet. But the operator will not actually be used for merge joins + unless a matching operator family can be found. The + <literal>MERGES</literal> flag thus acts as a hint to the planner that + it's worth looking for a matching operator family. + </para> + + <para> + A merge-joinable operator must have a commutator (itself if the two + operand data types are the same, or a related equality operator + if they are different) that appears in the same operator family. + If this is not the case, planner errors might occur when the operator + is used. Also, it is a good idea (but not strictly required) for + a <literal>btree</literal> operator family that supports multiple data types to provide + equality operators for every combination of the data types; this + allows better optimization. + </para> + + <note> + <para> + The function underlying a merge-joinable operator must be marked + immutable or stable. If it is volatile, the system will never + attempt to use the operator for a merge join. + </para> + </note> + </sect2> + </sect1> |