summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/xtypes.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/xtypes.sgml')
-rw-r--r--doc/src/sgml/xtypes.sgml376
1 files changed, 376 insertions, 0 deletions
diff --git a/doc/src/sgml/xtypes.sgml b/doc/src/sgml/xtypes.sgml
new file mode 100644
index 0000000..e67e5bd
--- /dev/null
+++ b/doc/src/sgml/xtypes.sgml
@@ -0,0 +1,376 @@
+<!-- doc/src/sgml/xtypes.sgml -->
+
+ <sect1 id="xtypes">
+ <title>User-Defined Types</title>
+
+ <indexterm zone="xtypes">
+ <primary>data type</primary>
+ <secondary>user-defined</secondary>
+ </indexterm>
+
+ <para>
+ As described in <xref linkend="extend-type-system"/>,
+ <productname>PostgreSQL</productname> can be extended to support new
+ data types. This section describes how to define new base types,
+ which are data types defined below the level of the <acronym>SQL</acronym>
+ language. Creating a new base type requires implementing functions
+ to operate on the type in a low-level language, usually C.
+ </para>
+
+ <para>
+ The examples in this section can be found in
+ <filename>complex.sql</filename> and <filename>complex.c</filename>
+ in the <filename>src/tutorial</filename> directory of the source distribution.
+ See the <filename>README</filename> file in that directory for instructions
+ about running the examples.
+ </para>
+
+ <para>
+ <indexterm>
+ <primary>input function</primary>
+ </indexterm>
+ <indexterm>
+ <primary>output function</primary>
+ </indexterm>
+ A user-defined type must always have input and output functions.
+ These functions determine how the type appears in strings (for input
+ by the user and output to the user) and how the type is organized in
+ memory. The input function takes a null-terminated character string
+ as its argument and returns the internal (in memory) representation
+ of the type. The output function takes the internal representation
+ of the type as argument and returns a null-terminated character
+ string. If we want to do anything more with the type than merely
+ store it, we must provide additional functions to implement whatever
+ operations we'd like to have for the type.
+ </para>
+
+ <para>
+ Suppose we want to define a type <type>complex</type> that represents
+ complex numbers. A natural way to represent a complex number in
+ memory would be the following C structure:
+
+<programlisting>
+typedef struct Complex {
+ double x;
+ double y;
+} Complex;
+</programlisting>
+
+ We will need to make this a pass-by-reference type, since it's too
+ large to fit into a single <type>Datum</type> value.
+ </para>
+
+ <para>
+ As the external string representation of the type, we choose a
+ string of the form <literal>(x,y)</literal>.
+ </para>
+
+ <para>
+ The input and output functions are usually not hard to write,
+ especially the output function. But when defining the external
+ string representation of the type, remember that you must eventually
+ write a complete and robust parser for that representation as your
+ input function. For instance:
+
+<programlisting><![CDATA[
+PG_FUNCTION_INFO_V1(complex_in);
+
+Datum
+complex_in(PG_FUNCTION_ARGS)
+{
+ char *str = PG_GETARG_CSTRING(0);
+ double x,
+ y;
+ Complex *result;
+
+ if (sscanf(str, " ( %lf , %lf )", &x, &y) != 2)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for type %s: \"%s\"",
+ "complex", str)));
+
+ result = (Complex *) palloc(sizeof(Complex));
+ result->x = x;
+ result->y = y;
+ PG_RETURN_POINTER(result);
+}
+]]>
+</programlisting>
+
+ The output function can simply be:
+
+<programlisting><![CDATA[
+PG_FUNCTION_INFO_V1(complex_out);
+
+Datum
+complex_out(PG_FUNCTION_ARGS)
+{
+ Complex *complex = (Complex *) PG_GETARG_POINTER(0);
+ char *result;
+
+ result = psprintf("(%g,%g)", complex->x, complex->y);
+ PG_RETURN_CSTRING(result);
+}
+]]>
+</programlisting>
+ </para>
+
+ <para>
+ You should be careful to make the input and output functions inverses of
+ each other. If you do not, you will have severe problems when you
+ need to dump your data into a file and then read it back in. This
+ is a particularly common problem when floating-point numbers are
+ involved.
+ </para>
+
+ <para>
+ Optionally, a user-defined type can provide binary input and output
+ routines. Binary I/O is normally faster but less portable than textual
+ I/O. As with textual I/O, it is up to you to define exactly what the
+ external binary representation is. Most of the built-in data types
+ try to provide a machine-independent binary representation. For
+ <type>complex</type>, we will piggy-back on the binary I/O converters
+ for type <type>float8</type>:
+
+<programlisting><![CDATA[
+PG_FUNCTION_INFO_V1(complex_recv);
+
+Datum
+complex_recv(PG_FUNCTION_ARGS)
+{
+ StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
+ Complex *result;
+
+ result = (Complex *) palloc(sizeof(Complex));
+ result->x = pq_getmsgfloat8(buf);
+ result->y = pq_getmsgfloat8(buf);
+ PG_RETURN_POINTER(result);
+}
+
+PG_FUNCTION_INFO_V1(complex_send);
+
+Datum
+complex_send(PG_FUNCTION_ARGS)
+{
+ Complex *complex = (Complex *) PG_GETARG_POINTER(0);
+ StringInfoData buf;
+
+ pq_begintypsend(&buf);
+ pq_sendfloat8(&buf, complex->x);
+ pq_sendfloat8(&buf, complex->y);
+ PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
+}
+]]>
+</programlisting>
+ </para>
+
+ <para>
+ Once we have written the I/O functions and compiled them into a shared
+ library, we can define the <type>complex</type> type in SQL.
+ First we declare it as a shell type:
+
+<programlisting>
+CREATE TYPE complex;
+</programlisting>
+
+ This serves as a placeholder that allows us to reference the type while
+ defining its I/O functions. Now we can define the I/O functions:
+
+<programlisting>
+CREATE FUNCTION complex_in(cstring)
+ RETURNS complex
+ AS '<replaceable>filename</replaceable>'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION complex_out(complex)
+ RETURNS cstring
+ AS '<replaceable>filename</replaceable>'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION complex_recv(internal)
+ RETURNS complex
+ AS '<replaceable>filename</replaceable>'
+ LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION complex_send(complex)
+ RETURNS bytea
+ AS '<replaceable>filename</replaceable>'
+ LANGUAGE C IMMUTABLE STRICT;
+</programlisting>
+ </para>
+
+ <para>
+ Finally, we can provide the full definition of the data type:
+<programlisting>
+CREATE TYPE complex (
+ internallength = 16,
+ input = complex_in,
+ output = complex_out,
+ receive = complex_recv,
+ send = complex_send,
+ alignment = double
+);
+</programlisting>
+ </para>
+
+ <para>
+ <indexterm>
+ <primary>array</primary>
+ <secondary>of user-defined type</secondary>
+ </indexterm>
+ When you define a new base type,
+ <productname>PostgreSQL</productname> automatically provides support
+ for arrays of that type. The array type typically
+ has the same name as the base type with the underscore character
+ (<literal>_</literal>) prepended.
+ </para>
+
+ <para>
+ Once the data type exists, we can declare additional functions to
+ provide useful operations on the data type. Operators can then be
+ defined atop the functions, and if needed, operator classes can be
+ created to support indexing of the data type. These additional
+ layers are discussed in following sections.
+ </para>
+
+ <para>
+ If the internal representation of the data type is variable-length, the
+ internal representation must follow the standard layout for variable-length
+ data: the first four bytes must be a <type>char[4]</type> field which is
+ never accessed directly (customarily named <structfield>vl_len_</structfield>). You
+ must use the <function>SET_VARSIZE()</function> macro to store the total
+ size of the datum (including the length field itself) in this field
+ and <function>VARSIZE()</function> to retrieve it. (These macros exist
+ because the length field may be encoded depending on platform.)
+ </para>
+
+ <para>
+ For further details see the description of the
+ <xref linkend="sql-createtype"/> command.
+ </para>
+
+ <sect2 id="xtypes-toast">
+ <title>TOAST Considerations</title>
+ <indexterm>
+ <primary>TOAST</primary>
+ <secondary>and user-defined types</secondary>
+ </indexterm>
+
+ <para>
+ If the values of your data type vary in size (in internal form), it's
+ usually desirable to make the data type <acronym>TOAST</acronym>-able (see <xref
+ linkend="storage-toast"/>). You should do this even if the values are always
+ too small to be compressed or stored externally, because
+ <acronym>TOAST</acronym> can save space on small data too, by reducing header
+ overhead.
+ </para>
+
+ <para>
+ To support <acronym>TOAST</acronym> storage, the C functions operating on the data
+ type must always be careful to unpack any toasted values they are handed
+ by using <function>PG_DETOAST_DATUM</function>. (This detail is customarily hidden
+ by defining type-specific <function>GETARG_DATATYPE_P</function> macros.)
+ Then, when running the <command>CREATE TYPE</command> command, specify the
+ internal length as <literal>variable</literal> and select some appropriate storage
+ option other than <literal>plain</literal>.
+ </para>
+
+ <para>
+ If data alignment is unimportant (either just for a specific function or
+ because the data type specifies byte alignment anyway) then it's possible
+ to avoid some of the overhead of <function>PG_DETOAST_DATUM</function>. You can use
+ <function>PG_DETOAST_DATUM_PACKED</function> instead (customarily hidden by
+ defining a <function>GETARG_DATATYPE_PP</function> macro) and using the macros
+ <function>VARSIZE_ANY_EXHDR</function> and <function>VARDATA_ANY</function> to access
+ a potentially-packed datum.
+ Again, the data returned by these macros is not aligned even if the data
+ type definition specifies an alignment. If the alignment is important you
+ must go through the regular <function>PG_DETOAST_DATUM</function> interface.
+ </para>
+
+ <note>
+ <para>
+ Older code frequently declares <structfield>vl_len_</structfield> as an
+ <type>int32</type> field instead of <type>char[4]</type>. This is OK as long as
+ the struct definition has other fields that have at least <type>int32</type>
+ alignment. But it is dangerous to use such a struct definition when
+ working with a potentially unaligned datum; the compiler may take it as
+ license to assume the datum actually is aligned, leading to core dumps on
+ architectures that are strict about alignment.
+ </para>
+ </note>
+
+ <para>
+ Another feature that's enabled by <acronym>TOAST</acronym> support is the
+ possibility of having an <firstterm>expanded</firstterm> in-memory data
+ representation that is more convenient to work with than the format that
+ is stored on disk. The regular or <quote>flat</quote> varlena storage format
+ is ultimately just a blob of bytes; it cannot for example contain
+ pointers, since it may get copied to other locations in memory.
+ For complex data types, the flat format may be quite expensive to work
+ with, so <productname>PostgreSQL</productname> provides a way to <quote>expand</quote>
+ the flat format into a representation that is more suited to computation,
+ and then pass that format in-memory between functions of the data type.
+ </para>
+
+ <para>
+ To use expanded storage, a data type must define an expanded format that
+ follows the rules given in <filename>src/include/utils/expandeddatum.h</filename>,
+ and provide functions to <quote>expand</quote> a flat varlena value into
+ expanded format and <quote>flatten</quote> the expanded format back to the
+ regular varlena representation. Then ensure that all C functions for
+ the data type can accept either representation, possibly by converting
+ one into the other immediately upon receipt. This does not require fixing
+ all existing functions for the data type at once, because the standard
+ <function>PG_DETOAST_DATUM</function> macro is defined to convert expanded inputs
+ into regular flat format. Therefore, existing functions that work with
+ the flat varlena format will continue to work, though slightly
+ inefficiently, with expanded inputs; they need not be converted until and
+ unless better performance is important.
+ </para>
+
+ <para>
+ C functions that know how to work with an expanded representation
+ typically fall into two categories: those that can only handle expanded
+ format, and those that can handle either expanded or flat varlena inputs.
+ The former are easier to write but may be less efficient overall, because
+ converting a flat input to expanded form for use by a single function may
+ cost more than is saved by operating on the expanded format.
+ When only expanded format need be handled, conversion of flat inputs to
+ expanded form can be hidden inside an argument-fetching macro, so that
+ the function appears no more complex than one working with traditional
+ varlena input.
+ To handle both types of input, write an argument-fetching function that
+ will detoast external, short-header, and compressed varlena inputs, but
+ not expanded inputs. Such a function can be defined as returning a
+ pointer to a union of the flat varlena format and the expanded format.
+ Callers can use the <function>VARATT_IS_EXPANDED_HEADER()</function> macro to
+ determine which format they received.
+ </para>
+
+ <para>
+ The <acronym>TOAST</acronym> infrastructure not only allows regular varlena
+ values to be distinguished from expanded values, but also
+ distinguishes <quote>read-write</quote> and <quote>read-only</quote> pointers to
+ expanded values. C functions that only need to examine an expanded
+ value, or will only change it in safe and non-semantically-visible ways,
+ need not care which type of pointer they receive. C functions that
+ produce a modified version of an input value are allowed to modify an
+ expanded input value in-place if they receive a read-write pointer, but
+ must not modify the input if they receive a read-only pointer; in that
+ case they have to copy the value first, producing a new value to modify.
+ A C function that has constructed a new expanded value should always
+ return a read-write pointer to it. Also, a C function that is modifying
+ a read-write expanded value in-place should take care to leave the value
+ in a sane state if it fails partway through.
+ </para>
+
+ <para>
+ For examples of working with expanded values, see the standard array
+ infrastructure, particularly
+ <filename>src/backend/utils/adt/array_expanded.c</filename>.
+ </para>
+
+ </sect2>
+
+</sect1>