summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/citext.sgml
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:17:33 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:17:33 +0000
commit5e45211a64149b3c659b90ff2de6fa982a5a93ed (patch)
tree739caf8c461053357daa9f162bef34516c7bf452 /doc/src/sgml/citext.sgml
parentInitial commit. (diff)
downloadpostgresql-15-5e45211a64149b3c659b90ff2de6fa982a5a93ed.tar.xz
postgresql-15-5e45211a64149b3c659b90ff2de6fa982a5a93ed.zip
Adding upstream version 15.5.upstream/15.5
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/citext.sgml')
-rw-r--r--doc/src/sgml/citext.sgml293
1 files changed, 293 insertions, 0 deletions
diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml
new file mode 100644
index 0000000..5986601
--- /dev/null
+++ b/doc/src/sgml/citext.sgml
@@ -0,0 +1,293 @@
+<!-- doc/src/sgml/citext.sgml -->
+
+<sect1 id="citext" xreflabel="citext">
+ <title>citext</title>
+
+ <indexterm zone="citext">
+ <primary>citext</primary>
+ </indexterm>
+
+ <para>
+ The <filename>citext</filename> module provides a case-insensitive
+ character string type, <type>citext</type>. Essentially, it internally calls
+ <function>lower</function> when comparing values. Otherwise, it behaves almost
+ exactly like <type>text</type>.
+ </para>
+
+ <tip>
+ <para>
+ Consider using <firstterm>nondeterministic collations</firstterm> (see
+ <xref linkend="collation-nondeterministic"/>) instead of this module. They
+ can be used for case-insensitive comparisons, accent-insensitive
+ comparisons, and other combinations, and they handle more Unicode special
+ cases correctly.
+ </para>
+ </tip>
+
+ <para>
+ This module is considered <quote>trusted</quote>, that is, it can be
+ installed by non-superusers who have <literal>CREATE</literal> privilege
+ on the current database.
+ </para>
+
+ <sect2>
+ <title>Rationale</title>
+
+ <para>
+ The standard approach to doing case-insensitive matches
+ in <productname>PostgreSQL</productname> has been to use the <function>lower</function>
+ function when comparing values, for example
+
+<programlisting>
+SELECT * FROM tab WHERE lower(col) = LOWER(?);
+</programlisting>
+ </para>
+
+ <para>
+ This works reasonably well, but has a number of drawbacks:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ It makes your SQL statements verbose, and you always have to remember to
+ use <function>lower</function> on both the column and the query value.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ It won't use an index, unless you create a functional index using
+ <function>lower</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If you declare a column as <literal>UNIQUE</literal> or <literal>PRIMARY
+ KEY</literal>, the implicitly generated index is case-sensitive. So it's
+ useless for case-insensitive searches, and it won't enforce
+ uniqueness case-insensitively.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ The <type>citext</type> data type allows you to eliminate calls
+ to <function>lower</function> in SQL queries, and allows a primary key to
+ be case-insensitive. <type>citext</type> is locale-aware, just
+ like <type>text</type>, which means that the matching of upper case and
+ lower case characters is dependent on the rules of
+ the database's <literal>LC_CTYPE</literal> setting. Again, this behavior is
+ identical to the use of <function>lower</function> in queries. But because it's
+ done transparently by the data type, you don't have to remember to do
+ anything special in your queries.
+ </para>
+
+ </sect2>
+
+ <sect2>
+ <title>How to Use It</title>
+
+ <para>
+ Here's a simple example of usage:
+
+<programlisting>
+CREATE TABLE users (
+ nick CITEXT PRIMARY KEY,
+ pass TEXT NOT NULL
+);
+
+INSERT INTO users VALUES ( 'larry', sha256(random()::text::bytea) );
+INSERT INTO users VALUES ( 'Tom', sha256(random()::text::bytea) );
+INSERT INTO users VALUES ( 'Damian', sha256(random()::text::bytea) );
+INSERT INTO users VALUES ( 'NEAL', sha256(random()::text::bytea) );
+INSERT INTO users VALUES ( 'Bj&oslash;rn', sha256(random()::text::bytea) );
+
+SELECT * FROM users WHERE nick = 'Larry';
+</programlisting>
+
+ The <command>SELECT</command> statement will return one tuple, even though
+ the <structfield>nick</structfield> column was set to <literal>larry</literal> and the query
+ was for <literal>Larry</literal>.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>String Comparison Behavior</title>
+
+ <para>
+ <type>citext</type> performs comparisons by converting each string to lower
+ case (as though <function>lower</function> were called) and then comparing the
+ results normally. Thus, for example, two strings are considered equal
+ if <function>lower</function> would produce identical results for them.
+ </para>
+
+ <para>
+ In order to emulate a case-insensitive collation as closely as possible,
+ there are <type>citext</type>-specific versions of a number of string-processing
+ operators and functions. So, for example, the regular expression
+ operators <literal>~</literal> and <literal>~*</literal> exhibit the same behavior when
+ applied to <type>citext</type>: they both match case-insensitively.
+ The same is true
+ for <literal>!~</literal> and <literal>!~*</literal>, as well as for the
+ <literal>LIKE</literal> operators <literal>~~</literal> and <literal>~~*</literal>, and
+ <literal>!~~</literal> and <literal>!~~*</literal>. If you'd like to match
+ case-sensitively, you can cast the operator's arguments to <type>text</type>.
+ </para>
+
+ <para>
+ Similarly, all of the following functions perform matching
+ case-insensitively if their arguments are <type>citext</type>:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <function>regexp_match()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>regexp_matches()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>regexp_replace()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>regexp_split_to_array()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>regexp_split_to_table()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>replace()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>split_part()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>strpos()</function>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <function>translate()</function>
+ </para>
+ </listitem>
+ </itemizedlist>
+
+ <para>
+ For the regexp functions, if you want to match case-sensitively, you can
+ specify the <quote>c</quote> flag to force a case-sensitive match. Otherwise,
+ you must cast to <type>text</type> before using one of these functions if
+ you want case-sensitive behavior.
+ </para>
+
+ </sect2>
+
+ <sect2>
+ <title>Limitations</title>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ <type>citext</type>'s case-folding behavior depends on
+ the <literal>LC_CTYPE</literal> setting of your database. How it compares
+ values is therefore determined when the database is created.
+ It is not truly
+ case-insensitive in the terms defined by the Unicode standard.
+ Effectively, what this means is that, as long as you're happy with your
+ collation, you should be happy with <type>citext</type>'s comparisons. But
+ if you have data in different languages stored in your database, users
+ of one language may find their query results are not as expected if the
+ collation is for another language.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ As of <productname>PostgreSQL</productname> 9.1, you can attach a
+ <literal>COLLATE</literal> specification to <type>citext</type> columns or data
+ values. Currently, <type>citext</type> operators will honor a non-default
+ <literal>COLLATE</literal> specification while comparing case-folded strings,
+ but the initial folding to lower case is always done according to the
+ database's <literal>LC_CTYPE</literal> setting (that is, as though
+ <literal>COLLATE "default"</literal> were given). This may be changed in a
+ future release so that both steps follow the input <literal>COLLATE</literal>
+ specification.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>citext</type> is not as efficient as <type>text</type> because the
+ operator functions and the B-tree comparison functions must make copies
+ of the data and convert it to lower case for comparisons. Also, only
+ <type>text</type> can support B-Tree deduplication. However,
+ <type>citext</type> is slightly more efficient than using
+ <function>lower</function> to get case-insensitive matching.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>citext</type> doesn't help much if you need data to compare
+ case-sensitively in some contexts and case-insensitively in other
+ contexts. The standard answer is to use the <type>text</type> type and
+ manually use the <function>lower</function> function when you need to compare
+ case-insensitively; this works all right if case-insensitive comparison
+ is needed only infrequently. If you need case-insensitive behavior most
+ of the time and case-sensitive infrequently, consider storing the data
+ as <type>citext</type> and explicitly casting the column to <type>text</type>
+ when you want case-sensitive comparison. In either situation, you will
+ need two indexes if you want both types of searches to be fast.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The schema containing the <type>citext</type> operators must be
+ in the current <varname>search_path</varname> (typically <literal>public</literal>);
+ if it is not, the normal case-sensitive <type>text</type> operators
+ will be invoked instead.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The approach of lower-casing strings for comparison does not handle some
+ Unicode special cases correctly, for example when one upper-case letter
+ has two lower-case letter equivalents. Unicode distinguishes between
+ <firstterm>case mapping</firstterm> and <firstterm>case
+ folding</firstterm> for this reason. Use nondeterministic collations
+ instead of <type>citext</type> to handle that correctly.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect2>
+
+ <sect2>
+ <title>Author</title>
+
+ <para>
+ David E. Wheeler <email>david@kineticode.com</email>
+ </para>
+
+ <para>
+ Inspired by the original <type>citext</type> module by Donald Fraser.
+ </para>
+
+ </sect2>
+
+</sect1>