diff options
Diffstat (limited to 'doc/src/sgml/citext.sgml')
-rw-r--r-- | doc/src/sgml/citext.sgml | 293 |
1 files changed, 293 insertions, 0 deletions
diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml new file mode 100644 index 0000000..5986601 --- /dev/null +++ b/doc/src/sgml/citext.sgml @@ -0,0 +1,293 @@ +<!-- doc/src/sgml/citext.sgml --> + +<sect1 id="citext" xreflabel="citext"> + <title>citext</title> + + <indexterm zone="citext"> + <primary>citext</primary> + </indexterm> + + <para> + The <filename>citext</filename> module provides a case-insensitive + character string type, <type>citext</type>. Essentially, it internally calls + <function>lower</function> when comparing values. Otherwise, it behaves almost + exactly like <type>text</type>. + </para> + + <tip> + <para> + Consider using <firstterm>nondeterministic collations</firstterm> (see + <xref linkend="collation-nondeterministic"/>) instead of this module. They + can be used for case-insensitive comparisons, accent-insensitive + comparisons, and other combinations, and they handle more Unicode special + cases correctly. + </para> + </tip> + + <para> + This module is considered <quote>trusted</quote>, that is, it can be + installed by non-superusers who have <literal>CREATE</literal> privilege + on the current database. + </para> + + <sect2> + <title>Rationale</title> + + <para> + The standard approach to doing case-insensitive matches + in <productname>PostgreSQL</productname> has been to use the <function>lower</function> + function when comparing values, for example + +<programlisting> +SELECT * FROM tab WHERE lower(col) = LOWER(?); +</programlisting> + </para> + + <para> + This works reasonably well, but has a number of drawbacks: + </para> + + <itemizedlist> + <listitem> + <para> + It makes your SQL statements verbose, and you always have to remember to + use <function>lower</function> on both the column and the query value. + </para> + </listitem> + <listitem> + <para> + It won't use an index, unless you create a functional index using + <function>lower</function>. + </para> + </listitem> + <listitem> + <para> + If you declare a column as <literal>UNIQUE</literal> or <literal>PRIMARY + KEY</literal>, the implicitly generated index is case-sensitive. So it's + useless for case-insensitive searches, and it won't enforce + uniqueness case-insensitively. + </para> + </listitem> + </itemizedlist> + + <para> + The <type>citext</type> data type allows you to eliminate calls + to <function>lower</function> in SQL queries, and allows a primary key to + be case-insensitive. <type>citext</type> is locale-aware, just + like <type>text</type>, which means that the matching of upper case and + lower case characters is dependent on the rules of + the database's <literal>LC_CTYPE</literal> setting. Again, this behavior is + identical to the use of <function>lower</function> in queries. But because it's + done transparently by the data type, you don't have to remember to do + anything special in your queries. + </para> + + </sect2> + + <sect2> + <title>How to Use It</title> + + <para> + Here's a simple example of usage: + +<programlisting> +CREATE TABLE users ( + nick CITEXT PRIMARY KEY, + pass TEXT NOT NULL +); + +INSERT INTO users VALUES ( 'larry', sha256(random()::text::bytea) ); +INSERT INTO users VALUES ( 'Tom', sha256(random()::text::bytea) ); +INSERT INTO users VALUES ( 'Damian', sha256(random()::text::bytea) ); +INSERT INTO users VALUES ( 'NEAL', sha256(random()::text::bytea) ); +INSERT INTO users VALUES ( 'Bjørn', sha256(random()::text::bytea) ); + +SELECT * FROM users WHERE nick = 'Larry'; +</programlisting> + + The <command>SELECT</command> statement will return one tuple, even though + the <structfield>nick</structfield> column was set to <literal>larry</literal> and the query + was for <literal>Larry</literal>. + </para> + </sect2> + + <sect2> + <title>String Comparison Behavior</title> + + <para> + <type>citext</type> performs comparisons by converting each string to lower + case (as though <function>lower</function> were called) and then comparing the + results normally. Thus, for example, two strings are considered equal + if <function>lower</function> would produce identical results for them. + </para> + + <para> + In order to emulate a case-insensitive collation as closely as possible, + there are <type>citext</type>-specific versions of a number of string-processing + operators and functions. So, for example, the regular expression + operators <literal>~</literal> and <literal>~*</literal> exhibit the same behavior when + applied to <type>citext</type>: they both match case-insensitively. + The same is true + for <literal>!~</literal> and <literal>!~*</literal>, as well as for the + <literal>LIKE</literal> operators <literal>~~</literal> and <literal>~~*</literal>, and + <literal>!~~</literal> and <literal>!~~*</literal>. If you'd like to match + case-sensitively, you can cast the operator's arguments to <type>text</type>. + </para> + + <para> + Similarly, all of the following functions perform matching + case-insensitively if their arguments are <type>citext</type>: + </para> + + <itemizedlist> + <listitem> + <para> + <function>regexp_match()</function> + </para> + </listitem> + <listitem> + <para> + <function>regexp_matches()</function> + </para> + </listitem> + <listitem> + <para> + <function>regexp_replace()</function> + </para> + </listitem> + <listitem> + <para> + <function>regexp_split_to_array()</function> + </para> + </listitem> + <listitem> + <para> + <function>regexp_split_to_table()</function> + </para> + </listitem> + <listitem> + <para> + <function>replace()</function> + </para> + </listitem> + <listitem> + <para> + <function>split_part()</function> + </para> + </listitem> + <listitem> + <para> + <function>strpos()</function> + </para> + </listitem> + <listitem> + <para> + <function>translate()</function> + </para> + </listitem> + </itemizedlist> + + <para> + For the regexp functions, if you want to match case-sensitively, you can + specify the <quote>c</quote> flag to force a case-sensitive match. Otherwise, + you must cast to <type>text</type> before using one of these functions if + you want case-sensitive behavior. + </para> + + </sect2> + + <sect2> + <title>Limitations</title> + + <itemizedlist> + <listitem> + <para> + <type>citext</type>'s case-folding behavior depends on + the <literal>LC_CTYPE</literal> setting of your database. How it compares + values is therefore determined when the database is created. + It is not truly + case-insensitive in the terms defined by the Unicode standard. + Effectively, what this means is that, as long as you're happy with your + collation, you should be happy with <type>citext</type>'s comparisons. But + if you have data in different languages stored in your database, users + of one language may find their query results are not as expected if the + collation is for another language. + </para> + </listitem> + + <listitem> + <para> + As of <productname>PostgreSQL</productname> 9.1, you can attach a + <literal>COLLATE</literal> specification to <type>citext</type> columns or data + values. Currently, <type>citext</type> operators will honor a non-default + <literal>COLLATE</literal> specification while comparing case-folded strings, + but the initial folding to lower case is always done according to the + database's <literal>LC_CTYPE</literal> setting (that is, as though + <literal>COLLATE "default"</literal> were given). This may be changed in a + future release so that both steps follow the input <literal>COLLATE</literal> + specification. + </para> + </listitem> + + <listitem> + <para> + <type>citext</type> is not as efficient as <type>text</type> because the + operator functions and the B-tree comparison functions must make copies + of the data and convert it to lower case for comparisons. Also, only + <type>text</type> can support B-Tree deduplication. However, + <type>citext</type> is slightly more efficient than using + <function>lower</function> to get case-insensitive matching. + </para> + </listitem> + + <listitem> + <para> + <type>citext</type> doesn't help much if you need data to compare + case-sensitively in some contexts and case-insensitively in other + contexts. The standard answer is to use the <type>text</type> type and + manually use the <function>lower</function> function when you need to compare + case-insensitively; this works all right if case-insensitive comparison + is needed only infrequently. If you need case-insensitive behavior most + of the time and case-sensitive infrequently, consider storing the data + as <type>citext</type> and explicitly casting the column to <type>text</type> + when you want case-sensitive comparison. In either situation, you will + need two indexes if you want both types of searches to be fast. + </para> + </listitem> + + <listitem> + <para> + The schema containing the <type>citext</type> operators must be + in the current <varname>search_path</varname> (typically <literal>public</literal>); + if it is not, the normal case-sensitive <type>text</type> operators + will be invoked instead. + </para> + </listitem> + + <listitem> + <para> + The approach of lower-casing strings for comparison does not handle some + Unicode special cases correctly, for example when one upper-case letter + has two lower-case letter equivalents. Unicode distinguishes between + <firstterm>case mapping</firstterm> and <firstterm>case + folding</firstterm> for this reason. Use nondeterministic collations + instead of <type>citext</type> to handle that correctly. + </para> + </listitem> + </itemizedlist> + </sect2> + + <sect2> + <title>Author</title> + + <para> + David E. Wheeler <email>david@kineticode.com</email> + </para> + + <para> + Inspired by the original <type>citext</type> module by Donald Fraser. + </para> + + </sect2> + +</sect1> |