diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 13:44:03 +0000 |
commit | 293913568e6a7a86fd1479e1cff8e2ecb58d6568 (patch) | |
tree | fc3b469a3ec5ab71b36ea97cc7aaddb838423a0c /doc/src/sgml/charset.sgml | |
parent | Initial commit. (diff) | |
download | postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.tar.xz postgresql-16-293913568e6a7a86fd1479e1cff8e2ecb58d6568.zip |
Adding upstream version 16.2.upstream/16.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/charset.sgml')
-rw-r--r-- | doc/src/sgml/charset.sgml | 3318 |
1 files changed, 3318 insertions, 0 deletions
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml new file mode 100644 index 0000000..975b9dc --- /dev/null +++ b/doc/src/sgml/charset.sgml @@ -0,0 +1,3318 @@ +<!-- doc/src/sgml/charset.sgml --> + +<chapter id="charset"> + <title>Localization</title> + + <para> + This chapter describes the available localization features from the + point of view of the administrator. + <productname>PostgreSQL</productname> supports two localization + facilities: + + <itemizedlist> + <listitem> + <para> + Using the locale features of the operating system to provide + locale-specific collation order, number formatting, translated + messages, and other aspects. + This is covered in <xref linkend="locale"/> and + <xref linkend="collation"/>. + </para> + </listitem> + + <listitem> + <para> + Providing a number of different character sets to support storing text + in all kinds of languages, and providing character set translation + between client and server. + This is covered in <xref linkend="multibyte"/>. + </para> + </listitem> + </itemizedlist> + </para> + + + <sect1 id="locale"> + <title>Locale Support</title> + + <indexterm zone="locale"><primary>locale</primary></indexterm> + + <para> + <firstterm>Locale</firstterm> support refers to an application respecting + cultural preferences regarding alphabets, sorting, number + formatting, etc. <productname>PostgreSQL</productname> uses the standard ISO + C and <acronym>POSIX</acronym> locale facilities provided by the server operating + system. For additional information refer to the documentation of your + system. + </para> + + <sect2 id="locale-overview"> + <title>Overview</title> + + <para> + Locale support is automatically initialized when a database + cluster is created using <command>initdb</command>. + <command>initdb</command> will initialize the database cluster + with the locale setting of its execution environment by default, + so if your system is already set to use the locale that you want + in your database cluster then there is nothing else you need to + do. If you want to use a different locale (or you are not sure + which locale your system is set to), you can instruct + <command>initdb</command> exactly which locale to use by + specifying the <option>--locale</option> option. For example: +<screen> +initdb --locale=sv_SE +</screen> + </para> + + <para> + This example for Unix systems sets the locale to Swedish + (<literal>sv</literal>) as spoken + in Sweden (<literal>SE</literal>). Other possibilities might include + <literal>en_US</literal> (U.S. English) and <literal>fr_CA</literal> (French + Canadian). If more than one character set can be used for a + locale then the specifications can take the form + <replaceable>language_territory.codeset</replaceable>. For example, + <literal>fr_BE.UTF-8</literal> represents the French language (fr) as + spoken in Belgium (BE), with a <acronym>UTF-8</acronym> character set + encoding. + </para> + + <para> + What locales are available on your + system under what names depends on what was provided by the operating + system vendor and what was installed. On most Unix systems, the command + <literal>locale -a</literal> will provide a list of available locales. + Windows uses more verbose locale names, such as <literal>German_Germany</literal> + or <literal>Swedish_Sweden.1252</literal>, but the principles are the same. + </para> + + <para> + Occasionally it is useful to mix rules from several locales, e.g., + use English collation rules but Spanish messages. To support that, a + set of locale subcategories exist that control only certain + aspects of the localization rules: + + <informaltable> + <tgroup cols="2"> + <colspec colname="col1" colwidth="1*"/> + <colspec colname="col2" colwidth="3*"/> + <tbody> + <row> + <entry><envar>LC_COLLATE</envar></entry> + <entry>String sort order</entry> + </row> + <row> + <entry><envar>LC_CTYPE</envar></entry> + <entry>Character classification (What is a letter? Its upper-case equivalent?)</entry> + </row> + <row> + <entry><envar>LC_MESSAGES</envar></entry> + <entry>Language of messages</entry> + </row> + <row> + <entry><envar>LC_MONETARY</envar></entry> + <entry>Formatting of currency amounts</entry> + </row> + <row> + <entry><envar>LC_NUMERIC</envar></entry> + <entry>Formatting of numbers</entry> + </row> + <row> + <entry><envar>LC_TIME</envar></entry> + <entry>Formatting of dates and times</entry> + </row> + </tbody> + </tgroup> + </informaltable> + + The category names translate into names of + <command>initdb</command> options to override the locale choice + for a specific category. For instance, to set the locale to + French Canadian, but use U.S. rules for formatting currency, use + <literal>initdb --locale=fr_CA --lc-monetary=en_US</literal>. + </para> + + <para> + If you want the system to behave as if it had no locale support, + use the special locale name <literal>C</literal>, or equivalently + <literal>POSIX</literal>. + </para> + + <para> + Some locale categories must have their values + fixed when the database is created. You can use different settings + for different databases, but once a database is created, you cannot + change them for that database anymore. <literal>LC_COLLATE</literal> + and <literal>LC_CTYPE</literal> are these categories. They affect + the sort order of indexes, so they must be kept fixed, or indexes on + text columns would become corrupt. + (But you can alleviate this restriction using collations, as discussed + in <xref linkend="collation"/>.) + The default values for these + categories are determined when <command>initdb</command> is run, and + those values are used when new databases are created, unless + specified otherwise in the <command>CREATE DATABASE</command> command. + </para> + + <para> + The other locale categories can be changed whenever desired + by setting the server configuration parameters + that have the same name as the locale categories (see <xref + linkend="runtime-config-client-format"/> for details). The values + that are chosen by <command>initdb</command> are actually only written + into the configuration file <filename>postgresql.conf</filename> to + serve as defaults when the server is started. If you remove these + assignments from <filename>postgresql.conf</filename> then the + server will inherit the settings from its execution environment. + </para> + + <para> + Note that the locale behavior of the server is determined by the + environment variables seen by the server, not by the environment + of any client. Therefore, be careful to configure the correct locale settings + before starting the server. A consequence of this is that if + client and server are set up in different locales, messages might + appear in different languages depending on where they originated. + </para> + + <note> + <para> + When we speak of inheriting the locale from the execution + environment, this means the following on most operating systems: + For a given locale category, say the collation, the following + environment variables are consulted in this order until one is + found to be set: <envar>LC_ALL</envar>, <envar>LC_COLLATE</envar> + (or the variable corresponding to the respective category), + <envar>LANG</envar>. If none of these environment variables are + set then the locale defaults to <literal>C</literal>. + </para> + + <para> + Some message localization libraries also look at the environment + variable <envar>LANGUAGE</envar> which overrides all other locale + settings for the purpose of setting the language of messages. If + in doubt, please refer to the documentation of your operating + system, in particular the documentation about + <application>gettext</application>. + </para> + </note> + + <para> + To enable messages to be translated to the user's preferred language, + <acronym>NLS</acronym> must have been selected at build time + (<literal>configure --enable-nls</literal>). All other locale support is + built in automatically. + </para> + </sect2> + + <sect2 id="locale-behavior"> + <title>Behavior</title> + + <para> + The locale settings influence the following SQL features: + + <itemizedlist> + <listitem> + <para> + Sort order in queries using <literal>ORDER BY</literal> or the standard + comparison operators on textual data + <indexterm><primary>ORDER BY</primary><secondary>and locales</secondary></indexterm> + </para> + </listitem> + + <listitem> + <para> + The <function>upper</function>, <function>lower</function>, and <function>initcap</function> + functions + <indexterm><primary>upper</primary><secondary>and locales</secondary></indexterm> + <indexterm><primary>lower</primary><secondary>and locales</secondary></indexterm> + </para> + </listitem> + + <listitem> + <para> + Pattern matching operators (<literal>LIKE</literal>, <literal>SIMILAR TO</literal>, + and POSIX-style regular expressions); locales affect both case + insensitive matching and the classification of characters by + character-class regular expressions + <indexterm><primary>LIKE</primary><secondary>and locales</secondary></indexterm> + <indexterm><primary>regular expressions</primary><secondary>and locales</secondary></indexterm> + </para> + </listitem> + + <listitem> + <para> + The <function>to_char</function> family of functions + <indexterm><primary>to_char</primary><secondary>and locales</secondary></indexterm> + </para> + </listitem> + + <listitem> + <para> + The ability to use indexes with <literal>LIKE</literal> clauses + </para> + </listitem> + </itemizedlist> + </para> + + <para> + The drawback of using locales other than <literal>C</literal> or + <literal>POSIX</literal> in <productname>PostgreSQL</productname> is its performance + impact. It slows character handling and prevents ordinary indexes + from being used by <literal>LIKE</literal>. For this reason use locales + only if you actually need them. + </para> + + <para> + As a workaround to allow <productname>PostgreSQL</productname> to use indexes + with <literal>LIKE</literal> clauses under a non-C locale, several custom + operator classes exist. These allow the creation of an index that + performs a strict character-by-character comparison, ignoring + locale comparison rules. Refer to <xref linkend="indexes-opclass"/> + for more information. Another approach is to create indexes using + the <literal>C</literal> collation, as discussed in + <xref linkend="collation"/>. + </para> + </sect2> + + <sect2 id="locale-selecting-locales"> + <title>Selecting Locales</title> + + <para> + Locales can be selected in different scopes depending on requirements. + The above overview showed how locales are specified using + <command>initdb</command> to set the defaults for the entire cluster. The + following list shows where locales can be selected. Each item provides + the defaults for the subsequent items, and each lower item allows + overriding the defaults on a finer granularity. + </para> + + <orderedlist> + <listitem> + <para> + As explained above, the environment of the operating system provides the + defaults for the locales of a newly initialized database cluster. In + many cases, this is enough: If the operating system is configured for + the desired language/territory, then + <productname>PostgreSQL</productname> will by default also behave + according to that locale. + </para> + </listitem> + + <listitem> + <para> + As shown above, command-line options for <command>initdb</command> + specify the locale settings for a newly initialized database cluster. + Use this if the operating system does not have the locale configuration + you want for your database system. + </para> + </listitem> + + <listitem> + <para> + A locale can be selected separately for each database. The SQL command + <command>CREATE DATABASE</command> and its command-line equivalent + <command>createdb</command> have options for that. Use this for example + if a database cluster houses databases for multiple tenants with + different requirements. + </para> + </listitem> + + <listitem> + <para> + Locale settings can be made for individual table columns. This uses an + SQL object called <firstterm>collation</firstterm> and is explained in + <xref linkend="collation"/>. Use this for example to sort data in + different languages or customize the sort order of a particular table. + </para> + </listitem> + + <listitem> + <para> + Finally, locales can be selected for an individual query. Again, this + uses SQL collation objects. This could be used to change the sort order + based on run-time choices or for ad-hoc experimentation. + </para> + </listitem> + </orderedlist> + </sect2> + + <sect2 id="locale-providers"> + <title>Locale Providers</title> + + <para> + <productname>PostgreSQL</productname> supports multiple <firstterm>locale + providers</firstterm>. This specifies which library supplies the locale + data. One standard provider name is <literal>libc</literal>, which uses + the locales provided by the operating system C library. These are the + locales used by most tools provided by the operating system. Another + provider is <literal>icu</literal>, which uses the external + ICU<indexterm><primary>ICU</primary></indexterm> library. ICU locales can + only be used if support for ICU was configured when PostgreSQL was built. + </para> + + <para> + The commands and tools that select the locale settings, as described + above, each have an option to select the locale provider. The examples + shown earlier all use the <literal>libc</literal> provider, which is the + default. Here is an example to initialize a database cluster using the + ICU provider: +<programlisting> +initdb --locale-provider=icu --icu-locale=en +</programlisting> + See the description of the respective commands and programs for + details. Note that you can mix locale providers at different + granularities, for example use <literal>libc</literal> by default for the + cluster but have one database that uses the <literal>icu</literal> + provider, and then have collation objects using either provider within + those databases. + </para> + + <para> + Which locale provider to use depends on individual requirements. For most + basic uses, either provider will give adequate results. For the libc + provider, it depends on what the operating system offers; some operating + systems are better than others. For advanced uses, ICU offers more locale + variants and customization options. + </para> + </sect2> + + <sect2 id="icu-locales"> + <title>ICU Locales</title> + + <sect3 id="icu-locale-names"> + <title>ICU Locale Names</title> + + <para> + The ICU format for the locale name is a <link + linkend="icu-language-tag">Language Tag</link>. + +<programlisting> +CREATE COLLATION mycollation1 (provider = icu, locale = 'ja-JP'); +CREATE COLLATION mycollation2 (provider = icu, locale = 'fr'); +</programlisting> + </para> + </sect3> + + <sect3 id="icu-canonicalization"> + <title>Locale Canonicalization and Validation</title> + <para> + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + +<screen> +CREATE COLLATION mycollation3 (provider = icu, locale = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (provider = icu, locale = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" +</screen> + + If you see this notice, ensure that the <symbol>provider</symbol> and + <symbol>locale</symbol> are the expected result. For consistent results + when using the ICU provider, specify the canonical <link + linkend="icu-language-tag">language tag</link> instead of relying on the + transformation. + </para> + + <para> + A locale with no language name, or the special language name + <literal>root</literal>, is transformed to have the language + <literal>und</literal> ("undefined"). + </para> + + <para> + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name is + used in ICU, it may not have precisely the same behavior as in libc. + </para> + + <para> + If there is a problem interpreting the locale name, or if the locale name + represents a language or region that ICU does not recognize, you will see + the following warning: + +<screen> +CREATE COLLATION nonsense (provider = icu, locale = 'nonsense'); +WARNING: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +CREATE COLLATION +</screen> + + <xref linkend="guc-icu-validation-level"/> controls how the message is + reported. Unless set to <literal>ERROR</literal>, the collation will + still be created, but the behavior may not be what the user intended. + </para> + </sect3> + + <sect3 id="icu-language-tag"> + <title>Language Tag</title> + + <para> + A language tag, defined in BCP 47, is a standardized identifier used to + identify languages, regions, and other information about a locale. + </para> + + <para> + Basic language tags are simply + <replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>; + or even just <replaceable>language</replaceable>. The + <replaceable>language</replaceable> is a language code + (e.g. <literal>fr</literal> for French), and + <replaceable>region</replaceable> is a region code + (e.g. <literal>CA</literal> for Canada). Examples: + <literal>ja-JP</literal>, <literal>de</literal>, or + <literal>fr-CA</literal>. + </para> + + <para> + Collation settings may be included in the language tag to customize + collation behavior. ICU allows extensive customization, such as + sensitivity (or insensitivity) to accents, case, and punctuation; + treatment of digits within text; and many other options to satisfy a + variety of uses. + </para> + + <para> + To include this additional collation information in a language tag, + append <literal>-u</literal>, which indicates there are additional + collation settings, followed by one or more + <literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable> + pairs. The <replaceable>key</replaceable> is the key for a <link + linkend="icu-collation-settings">collation setting</link> and + <replaceable>value</replaceable> is a valid value for that setting. For + boolean settings, the <literal>-</literal><replaceable>key</replaceable> + may be specified without a corresponding + <literal>-</literal><replaceable>value</replaceable>, which implies a + value of <literal>true</literal>. + </para> + + <para> + For example, the language tag <literal>en-US-u-kn-ks-level2</literal> + means the locale with the English language in the US region, with + collation settings <literal>kn</literal> set to <literal>true</literal> + and <literal>ks</literal> set to <literal>level2</literal>. Those + settings mean the collation will be case-insensitive and treat a sequence + of digits as a single number: + +<screen> +CREATE COLLATION mycollation5 (provider = icu, deterministic = false, locale = 'en-US-u-kn-ks-level2'); +SELECT 'aB' = 'Ab' COLLATE mycollation5 as result; + result +-------- + t +(1 row) + +SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result; + result +-------- + t +(1 row) +</screen> + </para> + + <para> + See <xref linkend="icu-custom-collations"/> for details and additional + examples of using language tags with custom collation information for the + locale. + </para> + </sect3> + </sect2> + + <sect2 id="locale-problems"> + <title>Problems</title> + + <para> + If locale support doesn't work according to the explanation above, + check that the locale support in your operating system is + correctly configured. To check what locales are installed on your + system, you can use the command <literal>locale -a</literal> if + your operating system provides it. + </para> + + <para> + Check that <productname>PostgreSQL</productname> is actually using the locale + that you think it is. The <envar>LC_COLLATE</envar> and <envar>LC_CTYPE</envar> + settings are determined when a database is created, and cannot be + changed except by creating a new database. Other locale + settings including <envar>LC_MESSAGES</envar> and <envar>LC_MONETARY</envar> + are initially determined by the environment the server is started + in, but can be changed on-the-fly. You can check the active locale + settings using the <command>SHOW</command> command. + </para> + + <para> + The directory <filename>src/test/locale</filename> in the source + distribution contains a test suite for + <productname>PostgreSQL</productname>'s locale support. + </para> + + <para> + Client applications that handle server-side errors by parsing the + text of the error message will obviously have problems when the + server's messages are in a different language. Authors of such + applications are advised to make use of the error code scheme + instead. + </para> + + <para> + Maintaining catalogs of message translations requires the on-going + efforts of many volunteers that want to see + <productname>PostgreSQL</productname> speak their preferred language well. + If messages in your language are currently not available or not fully + translated, your assistance would be appreciated. If you want to + help, refer to <xref linkend="nls"/> or write to the developers' + mailing list. + </para> + </sect2> + </sect1> + + + <sect1 id="collation"> + <title>Collation Support</title> + + <indexterm zone="collation"><primary>collation</primary></indexterm> + + <para> + The collation feature allows specifying the sort order and character + classification behavior of data per-column, or even per-operation. + This alleviates the restriction that the + <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol> settings + of a database cannot be changed after its creation. + </para> + + <sect2 id="collation-concepts"> + <title>Concepts</title> + + <para> + Conceptually, every expression of a collatable data type has a + collation. (The built-in collatable data types are + <type>text</type>, <type>varchar</type>, and <type>char</type>. + User-defined base types can also be marked collatable, and of course + a <glossterm linkend="glossary-domain">domain</glossterm> over a + collatable data type is collatable.) If the + expression is a column reference, the collation of the expression is the + defined collation of the column. If the expression is a constant, the + collation is the default collation of the data type of the + constant. The collation of a more complex expression is derived + from the collations of its inputs, as described below. + </para> + + <para> + The collation of an expression can be the <quote>default</quote> + collation, which means the locale settings defined for the + database. It is also possible for an expression's collation to be + indeterminate. In such cases, ordering operations and other + operations that need to know the collation will fail. + </para> + + <para> + When the database system has to perform an ordering or a character + classification, it uses the collation of the input expression. This + happens, for example, with <literal>ORDER BY</literal> clauses + and function or operator calls such as <literal><</literal>. + The collation to apply for an <literal>ORDER BY</literal> clause + is simply the collation of the sort key. The collation to apply for a + function or operator call is derived from the arguments, as described + below. In addition to comparison operators, collations are taken into + account by functions that convert between lower and upper case + letters, such as <function>lower</function>, <function>upper</function>, and + <function>initcap</function>; by pattern matching operators; and by + <function>to_char</function> and related functions. + </para> + + <para> + For a function or operator call, the collation that is derived by + examining the argument collations is used at run time for performing + the specified operation. If the result of the function or operator + call is of a collatable data type, the collation is also used at parse + time as the defined collation of the function or operator expression, + in case there is a surrounding expression that requires knowledge of + its collation. + </para> + + <para> + The <firstterm>collation derivation</firstterm> of an expression can be + implicit or explicit. This distinction affects how collations are + combined when multiple different collations appear in an + expression. An explicit collation derivation occurs when a + <literal>COLLATE</literal> clause is used; all other collation + derivations are implicit. When multiple collations need to be + combined, for example in a function call, the following rules are + used: + + <orderedlist> + <listitem> + <para> + If any input expression has an explicit collation derivation, then + all explicitly derived collations among the input expressions must be + the same, otherwise an error is raised. If any explicitly + derived collation is present, that is the result of the + collation combination. + </para> + </listitem> + + <listitem> + <para> + Otherwise, all input expressions must have the same implicit + collation derivation or the default collation. If any non-default + collation is present, that is the result of the collation combination. + Otherwise, the result is the default collation. + </para> + </listitem> + + <listitem> + <para> + If there are conflicting non-default implicit collations among the + input expressions, then the combination is deemed to have indeterminate + collation. This is not an error condition unless the particular + function being invoked requires knowledge of the collation it should + apply. If it does, an error will be raised at run-time. + </para> + </listitem> + </orderedlist> + + For example, consider this table definition: +<programlisting> +CREATE TABLE test1 ( + a text COLLATE "de_DE", + b text COLLATE "es_ES", + ... +); +</programlisting> + + Then in +<programlisting> +SELECT a < 'foo' FROM test1; +</programlisting> + the <literal><</literal> comparison is performed according to + <literal>de_DE</literal> rules, because the expression combines an + implicitly derived collation with the default collation. But in +<programlisting> +SELECT a < ('foo' COLLATE "fr_FR") FROM test1; +</programlisting> + the comparison is performed using <literal>fr_FR</literal> rules, + because the explicit collation derivation overrides the implicit one. + Furthermore, given +<programlisting> +SELECT a < b FROM test1; +</programlisting> + the parser cannot determine which collation to apply, since the + <structfield>a</structfield> and <structfield>b</structfield> columns have conflicting + implicit collations. Since the <literal><</literal> operator + does need to know which collation to use, this will result in an + error. The error can be resolved by attaching an explicit collation + specifier to either input expression, thus: +<programlisting> +SELECT a < b COLLATE "de_DE" FROM test1; +</programlisting> + or equivalently +<programlisting> +SELECT a COLLATE "de_DE" < b FROM test1; +</programlisting> + On the other hand, the structurally similar case +<programlisting> +SELECT a || b FROM test1; +</programlisting> + does not result in an error, because the <literal>||</literal> operator + does not care about collations: its result is the same regardless + of the collation. + </para> + + <para> + The collation assigned to a function or operator's combined input + expressions is also considered to apply to the function or operator's + result, if the function or operator delivers a result of a collatable + data type. So, in +<programlisting> +SELECT * FROM test1 ORDER BY a || 'foo'; +</programlisting> + the ordering will be done according to <literal>de_DE</literal> rules. + But this query: +<programlisting> +SELECT * FROM test1 ORDER BY a || b; +</programlisting> + results in an error, because even though the <literal>||</literal> operator + doesn't need to know a collation, the <literal>ORDER BY</literal> clause does. + As before, the conflict can be resolved with an explicit collation + specifier: +<programlisting> +SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; +</programlisting> + </para> + </sect2> + + <sect2 id="collation-managing"> + <title>Managing Collations</title> + + <para> + A collation is an SQL schema object that maps an SQL name to locales + provided by libraries installed in the operating system. A collation + definition has a <firstterm>provider</firstterm> that specifies which + library supplies the locale data. One standard provider name + is <literal>libc</literal>, which uses the locales provided by the + operating system C library. These are the locales used by most tools + provided by the operating system. Another provider + is <literal>icu</literal>, which uses the external + ICU<indexterm><primary>ICU</primary></indexterm> library. ICU locales can only be + used if support for ICU was configured when PostgreSQL was built. + </para> + + <para> + A collation object provided by <literal>libc</literal> maps to a + combination of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol> + settings, as accepted by the <literal>setlocale()</literal> system library call. (As + the name would suggest, the main purpose of a collation is to set + <symbol>LC_COLLATE</symbol>, which controls the sort order. But + it is rarely necessary in practice to have an + <symbol>LC_CTYPE</symbol> setting that is different from + <symbol>LC_COLLATE</symbol>, so it is more convenient to collect + these under one concept than to create another infrastructure for + setting <symbol>LC_CTYPE</symbol> per expression.) Also, + a <literal>libc</literal> collation + is tied to a character set encoding (see <xref linkend="multibyte"/>). + The same collation name may exist for different encodings. + </para> + + <para> + A collation object provided by <literal>icu</literal> maps to a named + collator provided by the ICU library. ICU does not support + separate <quote>collate</quote> and <quote>ctype</quote> settings, so + they are always the same. Also, ICU collations are independent of the + encoding, so there is always only one ICU collation of a given name in + a database. + </para> + + <sect3 id="collation-managing-standard"> + <title>Standard Collations</title> + + <para> + On all platforms, the collations named <literal>default</literal>, + <literal>C</literal>, and <literal>POSIX</literal> are available. Additional + collations may be available depending on operating system support. + The <literal>default</literal> collation selects the <symbol>LC_COLLATE</symbol> + and <symbol>LC_CTYPE</symbol> values specified at database creation time. + The <literal>C</literal> and <literal>POSIX</literal> collations both specify + <quote>traditional C</quote> behavior, in which only the ASCII letters + <quote><literal>A</literal></quote> through <quote><literal>Z</literal></quote> + are treated as letters, and sorting is done strictly by character + code byte values. + </para> + + <note> + <para> + The <literal>C</literal> and <literal>POSIX</literal> locales may behave + differently depending on the database encoding. + </para> + </note> + + <para> + Additionally, two SQL standard collation names are available: + + <variablelist> + <varlistentry> + <term><literal>unicode</literal></term> + <listitem> + <para> + This collation sorts using the Unicode Collation Algorithm with the + Default Unicode Collation Element Table. It is available in all + encodings. ICU support is required to use this collation. (This + collation has the same behavior as the ICU root locale; see <xref + linkend="collation-managing-predefined-icu-und-x-icu"/>.) + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>ucs_basic</literal></term> + <listitem> + <para> + This collation sorts by Unicode code point. It is only available for + encoding <literal>UTF8</literal>. (This collation has the same + behavior as the libc locale specification <literal>C</literal> in + <literal>UTF8</literal> encoding.) + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + </sect3> + + <sect3 id="collation-managing-predefined"> + <title>Predefined Collations</title> + + <para> + If the operating system provides support for using multiple locales + within a single program (<function>newlocale</function> and related functions), + or if support for ICU is configured, + then when a database cluster is initialized, <command>initdb</command> + populates the system catalog <literal>pg_collation</literal> with + collations based on all the locales it finds in the operating + system at the time. + </para> + + <para> + To inspect the currently available locales, use the query <literal>SELECT + * FROM pg_collation</literal>, or the command <command>\dOS+</command> + in <application>psql</application>. + </para> + + <sect4 id="collation-managing-predefined-libc"> + <title>libc Collations</title> + + <para> + For example, the operating system might + provide a locale named <literal>de_DE.utf8</literal>. + <command>initdb</command> would then create a collation named + <literal>de_DE.utf8</literal> for encoding <literal>UTF8</literal> + that has both <symbol>LC_COLLATE</symbol> and + <symbol>LC_CTYPE</symbol> set to <literal>de_DE.utf8</literal>. + It will also create a collation with the <literal>.utf8</literal> + tag stripped off the name. So you could also use the collation + under the name <literal>de_DE</literal>, which is less cumbersome + to write and makes the name less encoding-dependent. Note that, + nevertheless, the initial set of collation names is + platform-dependent. + </para> + + <para> + The default set of collations provided by <literal>libc</literal> map + directly to the locales installed in the operating system, which can be + listed using the command <literal>locale -a</literal>. In case + a <literal>libc</literal> collation is needed that has different values + for <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or if new + locales are installed in the operating system after the database system + was initialized, then a new collation may be created using + the <xref linkend="sql-createcollation"/> command. + New operating system locales can also be imported en masse using + the <link linkend="functions-admin-collation"><function>pg_import_system_collations()</function></link> function. + </para> + + <para> + Within any particular database, only collations that use that + database's encoding are of interest. Other entries in + <literal>pg_collation</literal> are ignored. Thus, a stripped collation + name such as <literal>de_DE</literal> can be considered unique + within a given database even though it would not be unique globally. + Use of the stripped collation names is recommended, since it will + make one fewer thing you need to change if you decide to change to + another database encoding. Note however that the <literal>default</literal>, + <literal>C</literal>, and <literal>POSIX</literal> collations can be used regardless of + the database encoding. + </para> + + <para> + <productname>PostgreSQL</productname> considers distinct collation + objects to be incompatible even when they have identical properties. + Thus for example, +<programlisting> +SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1; +</programlisting> + will draw an error even though the <literal>C</literal> and <literal>POSIX</literal> + collations have identical behaviors. Mixing stripped and non-stripped + collation names is therefore not recommended. + </para> + </sect4> + + <sect4 id="collation-managing-predefined-icu"> + <title>ICU Collations</title> + + <para> + With ICU, it is not sensible to enumerate all possible locale names. ICU + uses a particular naming system for locales, but there are many more ways + to name a locale than there are actually distinct locales. + <command>initdb</command> uses the ICU APIs to extract a set of distinct + locales to populate the initial set of collations. Collations provided by + ICU are created in the SQL environment with names in BCP 47 language tag + format, with a <quote>private use</quote> + extension <literal>-x-icu</literal> appended, to distinguish them from + libc locales. + </para> + + <para> + Here are some example collations that might be created: + + <variablelist> + <varlistentry id="collation-managing-predefined-icu-de-x-icu"> + <term><literal>de-x-icu</literal></term> + <listitem> + <para>German collation, default variant</para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-predefined-icu-de-at-x-icu"> + <term><literal>de-AT-x-icu</literal></term> + <listitem> + <para>German collation for Austria, default variant</para> + <para> + (There are also, say, <literal>de-DE-x-icu</literal> + or <literal>de-CH-x-icu</literal>, but as of this writing, they are + equivalent to <literal>de-x-icu</literal>.) + </para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-predefined-icu-und-x-icu"> + <term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term> + <listitem> + <para> + ICU <quote>root</quote> collation. Use this to get a reasonable + language-agnostic sort order. + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + + <para> + Some (less frequently used) encodings are not supported by ICU. When the + database encoding is one of these, ICU collation entries + in <literal>pg_collation</literal> are ignored. Attempting to use one + will draw an error along the lines of <quote>collation "de-x-icu" for + encoding "WIN874" does not exist</quote>. + </para> + </sect4> + </sect3> + + <sect3 id="collation-create"> + <title>Creating New Collation Objects</title> + + <para> + If the standard and predefined collations are not sufficient, users can + create their own collation objects using the SQL + command <xref linkend="sql-createcollation"/>. + </para> + + <para> + The standard and predefined collations are in the + schema <literal>pg_catalog</literal>, like all predefined objects. + User-defined collations should be created in user schemas. This also + ensures that they are saved by <command>pg_dump</command>. + </para> + + <sect4 id="collation-managing-create-libc"> + <title>libc Collations</title> + + <para> + New libc collations can be created like this: +<programlisting> +CREATE COLLATION german (provider = libc, locale = 'de_DE'); +</programlisting> + The exact values that are acceptable for the <literal>locale</literal> + clause in this command depend on the operating system. On Unix-like + systems, the command <literal>locale -a</literal> will show a list. + </para> + + <para> + Since the predefined libc collations already include all collations + defined in the operating system when the database instance is + initialized, it is not often necessary to manually create new ones. + Reasons might be if a different naming system is desired (in which case + see also <xref linkend="collation-copy"/>) or if the operating system has + been upgraded to provide new locale definitions (in which case see + also <link linkend="functions-admin-collation"><function>pg_import_system_collations()</function></link>). + </para> + </sect4> + + <sect4 id="collation-managing-create-icu"> + <title>ICU Collations</title> + + <para> + ICU collations can be created like: + +<programlisting> +CREATE COLLATION german (provider = icu, locale = 'de-DE'); +</programlisting> + + ICU locales are specified as a BCP 47 <link + linkend="icu-language-tag">Language Tag</link>, but can also accept most + libc-style locale names. If possible, libc-style locale names are + transformed into language tags. + </para> + <para> + New ICU collations can customize collation behavior extensively by + including collation attributes in the language tag. See <xref + linkend="icu-custom-collations"/> for details and examples. + </para> + </sect4> + <sect4 id="collation-copy"> + <title>Copying Collations</title> + + <para> + The command <xref linkend="sql-createcollation"/> can also be used to + create a new collation from an existing collation, which can be useful to + be able to use operating-system-independent collation names in + applications, create compatibility names, or use an ICU-provided collation + under a more readable name. For example: +<programlisting> +CREATE COLLATION german FROM "de_DE"; +CREATE COLLATION french FROM "fr-x-icu"; +</programlisting> + </para> + </sect4> + </sect3> + + <sect3 id="collation-nondeterministic"> + <title>Nondeterministic Collations</title> + + <para> + A collation is either <firstterm>deterministic</firstterm> or + <firstterm>nondeterministic</firstterm>. A deterministic collation uses + deterministic comparisons, which means that it considers strings to be + equal only if they consist of the same byte sequence. Nondeterministic + comparison may determine strings to be equal even if they consist of + different bytes. Typical situations include case-insensitive comparison, + accent-insensitive comparison, as well as comparison of strings in + different Unicode normal forms. It is up to the collation provider to + actually implement such insensitive comparisons; the deterministic flag + only determines whether ties are to be broken using bytewise comparison. + See also <ulink url="https://www.unicode.org/reports/tr10">Unicode Technical + Standard 10</ulink> for more information on the terminology. + </para> + + <para> + To create a nondeterministic collation, specify the property + <literal>deterministic = false</literal> to <command>CREATE + COLLATION</command>, for example: +<programlisting> +CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false); +</programlisting> + This example would use the standard Unicode collation in a + nondeterministic way. In particular, this would allow strings in + different normal forms to be compared correctly. More interesting + examples make use of the ICU customization facilities explained above. + For example: +<programlisting> +CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); +CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); +</programlisting> + </para> + + <para> + All standard and predefined collations are deterministic, all + user-defined collations are deterministic by default. While + nondeterministic collations give a more <quote>correct</quote> behavior, + especially when considering the full power of Unicode and its many + special cases, they also have some drawbacks. Foremost, their use leads + to a performance penalty. Note, in particular, that B-tree cannot use + deduplication with indexes that use a nondeterministic collation. Also, + certain operations are not possible with nondeterministic collations, + such as pattern matching operations. Therefore, they should be used + only in cases where they are specifically wanted. + </para> + + <tip> + <para> + To deal with text in different Unicode normalization forms, it is also + an option to use the functions/expressions + <function>normalize</function> and <literal>is normalized</literal> to + preprocess or check the strings, instead of using nondeterministic + collations. There are different trade-offs for each approach. + </para> + </tip> + </sect3> + </sect2> + + <sect2 id="icu-custom-collations"> + <title>ICU Custom Collations</title> + + <para> + ICU allows extensive control over collation behavior by defining new + collations with collation settings as a part of the language tag. These + settings can modify the collation order to suit a variety of needs. For + instance: + +<programlisting> +-- ignore differences in accents and case +CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1'); +SELECT 'Ã…' = 'A' COLLATE ignore_accent_case; -- true +SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true + +-- upper case letters sort before lower case. +CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper'); +SELECT 'B' < 'b' COLLATE upper_first; -- true + +-- treat digits numerically and ignore punctuation +CREATE COLLATION num_ignore_punct (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-kn'); +SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true +SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true +</programlisting> + + Many of the available options are described in <xref + linkend="icu-collation-settings"/>, or see <xref + linkend="icu-external-references"/> for more details. + </para> + + <sect3 id="icu-collation-comparison-levels"> + <title>ICU Comparison Levels</title> + + <para> + Comparison of two strings (collation) in ICU is determined by a + multi-level process, where textual features are grouped into + "levels". Treatment of each level is controlled by the <link + linkend="icu-collation-settings-table">collation settings</link>. Higher + levels correspond to finer textual features. + </para> + + <para> + <xref linkend="icu-collation-levels"/> shows which textual feature + differences are considered significant when determining equality at the + given level. The Unicode character <literal>U+2063</literal> is an + invisible separator, and as seen in the table, is ignored for at all + levels of comparison less than <literal>identic</literal>. + </para> + + <table id="icu-collation-levels"> + <title>ICU Collation Levels</title> + <tgroup cols="8"> + <colspec colname="col1" colwidth="1*"/> + <colspec colname="col2" colwidth="1.25*"/> + <colspec colname="col3" colwidth="1*"/> + <colspec colname="col4" colwidth="1*"/> + <colspec colname="col5" colwidth="1*"/> + <colspec colname="col6" colwidth="1*"/> + <colspec colname="col7" colwidth="1*"/> + <colspec colname="col8" colwidth="1*"/> + + <thead> + <row> + <entry>Level</entry> + <entry>Description</entry> + <entry><literal>'f' = 'f'</literal></entry> + <entry><literal>'ab' = U&'a\2063b'</literal></entry> + <entry><literal>'x-y' = 'x_y'</literal></entry> + <entry><literal>'g' = 'G'</literal></entry> + <entry><literal>'n' = 'ñ'</literal></entry> + <entry><literal>'y' = 'z'</literal></entry> + </row> + </thead> + + <tbody> + <row> + <entry>level1</entry> + <entry>Base Character</entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>false</literal></entry> + </row> + <row> + <entry>level2</entry> + <entry>Accents</entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + </row> + <row> + <entry>level3</entry> + <entry>Case/Variants</entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + </row> + <row> + <entry>level4</entry> + <entry>Punctuation</entry> + <entry><literal>true</literal></entry> + <entry><literal>true</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + </row> + <row> + <entry>identic</entry> + <entry>All</entry> + <entry><literal>true</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry><literal>false</literal></entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + At every level, even with full normalization off, basic normalization is + performed. For example, <literal>'á'</literal> may be composed of the + code points <literal>U&'\0061\0301'</literal> or the single code + point <literal>U&'\00E1'</literal>, and those sequences will be + considered equal even at the <literal>identic</literal> level. To treat + any difference in code point representation as distinct, use a collation + created with <symbol>deterministic</symbol> set to + <literal>true</literal>. + </para> + + <sect4 id="icu-collation-level-examples"> + <title>Collation Level Examples</title> + +<programlisting> +CREATE COLLATION level3 (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-level3'); +CREATE COLLATION level4 (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-level4'); +CREATE COLLATION identic (provider = icu, deterministic = false, locale = 'und-u-ka-shifted-ks-identic'); + +-- invisible separator ignored at all levels except identic +SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true +SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false + +-- punctuation ignored at level3 but not at level 4 +SELECT 'x-y' = 'x_y' COLLATE level3; -- true +SELECT 'x-y' = 'x_y' COLLATE level4; -- false +</programlisting> + + </sect4> + </sect3> + + <sect3 id="icu-collation-settings"> + <title>Collation Settings for an ICU Locale</title> + + <para> + <xref linkend="icu-collation-settings-table"/> shows the available + collation settings, which can be used as part of a language tag to + customize a collation. + </para> + + <table id="icu-collation-settings-table"> + <title>ICU Collation Settings</title> + <tgroup cols="4"> + <colspec colname="col1" colwidth="1*"/> + <colspec colname="col2" colwidth="2*"/> + <colspec colname="col3" colwidth="2*"/> + <colspec colname="col4" colwidth="5*"/> + + <thead> + <row> + <entry>Key</entry> + <entry>Values</entry> + <entry>Default</entry> + <entry>Description</entry> + </row> + </thead> + + <tbody> + <row> + <entry><literal>co</literal></entry> + <entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry> + <entry><literal>standard</literal></entry> + <entry> + Collation type. See <xref linkend="icu-external-references"/> for additional options and details. + </entry> + </row> + + <row> + <entry><literal>ka</literal></entry> + <entry><literal>noignore</literal>, <literal>shifted</literal></entry> + <entry><literal>noignore</literal></entry> + <entry> + If set to <literal>shifted</literal>, causes some characters + (e.g. punctuation or space) to be ignored in comparison. Key + <literal>ks</literal> must be set to <literal>level3</literal> or + lower to take effect. Set key <literal>kv</literal> to control which + character classes are ignored. + </entry> + </row> + + <row> + <entry><literal>kb</literal></entry> + <entry><literal>true</literal>, <literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry> + Backwards comparison for the level 2 differences. For example, + locale <literal>und-u-kb</literal> sorts <literal>'à e'</literal> + before <literal>'aé'</literal>. + </entry> + </row> + + <row> + <entry><literal>kc</literal></entry> + <entry><literal>true</literal>, <literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry> + <para> + Separates case into a "level 2.5" that falls between accents and + other level 3 features. + </para> + <para> + If set to <literal>true</literal> and <literal>ks</literal> is set + to <literal>level1</literal>, will ignore accents but take case + into account. + </para> + </entry> + </row> + + <row> + <entry><literal>kf</literal></entry> + <entry> + <literal>upper</literal>, <literal>lower</literal>, + <literal>false</literal> + </entry> + <entry><literal>false</literal></entry> + <entry> + If set to <literal>upper</literal>, upper case sorts before lower + case. If set to <literal>lower</literal>, lower case sorts before + upper case. If set to <literal>false</literal>, the sort depends on + the rules of the locale. + </entry> + </row> + + <row> + <entry><literal>kn</literal></entry> + <entry><literal>true</literal>, <literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry> + If set to <literal>true</literal>, numbers within a string are + treated as a single numeric value rather than a sequence of + digits. For example, <literal>'id-45'</literal> sorts before + <literal>'id-123'</literal>. + </entry> + </row> + + <row> + <entry><literal>kk</literal></entry> + <entry><literal>true</literal>, <literal>false</literal></entry> + <entry><literal>false</literal></entry> + <entry> + <para> + Enable full normalization; may affect performance. Basic + normalization is performed even when set to + <literal>false</literal>. Locales for languages that require full + normalization typically enable it by default. + </para> + <para> + Full normalization is important in some cases, such as when + multiple accents are applied to a single character. For example, + the code point sequences <literal>U&'\0065\0323\0302'</literal> + and <literal>U&'\0065\0302\0323'</literal> represent + an <literal>e</literal> with circumflex and dot-below accents + applied in different orders. With full normalization + on, these code point sequences are treated as equal; otherwise they + are unequal. + </para> + </entry> + </row> + + <row> + <entry><literal>kr</literal></entry> + <entry> + <literal>space</literal>, <literal>punct</literal>, + <literal>symbol</literal>, <literal>currency</literal>, + <literal>digit</literal>, <replaceable>script-id</replaceable> + </entry> + <entry></entry> + <entry> + <para> + Set to one or more of the valid values, or any BCP 47 + <replaceable>script-id</replaceable>, e.g. <literal>latn</literal> + ("Latin") or <literal>grek</literal> ("Greek"). Multiple values are + separated by "<literal>-</literal>". + </para> + <para> + Redefines the ordering of classes of characters; those characters + belonging to a class earlier in the list sort before characters + belonging to a class later in the list. For instance, the value + <literal>digit-currency-space</literal> (as part of a language tag + like <literal>und-u-kr-digit-currency-space</literal>) sorts + punctuation before digits and spaces. + </para> + </entry> + </row> + + <row> + <entry><literal>ks</literal></entry> + <entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry> + <entry><literal>level3</literal></entry> + <entry> + Sensitivity (or "strength") when determining equality, with + <literal>level1</literal> the least sensitive to differences and + <literal>identic</literal> the most sensitive to differences. See + <xref linkend="icu-collation-levels"/> for details. + </entry> + </row> + + <row> + <entry><literal>kv</literal></entry> + <entry> + <literal>space</literal>, <literal>punct</literal>, + <literal>symbol</literal>, <literal>currency</literal> + </entry> + <entry><literal>punct</literal></entry> + <entry> + Classes of characters ignored during comparison at level 3. Setting + to a later value includes earlier values; + e.g. <literal>symbol</literal> also includes + <literal>punct</literal> and <literal>space</literal> in the + characters to be ignored. Key <literal>ka</literal> must be set to + <literal>shifted</literal> and key <literal>ks</literal> must be set + to <literal>level3</literal> or lower to take effect. + </entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + Defaults may depend on locale. The above table is not meant to be + complete. See <xref linkend="icu-external-references"/> for additional + options and details. + </para> + + <note> + <para> + For many collation settings, you must create the collation with + <option>deterministic</option> set to <literal>false</literal> for the + setting to have the desired effect (see <xref + linkend="collation-nondeterministic"/>). Additionally, some settings + only take effect when the key <literal>ka</literal> is set to + <literal>shifted</literal> (see <xref + linkend="icu-collation-settings-table"/>). + </para> + </note> + </sect3> + + <sect3 id="icu-locale-examples"> + <title>Collation Settings Examples</title> + + <variablelist> + <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu"> + <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term> + <listitem> + <para>German collation with phone book collation type</para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu"> + <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term> + <listitem> + <para> + Root collation with Emoji collation type, per Unicode Technical Standard #51 + </para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn"> + <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term> + <listitem> + <para> + Sort Greek letters before Latin ones. (The default is Latin before Greek.) + </para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-create-icu-en-u-kf-upper"> + <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term> + <listitem> + <para> + Sort upper-case letters before lower-case letters. (The default is + lower-case letters first.) + </para> + </listitem> + </varlistentry> + + <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn"> + <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term> + <listitem> + <para> + Combines both of the above options. + </para> + </listitem> + </varlistentry> + </variablelist> + </sect3> + + <sect3 id="icu-tailoring-rules"> + <title>ICU Tailoring Rules</title> + + <para> + If the options provided by the collation settings shown above are not + sufficient, the order of collation elements can be changed with tailoring + rules, whose syntax is detailed at <ulink + url="https://unicode-org.github.io/icu/userguide/collation/customization/"></ulink>. + </para> + + <para> + This small example creates a collation based on the root locale with a + tailoring rule: +<programlisting> +<![CDATA[CREATE COLLATION custom (provider = icu, locale = 'und', rules = '&V << w <<< W');]]> +</programlisting> + With this rule, the letter <quote>W</quote> is sorted after + <quote>V</quote>, but is treated as a secondary difference similar to an + accent. Rules like this are contained in the locale definitions of some + languages. (Of course, if a locale definition already contains the + desired rules, then they don't need to be specified again explicitly.) + </para> + + <para> + Here is a more complex example. The following statement sets up a + collation named <literal>ebcdic</literal> with rules to sort US-ASCII + characters in the order of the EBCDIC encoding. + +<programlisting> +<![CDATA[CREATE COLLATION ebcdic (provider = icu, locale = 'und', +rules = $$ +& ' ' < '.' < '<' < '(' < '+' < \| +< '&' < '!' < '$' < '*' < ')' < ';' +< '-' < '/' < ',' < '%' < '_' < '>' < '?' +< '`' < ':' < '#' < '@' < \' < '=' < '"' +<*a-r < '~' <*s-z < '^' < '[' < ']' +< '{' <*A-I < '}' <*J-R < '\' <*S-Z <*0-9 +$$);]]> + +SELECT c +FROM (VALUES ('a'), ('b'), ('A'), ('B'), ('1'), ('2'), ('!'), ('^')) AS x(c) +ORDER BY c COLLATE ebcdic; + c +--- + ! + a + b + ^ + A + B + 1 + 2 +</programlisting> + </para> + </sect3> + + <sect3 id="icu-external-references"> + <title>External References for ICU</title> + + <para> + This section (<xref linkend="icu-custom-collations"/>) is only a brief + overview of ICU behavior and language tags. Refer to the following + documents for technical details, additional options, and new behavior: + </para> + + <itemizedlist> + <listitem> + <para> + <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode Technical Standard #35</ulink> + </para> + </listitem> + <listitem> + <para> + <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> + </para> + </listitem> + <listitem> + <para> + <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR repository</ulink> + </para> + </listitem> + <listitem> + <para> + <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink> + </para> + </listitem> + <listitem> + <para> + <ulink url="https://unicode-org.github.io/icu/userguide/collation/"></ulink> + </para> + </listitem> + </itemizedlist> + </sect3> + </sect2> + </sect1> + + <sect1 id="multibyte"> + <title>Character Set Support</title> + + <indexterm zone="multibyte"><primary>character set</primary></indexterm> + + <para> + The character set support in <productname>PostgreSQL</productname> + allows you to store text in a variety of character sets (also called + encodings), including + single-byte character sets such as the ISO 8859 series and + multiple-byte character sets such as <acronym>EUC</acronym> (Extended Unix + Code), UTF-8, and Mule internal code. All supported character sets + can be used transparently by clients, but a few are not supported + for use within the server (that is, as a server-side encoding). + The default character set is selected while + initializing your <productname>PostgreSQL</productname> database + cluster using <command>initdb</command>. It can be overridden when you + create a database, so you can have multiple + databases each with a different character set. + </para> + + <para> + An important restriction, however, is that each database's character set + must be compatible with the database's <envar>LC_CTYPE</envar> (character + classification) and <envar>LC_COLLATE</envar> (string sort order) locale + settings. For <literal>C</literal> or + <literal>POSIX</literal> locale, any character set is allowed, but for other + libc-provided locales there is only one character set that will work + correctly. + (On Windows, however, UTF-8 encoding can be used with any locale.) + If you have ICU support configured, ICU-provided locales can be used + with most but not all server-side encodings. + </para> + + <sect2 id="multibyte-charset-supported"> + <title>Supported Character Sets</title> + + <para> + <xref linkend="charset-table"/> shows the character sets available + for use in <productname>PostgreSQL</productname>. + </para> + + <table id="charset-table"> + <title><productname>PostgreSQL</productname> Character Sets</title> + <tgroup cols="7"> + <colspec colname="col1" colwidth="3*"/> + <colspec colname="col2" colwidth="2*"/> + <colspec colname="col3" colwidth="2*"/> + <colspec colname="col4" colwidth="1.25*"/> + <colspec colname="col5" colwidth="1*"/> + <colspec colname="col6" colwidth="1*"/> + <colspec colname="col7" colwidth="2*"/> + <thead> + <row> + <entry>Name</entry> + <entry>Description</entry> + <entry>Language</entry> + <entry>Server?</entry> + <entry>ICU?</entry> + <!-- + The Bytes/Char field is populated by looking at the values returned + by pg_wchar_table.mblen function for each encoding. + --> + <entry>Bytes/&zwsp;Char</entry> + <entry>Aliases</entry> + </row> + </thead> + <tbody> + <row> + <entry><literal>BIG5</literal></entry> + <entry>Big Five</entry> + <entry>Traditional Chinese</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–2</entry> + <entry><literal>WIN950</literal>, <literal>Windows950</literal></entry> + </row> + <row> + <entry><literal>EUC_CN</literal></entry> + <entry>Extended UNIX Code-CN</entry> + <entry>Simplified Chinese</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>EUC_JP</literal></entry> + <entry>Extended UNIX Code-JP</entry> + <entry>Japanese</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>EUC_JIS_2004</literal></entry> + <entry>Extended UNIX Code-JP, JIS X 0213</entry> + <entry>Japanese</entry> + <entry>Yes</entry> + <entry>No</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>EUC_KR</literal></entry> + <entry>Extended UNIX Code-KR</entry> + <entry>Korean</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>EUC_TW</literal></entry> + <entry>Extended UNIX Code-TW</entry> + <entry>Traditional Chinese, Taiwanese</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>GB18030</literal></entry> + <entry>National Standard</entry> + <entry>Chinese</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–4</entry> + <entry></entry> + </row> + <row> + <entry><literal>GBK</literal></entry> + <entry>Extended National Standard</entry> + <entry>Simplified Chinese</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–2</entry> + <entry><literal>WIN936</literal>, <literal>Windows936</literal></entry> + </row> + <row> + <entry><literal>ISO_8859_5</literal></entry> + <entry>ISO 8859-5, <acronym>ECMA</acronym> 113</entry> + <entry>Latin/Cyrillic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>ISO_8859_6</literal></entry> + <entry>ISO 8859-6, <acronym>ECMA</acronym> 114</entry> + <entry>Latin/Arabic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>ISO_8859_7</literal></entry> + <entry>ISO 8859-7, <acronym>ECMA</acronym> 118</entry> + <entry>Latin/Greek</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>ISO_8859_8</literal></entry> + <entry>ISO 8859-8, <acronym>ECMA</acronym> 121</entry> + <entry>Latin/Hebrew</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>JOHAB</literal></entry> + <entry><acronym>JOHAB</acronym></entry> + <entry>Korean (Hangul)</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–3</entry> + <entry></entry> + </row> + <row> + <entry><literal>KOI8R</literal></entry> + <entry><acronym>KOI</acronym>8-R</entry> + <entry>Cyrillic (Russian)</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>KOI8</literal></entry> + </row> + <row> + <entry><literal>KOI8U</literal></entry> + <entry><acronym>KOI</acronym>8-U</entry> + <entry>Cyrillic (Ukrainian)</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>LATIN1</literal></entry> + <entry>ISO 8859-1, <acronym>ECMA</acronym> 94</entry> + <entry>Western European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO88591</literal></entry> + </row> + <row> + <entry><literal>LATIN2</literal></entry> + <entry>ISO 8859-2, <acronym>ECMA</acronym> 94</entry> + <entry>Central European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO88592</literal></entry> + </row> + <row> + <entry><literal>LATIN3</literal></entry> + <entry>ISO 8859-3, <acronym>ECMA</acronym> 94</entry> + <entry>South European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO88593</literal></entry> + </row> + <row> + <entry><literal>LATIN4</literal></entry> + <entry>ISO 8859-4, <acronym>ECMA</acronym> 94</entry> + <entry>North European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO88594</literal></entry> + </row> + <row> + <entry><literal>LATIN5</literal></entry> + <entry>ISO 8859-9, <acronym>ECMA</acronym> 128</entry> + <entry>Turkish</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO88599</literal></entry> + </row> + <row> + <entry><literal>LATIN6</literal></entry> + <entry>ISO 8859-10, <acronym>ECMA</acronym> 144</entry> + <entry>Nordic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO885910</literal></entry> + </row> + <row> + <entry><literal>LATIN7</literal></entry> + <entry>ISO 8859-13</entry> + <entry>Baltic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO885913</literal></entry> + </row> + <row> + <entry><literal>LATIN8</literal></entry> + <entry>ISO 8859-14</entry> + <entry>Celtic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO885914</literal></entry> + </row> + <row> + <entry><literal>LATIN9</literal></entry> + <entry>ISO 8859-15</entry> + <entry>LATIN1 with Euro and accents</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ISO885915</literal></entry> + </row> + <row> + <entry><literal>LATIN10</literal></entry> + <entry>ISO 8859-16, <acronym>ASRO</acronym> SR 14111</entry> + <entry>Romanian</entry> + <entry>Yes</entry> + <entry>No</entry> + <entry>1</entry> + <entry><literal>ISO885916</literal></entry> + </row> + <row> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry>Mule internal code</entry> + <entry>Multilingual Emacs</entry> + <entry>Yes</entry> + <entry>No</entry> + <entry>1–4</entry> + <entry></entry> + </row> + <row> + <entry><literal>SJIS</literal></entry> + <entry>Shift JIS</entry> + <entry>Japanese</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–2</entry> + <entry><literal>Mskanji</literal>, <literal>ShiftJIS</literal>, <literal>WIN932</literal>, <literal>Windows932</literal></entry> + </row> + <row> + <entry><literal>SHIFT_JIS_2004</literal></entry> + <entry>Shift JIS, JIS X 0213</entry> + <entry>Japanese</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–2</entry> + <entry></entry> + </row> + <row> + <entry><literal>SQL_ASCII</literal></entry> + <entry>unspecified (see text)</entry> + <entry><emphasis>any</emphasis></entry> + <entry>Yes</entry> + <entry>No</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>UHC</literal></entry> + <entry>Unified Hangul Code</entry> + <entry>Korean</entry> + <entry>No</entry> + <entry>No</entry> + <entry>1–2</entry> + <entry><literal>WIN949</literal>, <literal>Windows949</literal></entry> + </row> + <row> + <entry><literal>UTF8</literal></entry> + <entry>Unicode, 8-bit</entry> + <entry><emphasis>all</emphasis></entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1–4</entry> + <entry><literal>Unicode</literal></entry> + </row> + <row> + <entry><literal>WIN866</literal></entry> + <entry>Windows CP866</entry> + <entry>Cyrillic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ALT</literal></entry> + </row> + <row> + <entry><literal>WIN874</literal></entry> + <entry>Windows CP874</entry> + <entry>Thai</entry> + <entry>Yes</entry> + <entry>No</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1250</literal></entry> + <entry>Windows CP1250</entry> + <entry>Central European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1251</literal></entry> + <entry>Windows CP1251</entry> + <entry>Cyrillic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>WIN</literal></entry> + </row> + <row> + <entry><literal>WIN1252</literal></entry> + <entry>Windows CP1252</entry> + <entry>Western European</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1253</literal></entry> + <entry>Windows CP1253</entry> + <entry>Greek</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1254</literal></entry> + <entry>Windows CP1254</entry> + <entry>Turkish</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1255</literal></entry> + <entry>Windows CP1255</entry> + <entry>Hebrew</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1256</literal></entry> + <entry>Windows CP1256</entry> + <entry>Arabic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1257</literal></entry> + <entry>Windows CP1257</entry> + <entry>Baltic</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry></entry> + </row> + <row> + <entry><literal>WIN1258</literal></entry> + <entry>Windows CP1258</entry> + <entry>Vietnamese</entry> + <entry>Yes</entry> + <entry>Yes</entry> + <entry>1</entry> + <entry><literal>ABC</literal>, <literal>TCVN</literal>, <literal>TCVN5712</literal>, <literal>VSCII</literal></entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + Not all client <acronym>API</acronym>s support all the listed character sets. For example, the + <productname>PostgreSQL</productname> + JDBC driver does not support <literal>MULE_INTERNAL</literal>, <literal>LATIN6</literal>, + <literal>LATIN8</literal>, and <literal>LATIN10</literal>. + </para> + + <para> + The <literal>SQL_ASCII</literal> setting behaves considerably differently + from the other settings. When the server character set is + <literal>SQL_ASCII</literal>, the server interprets byte values 0–127 + according to the ASCII standard, while byte values 128–255 are taken + as uninterpreted characters. No encoding conversion will be done when + the setting is <literal>SQL_ASCII</literal>. Thus, this setting is not so + much a declaration that a specific encoding is in use, as a declaration + of ignorance about the encoding. In most cases, if you are + working with any non-ASCII data, it is unwise to use the + <literal>SQL_ASCII</literal> setting because + <productname>PostgreSQL</productname> will be unable to help you by + converting or validating non-ASCII characters. + </para> + </sect2> + + <sect2 id="multibyte-setting"> + <title>Setting the Character Set</title> + + <para> + <command>initdb</command> defines the default character set (encoding) + for a <productname>PostgreSQL</productname> cluster. For example, + +<screen> +initdb -E EUC_JP +</screen> + + sets the default character set to + <literal>EUC_JP</literal> (Extended Unix Code for Japanese). You + can use <option>--encoding</option> instead of + <option>-E</option> if you prefer longer option strings. + If no <option>-E</option> or <option>--encoding</option> option is + given, <command>initdb</command> attempts to determine the appropriate + encoding to use based on the specified or default locale. + </para> + + <para> + You can specify a non-default encoding at database creation time, + provided that the encoding is compatible with the selected locale: + +<screen> +createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean +</screen> + + This will create a database named <literal>korean</literal> that + uses the character set <literal>EUC_KR</literal>, and locale <literal>ko_KR</literal>. + Another way to accomplish this is to use this SQL command: + +<programlisting> +CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0; +</programlisting> + + Notice that the above commands specify copying the <literal>template0</literal> + database. When copying any other database, the encoding and locale + settings cannot be changed from those of the source database, because + that might result in corrupt data. For more information see + <xref linkend="manage-ag-templatedbs"/>. + </para> + + <para> + The encoding for a database is stored in the system catalog + <literal>pg_database</literal>. You can see it by using the + <command>psql</command> <option>-l</option> option or the + <command>\l</command> command. + +<screen> +$ <userinput>psql -l</userinput> + List of databases + Name | Owner | Encoding | Collation | Ctype | Access Privileges +-----------+----------+-----------+-------------+-------------+------------------------------------- + clocaledb | hlinnaka | SQL_ASCII | C | C | + englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 | + japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 | + korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr | + postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | + template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} + template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} +(7 rows) +</screen> + </para> + + <important> + <para> + On most modern operating systems, <productname>PostgreSQL</productname> + can determine which character set is implied by the <envar>LC_CTYPE</envar> + setting, and it will enforce that only the matching database encoding is + used. On older systems it is your responsibility to ensure that you use + the encoding expected by the locale you have selected. A mistake in + this area is likely to lead to strange behavior of locale-dependent + operations such as sorting. + </para> + + <para> + <productname>PostgreSQL</productname> will allow superusers to create + databases with <literal>SQL_ASCII</literal> encoding even when + <envar>LC_CTYPE</envar> is not <literal>C</literal> or <literal>POSIX</literal>. As noted + above, <literal>SQL_ASCII</literal> does not enforce that the data stored in + the database has any particular encoding, and so this choice poses risks + of locale-dependent misbehavior. Using this combination of settings is + deprecated and may someday be forbidden altogether. + </para> + </important> + </sect2> + + <sect2 id="multibyte-automatic-conversion"> + <title>Automatic Character Set Conversion Between Server and Client</title> + + <para> + <productname>PostgreSQL</productname> supports automatic character + set conversion between server and client for many combinations of + character sets (<xref linkend="multibyte-conversions-supported"/> + shows which ones). + </para> + + <para> + To enable automatic character set conversion, you have to + tell <productname>PostgreSQL</productname> the character set + (encoding) you would like to use in the client. There are several + ways to accomplish this: + + <itemizedlist> + <listitem> + <para> + Using the <command>\encoding</command> command in + <application>psql</application>. + <command>\encoding</command> allows you to change client + encoding on the fly. For + example, to change the encoding to <literal>SJIS</literal>, type: + +<programlisting> +\encoding SJIS +</programlisting> + </para> + </listitem> + + <listitem> + <para> + <application>libpq</application> (<xref linkend="libpq-control"/>) has functions to control the client encoding. + </para> + </listitem> + + <listitem> + <para> + Using <command>SET client_encoding TO</command>. + + Setting the client encoding can be done with this SQL command: + +<programlisting> +SET CLIENT_ENCODING TO '<replaceable>value</replaceable>'; +</programlisting> + + Also you can use the standard SQL syntax <literal>SET NAMES</literal> + for this purpose: + +<programlisting> +SET NAMES '<replaceable>value</replaceable>'; +</programlisting> + + To query the current client encoding: + +<programlisting> +SHOW client_encoding; +</programlisting> + + To return to the default encoding: + +<programlisting> +RESET client_encoding; +</programlisting> + </para> + </listitem> + + <listitem> + <para> + Using <envar>PGCLIENTENCODING</envar>. If the environment variable + <envar>PGCLIENTENCODING</envar> is defined in the client's + environment, that client encoding is automatically selected + when a connection to the server is made. (This can + subsequently be overridden using any of the other methods + mentioned above.) + </para> + </listitem> + + <listitem> + <para> + Using the configuration variable <xref + linkend="guc-client-encoding"/>. If the + <varname>client_encoding</varname> variable is set, that client + encoding is automatically selected when a connection to the + server is made. (This can subsequently be overridden using any + of the other methods mentioned above.) + </para> + </listitem> + + </itemizedlist> + </para> + + <para> + If the conversion of a particular character is not possible + — suppose you chose <literal>EUC_JP</literal> for the + server and <literal>LATIN1</literal> for the client, and some + Japanese characters are returned that do not have a representation in + <literal>LATIN1</literal> — an error is reported. + </para> + + <para> + If the client character set is defined as <literal>SQL_ASCII</literal>, + encoding conversion is disabled, regardless of the server's character + set. (However, if the server's character set is + not <literal>SQL_ASCII</literal>, the server will still check that + incoming data is valid for that encoding; so the net effect is as + though the client character set were the same as the server's.) + Just as for the server, use of <literal>SQL_ASCII</literal> is unwise + unless you are working with all-ASCII data. + </para> + </sect2> + + <sect2 id="multibyte-conversions-supported"> + <title>Available Character Set Conversions</title> + + <para> + <productname>PostgreSQL</productname> allows conversion between any + two character sets for which a conversion function is listed in the + <link linkend="catalog-pg-conversion"><structname>pg_conversion</structname></link> + system catalog. <productname>PostgreSQL</productname> comes with + some predefined conversions, as summarized in + <xref linkend="multibyte-translation-table"/> and shown in more + detail in <xref linkend="builtin-conversions-table"/>. You can + create a new conversion using the SQL command + <xref linkend="sql-createconversion"/>. (To be used for automatic + client/server conversions, a conversion must be marked + as <quote>default</quote> for its character set pair.) + </para> + + <table id="multibyte-translation-table"> + <title>Built-in Client/Server Character Set Conversions</title> + <tgroup cols="2"> + <colspec colname="col1" colwidth="1*"/> + <colspec colname="col2" colwidth="3*"/> + <thead> + <row> + <entry>Server Character Set</entry> + <entry>Available Client Character Sets</entry> + </row> + </thead> + <tbody> + <row> + <entry><literal>BIG5</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>EUC_CN</literal></entry> + <entry><emphasis>EUC_CN</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>EUC_JP</literal></entry> + <entry><emphasis>EUC_JP</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>SJIS</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>EUC_JIS_2004</literal></entry> + <entry><emphasis>EUC_JIS_2004</emphasis>, + <literal>SHIFT_JIS_2004</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>EUC_KR</literal></entry> + <entry><emphasis>EUC_KR</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>EUC_TW</literal></entry> + <entry><emphasis>EUC_TW</emphasis>, + <literal>BIG5</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>GB18030</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>GBK</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>ISO_8859_5</literal></entry> + <entry><emphasis>ISO_8859_5</emphasis>, + <literal>KOI8R</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal>, + <literal>WIN866</literal>, + <literal>WIN1251</literal> + </entry> + </row> + <row> + <entry><literal>ISO_8859_6</literal></entry> + <entry><emphasis>ISO_8859_6</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>ISO_8859_7</literal></entry> + <entry><emphasis>ISO_8859_7</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>ISO_8859_8</literal></entry> + <entry><emphasis>ISO_8859_8</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>JOHAB</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>KOI8R</literal></entry> + <entry><emphasis>KOI8R</emphasis>, + <literal>ISO_8859_5</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal>, + <literal>WIN866</literal>, + <literal>WIN1251</literal> + </entry> + </row> + <row> + <entry><literal>KOI8U</literal></entry> + <entry><emphasis>KOI8U</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN1</literal></entry> + <entry><emphasis>LATIN1</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN2</literal></entry> + <entry><emphasis>LATIN2</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal>, + <literal>WIN1250</literal> + </entry> + </row> + <row> + <entry><literal>LATIN3</literal></entry> + <entry><emphasis>LATIN3</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN4</literal></entry> + <entry><emphasis>LATIN4</emphasis>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN5</literal></entry> + <entry><emphasis>LATIN5</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN6</literal></entry> + <entry><emphasis>LATIN6</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN7</literal></entry> + <entry><emphasis>LATIN7</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN8</literal></entry> + <entry><emphasis>LATIN8</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN9</literal></entry> + <entry><emphasis>LATIN9</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>LATIN10</literal></entry> + <entry><emphasis>LATIN10</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><emphasis>MULE_INTERNAL</emphasis>, + <literal>BIG5</literal>, + <literal>EUC_CN</literal>, + <literal>EUC_JP</literal>, + <literal>EUC_KR</literal>, + <literal>EUC_TW</literal>, + <literal>ISO_8859_5</literal>, + <literal>KOI8R</literal>, + <literal>LATIN1</literal> to <literal>LATIN4</literal>, + <literal>SJIS</literal>, + <literal>WIN866</literal>, + <literal>WIN1250</literal>, + <literal>WIN1251</literal> + </entry> + </row> + <row> + <entry><literal>SJIS</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>SHIFT_JIS_2004</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>SQL_ASCII</literal></entry> + <entry><emphasis>any (no conversion will be performed)</emphasis> + </entry> + </row> + <row> + <entry><literal>UHC</literal></entry> + <entry><emphasis>not supported as a server encoding</emphasis> + </entry> + </row> + <row> + <entry><literal>UTF8</literal></entry> + <entry><emphasis>all supported encodings</emphasis> + </entry> + </row> + <row> + <entry><literal>WIN866</literal></entry> + <entry><emphasis>WIN866</emphasis>, + <literal>ISO_8859_5</literal>, + <literal>KOI8R</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal>, + <literal>WIN1251</literal> + </entry> + </row> + <row> + <entry><literal>WIN874</literal></entry> + <entry><emphasis>WIN874</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1250</literal></entry> + <entry><emphasis>WIN1250</emphasis>, + <literal>LATIN2</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1251</literal></entry> + <entry><emphasis>WIN1251</emphasis>, + <literal>ISO_8859_5</literal>, + <literal>KOI8R</literal>, + <literal>MULE_INTERNAL</literal>, + <literal>UTF8</literal>, + <literal>WIN866</literal> + </entry> + </row> + <row> + <entry><literal>WIN1252</literal></entry> + <entry><emphasis>WIN1252</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1253</literal></entry> + <entry><emphasis>WIN1253</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1254</literal></entry> + <entry><emphasis>WIN1254</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1255</literal></entry> + <entry><emphasis>WIN1255</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1256</literal></entry> + <entry><emphasis>WIN1256</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1257</literal></entry> + <entry><emphasis>WIN1257</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + <row> + <entry><literal>WIN1258</literal></entry> + <entry><emphasis>WIN1258</emphasis>, + <literal>UTF8</literal> + </entry> + </row> + </tbody> + </tgroup> + </table> + + <table id="builtin-conversions-table"> + <title>All Built-in Character Set Conversions</title> + <tgroup cols="3"> + <colspec colname="col1" colwidth="2*"/> + <colspec colname="col2" colwidth="1*"/> + <colspec colname="col3" colwidth="1*"/> + <thead> + <row> + <entry>Conversion Name + <footnote> + <para> + The conversion names follow a standard naming scheme: The + official name of the source encoding with all + non-alphanumeric characters replaced by underscores, followed + by <literal>_to_</literal>, followed by the similarly processed + destination encoding name. Therefore, these names sometimes + deviate from the customary encoding names shown in + <xref linkend="charset-table"/>. + </para> + </footnote> + </entry> + <entry>Source Encoding</entry> + <entry>Destination Encoding</entry> + </row> + </thead> + + <tbody> + <row> + <entry><literal>big5_to_euc_tw</literal></entry> + <entry><literal>BIG5</literal></entry> + <entry><literal>EUC_TW</literal></entry> + </row> + <row> + <entry><literal>big5_to_mic</literal></entry> + <entry><literal>BIG5</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>big5_to_utf8</literal></entry> + <entry><literal>BIG5</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>euc_cn_to_mic</literal></entry> + <entry><literal>EUC_CN</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>euc_cn_to_utf8</literal></entry> + <entry><literal>EUC_CN</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>euc_jp_to_mic</literal></entry> + <entry><literal>EUC_JP</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>euc_jp_to_sjis</literal></entry> + <entry><literal>EUC_JP</literal></entry> + <entry><literal>SJIS</literal></entry> + </row> + <row> + <entry><literal>euc_jp_to_utf8</literal></entry> + <entry><literal>EUC_JP</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>euc_kr_to_mic</literal></entry> + <entry><literal>EUC_KR</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>euc_kr_to_utf8</literal></entry> + <entry><literal>EUC_KR</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>euc_tw_to_big5</literal></entry> + <entry><literal>EUC_TW</literal></entry> + <entry><literal>BIG5</literal></entry> + </row> + <row> + <entry><literal>euc_tw_to_mic</literal></entry> + <entry><literal>EUC_TW</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>euc_tw_to_utf8</literal></entry> + <entry><literal>EUC_TW</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>gb18030_to_utf8</literal></entry> + <entry><literal>GB18030</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>gbk_to_utf8</literal></entry> + <entry><literal>GBK</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_10_to_utf8</literal></entry> + <entry><literal>LATIN6</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_13_to_utf8</literal></entry> + <entry><literal>LATIN7</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_14_to_utf8</literal></entry> + <entry><literal>LATIN8</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_15_to_utf8</literal></entry> + <entry><literal>LATIN9</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_16_to_utf8</literal></entry> + <entry><literal>LATIN10</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_1_to_mic</literal></entry> + <entry><literal>LATIN1</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>iso_8859_1_to_utf8</literal></entry> + <entry><literal>LATIN1</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_2_to_mic</literal></entry> + <entry><literal>LATIN2</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>iso_8859_2_to_utf8</literal></entry> + <entry><literal>LATIN2</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_2_to_windows_1250</literal></entry> + <entry><literal>LATIN2</literal></entry> + <entry><literal>WIN1250</literal></entry> + </row> + <row> + <entry><literal>iso_8859_3_to_mic</literal></entry> + <entry><literal>LATIN3</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>iso_8859_3_to_utf8</literal></entry> + <entry><literal>LATIN3</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_4_to_mic</literal></entry> + <entry><literal>LATIN4</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>iso_8859_4_to_utf8</literal></entry> + <entry><literal>LATIN4</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_5_to_koi8_r</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + <entry><literal>KOI8R</literal></entry> + </row> + <row> + <entry><literal>iso_8859_5_to_mic</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>iso_8859_5_to_utf8</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_5_to_windows_1251</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + <entry><literal>WIN1251</literal></entry> + </row> + <row> + <entry><literal>iso_8859_5_to_windows_866</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + <entry><literal>WIN866</literal></entry> + </row> + <row> + <entry><literal>iso_8859_6_to_utf8</literal></entry> + <entry><literal>ISO_8859_6</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_7_to_utf8</literal></entry> + <entry><literal>ISO_8859_7</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_8_to_utf8</literal></entry> + <entry><literal>ISO_8859_8</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>iso_8859_9_to_utf8</literal></entry> + <entry><literal>LATIN5</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>johab_to_utf8</literal></entry> + <entry><literal>JOHAB</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>koi8_r_to_iso_8859_5</literal></entry> + <entry><literal>KOI8R</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + </row> + <row> + <entry><literal>koi8_r_to_mic</literal></entry> + <entry><literal>KOI8R</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>koi8_r_to_utf8</literal></entry> + <entry><literal>KOI8R</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>koi8_r_to_windows_1251</literal></entry> + <entry><literal>KOI8R</literal></entry> + <entry><literal>WIN1251</literal></entry> + </row> + <row> + <entry><literal>koi8_r_to_windows_866</literal></entry> + <entry><literal>KOI8R</literal></entry> + <entry><literal>WIN866</literal></entry> + </row> + <row> + <entry><literal>koi8_u_to_utf8</literal></entry> + <entry><literal>KOI8U</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>mic_to_big5</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>BIG5</literal></entry> + </row> + <row> + <entry><literal>mic_to_euc_cn</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>EUC_CN</literal></entry> + </row> + <row> + <entry><literal>mic_to_euc_jp</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>EUC_JP</literal></entry> + </row> + <row> + <entry><literal>mic_to_euc_kr</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>EUC_KR</literal></entry> + </row> + <row> + <entry><literal>mic_to_euc_tw</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>EUC_TW</literal></entry> + </row> + <row> + <entry><literal>mic_to_iso_8859_1</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>LATIN1</literal></entry> + </row> + <row> + <entry><literal>mic_to_iso_8859_2</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>LATIN2</literal></entry> + </row> + <row> + <entry><literal>mic_to_iso_8859_3</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>LATIN3</literal></entry> + </row> + <row> + <entry><literal>mic_to_iso_8859_4</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>LATIN4</literal></entry> + </row> + <row> + <entry><literal>mic_to_iso_8859_5</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + </row> + <row> + <entry><literal>mic_to_koi8_r</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>KOI8R</literal></entry> + </row> + <row> + <entry><literal>mic_to_sjis</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>SJIS</literal></entry> + </row> + <row> + <entry><literal>mic_to_windows_1250</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>WIN1250</literal></entry> + </row> + <row> + <entry><literal>mic_to_windows_1251</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>WIN1251</literal></entry> + </row> + <row> + <entry><literal>mic_to_windows_866</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + <entry><literal>WIN866</literal></entry> + </row> + <row> + <entry><literal>sjis_to_euc_jp</literal></entry> + <entry><literal>SJIS</literal></entry> + <entry><literal>EUC_JP</literal></entry> + </row> + <row> + <entry><literal>sjis_to_mic</literal></entry> + <entry><literal>SJIS</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>sjis_to_utf8</literal></entry> + <entry><literal>SJIS</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_1258_to_utf8</literal></entry> + <entry><literal>WIN1258</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>uhc_to_utf8</literal></entry> + <entry><literal>UHC</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>utf8_to_big5</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>BIG5</literal></entry> + </row> + <row> + <entry><literal>utf8_to_euc_cn</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>EUC_CN</literal></entry> + </row> + <row> + <entry><literal>utf8_to_euc_jp</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>EUC_JP</literal></entry> + </row> + <row> + <entry><literal>utf8_to_euc_kr</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>EUC_KR</literal></entry> + </row> + <row> + <entry><literal>utf8_to_euc_tw</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>EUC_TW</literal></entry> + </row> + <row> + <entry><literal>utf8_to_gb18030</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>GB18030</literal></entry> + </row> + <row> + <entry><literal>utf8_to_gbk</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>GBK</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_1</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN1</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_10</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN6</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_13</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN7</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_14</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN8</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_15</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN9</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_16</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN10</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_2</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN2</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_3</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN3</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_4</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN4</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_5</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_6</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>ISO_8859_6</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_7</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>ISO_8859_7</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_8</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>ISO_8859_8</literal></entry> + </row> + <row> + <entry><literal>utf8_to_iso_8859_9</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>LATIN5</literal></entry> + </row> + <row> + <entry><literal>utf8_to_johab</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>JOHAB</literal></entry> + </row> + <row> + <entry><literal>utf8_to_koi8_r</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>KOI8R</literal></entry> + </row> + <row> + <entry><literal>utf8_to_koi8_u</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>KOI8U</literal></entry> + </row> + <row> + <entry><literal>utf8_to_sjis</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>SJIS</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1258</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1258</literal></entry> + </row> + <row> + <entry><literal>utf8_to_uhc</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>UHC</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1250</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1250</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1251</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1251</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1252</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1252</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1253</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1253</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1254</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1254</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1255</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1255</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1256</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1256</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_1257</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN1257</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_866</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN866</literal></entry> + </row> + <row> + <entry><literal>utf8_to_windows_874</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>WIN874</literal></entry> + </row> + <row> + <entry><literal>windows_1250_to_iso_8859_2</literal></entry> + <entry><literal>WIN1250</literal></entry> + <entry><literal>LATIN2</literal></entry> + </row> + <row> + <entry><literal>windows_1250_to_mic</literal></entry> + <entry><literal>WIN1250</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>windows_1250_to_utf8</literal></entry> + <entry><literal>WIN1250</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_1251_to_iso_8859_5</literal></entry> + <entry><literal>WIN1251</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + </row> + <row> + <entry><literal>windows_1251_to_koi8_r</literal></entry> + <entry><literal>WIN1251</literal></entry> + <entry><literal>KOI8R</literal></entry> + </row> + <row> + <entry><literal>windows_1251_to_mic</literal></entry> + <entry><literal>WIN1251</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>windows_1251_to_utf8</literal></entry> + <entry><literal>WIN1251</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_1251_to_windows_866</literal></entry> + <entry><literal>WIN1251</literal></entry> + <entry><literal>WIN866</literal></entry> + </row> + <row> + <entry><literal>windows_1252_to_utf8</literal></entry> + <entry><literal>WIN1252</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_1256_to_utf8</literal></entry> + <entry><literal>WIN1256</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_866_to_iso_8859_5</literal></entry> + <entry><literal>WIN866</literal></entry> + <entry><literal>ISO_8859_5</literal></entry> + </row> + <row> + <entry><literal>windows_866_to_koi8_r</literal></entry> + <entry><literal>WIN866</literal></entry> + <entry><literal>KOI8R</literal></entry> + </row> + <row> + <entry><literal>windows_866_to_mic</literal></entry> + <entry><literal>WIN866</literal></entry> + <entry><literal>MULE_INTERNAL</literal></entry> + </row> + <row> + <entry><literal>windows_866_to_utf8</literal></entry> + <entry><literal>WIN866</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>windows_866_to_windows_1251</literal></entry> + <entry><literal>WIN866</literal></entry> + <entry><literal>WIN</literal></entry> + </row> + <row> + <entry><literal>windows_874_to_utf8</literal></entry> + <entry><literal>WIN874</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>euc_jis_2004_to_utf8</literal></entry> + <entry><literal>EUC_JIS_2004</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>utf8_to_euc_jis_2004</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>EUC_JIS_2004</literal></entry> + </row> + <row> + <entry><literal>shift_jis_2004_to_utf8</literal></entry> + <entry><literal>SHIFT_JIS_2004</literal></entry> + <entry><literal>UTF8</literal></entry> + </row> + <row> + <entry><literal>utf8_to_shift_jis_2004</literal></entry> + <entry><literal>UTF8</literal></entry> + <entry><literal>SHIFT_JIS_2004</literal></entry> + </row> + <row> + <entry><literal>euc_jis_2004_to_shift_jis_2004</literal></entry> + <entry><literal>EUC_JIS_2004</literal></entry> + <entry><literal>SHIFT_JIS_2004</literal></entry> + </row> + <row> + <entry><literal>shift_jis_2004_to_euc_jis_2004</literal></entry> + <entry><literal>SHIFT_JIS_2004</literal></entry> + <entry><literal>EUC_JIS_2004</literal></entry> + </row> + </tbody> + </tgroup> + </table> + </sect2> + + <sect2 id="multibyte-further-reading"> + <title>Further Reading</title> + + <para> + These are good sources to start learning about various kinds of encoding + systems. + + <variablelist> + <varlistentry> + <term><citetitle>CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing</citetitle></term> + + <listitem> + <para> + Contains detailed explanations of <literal>EUC_JP</literal>, + <literal>EUC_CN</literal>, <literal>EUC_KR</literal>, + <literal>EUC_TW</literal>. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><ulink url="https://www.unicode.org/"></ulink></term> + + <listitem> + <para> + The web site of the Unicode Consortium. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><ulink url="https://tools.ietf.org/html/rfc3629">RFC 3629</ulink></term> + + <listitem> + <para> + <acronym>UTF</acronym>-8 (8-bit UCS/Unicode Transformation + Format) is defined here. + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + </sect2> + + </sect1> + +</chapter> |