From 5e45211a64149b3c659b90ff2de6fa982a5a93ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 4 May 2024 14:17:33 +0200 Subject: Adding upstream version 15.5. Signed-off-by: Daniel Baumann --- doc/src/sgml/charset.sgml | 2741 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2741 insertions(+) create mode 100644 doc/src/sgml/charset.sgml (limited to 'doc/src/sgml/charset.sgml') diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml new file mode 100644 index 0000000..445fd17 --- /dev/null +++ b/doc/src/sgml/charset.sgml @@ -0,0 +1,2741 @@ + + + + Localization + + + This chapter describes the available localization features from the + point of view of the administrator. + PostgreSQL supports two localization + facilities: + + + + + Using the locale features of the operating system to provide + locale-specific collation order, number formatting, translated + messages, and other aspects. + This is covered in and + . + + + + + + Providing a number of different character sets to support storing text + in all kinds of languages, and providing character set translation + between client and server. + This is covered in . + + + + + + + + Locale Support + + locale + + + Locale support refers to an application respecting + cultural preferences regarding alphabets, sorting, number + formatting, etc. PostgreSQL uses the standard ISO + C and POSIX locale facilities provided by the server operating + system. For additional information refer to the documentation of your + system. + + + + Overview + + + Locale support is automatically initialized when a database + cluster is created using initdb. + initdb will initialize the database cluster + with the locale setting of its execution environment by default, + so if your system is already set to use the locale that you want + in your database cluster then there is nothing else you need to + do. If you want to use a different locale (or you are not sure + which locale your system is set to), you can instruct + initdb exactly which locale to use by + specifying the option. For example: + +initdb --locale=sv_SE + + + + + This example for Unix systems sets the locale to Swedish + (sv) as spoken + in Sweden (SE). Other possibilities might include + en_US (U.S. English) and fr_CA (French + Canadian). If more than one character set can be used for a + locale then the specifications can take the form + language_territory.codeset. For example, + fr_BE.UTF-8 represents the French language (fr) as + spoken in Belgium (BE), with a UTF-8 character set + encoding. + + + + What locales are available on your + system under what names depends on what was provided by the operating + system vendor and what was installed. On most Unix systems, the command + locale -a will provide a list of available locales. + Windows uses more verbose locale names, such as German_Germany + or Swedish_Sweden.1252, but the principles are the same. + + + + Occasionally it is useful to mix rules from several locales, e.g., + use English collation rules but Spanish messages. To support that, a + set of locale subcategories exist that control only certain + aspects of the localization rules: + + + + + + + + LC_COLLATE + String sort order + + + LC_CTYPE + Character classification (What is a letter? Its upper-case equivalent?) + + + LC_MESSAGES + Language of messages + + + LC_MONETARY + Formatting of currency amounts + + + LC_NUMERIC + Formatting of numbers + + + LC_TIME + Formatting of dates and times + + + + + + The category names translate into names of + initdb options to override the locale choice + for a specific category. For instance, to set the locale to + French Canadian, but use U.S. rules for formatting currency, use + initdb --locale=fr_CA --lc-monetary=en_US. + + + + If you want the system to behave as if it had no locale support, + use the special locale name C, or equivalently + POSIX. + + + + Some locale categories must have their values + fixed when the database is created. You can use different settings + for different databases, but once a database is created, you cannot + change them for that database anymore. LC_COLLATE + and LC_CTYPE are these categories. They affect + the sort order of indexes, so they must be kept fixed, or indexes on + text columns would become corrupt. + (But you can alleviate this restriction using collations, as discussed + in .) + The default values for these + categories are determined when initdb is run, and + those values are used when new databases are created, unless + specified otherwise in the CREATE DATABASE command. + + + + The other locale categories can be changed whenever desired + by setting the server configuration parameters + that have the same name as the locale categories (see for details). The values + that are chosen by initdb are actually only written + into the configuration file postgresql.conf to + serve as defaults when the server is started. If you remove these + assignments from postgresql.conf then the + server will inherit the settings from its execution environment. + + + + Note that the locale behavior of the server is determined by the + environment variables seen by the server, not by the environment + of any client. Therefore, be careful to configure the correct locale settings + before starting the server. A consequence of this is that if + client and server are set up in different locales, messages might + appear in different languages depending on where they originated. + + + + + When we speak of inheriting the locale from the execution + environment, this means the following on most operating systems: + For a given locale category, say the collation, the following + environment variables are consulted in this order until one is + found to be set: LC_ALL, LC_COLLATE + (or the variable corresponding to the respective category), + LANG. If none of these environment variables are + set then the locale defaults to C. + + + + Some message localization libraries also look at the environment + variable LANGUAGE which overrides all other locale + settings for the purpose of setting the language of messages. If + in doubt, please refer to the documentation of your operating + system, in particular the documentation about + gettext. + + + + + To enable messages to be translated to the user's preferred language, + NLS must have been selected at build time + (configure --enable-nls). All other locale support is + built in automatically. + + + + + Behavior + + + The locale settings influence the following SQL features: + + + + + Sort order in queries using ORDER BY or the standard + comparison operators on textual data + ORDER BYand locales + + + + + + The upper, lower, and initcap + functions + upperand locales + lowerand locales + + + + + + Pattern matching operators (LIKE, SIMILAR TO, + and POSIX-style regular expressions); locales affect both case + insensitive matching and the classification of characters by + character-class regular expressions + LIKEand locales + regular expressionsand locales + + + + + + The to_char family of functions + to_charand locales + + + + + + The ability to use indexes with LIKE clauses + + + + + + + The drawback of using locales other than C or + POSIX in PostgreSQL is its performance + impact. It slows character handling and prevents ordinary indexes + from being used by LIKE. For this reason use locales + only if you actually need them. + + + + As a workaround to allow PostgreSQL to use indexes + with LIKE clauses under a non-C locale, several custom + operator classes exist. These allow the creation of an index that + performs a strict character-by-character comparison, ignoring + locale comparison rules. Refer to + for more information. Another approach is to create indexes using + the C collation, as discussed in + . + + + + + Selecting Locales + + + Locales can be selected in different scopes depending on requirements. + The above overview showed how locales are specified using + initdb to set the defaults for the entire cluster. The + following list shows where locales can be selected. Each item provides + the defaults for the subsequent items, and each lower item allows + overriding the defaults on a finer granularity. + + + + + + As explained above, the environment of the operating system provides the + defaults for the locales of a newly initialized database cluster. In + many cases, this is enough: If the operating system is configured for + the desired language/territory, then + PostgreSQL will by default also behave + according to that locale. + + + + + + As shown above, command-line options for initdb + specify the locale settings for a newly initialized database cluster. + Use this if the operating system does not have the locale configuration + you want for your database system. + + + + + + A locale can be selected separately for each database. The SQL command + CREATE DATABASE and its command-line equivalent + createdb have options for that. Use this for example + if a database cluster houses databases for multiple tenants with + different requirements. + + + + + + Locale settings can be made for individual table columns. This uses an + SQL object called collation and is explained in + . Use this for example to sort data in + different languages or customize the sort order of a particular table. + + + + + + Finally, locales can be selected for an individual query. Again, this + uses SQL collation objects. This could be used to change the sort order + based on run-time choices or for ad-hoc experimentation. + + + + + + + Locale Providers + + + PostgreSQL supports multiple locale + providers. This specifies which library supplies the locale + data. One standard provider name is libc, which uses + the locales provided by the operating system C library. These are the + locales used by most tools provided by the operating system. Another + provider is icu, which uses the external + ICUICU library. ICU locales can + only be used if support for ICU was configured when PostgreSQL was built. + + + + The commands and tools that select the locale settings, as described + above, each have an option to select the locale provider. The examples + shown earlier all use the libc provider, which is the + default. Here is an example to initialize a database cluster using the + ICU provider: + +initdb --locale-provider=icu --icu-locale=en + + See the description of the respective commands and programs for + details. Note that you can mix locale providers at different + granularities, for example use libc by default for the + cluster but have one database that uses the icu + provider, and then have collation objects using either provider within + those databases. + + + + Which locale provider to use depends on individual requirements. For most + basic uses, either provider will give adequate results. For the libc + provider, it depends on what the operating system offers; some operating + systems are better than others. For advanced uses, ICU offers more locale + variants and customization options. + + + + + Problems + + + If locale support doesn't work according to the explanation above, + check that the locale support in your operating system is + correctly configured. To check what locales are installed on your + system, you can use the command locale -a if + your operating system provides it. + + + + Check that PostgreSQL is actually using the locale + that you think it is. The LC_COLLATE and LC_CTYPE + settings are determined when a database is created, and cannot be + changed except by creating a new database. Other locale + settings including LC_MESSAGES and LC_MONETARY + are initially determined by the environment the server is started + in, but can be changed on-the-fly. You can check the active locale + settings using the SHOW command. + + + + The directory src/test/locale in the source + distribution contains a test suite for + PostgreSQL's locale support. + + + + Client applications that handle server-side errors by parsing the + text of the error message will obviously have problems when the + server's messages are in a different language. Authors of such + applications are advised to make use of the error code scheme + instead. + + + + Maintaining catalogs of message translations requires the on-going + efforts of many volunteers that want to see + PostgreSQL speak their preferred language well. + If messages in your language are currently not available or not fully + translated, your assistance would be appreciated. If you want to + help, refer to or write to the developers' + mailing list. + + + + + + + Collation Support + + collation + + + The collation feature allows specifying the sort order and character + classification behavior of data per-column, or even per-operation. + This alleviates the restriction that the + LC_COLLATE and LC_CTYPE settings + of a database cannot be changed after its creation. + + + + Concepts + + + Conceptually, every expression of a collatable data type has a + collation. (The built-in collatable data types are + text, varchar, and char. + User-defined base types can also be marked collatable, and of course + a domain over a + collatable data type is collatable.) If the + expression is a column reference, the collation of the expression is the + defined collation of the column. If the expression is a constant, the + collation is the default collation of the data type of the + constant. The collation of a more complex expression is derived + from the collations of its inputs, as described below. + + + + The collation of an expression can be the default + collation, which means the locale settings defined for the + database. It is also possible for an expression's collation to be + indeterminate. In such cases, ordering operations and other + operations that need to know the collation will fail. + + + + When the database system has to perform an ordering or a character + classification, it uses the collation of the input expression. This + happens, for example, with ORDER BY clauses + and function or operator calls such as <. + The collation to apply for an ORDER BY clause + is simply the collation of the sort key. The collation to apply for a + function or operator call is derived from the arguments, as described + below. In addition to comparison operators, collations are taken into + account by functions that convert between lower and upper case + letters, such as lower, upper, and + initcap; by pattern matching operators; and by + to_char and related functions. + + + + For a function or operator call, the collation that is derived by + examining the argument collations is used at run time for performing + the specified operation. If the result of the function or operator + call is of a collatable data type, the collation is also used at parse + time as the defined collation of the function or operator expression, + in case there is a surrounding expression that requires knowledge of + its collation. + + + + The collation derivation of an expression can be + implicit or explicit. This distinction affects how collations are + combined when multiple different collations appear in an + expression. An explicit collation derivation occurs when a + COLLATE clause is used; all other collation + derivations are implicit. When multiple collations need to be + combined, for example in a function call, the following rules are + used: + + + + + If any input expression has an explicit collation derivation, then + all explicitly derived collations among the input expressions must be + the same, otherwise an error is raised. If any explicitly + derived collation is present, that is the result of the + collation combination. + + + + + + Otherwise, all input expressions must have the same implicit + collation derivation or the default collation. If any non-default + collation is present, that is the result of the collation combination. + Otherwise, the result is the default collation. + + + + + + If there are conflicting non-default implicit collations among the + input expressions, then the combination is deemed to have indeterminate + collation. This is not an error condition unless the particular + function being invoked requires knowledge of the collation it should + apply. If it does, an error will be raised at run-time. + + + + + For example, consider this table definition: + +CREATE TABLE test1 ( + a text COLLATE "de_DE", + b text COLLATE "es_ES", + ... +); + + + Then in + +SELECT a < 'foo' FROM test1; + + the < comparison is performed according to + de_DE rules, because the expression combines an + implicitly derived collation with the default collation. But in + +SELECT a < ('foo' COLLATE "fr_FR") FROM test1; + + the comparison is performed using fr_FR rules, + because the explicit collation derivation overrides the implicit one. + Furthermore, given + +SELECT a < b FROM test1; + + the parser cannot determine which collation to apply, since the + a and b columns have conflicting + implicit collations. Since the < operator + does need to know which collation to use, this will result in an + error. The error can be resolved by attaching an explicit collation + specifier to either input expression, thus: + +SELECT a < b COLLATE "de_DE" FROM test1; + + or equivalently + +SELECT a COLLATE "de_DE" < b FROM test1; + + On the other hand, the structurally similar case + +SELECT a || b FROM test1; + + does not result in an error, because the || operator + does not care about collations: its result is the same regardless + of the collation. + + + + The collation assigned to a function or operator's combined input + expressions is also considered to apply to the function or operator's + result, if the function or operator delivers a result of a collatable + data type. So, in + +SELECT * FROM test1 ORDER BY a || 'foo'; + + the ordering will be done according to de_DE rules. + But this query: + +SELECT * FROM test1 ORDER BY a || b; + + results in an error, because even though the || operator + doesn't need to know a collation, the ORDER BY clause does. + As before, the conflict can be resolved with an explicit collation + specifier: + +SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; + + + + + + Managing Collations + + + A collation is an SQL schema object that maps an SQL name to locales + provided by libraries installed in the operating system. A collation + definition has a provider that specifies which + library supplies the locale data. One standard provider name + is libc, which uses the locales provided by the + operating system C library. These are the locales used by most tools + provided by the operating system. Another provider + is icu, which uses the external + ICUICU library. ICU locales can only be + used if support for ICU was configured when PostgreSQL was built. + + + + A collation object provided by libc maps to a + combination of LC_COLLATE and LC_CTYPE + settings, as accepted by the setlocale() system library call. (As + the name would suggest, the main purpose of a collation is to set + LC_COLLATE, which controls the sort order. But + it is rarely necessary in practice to have an + LC_CTYPE setting that is different from + LC_COLLATE, so it is more convenient to collect + these under one concept than to create another infrastructure for + setting LC_CTYPE per expression.) Also, + a libc collation + is tied to a character set encoding (see ). + The same collation name may exist for different encodings. + + + + A collation object provided by icu maps to a named + collator provided by the ICU library. ICU does not support + separate collate and ctype settings, so + they are always the same. Also, ICU collations are independent of the + encoding, so there is always only one ICU collation of a given name in + a database. + + + + Standard Collations + + + On all platforms, the collations named default, + C, and POSIX are available. Additional + collations may be available depending on operating system support. + The default collation selects the LC_COLLATE + and LC_CTYPE values specified at database creation time. + The C and POSIX collations both specify + traditional C behavior, in which only the ASCII letters + A through Z + are treated as letters, and sorting is done strictly by character + code byte values. + + + + Additionally, the SQL standard collation name ucs_basic + is available for encoding UTF8. It is equivalent + to C and sorts by Unicode code point. + + + + + Predefined Collations + + + If the operating system provides support for using multiple locales + within a single program (newlocale and related functions), + or if support for ICU is configured, + then when a database cluster is initialized, initdb + populates the system catalog pg_collation with + collations based on all the locales it finds in the operating + system at the time. + + + + To inspect the currently available locales, use the query SELECT + * FROM pg_collation, or the command \dOS+ + in psql. + + + + libc Collations + + + For example, the operating system might + provide a locale named de_DE.utf8. + initdb would then create a collation named + de_DE.utf8 for encoding UTF8 + that has both LC_COLLATE and + LC_CTYPE set to de_DE.utf8. + It will also create a collation with the .utf8 + tag stripped off the name. So you could also use the collation + under the name de_DE, which is less cumbersome + to write and makes the name less encoding-dependent. Note that, + nevertheless, the initial set of collation names is + platform-dependent. + + + + The default set of collations provided by libc map + directly to the locales installed in the operating system, which can be + listed using the command locale -a. In case + a libc collation is needed that has different values + for LC_COLLATE and LC_CTYPE, or if new + locales are installed in the operating system after the database system + was initialized, then a new collation may be created using + the command. + New operating system locales can also be imported en masse using + the pg_import_system_collations() function. + + + + Within any particular database, only collations that use that + database's encoding are of interest. Other entries in + pg_collation are ignored. Thus, a stripped collation + name such as de_DE can be considered unique + within a given database even though it would not be unique globally. + Use of the stripped collation names is recommended, since it will + make one fewer thing you need to change if you decide to change to + another database encoding. Note however that the default, + C, and POSIX collations can be used regardless of + the database encoding. + + + + PostgreSQL considers distinct collation + objects to be incompatible even when they have identical properties. + Thus for example, + +SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1; + + will draw an error even though the C and POSIX + collations have identical behaviors. Mixing stripped and non-stripped + collation names is therefore not recommended. + + + + + ICU Collations + + + With ICU, it is not sensible to enumerate all possible locale names. ICU + uses a particular naming system for locales, but there are many more ways + to name a locale than there are actually distinct locales. + initdb uses the ICU APIs to extract a set of distinct + locales to populate the initial set of collations. Collations provided by + ICU are created in the SQL environment with names in BCP 47 language tag + format, with a private use + extension -x-icu appended, to distinguish them from + libc locales. + + + + Here are some example collations that might be created: + + + + de-x-icu + + German collation, default variant + + + + + de-AT-x-icu + + German collation for Austria, default variant + + (There are also, say, de-DE-x-icu + or de-CH-x-icu, but as of this writing, they are + equivalent to de-x-icu.) + + + + + + und-x-icu (for undefined) + + + ICU root collation. Use this to get a reasonable + language-agnostic sort order. + + + + + + + + Some (less frequently used) encodings are not supported by ICU. When the + database encoding is one of these, ICU collation entries + in pg_collation are ignored. Attempting to use one + will draw an error along the lines of collation "de-x-icu" for + encoding "WIN874" does not exist. + + + + + + Creating New Collation Objects + + + If the standard and predefined collations are not sufficient, users can + create their own collation objects using the SQL + command . + + + + The standard and predefined collations are in the + schema pg_catalog, like all predefined objects. + User-defined collations should be created in user schemas. This also + ensures that they are saved by pg_dump. + + + + libc Collations + + + New libc collations can be created like this: + +CREATE COLLATION german (provider = libc, locale = 'de_DE'); + + The exact values that are acceptable for the locale + clause in this command depend on the operating system. On Unix-like + systems, the command locale -a will show a list. + + + + Since the predefined libc collations already include all collations + defined in the operating system when the database instance is + initialized, it is not often necessary to manually create new ones. + Reasons might be if a different naming system is desired (in which case + see also ) or if the operating system has + been upgraded to provide new locale definitions (in which case see + also pg_import_system_collations()). + + + + + ICU Collations + + + ICU allows collations to be customized beyond the basic language+country + set that is preloaded by initdb. Users are encouraged + to define their own collation objects that make use of these facilities to + suit the sorting behavior to their requirements. + See + and for + information on ICU locale naming. The set of acceptable names and + attributes depends on the particular ICU version. + + + + Here are some examples: + + + + CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); + CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook'); + + German collation with phone book collation type + + The first example selects the ICU locale using a language + tag per BCP 47. The second example uses the traditional + ICU-specific locale syntax. The first style is preferred going + forward, but it is not supported by older ICU versions. + + + Note that you can name the collation objects in the SQL environment + anything you want. In this example, we follow the naming style that + the predefined collations use, which in turn also follow BCP 47, but + that is not required for user-defined collations. + + + + + + CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); + CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji'); + + + Root collation with Emoji collation type, per Unicode Technical Standard #51 + + + Observe how in the traditional ICU locale naming system, the root + locale is selected by an empty string. + + + + + + CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); + CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn'); + + + Sort Greek letters before Latin ones. (The default is Latin before Greek.) + + + + + + CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); + CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper'); + + + Sort upper-case letters before lower-case letters. (The default is + lower-case letters first.) + + + + + + CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); + CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn'); + + + Combines both of the above options. + + + + + + CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); + CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes'); + + + Numeric ordering, sorts sequences of digits by their numeric value, + for example: A-21 < A-123 + (also known as natural sort). + + + + + + See Unicode + Technical Standard #35 + and BCP 47 for + details. The list of possible collation types (co + subtag) can be found in + the CLDR + repository. + + + + Note that while this system allows creating collations that ignore + case or ignore accents or similar (using the + ks key), in order for such collations to act in a + truly case- or accent-insensitive manner, they also need to be declared as not + deterministic in CREATE COLLATION; + see . + Otherwise, any strings that compare equal according to the collation but + are not byte-wise equal will be sorted according to their byte values. + + + + + By design, ICU will accept almost any string as a locale name and match + it to the closest locale it can provide, using the fallback procedure + described in its documentation. Thus, there will be no direct feedback + if a collation specification is composed using features that the given + ICU installation does not actually support. It is therefore recommended + to create application-level test cases to check that the collation + definitions satisfy one's requirements. + + + + + + Copying Collations + + + The command can also be used to + create a new collation from an existing collation, which can be useful to + be able to use operating-system-independent collation names in + applications, create compatibility names, or use an ICU-provided collation + under a more readable name. For example: + +CREATE COLLATION german FROM "de_DE"; +CREATE COLLATION french FROM "fr-x-icu"; + + + + + + + Nondeterministic Collations + + + A collation is either deterministic or + nondeterministic. A deterministic collation uses + deterministic comparisons, which means that it considers strings to be + equal only if they consist of the same byte sequence. Nondeterministic + comparison may determine strings to be equal even if they consist of + different bytes. Typical situations include case-insensitive comparison, + accent-insensitive comparison, as well as comparison of strings in + different Unicode normal forms. It is up to the collation provider to + actually implement such insensitive comparisons; the deterministic flag + only determines whether ties are to be broken using bytewise comparison. + See also Unicode Technical + Standard 10 for more information on the terminology. + + + + To create a nondeterministic collation, specify the property + deterministic = false to CREATE + COLLATION, for example: + +CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false); + + This example would use the standard Unicode collation in a + nondeterministic way. In particular, this would allow strings in + different normal forms to be compared correctly. More interesting + examples make use of the ICU customization facilities explained above. + For example: + +CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); +CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); + + + + + All standard and predefined collations are deterministic, all + user-defined collations are deterministic by default. While + nondeterministic collations give a more correct behavior, + especially when considering the full power of Unicode and its many + special cases, they also have some drawbacks. Foremost, their use leads + to a performance penalty. Note, in particular, that B-tree cannot use + deduplication with indexes that use a nondeterministic collation. Also, + certain operations are not possible with nondeterministic collations, + such as pattern matching operations. Therefore, they should be used + only in cases where they are specifically wanted. + + + + + To deal with text in different Unicode normalization forms, it is also + an option to use the functions/expressions + normalize and is normalized to + preprocess or check the strings, instead of using nondeterministic + collations. There are different trade-offs for each approach. + + + + + + + + Character Set Support + + character set + + + The character set support in PostgreSQL + allows you to store text in a variety of character sets (also called + encodings), including + single-byte character sets such as the ISO 8859 series and + multiple-byte character sets such as EUC (Extended Unix + Code), UTF-8, and Mule internal code. All supported character sets + can be used transparently by clients, but a few are not supported + for use within the server (that is, as a server-side encoding). + The default character set is selected while + initializing your PostgreSQL database + cluster using initdb. It can be overridden when you + create a database, so you can have multiple + databases each with a different character set. + + + + An important restriction, however, is that each database's character set + must be compatible with the database's LC_CTYPE (character + classification) and LC_COLLATE (string sort order) locale + settings. For C or + POSIX locale, any character set is allowed, but for other + libc-provided locales there is only one character set that will work + correctly. + (On Windows, however, UTF-8 encoding can be used with any locale.) + If you have ICU support configured, ICU-provided locales can be used + with most but not all server-side encodings. + + + + Supported Character Sets + + + shows the character sets available + for use in PostgreSQL. + + + + <productname>PostgreSQL</productname> Character Sets + + + + + + + + + + + Name + Description + Language + Server? + ICU? + + Bytes/&zwsp;Char + Aliases + + + + + BIG5 + Big Five + Traditional Chinese + No + No + 1–2 + WIN950, Windows950 + + + EUC_CN + Extended UNIX Code-CN + Simplified Chinese + Yes + Yes + 1–3 + + + + EUC_JP + Extended UNIX Code-JP + Japanese + Yes + Yes + 1–3 + + + + EUC_JIS_2004 + Extended UNIX Code-JP, JIS X 0213 + Japanese + Yes + No + 1–3 + + + + EUC_KR + Extended UNIX Code-KR + Korean + Yes + Yes + 1–3 + + + + EUC_TW + Extended UNIX Code-TW + Traditional Chinese, Taiwanese + Yes + Yes + 1–3 + + + + GB18030 + National Standard + Chinese + No + No + 1–4 + + + + GBK + Extended National Standard + Simplified Chinese + No + No + 1–2 + WIN936, Windows936 + + + ISO_8859_5 + ISO 8859-5, ECMA 113 + Latin/Cyrillic + Yes + Yes + 1 + + + + ISO_8859_6 + ISO 8859-6, ECMA 114 + Latin/Arabic + Yes + Yes + 1 + + + + ISO_8859_7 + ISO 8859-7, ECMA 118 + Latin/Greek + Yes + Yes + 1 + + + + ISO_8859_8 + ISO 8859-8, ECMA 121 + Latin/Hebrew + Yes + Yes + 1 + + + + JOHAB + JOHAB + Korean (Hangul) + No + No + 1–3 + + + + KOI8R + KOI8-R + Cyrillic (Russian) + Yes + Yes + 1 + KOI8 + + + KOI8U + KOI8-U + Cyrillic (Ukrainian) + Yes + Yes + 1 + + + + LATIN1 + ISO 8859-1, ECMA 94 + Western European + Yes + Yes + 1 + ISO88591 + + + LATIN2 + ISO 8859-2, ECMA 94 + Central European + Yes + Yes + 1 + ISO88592 + + + LATIN3 + ISO 8859-3, ECMA 94 + South European + Yes + Yes + 1 + ISO88593 + + + LATIN4 + ISO 8859-4, ECMA 94 + North European + Yes + Yes + 1 + ISO88594 + + + LATIN5 + ISO 8859-9, ECMA 128 + Turkish + Yes + Yes + 1 + ISO88599 + + + LATIN6 + ISO 8859-10, ECMA 144 + Nordic + Yes + Yes + 1 + ISO885910 + + + LATIN7 + ISO 8859-13 + Baltic + Yes + Yes + 1 + ISO885913 + + + LATIN8 + ISO 8859-14 + Celtic + Yes + Yes + 1 + ISO885914 + + + LATIN9 + ISO 8859-15 + LATIN1 with Euro and accents + Yes + Yes + 1 + ISO885915 + + + LATIN10 + ISO 8859-16, ASRO SR 14111 + Romanian + Yes + No + 1 + ISO885916 + + + MULE_INTERNAL + Mule internal code + Multilingual Emacs + Yes + No + 1–4 + + + + SJIS + Shift JIS + Japanese + No + No + 1–2 + Mskanji, ShiftJIS, WIN932, Windows932 + + + SHIFT_JIS_2004 + Shift JIS, JIS X 0213 + Japanese + No + No + 1–2 + + + + SQL_ASCII + unspecified (see text) + any + Yes + No + 1 + + + + UHC + Unified Hangul Code + Korean + No + No + 1–2 + WIN949, Windows949 + + + UTF8 + Unicode, 8-bit + all + Yes + Yes + 1–4 + Unicode + + + WIN866 + Windows CP866 + Cyrillic + Yes + Yes + 1 + ALT + + + WIN874 + Windows CP874 + Thai + Yes + No + 1 + + + + WIN1250 + Windows CP1250 + Central European + Yes + Yes + 1 + + + + WIN1251 + Windows CP1251 + Cyrillic + Yes + Yes + 1 + WIN + + + WIN1252 + Windows CP1252 + Western European + Yes + Yes + 1 + + + + WIN1253 + Windows CP1253 + Greek + Yes + Yes + 1 + + + + WIN1254 + Windows CP1254 + Turkish + Yes + Yes + 1 + + + + WIN1255 + Windows CP1255 + Hebrew + Yes + Yes + 1 + + + + WIN1256 + Windows CP1256 + Arabic + Yes + Yes + 1 + + + + WIN1257 + Windows CP1257 + Baltic + Yes + Yes + 1 + + + + WIN1258 + Windows CP1258 + Vietnamese + Yes + Yes + 1 + ABC, TCVN, TCVN5712, VSCII + + + +
+ + + Not all client APIs support all the listed character sets. For example, the + PostgreSQL + JDBC driver does not support MULE_INTERNAL, LATIN6, + LATIN8, and LATIN10. + + + + The SQL_ASCII setting behaves considerably differently + from the other settings. When the server character set is + SQL_ASCII, the server interprets byte values 0–127 + according to the ASCII standard, while byte values 128–255 are taken + as uninterpreted characters. No encoding conversion will be done when + the setting is SQL_ASCII. Thus, this setting is not so + much a declaration that a specific encoding is in use, as a declaration + of ignorance about the encoding. In most cases, if you are + working with any non-ASCII data, it is unwise to use the + SQL_ASCII setting because + PostgreSQL will be unable to help you by + converting or validating non-ASCII characters. + +
+ + + Setting the Character Set + + + initdb defines the default character set (encoding) + for a PostgreSQL cluster. For example, + + +initdb -E EUC_JP + + + sets the default character set to + EUC_JP (Extended Unix Code for Japanese). You + can use instead of + if you prefer longer option strings. + If no or option is + given, initdb attempts to determine the appropriate + encoding to use based on the specified or default locale. + + + + You can specify a non-default encoding at database creation time, + provided that the encoding is compatible with the selected locale: + + +createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean + + + This will create a database named korean that + uses the character set EUC_KR, and locale ko_KR. + Another way to accomplish this is to use this SQL command: + + +CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0; + + + Notice that the above commands specify copying the template0 + database. When copying any other database, the encoding and locale + settings cannot be changed from those of the source database, because + that might result in corrupt data. For more information see + . + + + + The encoding for a database is stored in the system catalog + pg_database. You can see it by using the + psql option or the + \l command. + + +$ psql -l + List of databases + Name | Owner | Encoding | Collation | Ctype | Access Privileges +-----------+----------+-----------+-------------+-------------+------------------------------------- + clocaledb | hlinnaka | SQL_ASCII | C | C | + englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 | + japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 | + korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr | + postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | + template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} + template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} +(7 rows) + + + + + + On most modern operating systems, PostgreSQL + can determine which character set is implied by the LC_CTYPE + setting, and it will enforce that only the matching database encoding is + used. On older systems it is your responsibility to ensure that you use + the encoding expected by the locale you have selected. A mistake in + this area is likely to lead to strange behavior of locale-dependent + operations such as sorting. + + + + PostgreSQL will allow superusers to create + databases with SQL_ASCII encoding even when + LC_CTYPE is not C or POSIX. As noted + above, SQL_ASCII does not enforce that the data stored in + the database has any particular encoding, and so this choice poses risks + of locale-dependent misbehavior. Using this combination of settings is + deprecated and may someday be forbidden altogether. + + + + + + Automatic Character Set Conversion Between Server and Client + + + PostgreSQL supports automatic character + set conversion between server and client for many combinations of + character sets ( + shows which ones). + + + + To enable automatic character set conversion, you have to + tell PostgreSQL the character set + (encoding) you would like to use in the client. There are several + ways to accomplish this: + + + + + Using the \encoding command in + psql. + \encoding allows you to change client + encoding on the fly. For + example, to change the encoding to SJIS, type: + + +\encoding SJIS + + + + + + + libpq () has functions to control the client encoding. + + + + + + Using SET client_encoding TO. + + Setting the client encoding can be done with this SQL command: + + +SET CLIENT_ENCODING TO 'value'; + + + Also you can use the standard SQL syntax SET NAMES + for this purpose: + + +SET NAMES 'value'; + + + To query the current client encoding: + + +SHOW client_encoding; + + + To return to the default encoding: + + +RESET client_encoding; + + + + + + + Using PGCLIENTENCODING. If the environment variable + PGCLIENTENCODING is defined in the client's + environment, that client encoding is automatically selected + when a connection to the server is made. (This can + subsequently be overridden using any of the other methods + mentioned above.) + + + + + + Using the configuration variable . If the + client_encoding variable is set, that client + encoding is automatically selected when a connection to the + server is made. (This can subsequently be overridden using any + of the other methods mentioned above.) + + + + + + + + If the conversion of a particular character is not possible + — suppose you chose EUC_JP for the + server and LATIN1 for the client, and some + Japanese characters are returned that do not have a representation in + LATIN1 — an error is reported. + + + + If the client character set is defined as SQL_ASCII, + encoding conversion is disabled, regardless of the server's character + set. (However, if the server's character set is + not SQL_ASCII, the server will still check that + incoming data is valid for that encoding; so the net effect is as + though the client character set were the same as the server's.) + Just as for the server, use of SQL_ASCII is unwise + unless you are working with all-ASCII data. + + + + + Available Character Set Conversions + + + PostgreSQL allows conversion between any + two character sets for which a conversion function is listed in the + pg_conversion + system catalog. PostgreSQL comes with + some predefined conversions, as summarized in + and shown in more + detail in . You can + create a new conversion using the SQL command + . (To be used for automatic + client/server conversions, a conversion must be marked + as default for its character set pair.) + + + + Built-in Client/Server Character Set Conversions + + + + + + Server Character Set + Available Client Character Sets + + + + + BIG5 + not supported as a server encoding + + + + EUC_CN + EUC_CN, + MULE_INTERNAL, + UTF8 + + + + EUC_JP + EUC_JP, + MULE_INTERNAL, + SJIS, + UTF8 + + + + EUC_JIS_2004 + EUC_JIS_2004, + SHIFT_JIS_2004, + UTF8 + + + + EUC_KR + EUC_KR, + MULE_INTERNAL, + UTF8 + + + + EUC_TW + EUC_TW, + BIG5, + MULE_INTERNAL, + UTF8 + + + + GB18030 + not supported as a server encoding + + + + GBK + not supported as a server encoding + + + + ISO_8859_5 + ISO_8859_5, + KOI8R, + MULE_INTERNAL, + UTF8, + WIN866, + WIN1251 + + + + ISO_8859_6 + ISO_8859_6, + UTF8 + + + + ISO_8859_7 + ISO_8859_7, + UTF8 + + + + ISO_8859_8 + ISO_8859_8, + UTF8 + + + + JOHAB + not supported as a server encoding + + + + KOI8R + KOI8R, + ISO_8859_5, + MULE_INTERNAL, + UTF8, + WIN866, + WIN1251 + + + + KOI8U + KOI8U, + UTF8 + + + + LATIN1 + LATIN1, + MULE_INTERNAL, + UTF8 + + + + LATIN2 + LATIN2, + MULE_INTERNAL, + UTF8, + WIN1250 + + + + LATIN3 + LATIN3, + MULE_INTERNAL, + UTF8 + + + + LATIN4 + LATIN4, + MULE_INTERNAL, + UTF8 + + + + LATIN5 + LATIN5, + UTF8 + + + + LATIN6 + LATIN6, + UTF8 + + + + LATIN7 + LATIN7, + UTF8 + + + + LATIN8 + LATIN8, + UTF8 + + + + LATIN9 + LATIN9, + UTF8 + + + + LATIN10 + LATIN10, + UTF8 + + + + MULE_INTERNAL + MULE_INTERNAL, + BIG5, + EUC_CN, + EUC_JP, + EUC_KR, + EUC_TW, + ISO_8859_5, + KOI8R, + LATIN1 to LATIN4, + SJIS, + WIN866, + WIN1250, + WIN1251 + + + + SJIS + not supported as a server encoding + + + + SHIFT_JIS_2004 + not supported as a server encoding + + + + SQL_ASCII + any (no conversion will be performed) + + + + UHC + not supported as a server encoding + + + + UTF8 + all supported encodings + + + + WIN866 + WIN866, + ISO_8859_5, + KOI8R, + MULE_INTERNAL, + UTF8, + WIN1251 + + + + WIN874 + WIN874, + UTF8 + + + + WIN1250 + WIN1250, + LATIN2, + MULE_INTERNAL, + UTF8 + + + + WIN1251 + WIN1251, + ISO_8859_5, + KOI8R, + MULE_INTERNAL, + UTF8, + WIN866 + + + + WIN1252 + WIN1252, + UTF8 + + + + WIN1253 + WIN1253, + UTF8 + + + + WIN1254 + WIN1254, + UTF8 + + + + WIN1255 + WIN1255, + UTF8 + + + + WIN1256 + WIN1256, + UTF8 + + + + WIN1257 + WIN1257, + UTF8 + + + + WIN1258 + WIN1258, + UTF8 + + + + +
+ + + All Built-in Character Set Conversions + + + + + + + Conversion Name + + + The conversion names follow a standard naming scheme: The + official name of the source encoding with all + non-alphanumeric characters replaced by underscores, followed + by _to_, followed by the similarly processed + destination encoding name. Therefore, these names sometimes + deviate from the customary encoding names shown in + . + + + + Source Encoding + Destination Encoding + + + + + + big5_to_euc_tw + BIG5 + EUC_TW + + + big5_to_mic + BIG5 + MULE_INTERNAL + + + big5_to_utf8 + BIG5 + UTF8 + + + euc_cn_to_mic + EUC_CN + MULE_INTERNAL + + + euc_cn_to_utf8 + EUC_CN + UTF8 + + + euc_jp_to_mic + EUC_JP + MULE_INTERNAL + + + euc_jp_to_sjis + EUC_JP + SJIS + + + euc_jp_to_utf8 + EUC_JP + UTF8 + + + euc_kr_to_mic + EUC_KR + MULE_INTERNAL + + + euc_kr_to_utf8 + EUC_KR + UTF8 + + + euc_tw_to_big5 + EUC_TW + BIG5 + + + euc_tw_to_mic + EUC_TW + MULE_INTERNAL + + + euc_tw_to_utf8 + EUC_TW + UTF8 + + + gb18030_to_utf8 + GB18030 + UTF8 + + + gbk_to_utf8 + GBK + UTF8 + + + iso_8859_10_to_utf8 + LATIN6 + UTF8 + + + iso_8859_13_to_utf8 + LATIN7 + UTF8 + + + iso_8859_14_to_utf8 + LATIN8 + UTF8 + + + iso_8859_15_to_utf8 + LATIN9 + UTF8 + + + iso_8859_16_to_utf8 + LATIN10 + UTF8 + + + iso_8859_1_to_mic + LATIN1 + MULE_INTERNAL + + + iso_8859_1_to_utf8 + LATIN1 + UTF8 + + + iso_8859_2_to_mic + LATIN2 + MULE_INTERNAL + + + iso_8859_2_to_utf8 + LATIN2 + UTF8 + + + iso_8859_2_to_windows_1250 + LATIN2 + WIN1250 + + + iso_8859_3_to_mic + LATIN3 + MULE_INTERNAL + + + iso_8859_3_to_utf8 + LATIN3 + UTF8 + + + iso_8859_4_to_mic + LATIN4 + MULE_INTERNAL + + + iso_8859_4_to_utf8 + LATIN4 + UTF8 + + + iso_8859_5_to_koi8_r + ISO_8859_5 + KOI8R + + + iso_8859_5_to_mic + ISO_8859_5 + MULE_INTERNAL + + + iso_8859_5_to_utf8 + ISO_8859_5 + UTF8 + + + iso_8859_5_to_windows_1251 + ISO_8859_5 + WIN1251 + + + iso_8859_5_to_windows_866 + ISO_8859_5 + WIN866 + + + iso_8859_6_to_utf8 + ISO_8859_6 + UTF8 + + + iso_8859_7_to_utf8 + ISO_8859_7 + UTF8 + + + iso_8859_8_to_utf8 + ISO_8859_8 + UTF8 + + + iso_8859_9_to_utf8 + LATIN5 + UTF8 + + + johab_to_utf8 + JOHAB + UTF8 + + + koi8_r_to_iso_8859_5 + KOI8R + ISO_8859_5 + + + koi8_r_to_mic + KOI8R + MULE_INTERNAL + + + koi8_r_to_utf8 + KOI8R + UTF8 + + + koi8_r_to_windows_1251 + KOI8R + WIN1251 + + + koi8_r_to_windows_866 + KOI8R + WIN866 + + + koi8_u_to_utf8 + KOI8U + UTF8 + + + mic_to_big5 + MULE_INTERNAL + BIG5 + + + mic_to_euc_cn + MULE_INTERNAL + EUC_CN + + + mic_to_euc_jp + MULE_INTERNAL + EUC_JP + + + mic_to_euc_kr + MULE_INTERNAL + EUC_KR + + + mic_to_euc_tw + MULE_INTERNAL + EUC_TW + + + mic_to_iso_8859_1 + MULE_INTERNAL + LATIN1 + + + mic_to_iso_8859_2 + MULE_INTERNAL + LATIN2 + + + mic_to_iso_8859_3 + MULE_INTERNAL + LATIN3 + + + mic_to_iso_8859_4 + MULE_INTERNAL + LATIN4 + + + mic_to_iso_8859_5 + MULE_INTERNAL + ISO_8859_5 + + + mic_to_koi8_r + MULE_INTERNAL + KOI8R + + + mic_to_sjis + MULE_INTERNAL + SJIS + + + mic_to_windows_1250 + MULE_INTERNAL + WIN1250 + + + mic_to_windows_1251 + MULE_INTERNAL + WIN1251 + + + mic_to_windows_866 + MULE_INTERNAL + WIN866 + + + sjis_to_euc_jp + SJIS + EUC_JP + + + sjis_to_mic + SJIS + MULE_INTERNAL + + + sjis_to_utf8 + SJIS + UTF8 + + + windows_1258_to_utf8 + WIN1258 + UTF8 + + + uhc_to_utf8 + UHC + UTF8 + + + utf8_to_big5 + UTF8 + BIG5 + + + utf8_to_euc_cn + UTF8 + EUC_CN + + + utf8_to_euc_jp + UTF8 + EUC_JP + + + utf8_to_euc_kr + UTF8 + EUC_KR + + + utf8_to_euc_tw + UTF8 + EUC_TW + + + utf8_to_gb18030 + UTF8 + GB18030 + + + utf8_to_gbk + UTF8 + GBK + + + utf8_to_iso_8859_1 + UTF8 + LATIN1 + + + utf8_to_iso_8859_10 + UTF8 + LATIN6 + + + utf8_to_iso_8859_13 + UTF8 + LATIN7 + + + utf8_to_iso_8859_14 + UTF8 + LATIN8 + + + utf8_to_iso_8859_15 + UTF8 + LATIN9 + + + utf8_to_iso_8859_16 + UTF8 + LATIN10 + + + utf8_to_iso_8859_2 + UTF8 + LATIN2 + + + utf8_to_iso_8859_3 + UTF8 + LATIN3 + + + utf8_to_iso_8859_4 + UTF8 + LATIN4 + + + utf8_to_iso_8859_5 + UTF8 + ISO_8859_5 + + + utf8_to_iso_8859_6 + UTF8 + ISO_8859_6 + + + utf8_to_iso_8859_7 + UTF8 + ISO_8859_7 + + + utf8_to_iso_8859_8 + UTF8 + ISO_8859_8 + + + utf8_to_iso_8859_9 + UTF8 + LATIN5 + + + utf8_to_johab + UTF8 + JOHAB + + + utf8_to_koi8_r + UTF8 + KOI8R + + + utf8_to_koi8_u + UTF8 + KOI8U + + + utf8_to_sjis + UTF8 + SJIS + + + utf8_to_windows_1258 + UTF8 + WIN1258 + + + utf8_to_uhc + UTF8 + UHC + + + utf8_to_windows_1250 + UTF8 + WIN1250 + + + utf8_to_windows_1251 + UTF8 + WIN1251 + + + utf8_to_windows_1252 + UTF8 + WIN1252 + + + utf8_to_windows_1253 + UTF8 + WIN1253 + + + utf8_to_windows_1254 + UTF8 + WIN1254 + + + utf8_to_windows_1255 + UTF8 + WIN1255 + + + utf8_to_windows_1256 + UTF8 + WIN1256 + + + utf8_to_windows_1257 + UTF8 + WIN1257 + + + utf8_to_windows_866 + UTF8 + WIN866 + + + utf8_to_windows_874 + UTF8 + WIN874 + + + windows_1250_to_iso_8859_2 + WIN1250 + LATIN2 + + + windows_1250_to_mic + WIN1250 + MULE_INTERNAL + + + windows_1250_to_utf8 + WIN1250 + UTF8 + + + windows_1251_to_iso_8859_5 + WIN1251 + ISO_8859_5 + + + windows_1251_to_koi8_r + WIN1251 + KOI8R + + + windows_1251_to_mic + WIN1251 + MULE_INTERNAL + + + windows_1251_to_utf8 + WIN1251 + UTF8 + + + windows_1251_to_windows_866 + WIN1251 + WIN866 + + + windows_1252_to_utf8 + WIN1252 + UTF8 + + + windows_1256_to_utf8 + WIN1256 + UTF8 + + + windows_866_to_iso_8859_5 + WIN866 + ISO_8859_5 + + + windows_866_to_koi8_r + WIN866 + KOI8R + + + windows_866_to_mic + WIN866 + MULE_INTERNAL + + + windows_866_to_utf8 + WIN866 + UTF8 + + + windows_866_to_windows_1251 + WIN866 + WIN + + + windows_874_to_utf8 + WIN874 + UTF8 + + + euc_jis_2004_to_utf8 + EUC_JIS_2004 + UTF8 + + + utf8_to_euc_jis_2004 + UTF8 + EUC_JIS_2004 + + + shift_jis_2004_to_utf8 + SHIFT_JIS_2004 + UTF8 + + + utf8_to_shift_jis_2004 + UTF8 + SHIFT_JIS_2004 + + + euc_jis_2004_to_shift_jis_2004 + EUC_JIS_2004 + SHIFT_JIS_2004 + + + shift_jis_2004_to_euc_jis_2004 + SHIFT_JIS_2004 + EUC_JIS_2004 + + + +
+
+ + + Further Reading + + + These are good sources to start learning about various kinds of encoding + systems. + + + + CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing + + + + Contains detailed explanations of EUC_JP, + EUC_CN, EUC_KR, + EUC_TW. + + + + + + + + + + The web site of the Unicode Consortium. + + + + + + RFC 3629 + + + + UTF-8 (8-bit UCS/Unicode Transformation + Format) is defined here. + + + + + + + +
+ +
-- cgit v1.2.3