diff options
Diffstat (limited to 'man7/charsets.7')
-rw-r--r-- | man7/charsets.7 | 335 |
1 files changed, 335 insertions, 0 deletions
diff --git a/man7/charsets.7 b/man7/charsets.7 new file mode 100644 index 0000000..0692d8d --- /dev/null +++ b/man7/charsets.7 @@ -0,0 +1,335 @@ +.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> +.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl> +.\" +.\" SPDX-License-Identifier: GPL-2.0-or-later +.\" +.\" This is combined from many sources, including notes by aeb and +.\" research by esr. Portions derive from a writeup by Roman Czyborra. +.\" +.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>. +.\" +.TH charsets 7 2023-03-12 "Linux man-pages 6.05.01" +.SH NAME +charsets \- character set standards and internationalization +.SH DESCRIPTION +This manual page gives an overview on different character set standards +and how they were used on Linux before Unicode became ubiquitous. +Some of this information is still helpful for people working with legacy +systems and documents. +.PP +Standards discussed include such as +ASCII, GB 2312, ISO 8859, JIS, KOI8-R, KS, and Unicode. +.PP +The primary emphasis is on character sets that were actually used by +locale character sets, not the myriad others that could be found in data +from other systems. +.SS ASCII +ASCII (American Standard Code For Information Interchange) is the original +7-bit character set, originally designed for American English. +Also known as US-ASCII. +It is currently described by the ISO 646:1991 IRV +(International Reference Version) standard. +.PP +Various ASCII variants replacing the dollar sign with other currency +symbols and replacing punctuation with non-English alphabetic +characters to cover German, French, Spanish, and others in 7 bits +emerged. +All are deprecated; +glibc does not support locales whose character sets are not true +supersets of ASCII. +.PP +As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text +still renders properly on modern UTF-8 using systems. +.SS ISO 8859 +ISO 8859 is a series of 15 8-bit character sets, all of which have ASCII +in their low (7-bit) half, invisible control characters in positions +128 to 159, and 96 fixed-width graphics in positions 160\[en]255. +.PP +Of these, the most important is ISO 8859-1 +("Latin Alphabet No. 1" / Latin-1). +It was widely adopted and supported by different systems, +and is gradually being replaced with Unicode. +The ISO 8859-1 characters are also the first 256 characters of Unicode. +.PP +Console support for the other 8859 character sets is available under +Linux through user-mode utilities (such as +.BR setfont (8)) +that modify keyboard bindings and the EGA graphics +table and employ the "user mapping" font table in the console +driver. +.PP +Here are brief descriptions of each character set: +.TP +8859-1 (Latin-1) +Latin-1 covers many European languages such as Albanian, Basque, +Danish, English, Faroese, Galician, Icelandic, Irish, Italian, +Norwegian, Portuguese, Spanish, and Swedish. +The lack of the ligatures +Dutch IJ/ij, +French œ, +and „German“ quotation marks +was considered tolerable. +.TP +8859-2 (Latin-2) +Latin-2 supports many Latin-written Central and East European +languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish, +Slovak, and Slovene. +Replacing Romanian ș/ț with ş/ţ +was considered tolerable. +.TP +8859-3 (Latin-3) +Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but +8859-9 later superseded it for Turkish. +.TP +8859-4 (Latin-4) +Latin-4 introduced letters for North European languages such as +Estonian, Latvian, and Lithuanian, but was superseded by 8859-10 and +8859-13. +.TP +8859-5 +Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, +Russian, Serbian, and (almost completely) Ukrainian. +It was never widely used, see the discussion of KOI8-R/KOI8-U below. +.TP +8859-6 +Was created for Arabic. +The 8859-6 glyph table is a fixed font of separate +letter forms, but a proper display engine should combine these +using the proper initial, medial, and final forms. +.TP +8859-7 +Was created for Modern Greek in 1987, updated in 2003. +.TP +8859-8 +Supports Modern Hebrew without niqud (punctuation signs). +Niqud and full-fledged Biblical Hebrew were outside the scope of this +character set. +.TP +8859-9 (Latin-5) +This is a variant of Latin-1 that replaces Icelandic letters with +Turkish ones. +.TP +8859-10 (Latin-6) +Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were +missing in Latin-4 to cover the entire Nordic area. +.TP +8859-11 +Supports the Thai alphabet and is nearly identical to the TIS-620 +standard. +.TP +8859-12 +This character set does not exist. +.TP +8859-13 (Latin-7) +Supports the Baltic Rim languages; in particular, it includes Latvian +characters not found in Latin-4. +.TP +8859-14 (Latin-8) +This is the Celtic character set, covering Old Irish, Manx, Gaelic, +Welsh, Cornish, and Breton. +.TP +8859-15 (Latin-9) +Latin-9 is similar to the widely used Latin-1 but replaces some less +common symbols with the Euro sign and French and Finnish letters that +were missing in Latin-1. +.TP +8859-16 (Latin-10) +This character set covers many Southeast European languages, +and most importantly supports Romanian more completely than Latin-2. +.SS KOI8-R / KOI8-U +KOI8-R is a non-ISO character set popular in Russia before Unicode. +The lower half is ASCII; +the upper is a Cyrillic character set somewhat better designed than +ISO 8859-5. +KOI8-U, based on KOI8-R, has better support for Ukrainian. +Neither of these sets are ISO-2022 compatible, +unlike the ISO 8859 series. +.PP +Console support for KOI8-R is available under Linux through user-mode +utilities that modify keyboard bindings and the EGA graphics table, +and employ the "user mapping" font table in the console driver. +.SS GB 2312 +GB 2312 is a mainland Chinese national standard character set used +to express simplified Chinese. +Just like JIS X 0208, characters are +mapped into a 94x94 two-byte matrix used to construct EUC-CN. +EUC-CN +is the most important encoding for Linux and includes ASCII and +GB 2312. +Note that EUC-CN is often called as GB, GB 2312, or CN-GB. +.SS Big5 +Big5 was a popular character set in Taiwan to express traditional +Chinese. +(Big5 is both a character set and an encoding.) +It is a superset of ASCII. +Non-ASCII characters are expressed in two bytes. +Bytes 0xa1\[en]0xfe are used as leading bytes for two-byte characters. +Big5 and its extension were widely used in Taiwan and Hong Kong. +It is not ISO 2022 compliant. +.\" Thanks to Tomohiro KUBOTA for the following sections about +.\" national standards. +.SS JIS X 0208 +JIS X 0208 is a Japanese national standard character set. +Though there are some more Japanese national standard character sets (like +JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. +Characters are mapped into a 94x94 two-byte matrix, +whose each byte is in the range 0x21\[en]0x7e. +Note that JIS X 0208 is a character set, not an encoding. +This means that JIS X 0208 +itself is not used for expressing text data. +JIS X 0208 is used +as a component to construct encodings such as EUC-JP, Shift_JIS, +and ISO-2022-JP. +EUC-JP is the most important encoding for Linux +and includes ASCII and JIS X 0208. +In EUC-JP, JIS X 0208 +characters are expressed in two bytes, each of which is the +JIS X 0208 code plus 0x80. +.SS KS X 1001 +KS X 1001 is a Korean national standard character set. +Just as +JIS X 0208, characters are mapped into a 94x94 two-byte matrix. +KS X 1001 is used like JIS X 0208, as a component +to construct encodings such as EUC-KR, Johab, and ISO-2022-KR. +EUC-KR is the most important encoding for Linux and includes +ASCII and KS X 1001. +KS C 5601 is an older name for KS X 1001. +.SS ISO 2022 and ISO 4873 +The ISO 2022 and 4873 standards describe a font-control model +based on VT100 practice. +This model is (partially) supported +by the Linux kernel and by +.BR xterm (1). +Several ISO 2022-based character encodings have been defined, +especially for Japanese. +.PP +There are 4 graphic character sets, called G0, G1, G2, and G3, +and one of them is the current character set for codes with +high bit zero (initially G0), and one of them is the current +character set for codes with high bit one (initially G1). +Each graphic character set has 94 or 96 characters, and is +essentially a 7-bit character set. +It uses codes either +040\[en]0177 (041\[en]0176) or 0240\[en]0377 (0241\[en]0376). +G0 always has size 94 and uses codes 041\[en]0176. +.PP +Switching between character sets is done using the shift functions +\fB\[ha]N\fP (SO or LS1), \fB\[ha]O\fP (SI or LS0), ESC n (LS2), ESC o (LS3), +ESC N (SS2), ESC O (SS3), ESC \[ti] (LS1R), ESC } (LS2R), ESC | (LS3R). +The function LS\fIn\fP makes character set G\fIn\fP the current one +for codes with high bit zero. +The function LS\fIn\fPR makes character set G\fIn\fP the current one +for codes with high bit one. +The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) +the current one for the next character only (regardless of the value +of its high order bit). +.PP +A 94-character set is designated as G\fIn\fP character set +by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), +ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol +or a pair of symbols found in the ISO 2375 International +Register of Coded Character Sets. +For example, ESC ( @ selects the ISO 646 character set as G0, +ESC ( A selects the UK standard character set (with pound +instead of number sign), ESC ( B selects ASCII (with dollar +instead of currency sign), ESC ( M selects a character set +for African languages, ESC ( ! A selects the Cuban character +set, and so on. +.PP +A 96-character set is designated as G\fIn\fP character set +by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) +or ESC / xx (for G3). +For example, ESC \- G selects the Hebrew alphabet as G1. +.PP +A multibyte character set is designated as G\fIn\fP character set +by an escape sequence ESC $ xx or ESC $ ( xx (for G0), +ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). +For example, ESC $ ( C selects the Korean character set for G0. +The Japanese character set selected by ESC $ B has a more +recent version selected by ESC & @ ESC $ B. +.PP +ISO 4873 stipulates a narrower use of character sets, where G0 +is fixed (always ASCII), so that G1, G2, and G3 +can be invoked only for codes with the high order bit set. +In particular, \fB\[ha]N\fP and \fB\[ha]O\fP are not used anymore, ESC ( xx +can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx +are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. +.SS TIS-620 +TIS-620 is a Thai national standard character set and a superset +of ASCII. +In the same fashion as the ISO 8859 series, Thai characters are mapped into +0xa1\[en]0xfe. +.SS Unicode +Unicode (ISO/IEC 10646) is a standard which aims to unambiguously represent +every character in every human language. +Unicode's structure permits 20.1 bits to encode every character. +Since most computers don't include 20.1-bit integers, Unicode is +usually encoded as 32-bit integers internally and either a series of +16-bit integers (UTF-16) (needing two 16-bit integers only when +encoding certain rare characters) or a series of 8-bit bytes (UTF-8). +.PP +Linux represents Unicode using the 8-bit Unicode Transformation Format +(UTF-8). +UTF-8 is a variable length encoding of Unicode. +It uses 1 +byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes +for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. +.PP +Let 0,1,x stand for a zero, one, or arbitrary bit. +A byte 0xxxxxxx +stands for the Unicode 00000000 0xxxxxxx which codes the same symbol +as the ASCII 0xxxxxxx. +Thus, ASCII goes unchanged into UTF-8, and +people using only ASCII do not notice any change: not in code, and not +in file size. +.PP +A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy +is assembled into 00000xxx xxyyyyyy. +A byte 1110xxxx is the start +of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled +into xxxxyyyy yyzzzzzz. +(When UTF-8 is used to code the 31-bit ISO/IEC 10646 +then this progression continues up to 6-byte codes.) +.PP +For most texts in ISO 8859 character sets, this means that the +characters outside of ASCII are now coded with two bytes. +This tends +to expand ordinary text files by only one or two percent. +For Russian +or Greek texts, this expands ordinary text files by 100%, since text in +those languages is mostly outside of ASCII. +For Japanese users this means +that the 16-bit codes now in common use will take three bytes. +While there are algorithmic conversions from some character sets +(especially ISO 8859-1) to Unicode, general conversion requires +carrying around conversion tables, which can be quite large for 16-bit +codes. +.PP +Note that UTF-8 is self-synchronizing: +10xxxxxx is a tail, +any other byte is the head of a code. +Note that the only way ASCII bytes occur in a UTF-8 stream, +is as themselves. +In particular, +there are no embedded NULs (\[aq]\e0\[aq]) or \[aq]/\[aq]s +that form part of some larger code. +.PP +Since ASCII, and, in particular, NUL and \[aq]/\[aq], are unchanged, the +kernel does not notice that UTF-8 is being used. +It does not care at +all what the bytes it is handling stand for. +.PP +Rendering of Unicode data streams is typically handled through +"subfont" tables which map a subset of Unicode to glyphs. +Internally +the kernel uses Unicode to describe the subfont loaded in video RAM. +This means that in the Linux console in UTF-8 mode, one can use a character +set with 512 different symbols. +This is not enough for Japanese, Chinese, and +Korean, but it is enough for most other purposes. +.SH SEE ALSO +.BR iconv (1), +.BR ascii (7), +.BR iso_8859\-1 (7), +.BR unicode (7), +.BR utf\-8 (7) |