diff options
Diffstat (limited to 'man7/charsets.7')
-rw-r--r-- | man7/charsets.7 | 335 |
1 files changed, 0 insertions, 335 deletions
diff --git a/man7/charsets.7 b/man7/charsets.7 deleted file mode 100644 index eb9f8f8..0000000 --- a/man7/charsets.7 +++ /dev/null @@ -1,335 +0,0 @@ -.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com> -.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl> -.\" -.\" SPDX-License-Identifier: GPL-2.0-or-later -.\" -.\" This is combined from many sources, including notes by aeb and -.\" research by esr. Portions derive from a writeup by Roman Czyborra. -.\" -.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>. -.\" -.TH charsets 7 2024-01-28 "Linux man-pages 6.7" -.SH NAME -charsets \- character set standards and internationalization -.SH DESCRIPTION -This manual page gives an overview on different character set standards -and how they were used on Linux before Unicode became ubiquitous. -Some of this information is still helpful for people working with legacy -systems and documents. -.P -Standards discussed include such as -ASCII, GB 2312, ISO/IEC\~8859, JIS, KOI8-R, KS, and Unicode. -.P -The primary emphasis is on character sets that were actually used by -locale character sets, not the myriad others that could be found in data -from other systems. -.SS ASCII -ASCII (American Standard Code For Information Interchange) is the original -7-bit character set, originally designed for American English. -Also known as US-ASCII. -It is currently described by the ISO/IEC\~646:1991 IRV -(International Reference Version) standard. -.P -Various ASCII variants replacing the dollar sign with other currency -symbols and replacing punctuation with non-English alphabetic -characters to cover German, French, Spanish, and others in 7 bits -emerged. -All are deprecated; -glibc does not support locales whose character sets are not true -supersets of ASCII. -.P -As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text -still renders properly on modern UTF-8 using systems. -.SS ISO/IEC\~8859 -ISO/IEC\~8859 is a series of 15 8-bit character sets, all of which have ASCII -in their low (7-bit) half, invisible control characters in positions -128 to 159, and 96 fixed-width graphics in positions 160\[en]255. -.P -Of these, the most important is ISO/IEC\~8859-1 -("Latin Alphabet No. 1" / Latin-1). -It was widely adopted and supported by different systems, -and is gradually being replaced with Unicode. -The ISO/IEC\~8859-1 characters are also the first 256 characters of Unicode. -.P -Console support for the other ISO/IEC\~8859 character sets is available under -Linux through user-mode utilities (such as -.BR setfont (8)) -that modify keyboard bindings and the EGA graphics -table and employ the "user mapping" font table in the console -driver. -.P -Here are brief descriptions of each character set: -.TP -ISO/IEC\~8859-1 (Latin-1) -Latin-1 covers many European languages such as Albanian, Basque, -Danish, English, Faroese, Galician, Icelandic, Irish, Italian, -Norwegian, Portuguese, Spanish, and Swedish. -The lack of the ligatures -Dutch IJ/ij, -French œ, -and „German“ quotation marks -was considered tolerable. -.TP -ISO/IEC\~8859-2 (Latin-2) -Latin-2 supports many Latin-written Central and East European -languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish, -Slovak, and Slovene. -Replacing Romanian ș/ț with ş/ţ -was considered tolerable. -.TP -ISO/IEC\~8859-3 (Latin-3) -Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but -ISO/IEC\~8859-9 later superseded it for Turkish. -.TP -ISO/IEC\~8859-4 (Latin-4) -Latin-4 introduced letters for North European languages such as -Estonian, Latvian, and Lithuanian, but was superseded by ISO/IEC\~8859-10 and -ISO/IEC\~8859-13. -.TP -ISO/IEC\~8859-5 -Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian, -Russian, Serbian, and (almost completely) Ukrainian. -It was never widely used, see the discussion of KOI8-R/KOI8-U below. -.TP -ISO/IEC\~8859-6 -Was created for Arabic. -The ISO/IEC\~8859-6 glyph table is a fixed font of separate -letter forms, but a proper display engine should combine these -using the proper initial, medial, and final forms. -.TP -ISO/IEC\~8859-7 -Was created for Modern Greek in 1987, updated in 2003. -.TP -ISO/IEC\~8859-8 -Supports Modern Hebrew without niqud (punctuation signs). -Niqud and full-fledged Biblical Hebrew were outside the scope of this -character set. -.TP -ISO/IEC\~8859-9 (Latin-5) -This is a variant of Latin-1 that replaces Icelandic letters with -Turkish ones. -.TP -ISO/IEC\~8859-10 (Latin-6) -Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were -missing in Latin-4 to cover the entire Nordic area. -.TP -ISO/IEC\~8859-11 -Supports the Thai alphabet and is nearly identical to the TIS-620 -standard. -.TP -ISO/IEC\~8859-12 -This character set does not exist. -.TP -ISO/IEC\~8859-13 (Latin-7) -Supports the Baltic Rim languages; in particular, it includes Latvian -characters not found in Latin-4. -.TP -ISO/IEC\~8859-14 (Latin-8) -This is the Celtic character set, covering Old Irish, Manx, Gaelic, -Welsh, Cornish, and Breton. -.TP -ISO/IEC\~8859-15 (Latin-9) -Latin-9 is similar to the widely used Latin-1 but replaces some less -common symbols with the Euro sign and French and Finnish letters that -were missing in Latin-1. -.TP -ISO/IEC\~8859-16 (Latin-10) -This character set covers many Southeast European languages, -and most importantly supports Romanian more completely than Latin-2. -.SS KOI8-R / KOI8-U -KOI8-R is a non-ISO character set popular in Russia before Unicode. -The lower half is ASCII; -the upper is a Cyrillic character set somewhat better designed than -ISO/IEC\~8859-5. -KOI8-U, based on KOI8-R, has better support for Ukrainian. -Neither of these sets are ISO/IEC\~2022 compatible, -unlike the ISO/IEC\~8859 series. -.P -Console support for KOI8-R is available under Linux through user-mode -utilities that modify keyboard bindings and the EGA graphics table, -and employ the "user mapping" font table in the console driver. -.SS GB 2312 -GB 2312 is a mainland Chinese national standard character set used -to express simplified Chinese. -Just like JIS X 0208, characters are -mapped into a 94x94 two-byte matrix used to construct EUC-CN. -EUC-CN -is the most important encoding for Linux and includes ASCII and -GB 2312. -Note that EUC-CN is often called as GB, GB 2312, or CN-GB. -.SS Big5 -Big5 was a popular character set in Taiwan to express traditional -Chinese. -(Big5 is both a character set and an encoding.) -It is a superset of ASCII. -Non-ASCII characters are expressed in two bytes. -Bytes 0xa1\[en]0xfe are used as leading bytes for two-byte characters. -Big5 and its extension were widely used in Taiwan and Hong Kong. -It is not ISO/IEC\~2022 compliant. -.\" Thanks to Tomohiro KUBOTA for the following sections about -.\" national standards. -.SS JIS X 0208 -JIS X 0208 is a Japanese national standard character set. -Though there are some more Japanese national standard character sets (like -JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one. -Characters are mapped into a 94x94 two-byte matrix, -whose each byte is in the range 0x21\[en]0x7e. -Note that JIS X 0208 is a character set, not an encoding. -This means that JIS X 0208 -itself is not used for expressing text data. -JIS X 0208 is used -as a component to construct encodings such as EUC-JP, Shift_JIS, -and ISO/IEC\~2022-JP. -EUC-JP is the most important encoding for Linux -and includes ASCII and JIS X 0208. -In EUC-JP, JIS X 0208 -characters are expressed in two bytes, each of which is the -JIS X 0208 code plus 0x80. -.SS KS X 1001 -KS X 1001 is a Korean national standard character set. -Just as -JIS X 0208, characters are mapped into a 94x94 two-byte matrix. -KS X 1001 is used like JIS X 0208, as a component -to construct encodings such as EUC-KR, Johab, and ISO/IEC\~2022-KR. -EUC-KR is the most important encoding for Linux and includes -ASCII and KS X 1001. -KS C 5601 is an older name for KS X 1001. -.SS ISO/IEC\~2022 and ISO/IEC\~4873 -The ISO/IEC\~2022 and ISO/IEC\~4873 standards describe a font-control model -based on VT100 practice. -This model is (partially) supported -by the Linux kernel and by -.BR xterm (1). -Several ISO/IEC\~2022-based character encodings have been defined, -especially for Japanese. -.P -There are 4 graphic character sets, called G0, G1, G2, and G3, -and one of them is the current character set for codes with -high bit zero (initially G0), and one of them is the current -character set for codes with high bit one (initially G1). -Each graphic character set has 94 or 96 characters, and is -essentially a 7-bit character set. -It uses codes either -040\[en]0177 (041\[en]0176) or 0240\[en]0377 (0241\[en]0376). -G0 always has size 94 and uses codes 041\[en]0176. -.P -Switching between character sets is done using the shift functions -\fB\[ha]N\fP (SO or LS1), \fB\[ha]O\fP (SI or LS0), ESC n (LS2), ESC o (LS3), -ESC N (SS2), ESC O (SS3), ESC \[ti] (LS1R), ESC } (LS2R), ESC | (LS3R). -The function LS\fIn\fP makes character set G\fIn\fP the current one -for codes with high bit zero. -The function LS\fIn\fPR makes character set G\fIn\fP the current one -for codes with high bit one. -The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3) -the current one for the next character only (regardless of the value -of its high order bit). -.P -A 94-character set is designated as G\fIn\fP character set -by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), -ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol -or a pair of symbols found in the ISO/IEC\~2375 International -Register of Coded Character Sets. -For example, ESC ( @ selects the ISO/IEC\~646 character set as G0, -ESC ( A selects the UK standard character set (with pound -instead of number sign), ESC ( B selects ASCII (with dollar -instead of currency sign), ESC ( M selects a character set -for African languages, ESC ( ! A selects the Cuban character -set, and so on. -.P -A 96-character set is designated as G\fIn\fP character set -by an escape sequence ESC \- xx (for G1), ESC . xx (for G2) -or ESC / xx (for G3). -For example, ESC \- G selects the Hebrew alphabet as G1. -.P -A multibyte character set is designated as G\fIn\fP character set -by an escape sequence ESC $ xx or ESC $ ( xx (for G0), -ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3). -For example, ESC $ ( C selects the Korean character set for G0. -The Japanese character set selected by ESC $ B has a more -recent version selected by ESC & @ ESC $ B. -.P -ISO/IEC\~4873 stipulates a narrower use of character sets, where G0 -is fixed (always ASCII), so that G1, G2, and G3 -can be invoked only for codes with the high order bit set. -In particular, \fB\[ha]N\fP and \fB\[ha]O\fP are not used anymore, ESC ( xx -can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx -are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively. -.SS TIS-620 -TIS-620 is a Thai national standard character set and a superset -of ASCII. -In the same fashion as the ISO/IEC\~8859 series, Thai characters are mapped into -0xa1\[en]0xfe. -.SS Unicode -Unicode (ISO/IEC 10646) is a standard which aims to unambiguously represent -every character in every human language. -Unicode's structure permits 20.1 bits to encode every character. -Since most computers don't include 20.1-bit integers, Unicode is -usually encoded as 32-bit integers internally and either a series of -16-bit integers (UTF-16) (needing two 16-bit integers only when -encoding certain rare characters) or a series of 8-bit bytes (UTF-8). -.P -Linux represents Unicode using the 8-bit Unicode Transformation Format -(UTF-8). -UTF-8 is a variable length encoding of Unicode. -It uses 1 -byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes -for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits. -.P -Let 0,1,x stand for a zero, one, or arbitrary bit. -A byte 0xxxxxxx -stands for the Unicode 00000000 0xxxxxxx which codes the same symbol -as the ASCII 0xxxxxxx. -Thus, ASCII goes unchanged into UTF-8, and -people using only ASCII do not notice any change: not in code, and not -in file size. -.P -A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy -is assembled into 00000xxx xxyyyyyy. -A byte 1110xxxx is the start -of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled -into xxxxyyyy yyzzzzzz. -(When UTF-8 is used to code the 31-bit ISO/IEC 10646 -then this progression continues up to 6-byte codes.) -.P -For most texts in ISO/IEC\~8859 character sets, this means that the -characters outside of ASCII are now coded with two bytes. -This tends -to expand ordinary text files by only one or two percent. -For Russian -or Greek texts, this expands ordinary text files by 100%, since text in -those languages is mostly outside of ASCII. -For Japanese users this means -that the 16-bit codes now in common use will take three bytes. -While there are algorithmic conversions from some character sets -(especially ISO/IEC\~8859-1) to Unicode, general conversion requires -carrying around conversion tables, which can be quite large for 16-bit -codes. -.P -Note that UTF-8 is self-synchronizing: -10xxxxxx is a tail, -any other byte is the head of a code. -Note that the only way ASCII bytes occur in a UTF-8 stream, -is as themselves. -In particular, -there are no embedded NULs (\[aq]\e0\[aq]) or \[aq]/\[aq]s -that form part of some larger code. -.P -Since ASCII, and, in particular, NUL and \[aq]/\[aq], are unchanged, the -kernel does not notice that UTF-8 is being used. -It does not care at -all what the bytes it is handling stand for. -.P -Rendering of Unicode data streams is typically handled through -"subfont" tables which map a subset of Unicode to glyphs. -Internally -the kernel uses Unicode to describe the subfont loaded in video RAM. -This means that in the Linux console in UTF-8 mode, one can use a character -set with 512 different symbols. -This is not enough for Japanese, Chinese, and -Korean, but it is enough for most other purposes. -.SH SEE ALSO -.BR iconv (1), -.BR ascii (7), -.BR iso_8859\-1 (7), -.BR unicode (7), -.BR utf\-8 (7) |