summaryrefslogtreecommitdiffstats
path: root/man7/charsets.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/charsets.7')
-rw-r--r--man7/charsets.7335
1 files changed, 0 insertions, 335 deletions
diff --git a/man7/charsets.7 b/man7/charsets.7
deleted file mode 100644
index eb9f8f8..0000000
--- a/man7/charsets.7
+++ /dev/null
@@ -1,335 +0,0 @@
-.\" Copyright (c) 1996 Eric S. Raymond <esr@thyrsus.com>
-.\" and Copyright (c) Andries Brouwer <aeb@cwi.nl>
-.\"
-.\" SPDX-License-Identifier: GPL-2.0-or-later
-.\"
-.\" This is combined from many sources, including notes by aeb and
-.\" research by esr. Portions derive from a writeup by Roman Czyborra.
-.\"
-.\" Changes also by David Starner <dstarner98@aasaa.ofe.org>.
-.\"
-.TH charsets 7 2024-01-28 "Linux man-pages 6.7"
-.SH NAME
-charsets \- character set standards and internationalization
-.SH DESCRIPTION
-This manual page gives an overview on different character set standards
-and how they were used on Linux before Unicode became ubiquitous.
-Some of this information is still helpful for people working with legacy
-systems and documents.
-.P
-Standards discussed include such as
-ASCII, GB 2312, ISO/IEC\~8859, JIS, KOI8-R, KS, and Unicode.
-.P
-The primary emphasis is on character sets that were actually used by
-locale character sets, not the myriad others that could be found in data
-from other systems.
-.SS ASCII
-ASCII (American Standard Code For Information Interchange) is the original
-7-bit character set, originally designed for American English.
-Also known as US-ASCII.
-It is currently described by the ISO/IEC\~646:1991 IRV
-(International Reference Version) standard.
-.P
-Various ASCII variants replacing the dollar sign with other currency
-symbols and replacing punctuation with non-English alphabetic
-characters to cover German, French, Spanish, and others in 7 bits
-emerged.
-All are deprecated;
-glibc does not support locales whose character sets are not true
-supersets of ASCII.
-.P
-As Unicode, when using UTF-8, is ASCII-compatible, plain ASCII text
-still renders properly on modern UTF-8 using systems.
-.SS ISO/IEC\~8859
-ISO/IEC\~8859 is a series of 15 8-bit character sets, all of which have ASCII
-in their low (7-bit) half, invisible control characters in positions
-128 to 159, and 96 fixed-width graphics in positions 160\[en]255.
-.P
-Of these, the most important is ISO/IEC\~8859-1
-("Latin Alphabet No. 1" / Latin-1).
-It was widely adopted and supported by different systems,
-and is gradually being replaced with Unicode.
-The ISO/IEC\~8859-1 characters are also the first 256 characters of Unicode.
-.P
-Console support for the other ISO/IEC\~8859 character sets is available under
-Linux through user-mode utilities (such as
-.BR setfont (8))
-that modify keyboard bindings and the EGA graphics
-table and employ the "user mapping" font table in the console
-driver.
-.P
-Here are brief descriptions of each character set:
-.TP
-ISO/IEC\~8859-1 (Latin-1)
-Latin-1 covers many European languages such as Albanian, Basque,
-Danish, English, Faroese, Galician, Icelandic, Irish, Italian,
-Norwegian, Portuguese, Spanish, and Swedish.
-The lack of the ligatures
-Dutch IJ/ij,
-French œ,
-and „German“ quotation marks
-was considered tolerable.
-.TP
-ISO/IEC\~8859-2 (Latin-2)
-Latin-2 supports many Latin-written Central and East European
-languages such as Bosnian, Croatian, Czech, German, Hungarian, Polish,
-Slovak, and Slovene.
-Replacing Romanian ș/ț with ş/ţ
-was considered tolerable.
-.TP
-ISO/IEC\~8859-3 (Latin-3)
-Latin-3 was designed to cover of Esperanto, Maltese, and Turkish, but
-ISO/IEC\~8859-9 later superseded it for Turkish.
-.TP
-ISO/IEC\~8859-4 (Latin-4)
-Latin-4 introduced letters for North European languages such as
-Estonian, Latvian, and Lithuanian, but was superseded by ISO/IEC\~8859-10 and
-ISO/IEC\~8859-13.
-.TP
-ISO/IEC\~8859-5
-Cyrillic letters supporting Bulgarian, Byelorussian, Macedonian,
-Russian, Serbian, and (almost completely) Ukrainian.
-It was never widely used, see the discussion of KOI8-R/KOI8-U below.
-.TP
-ISO/IEC\~8859-6
-Was created for Arabic.
-The ISO/IEC\~8859-6 glyph table is a fixed font of separate
-letter forms, but a proper display engine should combine these
-using the proper initial, medial, and final forms.
-.TP
-ISO/IEC\~8859-7
-Was created for Modern Greek in 1987, updated in 2003.
-.TP
-ISO/IEC\~8859-8
-Supports Modern Hebrew without niqud (punctuation signs).
-Niqud and full-fledged Biblical Hebrew were outside the scope of this
-character set.
-.TP
-ISO/IEC\~8859-9 (Latin-5)
-This is a variant of Latin-1 that replaces Icelandic letters with
-Turkish ones.
-.TP
-ISO/IEC\~8859-10 (Latin-6)
-Latin-6 added the Inuit (Greenlandic) and Sami (Lappish) letters that were
-missing in Latin-4 to cover the entire Nordic area.
-.TP
-ISO/IEC\~8859-11
-Supports the Thai alphabet and is nearly identical to the TIS-620
-standard.
-.TP
-ISO/IEC\~8859-12
-This character set does not exist.
-.TP
-ISO/IEC\~8859-13 (Latin-7)
-Supports the Baltic Rim languages; in particular, it includes Latvian
-characters not found in Latin-4.
-.TP
-ISO/IEC\~8859-14 (Latin-8)
-This is the Celtic character set, covering Old Irish, Manx, Gaelic,
-Welsh, Cornish, and Breton.
-.TP
-ISO/IEC\~8859-15 (Latin-9)
-Latin-9 is similar to the widely used Latin-1 but replaces some less
-common symbols with the Euro sign and French and Finnish letters that
-were missing in Latin-1.
-.TP
-ISO/IEC\~8859-16 (Latin-10)
-This character set covers many Southeast European languages,
-and most importantly supports Romanian more completely than Latin-2.
-.SS KOI8-R / KOI8-U
-KOI8-R is a non-ISO character set popular in Russia before Unicode.
-The lower half is ASCII;
-the upper is a Cyrillic character set somewhat better designed than
-ISO/IEC\~8859-5.
-KOI8-U, based on KOI8-R, has better support for Ukrainian.
-Neither of these sets are ISO/IEC\~2022 compatible,
-unlike the ISO/IEC\~8859 series.
-.P
-Console support for KOI8-R is available under Linux through user-mode
-utilities that modify keyboard bindings and the EGA graphics table,
-and employ the "user mapping" font table in the console driver.
-.SS GB 2312
-GB 2312 is a mainland Chinese national standard character set used
-to express simplified Chinese.
-Just like JIS X 0208, characters are
-mapped into a 94x94 two-byte matrix used to construct EUC-CN.
-EUC-CN
-is the most important encoding for Linux and includes ASCII and
-GB 2312.
-Note that EUC-CN is often called as GB, GB 2312, or CN-GB.
-.SS Big5
-Big5 was a popular character set in Taiwan to express traditional
-Chinese.
-(Big5 is both a character set and an encoding.)
-It is a superset of ASCII.
-Non-ASCII characters are expressed in two bytes.
-Bytes 0xa1\[en]0xfe are used as leading bytes for two-byte characters.
-Big5 and its extension were widely used in Taiwan and Hong Kong.
-It is not ISO/IEC\~2022 compliant.
-.\" Thanks to Tomohiro KUBOTA for the following sections about
-.\" national standards.
-.SS JIS X 0208
-JIS X 0208 is a Japanese national standard character set.
-Though there are some more Japanese national standard character sets (like
-JIS X 0201, JIS X 0212, and JIS X 0213), this is the most important one.
-Characters are mapped into a 94x94 two-byte matrix,
-whose each byte is in the range 0x21\[en]0x7e.
-Note that JIS X 0208 is a character set, not an encoding.
-This means that JIS X 0208
-itself is not used for expressing text data.
-JIS X 0208 is used
-as a component to construct encodings such as EUC-JP, Shift_JIS,
-and ISO/IEC\~2022-JP.
-EUC-JP is the most important encoding for Linux
-and includes ASCII and JIS X 0208.
-In EUC-JP, JIS X 0208
-characters are expressed in two bytes, each of which is the
-JIS X 0208 code plus 0x80.
-.SS KS X 1001
-KS X 1001 is a Korean national standard character set.
-Just as
-JIS X 0208, characters are mapped into a 94x94 two-byte matrix.
-KS X 1001 is used like JIS X 0208, as a component
-to construct encodings such as EUC-KR, Johab, and ISO/IEC\~2022-KR.
-EUC-KR is the most important encoding for Linux and includes
-ASCII and KS X 1001.
-KS C 5601 is an older name for KS X 1001.
-.SS ISO/IEC\~2022 and ISO/IEC\~4873
-The ISO/IEC\~2022 and ISO/IEC\~4873 standards describe a font-control model
-based on VT100 practice.
-This model is (partially) supported
-by the Linux kernel and by
-.BR xterm (1).
-Several ISO/IEC\~2022-based character encodings have been defined,
-especially for Japanese.
-.P
-There are 4 graphic character sets, called G0, G1, G2, and G3,
-and one of them is the current character set for codes with
-high bit zero (initially G0), and one of them is the current
-character set for codes with high bit one (initially G1).
-Each graphic character set has 94 or 96 characters, and is
-essentially a 7-bit character set.
-It uses codes either
-040\[en]0177 (041\[en]0176) or 0240\[en]0377 (0241\[en]0376).
-G0 always has size 94 and uses codes 041\[en]0176.
-.P
-Switching between character sets is done using the shift functions
-\fB\[ha]N\fP (SO or LS1), \fB\[ha]O\fP (SI or LS0), ESC n (LS2), ESC o (LS3),
-ESC N (SS2), ESC O (SS3), ESC \[ti] (LS1R), ESC } (LS2R), ESC | (LS3R).
-The function LS\fIn\fP makes character set G\fIn\fP the current one
-for codes with high bit zero.
-The function LS\fIn\fPR makes character set G\fIn\fP the current one
-for codes with high bit one.
-The function SS\fIn\fP makes character set G\fIn\fP (\fIn\fP=2 or 3)
-the current one for the next character only (regardless of the value
-of its high order bit).
-.P
-A 94-character set is designated as G\fIn\fP character set
-by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),
-ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol
-or a pair of symbols found in the ISO/IEC\~2375 International
-Register of Coded Character Sets.
-For example, ESC ( @ selects the ISO/IEC\~646 character set as G0,
-ESC ( A selects the UK standard character set (with pound
-instead of number sign), ESC ( B selects ASCII (with dollar
-instead of currency sign), ESC ( M selects a character set
-for African languages, ESC ( ! A selects the Cuban character
-set, and so on.
-.P
-A 96-character set is designated as G\fIn\fP character set
-by an escape sequence ESC \- xx (for G1), ESC . xx (for G2)
-or ESC / xx (for G3).
-For example, ESC \- G selects the Hebrew alphabet as G1.
-.P
-A multibyte character set is designated as G\fIn\fP character set
-by an escape sequence ESC $ xx or ESC $ ( xx (for G0),
-ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + xx (for G3).
-For example, ESC $ ( C selects the Korean character set for G0.
-The Japanese character set selected by ESC $ B has a more
-recent version selected by ESC & @ ESC $ B.
-.P
-ISO/IEC\~4873 stipulates a narrower use of character sets, where G0
-is fixed (always ASCII), so that G1, G2, and G3
-can be invoked only for codes with the high order bit set.
-In particular, \fB\[ha]N\fP and \fB\[ha]O\fP are not used anymore, ESC ( xx
-can be used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx
-are equivalent to ESC \- xx, ESC . xx, ESC / xx, respectively.
-.SS TIS-620
-TIS-620 is a Thai national standard character set and a superset
-of ASCII.
-In the same fashion as the ISO/IEC\~8859 series, Thai characters are mapped into
-0xa1\[en]0xfe.
-.SS Unicode
-Unicode (ISO/IEC 10646) is a standard which aims to unambiguously represent
-every character in every human language.
-Unicode's structure permits 20.1 bits to encode every character.
-Since most computers don't include 20.1-bit integers, Unicode is
-usually encoded as 32-bit integers internally and either a series of
-16-bit integers (UTF-16) (needing two 16-bit integers only when
-encoding certain rare characters) or a series of 8-bit bytes (UTF-8).
-.P
-Linux represents Unicode using the 8-bit Unicode Transformation Format
-(UTF-8).
-UTF-8 is a variable length encoding of Unicode.
-It uses 1
-byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes
-for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.
-.P
-Let 0,1,x stand for a zero, one, or arbitrary bit.
-A byte 0xxxxxxx
-stands for the Unicode 00000000 0xxxxxxx which codes the same symbol
-as the ASCII 0xxxxxxx.
-Thus, ASCII goes unchanged into UTF-8, and
-people using only ASCII do not notice any change: not in code, and not
-in file size.
-.P
-A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy
-is assembled into 00000xxx xxyyyyyy.
-A byte 1110xxxx is the start
-of a 3-byte code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled
-into xxxxyyyy yyzzzzzz.
-(When UTF-8 is used to code the 31-bit ISO/IEC 10646
-then this progression continues up to 6-byte codes.)
-.P
-For most texts in ISO/IEC\~8859 character sets, this means that the
-characters outside of ASCII are now coded with two bytes.
-This tends
-to expand ordinary text files by only one or two percent.
-For Russian
-or Greek texts, this expands ordinary text files by 100%, since text in
-those languages is mostly outside of ASCII.
-For Japanese users this means
-that the 16-bit codes now in common use will take three bytes.
-While there are algorithmic conversions from some character sets
-(especially ISO/IEC\~8859-1) to Unicode, general conversion requires
-carrying around conversion tables, which can be quite large for 16-bit
-codes.
-.P
-Note that UTF-8 is self-synchronizing:
-10xxxxxx is a tail,
-any other byte is the head of a code.
-Note that the only way ASCII bytes occur in a UTF-8 stream,
-is as themselves.
-In particular,
-there are no embedded NULs (\[aq]\e0\[aq]) or \[aq]/\[aq]s
-that form part of some larger code.
-.P
-Since ASCII, and, in particular, NUL and \[aq]/\[aq], are unchanged, the
-kernel does not notice that UTF-8 is being used.
-It does not care at
-all what the bytes it is handling stand for.
-.P
-Rendering of Unicode data streams is typically handled through
-"subfont" tables which map a subset of Unicode to glyphs.
-Internally
-the kernel uses Unicode to describe the subfont loaded in video RAM.
-This means that in the Linux console in UTF-8 mode, one can use a character
-set with 512 different symbols.
-This is not enough for Japanese, Chinese, and
-Korean, but it is enough for most other purposes.
-.SH SEE ALSO
-.BR iconv (1),
-.BR ascii (7),
-.BR iso_8859\-1 (7),
-.BR unicode (7),
-.BR utf\-8 (7)