diff options
Diffstat (limited to 'man7/unicode.7')
-rw-r--r-- | man7/unicode.7 | 246 |
1 files changed, 0 insertions, 246 deletions
diff --git a/man7/unicode.7 b/man7/unicode.7 deleted file mode 100644 index fe38909..0000000 --- a/man7/unicode.7 +++ /dev/null @@ -1,246 +0,0 @@ -.\" Copyright (C) Markus Kuhn, 1995, 2001 -.\" -.\" SPDX-License-Identifier: GPL-2.0-or-later -.\" -.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> -.\" First version written -.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> -.\" Update -.\" -.TH unicode 7 2024-01-28 "Linux man-pages 6.7" -.SH NAME -unicode \- universal character set -.SH DESCRIPTION -The international standard ISO/IEC 10646 defines the -Universal Character Set (UCS). -UCS contains all characters of all other character set standards. -It also guarantees "round-trip compatibility"; -in other words, -conversion tables can be built such that no information is lost -when a string is converted from any other encoding to UCS and back. -.P -UCS contains the characters required to represent practically all -known languages. -This includes not only the Latin, Greek, Cyrillic, -Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, -Japanese and Korean Han ideographs as well as scripts such as -Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, -Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, -Tibetan, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, -Ogham, Myanmar, Sinhala, Thaana, Yi, and others. -For scripts not yet -covered, research on how to best encode them for computer usage is -still going on and they will be added eventually. -This might -eventually include not only Hieroglyphs and various historic -Indo-European languages, but even some selected artistic scripts such -as Tengwar, Cirth, and Klingon. -UCS also covers a large number of -graphical, typographical, mathematical, and scientific symbols, -including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, -Macintosh, OCR fonts, as well as many word processing and publishing -systems, and more are being added. -.P -The UCS standard (ISO/IEC 10646) describes a -31-bit character set architecture -consisting of 128 24-bit -.IR groups , -each divided into 256 16-bit -.I planes -made up of 256 8-bit -.I rows -with 256 -.I column -positions, one for each character. -Part 1 of the standard (ISO/IEC 10646-1) -defines the first 65534 code positions (0x0000 to 0xfffd), which form -the -.I Basic Multilingual Plane -(BMP), that is plane 0 in group 0. -Part 2 of the standard (ISO/IEC 10646-2) -adds characters to group 0 outside the BMP in several -.I "supplementary planes" -in the range 0x10000 to 0x10ffff. -There are no plans to add characters -beyond 0x10ffff to the standard, therefore of the entire code space, -only a small fraction of group 0 will ever be actually used in the -foreseeable future. -The BMP contains all characters found in the -commonly used other character sets. -The supplemental planes added by -ISO/IEC 10646-2 cover only more exotic characters for special scientific, -dictionary printing, publishing industry, higher-level protocol and -enthusiast needs. -.P -The representation of each UCS character as a 2-byte word is referred -to as the UCS-2 form (only for BMP characters), -whereas UCS-4 is the representation of each character by a 4-byte word. -In addition, there exist two encoding forms UTF-8 -for backward compatibility with ASCII processing software and UTF-16 -for the backward-compatible handling of non-BMP characters up to -0x10ffff by UCS-2 software. -.P -The UCS characters 0x0000 to 0x007f are identical to those of the -classic US-ASCII -character set and the characters in the range 0x0000 to 0x00ff -are identical to those in -ISO/IEC\~8859-1 (Latin-1). -.SS Combining characters -Some code points in UCS -have been assigned to -.IR "combining characters" . -These are similar to the nonspacing accent keys on a typewriter. -A combining character just adds an accent to the previous character. -The most important accented characters have codes of their own in UCS, -however, the combining character mechanism allows us to add accents -and other diacritical marks to any character. -The combining characters -always follow the character which they modify. -For example, the German -character Umlaut-A ("Latin capital letter A with diaeresis") can -either be represented by the precomposed UCS code 0x00c4, or -alternatively as the combination of a normal "Latin capital letter A" -followed by a "combining diaeresis": 0x0041 0x0308. -.P -Combining characters are essential for instance for encoding the Thai -script or for mathematical typesetting and users of the International -Phonetic Alphabet. -.SS Implementation levels -As not all systems are expected to support advanced mechanisms like -combining characters, ISO/IEC 10646-1 specifies the following three -.I implementation levels -of UCS: -.TP 0.9i -Level 1 -Combining characters and Hangul Jamo -(a variant encoding of the Korean script, where a Hangul syllable -glyph is coded as a triplet or pair of vowel/consonant codes) are not -supported. -.TP -Level 2 -In addition to level 1, combining characters are now allowed for some -languages where they are essential (e.g., Thai, Lao, Hebrew, -Arabic, Devanagari, Malayalam). -.TP -Level 3 -All UCS characters are supported. -.P -The Unicode 3.0 Standard -published by the Unicode Consortium -contains exactly the UCS Basic Multilingual Plane -at implementation level 3, as described in ISO/IEC 10646-1:2000. -Unicode 3.1 added the supplemental planes of ISO/IEC 10646-2. -The Unicode standard and -technical reports published by the Unicode Consortium provide much -additional information on the semantics and recommended usages of -various characters. -They provide guidelines and algorithms for -editing, sorting, comparing, normalizing, converting, and displaying -Unicode strings. -.SS Unicode under Linux -Under GNU/Linux, the C type -.I wchar_t -is a signed 32-bit integer type. -Its values are always interpreted -by the C library as UCS -code values (in all locales), a convention that is signaled by the GNU -C library to applications by defining the constant -.B __STDC_ISO_10646__ -as specified in the ISO C99 standard. -.P -UCS/Unicode can be used just like ASCII in input/output streams, -terminal communication, plaintext files, filenames, and environment -variables in the ASCII compatible UTF-8 multibyte encoding. -To signal the use of UTF-8 as the character -encoding to all applications, a suitable -.I locale -has to be selected via environment variables (e.g., -"LANG=en_GB.UTF-8"). -.P -The -.B nl_langinfo(CODESET) -function returns the name of the selected encoding. -Library functions such as -.BR wctomb (3) -and -.BR mbsrtowcs (3) -can be used to transform the internal -.I wchar_t -characters and strings into the system character encoding and back -and -.BR wcwidth (3) -tells how many positions (0\[en]2) the cursor is advanced by the -output of a character. -.SS Private Use Areas (PUA) -In the Basic Multilingual Plane, -the range 0xe000 to 0xf8ff will never be assigned to any characters by -the standard and is reserved for private usage. -For the Linux -community, this private area has been subdivided further into the -range 0xe000 to 0xefff which can be used individually by any end-user -and the Linux zone in the range 0xf000 to 0xf8ff where extensions are -coordinated among all Linux users. -The registry of the characters -assigned to the Linux zone is maintained by LANANA and the registry -itself is -.I Documentation/admin\-guide/unicode.rst -in the Linux kernel sources -.\" commit 9d85025b0418163fae079c9ba8f8445212de8568 -(or -.I Documentation/unicode.txt -before Linux 4.10). -.P -Two other planes are reserved for private usage, plane 15 -(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) -and plane 16 (Supplementary Private Use Area-B, range -0x100000 to 0x10fffd). -.SS Literature -.IP \[bu] 3 -Information technology \[em] Universal Multiple-Octet Coded Character -Set (UCS) \[em] Part 1: Architecture and Basic Multilingual Plane. -International Standard ISO/IEC 10646-1, International Organization -for Standardization, Geneva, 2000. -.IP -This is the official specification of UCS. -Available from -.UR http://www.iso.ch/ -.UE . -.IP \[bu] -The Unicode Standard, Version 3.0. -The Unicode Consortium, Addison-Wesley, -Reading, MA, 2000, ISBN 0-201-61633-5. -.IP \[bu] -S.\& Harbison, G.\& Steele. C: A Reference Manual. Fourth edition, -Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3. -.IP -A good reference book about the C programming language. -The fourth -edition covers the 1994 Amendment 1 to the ISO C90 standard, which -adds a large number of new C library functions for handling wide and -multibyte character encodings, but it does not yet cover ISO C99, -which improved wide and multibyte character support even further. -.IP \[bu] -Unicode Technical Reports. -.RS -.UR http://www.unicode.org\:/reports/ -.UE -.RE -.IP \[bu] -Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux. -.RS -.UR http://www.cl.cam.ac.uk\:/\[ti]mgk25\:/unicode.html -.UE -.RE -.IP \[bu] -Bruno Haible: Unicode HOWTO. -.RS -.UR http://www.tldp.org\:/HOWTO\:/Unicode\-HOWTO.html -.UE -.RE -.\" .SH AUTHOR -.\" Markus Kuhn <mgk25@cl.cam.ac.uk> -.SH SEE ALSO -.BR locale (1), -.BR setlocale (3), -.BR charsets (7), -.BR utf\-8 (7) |