diff options
Diffstat (limited to 'man7/unicode.7')
-rw-r--r-- | man7/unicode.7 | 22 |
1 files changed, 11 insertions, 11 deletions
diff --git a/man7/unicode.7 b/man7/unicode.7 index f65a9b2..fe38909 100644 --- a/man7/unicode.7 +++ b/man7/unicode.7 @@ -7,7 +7,7 @@ .\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> .\" Update .\" -.TH unicode 7 2023-03-12 "Linux man-pages 6.05.01" +.TH unicode 7 2024-01-28 "Linux man-pages 6.7" .SH NAME unicode \- universal character set .SH DESCRIPTION @@ -18,7 +18,7 @@ It also guarantees "round-trip compatibility"; in other words, conversion tables can be built such that no information is lost when a string is converted from any other encoding to UCS and back. -.PP +.P UCS contains the characters required to represent practically all known languages. This includes not only the Latin, Greek, Cyrillic, @@ -40,7 +40,7 @@ graphical, typographical, mathematical, and scientific symbols, including those provided by TeX, Postscript, APL, MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many word processing and publishing systems, and more are being added. -.PP +.P The UCS standard (ISO/IEC 10646) describes a 31-bit character set architecture consisting of 128 24-bit @@ -71,7 +71,7 @@ The supplemental planes added by ISO/IEC 10646-2 cover only more exotic characters for special scientific, dictionary printing, publishing industry, higher-level protocol and enthusiast needs. -.PP +.P The representation of each UCS character as a 2-byte word is referred to as the UCS-2 form (only for BMP characters), whereas UCS-4 is the representation of each character by a 4-byte word. @@ -79,12 +79,12 @@ In addition, there exist two encoding forms UTF-8 for backward compatibility with ASCII processing software and UTF-16 for the backward-compatible handling of non-BMP characters up to 0x10ffff by UCS-2 software. -.PP +.P The UCS characters 0x0000 to 0x007f are identical to those of the classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in -ISO 8859-1 (Latin-1). +ISO/IEC\~8859-1 (Latin-1). .SS Combining characters Some code points in UCS have been assigned to @@ -101,7 +101,7 @@ character Umlaut-A ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code 0x00c4, or alternatively as the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": 0x0041 0x0308. -.PP +.P Combining characters are essential for instance for encoding the Thai script or for mathematical typesetting and users of the International Phonetic Alphabet. @@ -124,7 +124,7 @@ Arabic, Devanagari, Malayalam). .TP Level 3 All UCS characters are supported. -.PP +.P The Unicode 3.0 Standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane @@ -147,7 +147,7 @@ code values (in all locales), a convention that is signaled by the GNU C library to applications by defining the constant .B __STDC_ISO_10646__ as specified in the ISO C99 standard. -.PP +.P UCS/Unicode can be used just like ASCII in input/output streams, terminal communication, plaintext files, filenames, and environment variables in the ASCII compatible UTF-8 multibyte encoding. @@ -156,7 +156,7 @@ encoding to all applications, a suitable .I locale has to be selected via environment variables (e.g., "LANG=en_GB.UTF-8"). -.PP +.P The .B nl_langinfo(CODESET) function returns the name of the selected encoding. @@ -189,7 +189,7 @@ in the Linux kernel sources (or .I Documentation/unicode.txt before Linux 4.10). -.PP +.P Two other planes are reserved for private usage, plane 15 (Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) and plane 16 (Supplementary Private Use Area-B, range |