diff options
Diffstat (limited to 'man7/utf-8.7')
-rw-r--r-- | man7/utf-8.7 | 205 |
1 files changed, 0 insertions, 205 deletions
diff --git a/man7/utf-8.7 b/man7/utf-8.7 deleted file mode 100644 index 077fa42..0000000 --- a/man7/utf-8.7 +++ /dev/null @@ -1,205 +0,0 @@ -.\" Copyright (C) Markus Kuhn, 1996, 2001 -.\" -.\" SPDX-License-Identifier: GPL-2.0-or-later -.\" -.\" 1995-11-26 Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de> -.\" First version written -.\" 2001-05-11 Markus Kuhn <mgk25@cl.cam.ac.uk> -.\" Update -.\" -.TH UTF-8 7 2024-03-14 "Linux man-pages 6.7" -.SH NAME -UTF-8 \- an ASCII compatible multibyte Unicode encoding -.SH DESCRIPTION -The Unicode 3.0 character set occupies a 16-bit code space. -The most obvious -Unicode encoding (known as UCS-2) -consists of a sequence of 16-bit words. -Such strings can contain\[em]as part of many 16-bit characters\[em]bytes -such as \[aq]\e0\[aq] or \[aq]/\[aq], which have a -special meaning in filenames and other C library function arguments. -In addition, the majority of UNIX tools expect ASCII files and can't -read 16-bit words as characters without major modifications. -For these reasons, -UCS-2 is not a suitable external encoding of Unicode -in filenames, text files, environment variables, and so on. -The ISO/IEC 10646 Universal Character Set (UCS), -a superset of Unicode, occupies an even larger code -space\[em]31\ bits\[em]and the obvious -UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. -.P -The UTF-8 encoding of Unicode and UCS -does not have these problems and is the common way in which -Unicode is used on UNIX-style operating systems. -.SS Properties -The UTF-8 encoding has the following nice properties: -.IP \[bu] 3 -UCS -characters 0x00000000 to 0x0000007f (the classic US-ASCII -characters) are encoded simply as bytes 0x00 to 0x7f (ASCII -compatibility). -This means that files and strings which contain only -7-bit ASCII characters have the same encoding under both -ASCII -and -UTF-8. -.IP \[bu] -All UCS characters greater than 0x7f are encoded as a multibyte sequence -consisting only of bytes in the range 0x80 to 0xfd, so no ASCII -byte can appear as part of another character and there are no -problems with, for example, \[aq]\e0\[aq] or \[aq]/\[aq]. -.IP \[bu] -The lexicographic sorting order of UCS-4 strings is preserved. -.IP \[bu] -All possible 2\[ha]31 UCS codes can be encoded using UTF-8. -.IP \[bu] -The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding. -.IP \[bu] -The first byte of a multibyte sequence which represents a single non-ASCII -UCS character is always in the range 0xc2 to 0xfd and indicates how long -this multibyte sequence is. -All further bytes in a multibyte sequence -are in the range 0x80 to 0xbf. -This allows easy resynchronization and -makes the encoding stateless and robust against missing bytes. -.IP \[bu] -UTF-8 encoded UCS characters may be up to six bytes long, however the -Unicode standard specifies no characters above 0x10ffff, so Unicode characters -can be only up to four bytes long in -UTF-8. -.SS Encoding -The following byte sequences are used to represent a character. -The sequence to be used depends on the UCS code number of the character: -.TP -0x00000000 \- 0x0000007F: -.RI 0 xxxxxxx -.TP -0x00000080 \- 0x000007FF: -.RI 110 xxxxx -.RI 10 xxxxxx -.TP -0x00000800 \- 0x0000FFFF: -.RI 1110 xxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.TP -0x00010000 \- 0x001FFFFF: -.RI 11110 xxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.TP -0x00200000 \- 0x03FFFFFF: -.RI 111110 xx -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.TP -0x04000000 \- 0x7FFFFFFF: -.RI 1111110 x -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.RI 10 xxxxxx -.P -The -.I xxx -bit positions are filled with the bits of the character code number in -binary representation, most significant bit first (big-endian). -Only the shortest possible multibyte sequence -which can represent the code number of the character can be used. -.P -The UCS code values 0xd800\[en]0xdfff (UTF-16 surrogates) as well as 0xfffe and -0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. -According to RFC 3629 no point above U+10FFFF should be used, -which limits characters to four bytes. -.SS Example -The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded -in UTF-8 as -.P -.RS -11000010 10101001 = 0xc2 0xa9 -.RE -.P -and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is -encoded as: -.P -.RS -11100010 10001001 10100000 = 0xe2 0x89 0xa0 -.RE -.SS Application notes -Users have to select a UTF-8 locale, for example with -.P -.RS -export LANG=en_GB.UTF-8 -.RE -.P -in order to activate the UTF-8 support in applications. -.P -Application software that has to be aware of the used character -encoding should always set the locale with for example -.P -.RS -setlocale(LC_CTYPE, "") -.RE -.P -and programmers can then test the expression -.P -.RS -strcmp(nl_langinfo(CODESET), "UTF-8") == 0 -.RE -.P -to determine whether a UTF-8 locale has been selected and whether -therefore all plaintext standard input and output, terminal -communication, plaintext file content, filenames, and environment -variables are encoded in UTF-8. -.P -Programmers accustomed to single-byte encodings -such as US-ASCII or ISO/IEC\~8859 -have to be aware that two assumptions made so far are no longer valid -in UTF-8 locales. -Firstly, a single byte does not necessarily correspond any -more to a single character. -Secondly, since modern terminal emulators in UTF-8 -mode also support Chinese, Japanese, and Korean -double-width characters as well as nonspacing combining characters, -outputting a single character does not necessarily advance the cursor -by one position as it did in ASCII. -Library functions such as -.BR mbsrtowcs (3) -and -.BR wcswidth (3) -should be used today to count characters and cursor positions. -.P -The official ESC sequence to switch from an ISO/IEC\~2022 -encoding scheme (as used for instance by VT100 terminals) to -UTF-8 is ESC % G -("\ex1b%G"). -The corresponding return sequence from -UTF-8 to ISO/IEC\~2022 is ESC % @ ("\ex1b%@"). -Other ISO/IEC\~2022 sequences (such as -for switching the G0 and G1 sets) are not applicable in UTF-8 mode. -.SS Security -The Unicode and UCS standards require that producers of UTF-8 -shall use the shortest form possible, for example, producing a two-byte -sequence with first byte 0xc0 is nonconforming. -Unicode 3.1 has added the requirement that conforming programs must not accept -non-shortest forms in their input. -This is for security reasons: if -user input is checked for possible security violations, a program -might check only for the ASCII -version of "/../" or ";" or NUL and overlook that there are many -non-ASCII ways to represent these things in a non-shortest UTF-8 -encoding. -.SS Standards -ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9. -.\" .SH AUTHOR -.\" Markus Kuhn <mgk25@cl.cam.ac.uk> -.SH SEE ALSO -.BR locale (1), -.BR nl_langinfo (3), -.BR setlocale (3), -.BR charsets (7), -.BR unicode (7) |