From 3d08cd331c1adcf0d917392f7e527b3f00511748 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 24 May 2024 06:52:22 +0200 Subject: Merging upstream version 6.8. Signed-off-by: Daniel Baumann --- man/man7/utf-8.7 | 205 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 man/man7/utf-8.7 (limited to 'man/man7/utf-8.7') diff --git a/man/man7/utf-8.7 b/man/man7/utf-8.7 new file mode 100644 index 0000000..9762871 --- /dev/null +++ b/man/man7/utf-8.7 @@ -0,0 +1,205 @@ +.\" Copyright (C) Markus Kuhn, 1996, 2001 +.\" +.\" SPDX-License-Identifier: GPL-2.0-or-later +.\" +.\" 1995-11-26 Markus Kuhn +.\" First version written +.\" 2001-05-11 Markus Kuhn +.\" Update +.\" +.TH UTF-8 7 2024-05-02 "Linux man-pages (unreleased)" +.SH NAME +UTF-8 \- an ASCII compatible multibyte Unicode encoding +.SH DESCRIPTION +The Unicode 3.0 character set occupies a 16-bit code space. +The most obvious +Unicode encoding (known as UCS-2) +consists of a sequence of 16-bit words. +Such strings can contain\[em]as part of many 16-bit characters\[em]bytes +such as \[aq]\e0\[aq] or \[aq]/\[aq], which have a +special meaning in filenames and other C library function arguments. +In addition, the majority of UNIX tools expect ASCII files and can't +read 16-bit words as characters without major modifications. +For these reasons, +UCS-2 is not a suitable external encoding of Unicode +in filenames, text files, environment variables, and so on. +The ISO/IEC 10646 Universal Character Set (UCS), +a superset of Unicode, occupies an even larger code +space\[em]31\ bits\[em]and the obvious +UCS-4 encoding for it (a sequence of 32-bit words) has the same problems. +.P +The UTF-8 encoding of Unicode and UCS +does not have these problems and is the common way in which +Unicode is used on UNIX-style operating systems. +.SS Properties +The UTF-8 encoding has the following nice properties: +.IP \[bu] 3 +UCS +characters 0x00000000 to 0x0000007f (the classic US-ASCII +characters) are encoded simply as bytes 0x00 to 0x7f (ASCII +compatibility). +This means that files and strings which contain only +7-bit ASCII characters have the same encoding under both +ASCII +and +UTF-8. +.IP \[bu] +All UCS characters greater than 0x7f are encoded as a multibyte sequence +consisting only of bytes in the range 0x80 to 0xfd, so no ASCII +byte can appear as part of another character and there are no +problems with, for example, \[aq]\e0\[aq] or \[aq]/\[aq]. +.IP \[bu] +The lexicographic sorting order of UCS-4 strings is preserved. +.IP \[bu] +All possible 2\[ha]31 UCS codes can be encoded using UTF-8. +.IP \[bu] +The bytes 0xc0, 0xc1, 0xfe, and 0xff are never used in the UTF-8 encoding. +.IP \[bu] +The first byte of a multibyte sequence which represents a single non-ASCII +UCS character is always in the range 0xc2 to 0xfd and indicates how long +this multibyte sequence is. +All further bytes in a multibyte sequence +are in the range 0x80 to 0xbf. +This allows easy resynchronization and +makes the encoding stateless and robust against missing bytes. +.IP \[bu] +UTF-8 encoded UCS characters may be up to six bytes long, however the +Unicode standard specifies no characters above 0x10ffff, so Unicode characters +can be only up to four bytes long in +UTF-8. +.SS Encoding +The following byte sequences are used to represent a character. +The sequence to be used depends on the UCS code number of the character: +.TP +0x00000000 \- 0x0000007F: +.RI 0 xxxxxxx +.TP +0x00000080 \- 0x000007FF: +.RI 110 xxxxx +.RI 10 xxxxxx +.TP +0x00000800 \- 0x0000FFFF: +.RI 1110 xxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x00010000 \- 0x001FFFFF: +.RI 11110 xxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x00200000 \- 0x03FFFFFF: +.RI 111110 xx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.TP +0x04000000 \- 0x7FFFFFFF: +.RI 1111110 x +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.RI 10 xxxxxx +.P +The +.I xxx +bit positions are filled with the bits of the character code number in +binary representation, most significant bit first (big-endian). +Only the shortest possible multibyte sequence +which can represent the code number of the character can be used. +.P +The UCS code values 0xd800\[en]0xdfff (UTF-16 surrogates) as well as 0xfffe and +0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. +According to RFC 3629 no point above U+10FFFF should be used, +which limits characters to four bytes. +.SS Example +The Unicode character 0xa9 = 1010 1001 (the copyright sign) is encoded +in UTF-8 as +.P +.RS +11000010 10101001 = 0xc2 0xa9 +.RE +.P +and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is +encoded as: +.P +.RS +11100010 10001001 10100000 = 0xe2 0x89 0xa0 +.RE +.SS Application notes +Users have to select a UTF-8 locale, for example with +.P +.RS +export LANG=en_GB.UTF-8 +.RE +.P +in order to activate the UTF-8 support in applications. +.P +Application software that has to be aware of the used character +encoding should always set the locale with for example +.P +.RS +setlocale(LC_CTYPE, "") +.RE +.P +and programmers can then test the expression +.P +.RS +strcmp(nl_langinfo(CODESET), "UTF-8") == 0 +.RE +.P +to determine whether a UTF-8 locale has been selected and whether +therefore all plaintext standard input and output, terminal +communication, plaintext file content, filenames, and environment +variables are encoded in UTF-8. +.P +Programmers accustomed to single-byte encodings +such as US-ASCII or ISO/IEC\~8859 +have to be aware that two assumptions made so far are no longer valid +in UTF-8 locales. +Firstly, a single byte does not necessarily correspond any +more to a single character. +Secondly, since modern terminal emulators in UTF-8 +mode also support Chinese, Japanese, and Korean +double-width characters as well as nonspacing combining characters, +outputting a single character does not necessarily advance the cursor +by one position as it did in ASCII. +Library functions such as +.BR mbsrtowcs (3) +and +.BR wcswidth (3) +should be used today to count characters and cursor positions. +.P +The official ESC sequence to switch from an ISO/IEC\~2022 +encoding scheme (as used for instance by VT100 terminals) to +UTF-8 is ESC % G +("\ex1b%G"). +The corresponding return sequence from +UTF-8 to ISO/IEC\~2022 is ESC % @ ("\ex1b%@"). +Other ISO/IEC\~2022 sequences (such as +for switching the G0 and G1 sets) are not applicable in UTF-8 mode. +.SS Security +The Unicode and UCS standards require that producers of UTF-8 +shall use the shortest form possible, for example, producing a two-byte +sequence with first byte 0xc0 is nonconforming. +Unicode 3.1 has added the requirement that conforming programs must not accept +non-shortest forms in their input. +This is for security reasons: if +user input is checked for possible security violations, a program +might check only for the ASCII +version of "/../" or ";" or NUL and overlook that there are many +non-ASCII ways to represent these things in a non-shortest UTF-8 +encoding. +.SS Standards +ISO/IEC 10646-1:2000, Unicode 3.1, RFC\ 3629, Plan 9. +.\" .SH AUTHOR +.\" Markus Kuhn +.SH SEE ALSO +.BR locale (1), +.BR nl_langinfo (3), +.BR setlocale (3), +.BR charsets (7), +.BR unicode (7) -- cgit v1.2.3