diff options
Diffstat (limited to 'upstream/debian-unstable/man3/Encode::Unicode.3perl')
-rw-r--r-- | upstream/debian-unstable/man3/Encode::Unicode.3perl | 253 |
1 files changed, 253 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man3/Encode::Unicode.3perl b/upstream/debian-unstable/man3/Encode::Unicode.3perl new file mode 100644 index 00000000..234a9639 --- /dev/null +++ b/upstream/debian-unstable/man3/Encode::Unicode.3perl @@ -0,0 +1,253 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "Encode::Unicode 3perl" +.TH Encode::Unicode 3perl 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +Encode::Unicode \-\- Various Unicode Transformation Formats +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 3 +\& use Encode qw/encode decode/; +\& $ucs2 = encode("UCS\-2BE", $utf8); +\& $utf8 = decode("UCS\-2BE", $ucs2); +.Ve +.SH ABSTRACT +.IX Header "ABSTRACT" +This module implements all Character Encoding Schemes of Unicode that +are officially documented by Unicode Consortium (except, of course, +for UTF\-8, which is a native format in perl). +.IP "<http://www.unicode.org/glossary/> says:" 4 +.IX Item "<http://www.unicode.org/glossary/> says:" +\&\fICharacter Encoding Scheme\fR A character encoding form plus byte +serialization. There are Seven character encoding schemes in Unicode: +UTF\-8, UTF\-16, UTF\-16BE, UTF\-16LE, UTF\-32 (UCS\-4), UTF\-32BE (UCS\-4BE) and +UTF\-32LE (UCS\-4LE), and UTF\-7. +.Sp +Since UTF\-7 is a 7\-bit (re)encoded version of UTF\-16BE, It is not part of +Unicode's Character Encoding Scheme. It is separately implemented in +Encode::Unicode::UTF7. For details see Encode::Unicode::UTF7. +.IP "Quick Reference" 4 +.IX Item "Quick Reference" +.Vb 10 +\& Decodes from ord(N) Encodes chr(N) to... +\& octet/char BOM S.P d800\-dfff ord > 0xffff \ex{1abcd} == +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& UCS\-2BE 2 N N is bogus Not Available +\& UCS\-2LE 2 N N bogus Not Available +\& UTF\-16 2/4 Y Y is S.P S.P BE/LE +\& UTF\-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd +\& UTF\-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf +\& UTF\-32 4 Y \- is bogus As is BE/LE +\& UTF\-32BE 4 N \- bogus As is 0x0001abcd +\& UTF\-32LE 4 N \- bogus As is 0xcdab0100 +\& UTF\-8 1\-4 \- \- bogus >= 4 octets \exf0\ex9a\eaf\e8d +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.SH "Size, Endianness, and BOM" +.IX Header "Size, Endianness, and BOM" +You can categorize these CES by 3 criteria: size of each character, +endianness, and Byte Order Mark. +.SS "by size" +.IX Subsection "by size" +UCS\-2 is a fixed-length encoding with each character taking 16 bits. +It \fBdoes not\fR support \fIsurrogate pairs\fR. When a surrogate pair +is encountered during \fBdecode()\fR, its place is filled with \ex{FFFD} +if \fICHECK\fR is 0, or the routine croaks if \fICHECK\fR is 1. When a +character whose ord value is larger than 0xFFFF is encountered, +its place is filled with \ex{FFFD} if \fICHECK\fR is 0, or the routine +croaks if \fICHECK\fR is 1. +.PP +UTF\-16 is almost the same as UCS\-2 but it supports \fIsurrogate pairs\fR. +When it encounters a high surrogate (0xD800\-0xDBFF), it fetches the +following low surrogate (0xDC00\-0xDFFF) and \f(CW\*(C`desurrogate\*(C'\fRs them to +form a character. Bogus surrogates result in death. When \ex{10000} +or above is encountered during \fBencode()\fR, it \f(CW\*(C`ensurrogate\*(C'\fRs them and +pushes the surrogate pair to the output stream. +.PP +UTF\-32 (UCS\-4) is a fixed-length encoding with each character taking 32 bits. +Since it is 32\-bit, there is no need for \fIsurrogate pairs\fR. +.SS "by endianness" +.IX Subsection "by endianness" +The first (and now failed) goal of Unicode was to map all character +repertoires into a fixed-length integer so that programmers are happy. +Since each character is either a \fIshort\fR or \fIlong\fR in C, you have to +pay attention to the endianness of each platform when you pass data +to one another. +.PP +Anything marked as BE is Big Endian (or network byte order) and LE is +Little Endian (aka VAX byte order). For anything not marked either +BE or LE, a character called Byte Order Mark (BOM) indicating the +endianness is prepended to the string. +.PP +CAVEAT: Though BOM in utf8 (\exEF\exBB\exBF) is valid, it is meaningless +and as of this writing Encode suite just leave it as is (\ex{FeFF}). +.IP "BOM as integer when fetched in network byte order" 4 +.IX Item "BOM as integer when fetched in network byte order" +.Vb 5 +\& 16 32 bits/char +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& BE 0xFeFF 0x0000FeFF +\& LE 0xFFFe 0xFFFe0000 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.PP +This modules handles the BOM as follows. +.IP \(bu 4 +When BE or LE is explicitly stated as the name of encoding, BOM is +simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE). +.IP \(bu 4 +When BE or LE is omitted during \fBdecode()\fR, it checks if BOM is at the +beginning of the string; if one is found, the endianness is set to +what the BOM says. +.IP \(bu 4 +Default Byte Order +.Sp +When no BOM is found, Encode 2.76 and blow croaked. Since Encode +2.77, it falls back to BE accordingly to RFC2781 and the Unicode +Standard version 8.0 +.IP \(bu 4 +When BE or LE is omitted during \fBencode()\fR, it returns a BE-encoded +string with BOM prepended. So when you want to encode a whole text +file, make sure you \fBencode()\fR the whole text at once, not line by line +or each line, not file, will have a BOM prepended. +.IP \(bu 4 +\&\f(CW\*(C`UCS\-2\*(C'\fR is an exception. Unlike others, this is an alias of UCS\-2BE. +UCS\-2 is already registered by IANA and others that way. +.SH "Surrogate Pairs" +.IX Header "Surrogate Pairs" +To say the least, surrogate pairs were the biggest mistake of the +Unicode Consortium. But according to the late Douglas Adams in \fIThe +Hitchhiker's Guide to the Galaxy\fR Trilogy, \f(CW\*(C`In the beginning the +Universe was created. This has made a lot of people very angry and +been widely regarded as a bad move\*(C'\fR. Their mistake was not of this +magnitude so let's forgive them. +.PP +(I don't dare make any comparison with Unicode Consortium and the +Vogons here ;) Or, comparing Encode to Babel Fish is completely +appropriate \-\- if you can only stick this into your ear :) +.PP +Surrogate pairs were born when the Unicode Consortium finally +admitted that 16 bits were not big enough to hold all the world's +character repertoires. But they already made UCS\-2 16\-bit. What +do we do? +.PP +Back then, the range 0xD800\-0xDFFF was not allocated. Let's split +that range in half and use the first half to represent the \f(CW\*(C`upper +half of a character\*(C'\fR and the second half to represent the \f(CW\*(C`lower +half of a character\*(C'\fR. That way, you can represent 1024 * 1024 = +1048576 more characters. Now we can store character ranges up to +\&\ex{10ffff} even with 16\-bit encodings. This pair of half-character is +now called a \fIsurrogate pair\fR and UTF\-16 is the name of the encoding +that embraces them. +.PP +Here is a formula to ensurrogate a Unicode character \ex{10000} and +above; +.PP +.Vb 2 +\& $hi = ($uni \- 0x10000) / 0x400 + 0xD800; +\& $lo = ($uni \- 0x10000) % 0x400 + 0xDC00; +.Ve +.PP +And to desurrogate; +.PP +.Vb 1 +\& $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00); +.Ve +.PP +Note this move has made \ex{D800}\-\ex{DFFF} into a forbidden zone but +perl does not prohibit the use of characters within this range. To perl, +every one of \ex{0000_0000} up to \ex{ffff_ffff} (*) is \fIa character\fR. +.PP +.Vb 2 +\& (*) or \ex{ffff_ffff_ffff_ffff} if your perl is compiled with 64\-bit +\& integer support! +.Ve +.SH "Error Checking" +.IX Header "Error Checking" +Unlike most encodings which accept various ways to handle errors, +Unicode encodings simply croaks. +.PP +.Vb 6 +\& % perl \-MEncode \-e\*(Aq$_ = "\exfe\exff\exd8\exd9\exda\exdb\e0\en"\*(Aq \e +\& \-e\*(AqEncode::from_to($_, "utf16","shift_jis", 0); print\*(Aq +\& UTF\-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184. +\& % perl \-MEncode \-e\*(Aq$a = "BOM missing"\*(Aq \e +\& \-e\*(Aq Encode::from_to($a, "utf16", "shift_jis", 0); print\*(Aq +\& UTF\-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184. +.Ve +.PP +Unlike other encodings where mappings are not one-to-one against +Unicode, UTFs are supposed to map 100% against one another. So Encode +is more strict on UTFs. +.PP +Consider that "division by zero" of Encode :) +.SH "SEE ALSO" +.IX Header "SEE ALSO" +Encode, Encode::Unicode::UTF7, <https://www.unicode.org/glossary/>, +<https://www.unicode.org/faq/utf_bom.html>, +.PP +RFC 2781 <http://www.ietf.org/rfc/rfc2781.txt>, +.PP +The whole Unicode standard <https://www.unicode.org/standard/standard.html> +.PP +Ch. 6 pp. 275 of \f(CW\*(C`Programming Perl (3rd Edition)\*(C'\fR +by Tom Christiansen, brian d foy & Larry Wall; +O'Reilly & Associates; ISBN 978\-0\-596\-00492\-7 |