summaryrefslogtreecommitdiffstats
path: root/upstream/debian-unstable/man3/Encode::Unicode.3perl
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
commitfc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
treece1e3bce06471410239a6f41282e328770aa404a /upstream/debian-unstable/man3/Encode::Unicode.3perl
parentInitial commit. (diff)
downloadmanpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/debian-unstable/man3/Encode::Unicode.3perl')
-rw-r--r--upstream/debian-unstable/man3/Encode::Unicode.3perl253
1 files changed, 253 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man3/Encode::Unicode.3perl b/upstream/debian-unstable/man3/Encode::Unicode.3perl
new file mode 100644
index 00000000..234a9639
--- /dev/null
+++ b/upstream/debian-unstable/man3/Encode::Unicode.3perl
@@ -0,0 +1,253 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "Encode::Unicode 3perl"
+.TH Encode::Unicode 3perl 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+Encode::Unicode \-\- Various Unicode Transformation Formats
+.SH SYNOPSIS
+.IX Header "SYNOPSIS"
+.Vb 3
+\& use Encode qw/encode decode/;
+\& $ucs2 = encode("UCS\-2BE", $utf8);
+\& $utf8 = decode("UCS\-2BE", $ucs2);
+.Ve
+.SH ABSTRACT
+.IX Header "ABSTRACT"
+This module implements all Character Encoding Schemes of Unicode that
+are officially documented by Unicode Consortium (except, of course,
+for UTF\-8, which is a native format in perl).
+.IP "<http://www.unicode.org/glossary/> says:" 4
+.IX Item "<http://www.unicode.org/glossary/> says:"
+\&\fICharacter Encoding Scheme\fR A character encoding form plus byte
+serialization. There are Seven character encoding schemes in Unicode:
+UTF\-8, UTF\-16, UTF\-16BE, UTF\-16LE, UTF\-32 (UCS\-4), UTF\-32BE (UCS\-4BE) and
+UTF\-32LE (UCS\-4LE), and UTF\-7.
+.Sp
+Since UTF\-7 is a 7\-bit (re)encoded version of UTF\-16BE, It is not part of
+Unicode's Character Encoding Scheme. It is separately implemented in
+Encode::Unicode::UTF7. For details see Encode::Unicode::UTF7.
+.IP "Quick Reference" 4
+.IX Item "Quick Reference"
+.Vb 10
+\& Decodes from ord(N) Encodes chr(N) to...
+\& octet/char BOM S.P d800\-dfff ord > 0xffff \ex{1abcd} ==
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& UCS\-2BE 2 N N is bogus Not Available
+\& UCS\-2LE 2 N N bogus Not Available
+\& UTF\-16 2/4 Y Y is S.P S.P BE/LE
+\& UTF\-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd
+\& UTF\-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf
+\& UTF\-32 4 Y \- is bogus As is BE/LE
+\& UTF\-32BE 4 N \- bogus As is 0x0001abcd
+\& UTF\-32LE 4 N \- bogus As is 0xcdab0100
+\& UTF\-8 1\-4 \- \- bogus >= 4 octets \exf0\ex9a\eaf\e8d
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.SH "Size, Endianness, and BOM"
+.IX Header "Size, Endianness, and BOM"
+You can categorize these CES by 3 criteria: size of each character,
+endianness, and Byte Order Mark.
+.SS "by size"
+.IX Subsection "by size"
+UCS\-2 is a fixed-length encoding with each character taking 16 bits.
+It \fBdoes not\fR support \fIsurrogate pairs\fR. When a surrogate pair
+is encountered during \fBdecode()\fR, its place is filled with \ex{FFFD}
+if \fICHECK\fR is 0, or the routine croaks if \fICHECK\fR is 1. When a
+character whose ord value is larger than 0xFFFF is encountered,
+its place is filled with \ex{FFFD} if \fICHECK\fR is 0, or the routine
+croaks if \fICHECK\fR is 1.
+.PP
+UTF\-16 is almost the same as UCS\-2 but it supports \fIsurrogate pairs\fR.
+When it encounters a high surrogate (0xD800\-0xDBFF), it fetches the
+following low surrogate (0xDC00\-0xDFFF) and \f(CW\*(C`desurrogate\*(C'\fRs them to
+form a character. Bogus surrogates result in death. When \ex{10000}
+or above is encountered during \fBencode()\fR, it \f(CW\*(C`ensurrogate\*(C'\fRs them and
+pushes the surrogate pair to the output stream.
+.PP
+UTF\-32 (UCS\-4) is a fixed-length encoding with each character taking 32 bits.
+Since it is 32\-bit, there is no need for \fIsurrogate pairs\fR.
+.SS "by endianness"
+.IX Subsection "by endianness"
+The first (and now failed) goal of Unicode was to map all character
+repertoires into a fixed-length integer so that programmers are happy.
+Since each character is either a \fIshort\fR or \fIlong\fR in C, you have to
+pay attention to the endianness of each platform when you pass data
+to one another.
+.PP
+Anything marked as BE is Big Endian (or network byte order) and LE is
+Little Endian (aka VAX byte order). For anything not marked either
+BE or LE, a character called Byte Order Mark (BOM) indicating the
+endianness is prepended to the string.
+.PP
+CAVEAT: Though BOM in utf8 (\exEF\exBB\exBF) is valid, it is meaningless
+and as of this writing Encode suite just leave it as is (\ex{FeFF}).
+.IP "BOM as integer when fetched in network byte order" 4
+.IX Item "BOM as integer when fetched in network byte order"
+.Vb 5
+\& 16 32 bits/char
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& BE 0xFeFF 0x0000FeFF
+\& LE 0xFFFe 0xFFFe0000
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.PP
+This modules handles the BOM as follows.
+.IP \(bu 4
+When BE or LE is explicitly stated as the name of encoding, BOM is
+simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
+.IP \(bu 4
+When BE or LE is omitted during \fBdecode()\fR, it checks if BOM is at the
+beginning of the string; if one is found, the endianness is set to
+what the BOM says.
+.IP \(bu 4
+Default Byte Order
+.Sp
+When no BOM is found, Encode 2.76 and blow croaked. Since Encode
+2.77, it falls back to BE accordingly to RFC2781 and the Unicode
+Standard version 8.0
+.IP \(bu 4
+When BE or LE is omitted during \fBencode()\fR, it returns a BE-encoded
+string with BOM prepended. So when you want to encode a whole text
+file, make sure you \fBencode()\fR the whole text at once, not line by line
+or each line, not file, will have a BOM prepended.
+.IP \(bu 4
+\&\f(CW\*(C`UCS\-2\*(C'\fR is an exception. Unlike others, this is an alias of UCS\-2BE.
+UCS\-2 is already registered by IANA and others that way.
+.SH "Surrogate Pairs"
+.IX Header "Surrogate Pairs"
+To say the least, surrogate pairs were the biggest mistake of the
+Unicode Consortium. But according to the late Douglas Adams in \fIThe
+Hitchhiker's Guide to the Galaxy\fR Trilogy, \f(CW\*(C`In the beginning the
+Universe was created. This has made a lot of people very angry and
+been widely regarded as a bad move\*(C'\fR. Their mistake was not of this
+magnitude so let's forgive them.
+.PP
+(I don't dare make any comparison with Unicode Consortium and the
+Vogons here ;) Or, comparing Encode to Babel Fish is completely
+appropriate \-\- if you can only stick this into your ear :)
+.PP
+Surrogate pairs were born when the Unicode Consortium finally
+admitted that 16 bits were not big enough to hold all the world's
+character repertoires. But they already made UCS\-2 16\-bit. What
+do we do?
+.PP
+Back then, the range 0xD800\-0xDFFF was not allocated. Let's split
+that range in half and use the first half to represent the \f(CW\*(C`upper
+half of a character\*(C'\fR and the second half to represent the \f(CW\*(C`lower
+half of a character\*(C'\fR. That way, you can represent 1024 * 1024 =
+1048576 more characters. Now we can store character ranges up to
+\&\ex{10ffff} even with 16\-bit encodings. This pair of half-character is
+now called a \fIsurrogate pair\fR and UTF\-16 is the name of the encoding
+that embraces them.
+.PP
+Here is a formula to ensurrogate a Unicode character \ex{10000} and
+above;
+.PP
+.Vb 2
+\& $hi = ($uni \- 0x10000) / 0x400 + 0xD800;
+\& $lo = ($uni \- 0x10000) % 0x400 + 0xDC00;
+.Ve
+.PP
+And to desurrogate;
+.PP
+.Vb 1
+\& $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00);
+.Ve
+.PP
+Note this move has made \ex{D800}\-\ex{DFFF} into a forbidden zone but
+perl does not prohibit the use of characters within this range. To perl,
+every one of \ex{0000_0000} up to \ex{ffff_ffff} (*) is \fIa character\fR.
+.PP
+.Vb 2
+\& (*) or \ex{ffff_ffff_ffff_ffff} if your perl is compiled with 64\-bit
+\& integer support!
+.Ve
+.SH "Error Checking"
+.IX Header "Error Checking"
+Unlike most encodings which accept various ways to handle errors,
+Unicode encodings simply croaks.
+.PP
+.Vb 6
+\& % perl \-MEncode \-e\*(Aq$_ = "\exfe\exff\exd8\exd9\exda\exdb\e0\en"\*(Aq \e
+\& \-e\*(AqEncode::from_to($_, "utf16","shift_jis", 0); print\*(Aq
+\& UTF\-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
+\& % perl \-MEncode \-e\*(Aq$a = "BOM missing"\*(Aq \e
+\& \-e\*(Aq Encode::from_to($a, "utf16", "shift_jis", 0); print\*(Aq
+\& UTF\-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.
+.Ve
+.PP
+Unlike other encodings where mappings are not one-to-one against
+Unicode, UTFs are supposed to map 100% against one another. So Encode
+is more strict on UTFs.
+.PP
+Consider that "division by zero" of Encode :)
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+Encode, Encode::Unicode::UTF7, <https://www.unicode.org/glossary/>,
+<https://www.unicode.org/faq/utf_bom.html>,
+.PP
+RFC 2781 <http://www.ietf.org/rfc/rfc2781.txt>,
+.PP
+The whole Unicode standard <https://www.unicode.org/standard/standard.html>
+.PP
+Ch. 6 pp. 275 of \f(CW\*(C`Programming Perl (3rd Edition)\*(C'\fR
+by Tom Christiansen, brian d foy & Larry Wall;
+O'Reilly & Associates; ISBN 978\-0\-596\-00492\-7