Adding upstream version 4.22.0.upstream/4.22.0

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-15 19:43:11 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-15 19:43:11 +0000
commit: fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
tree: ce1e3bce06471410239a6f41282e328770aa404a /upstream/debian-unstable/man3/Encode::Unicode.3perl
parent: Initial commit. (diff)
download: manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
1 files changed, 253 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man3/Encode::Unicode.3perl b/upstream/debian-unstable/man3/Encode::Unicode.3perl
new file mode 100644
index 00000000..234a9639
--- /dev/null
+++ b/upstream/debian-unstable/man3/Encode::Unicode.3perl
@@ -0,0 +1,253 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+.    ds C` ""
+.    ds C' ""
+'br\}
+.el\{\
+.    ds C`
+.    ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el       .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD.  Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+.    if \nF \{\
+.        de IX
+.        tm Index:\\$1\t\\n%\t"\\$2"
+..
+.        if !\nF==2 \{\
+.            nr % 0
+.            nr F 2
+.        \}
+.    \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "Encode::Unicode 3perl"
+.TH Encode::Unicode 3perl 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+Encode::Unicode \-\- Various Unicode Transformation Formats
+.SH SYNOPSIS
+.IX Header "SYNOPSIS"
+.Vb 3
+\&    use Encode qw/encode decode/;
+\&    $ucs2 = encode("UCS\-2BE", $utf8);
+\&    $utf8 = decode("UCS\-2BE", $ucs2);
+.Ve
+.SH ABSTRACT
+.IX Header "ABSTRACT"
+This module implements all Character Encoding Schemes of Unicode that
+are officially documented by Unicode Consortium (except, of course,
+for UTF\-8, which is a native format in perl).
+.IP "<http://www.unicode.org/glossary/> says:" 4
+.IX Item "<http://www.unicode.org/glossary/> says:"
+\&\fICharacter Encoding Scheme\fR A character encoding form plus byte
+serialization. There are Seven character encoding schemes in Unicode:
+UTF\-8, UTF\-16, UTF\-16BE, UTF\-16LE, UTF\-32 (UCS\-4), UTF\-32BE (UCS\-4BE) and
+UTF\-32LE (UCS\-4LE), and UTF\-7.
+.Sp
+Since UTF\-7 is a 7\-bit (re)encoded version of UTF\-16BE, It is not part of
+Unicode's Character Encoding Scheme.  It is separately implemented in
+Encode::Unicode::UTF7.  For details see Encode::Unicode::UTF7.
+.IP "Quick Reference" 4
+.IX Item "Quick Reference"
+.Vb 10
+\&                Decodes from ord(N)           Encodes chr(N) to...
+\&       octet/char BOM S.P d800\-dfff  ord > 0xffff     \ex{1abcd} ==
+\&  \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&  UCS\-2BE       2   N   N  is bogus                  Not Available
+\&  UCS\-2LE       2   N   N     bogus                  Not Available
+\&  UTF\-16      2/4   Y   Y  is   S.P           S.P            BE/LE
+\&  UTF\-16BE    2/4   N   Y       S.P           S.P    0xd82a,0xdfcd
+\&  UTF\-16LE    2/4   N   Y       S.P           S.P    0x2ad8,0xcddf
+\&  UTF\-32        4   Y   \-  is bogus         As is            BE/LE
+\&  UTF\-32BE      4   N   \-     bogus         As is       0x0001abcd
+\&  UTF\-32LE      4   N   \-     bogus         As is       0xcdab0100
+\&  UTF\-8       1\-4   \-   \-     bogus   >= 4 octets   \exf0\ex9a\eaf\e8d
+\&  \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.SH "Size, Endianness, and BOM"
+.IX Header "Size, Endianness, and BOM"
+You can categorize these CES by 3 criteria:  size of each character,
+endianness, and Byte Order Mark.
+.SS "by size"
+.IX Subsection "by size"
+UCS\-2 is a fixed-length encoding with each character taking 16 bits.
+It \fBdoes not\fR support \fIsurrogate pairs\fR.  When a surrogate pair
+is encountered during \fBdecode()\fR, its place is filled with \ex{FFFD}
+if \fICHECK\fR is 0, or the routine croaks if \fICHECK\fR is 1.  When a
+character whose ord value is larger than 0xFFFF is encountered,
+its place is filled with \ex{FFFD} if \fICHECK\fR is 0, or the routine
+croaks if \fICHECK\fR is 1.
+.PP
+UTF\-16 is almost the same as UCS\-2 but it supports \fIsurrogate pairs\fR.
+When it encounters a high surrogate (0xD800\-0xDBFF), it fetches the
+following low surrogate (0xDC00\-0xDFFF) and \f(CW\*(C`desurrogate\*(C'\fRs them to
+form a character.  Bogus surrogates result in death.  When \ex{10000}
+or above is encountered during \fBencode()\fR, it \f(CW\*(C`ensurrogate\*(C'\fRs them and
+pushes the surrogate pair to the output stream.
+.PP
+UTF\-32 (UCS\-4) is a fixed-length encoding with each character taking 32 bits.
+Since it is 32\-bit, there is no need for \fIsurrogate pairs\fR.
+.SS "by endianness"
+.IX Subsection "by endianness"
+The first (and now failed) goal of Unicode was to map all character
+repertoires into a fixed-length integer so that programmers are happy.
+Since each character is either a \fIshort\fR or \fIlong\fR in C, you have to
+pay attention to the endianness of each platform when you pass data
+to one another.
+.PP
+Anything marked as BE is Big Endian (or network byte order) and LE is
+Little Endian (aka VAX byte order).  For anything not marked either
+BE or LE, a character called Byte Order Mark (BOM) indicating the
+endianness is prepended to the string.
+.PP
+CAVEAT: Though BOM in utf8 (\exEF\exBB\exBF) is valid, it is meaningless
+and as of this writing Encode suite just leave it as is (\ex{FeFF}).
+.IP "BOM as integer when fetched in network byte order" 4
+.IX Item "BOM as integer when fetched in network byte order"
+.Vb 5
+\&              16         32 bits/char
+\&  \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&  BE      0xFeFF 0x0000FeFF
+\&  LE      0xFFFe 0xFFFe0000
+\&  \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.PP
+This modules handles the BOM as follows.
+.IP \(bu 4
+When BE or LE is explicitly stated as the name of encoding, BOM is
+simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
+.IP \(bu 4
+When BE or LE is omitted during \fBdecode()\fR, it checks if BOM is at the
+beginning of the string; if one is found, the endianness is set to
+what the BOM says.
+.IP \(bu 4
+Default Byte Order
+.Sp
+When no BOM is found, Encode 2.76 and blow croaked.  Since Encode
+2.77, it falls back to BE accordingly to RFC2781 and the Unicode
+Standard version 8.0
+.IP \(bu 4
+When BE or LE is omitted during \fBencode()\fR, it returns a BE-encoded
+string with BOM prepended.  So when you want to encode a whole text
+file, make sure you \fBencode()\fR the whole text at once, not line by line
+or each line, not file, will have a BOM prepended.
+.IP \(bu 4
+\&\f(CW\*(C`UCS\-2\*(C'\fR is an exception.  Unlike others, this is an alias of UCS\-2BE.
+UCS\-2 is already registered by IANA and others that way.
+.SH "Surrogate Pairs"
+.IX Header "Surrogate Pairs"
+To say the least, surrogate pairs were the biggest mistake of the
+Unicode Consortium.  But according to the late Douglas Adams in \fIThe
+Hitchhiker's Guide to the Galaxy\fR Trilogy, \f(CW\*(C`In the beginning the
+Universe was created. This has made a lot of people very angry and
+been widely regarded as a bad move\*(C'\fR.  Their mistake was not of this
+magnitude so let's forgive them.
+.PP
+(I don't dare make any comparison with Unicode Consortium and the
+Vogons here ;)  Or, comparing Encode to Babel Fish is completely
+appropriate \-\- if you can only stick this into your ear :)
+.PP
+Surrogate pairs were born when the Unicode Consortium finally
+admitted that 16 bits were not big enough to hold all the world's
+character repertoires.  But they already made UCS\-2 16\-bit.  What
+do we do?
+.PP
+Back then, the range 0xD800\-0xDFFF was not allocated.  Let's split
+that range in half and use the first half to represent the \f(CW\*(C`upper
+half of a character\*(C'\fR and the second half to represent the \f(CW\*(C`lower
+half of a character\*(C'\fR.  That way, you can represent 1024 * 1024 =
+1048576 more characters.  Now we can store character ranges up to
+\&\ex{10ffff} even with 16\-bit encodings.  This pair of half-character is
+now called a \fIsurrogate pair\fR and UTF\-16 is the name of the encoding
+that embraces them.
+.PP
+Here is a formula to ensurrogate a Unicode character \ex{10000} and
+above;
+.PP
+.Vb 2
+\&  $hi = ($uni \- 0x10000) / 0x400 + 0xD800;
+\&  $lo = ($uni \- 0x10000) % 0x400 + 0xDC00;
+.Ve
+.PP
+And to desurrogate;
+.PP
+.Vb 1
+\& $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00);
+.Ve
+.PP
+Note this move has made \ex{D800}\-\ex{DFFF} into a forbidden zone but
+perl does not prohibit the use of characters within this range.  To perl,
+every one of \ex{0000_0000} up to \ex{ffff_ffff} (*) is \fIa character\fR.
+.PP
+.Vb 2
+\&  (*) or \ex{ffff_ffff_ffff_ffff} if your perl is compiled with 64\-bit
+\&  integer support!
+.Ve
+.SH "Error Checking"
+.IX Header "Error Checking"
+Unlike most encodings which accept various ways to handle errors,
+Unicode encodings simply croaks.
+.PP
+.Vb 6
+\&  % perl \-MEncode \-e\*(Aq$_ = "\exfe\exff\exd8\exd9\exda\exdb\e0\en"\*(Aq \e
+\&         \-e\*(AqEncode::from_to($_, "utf16","shift_jis", 0); print\*(Aq
+\&  UTF\-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
+\&  % perl \-MEncode \-e\*(Aq$a = "BOM missing"\*(Aq \e
+\&         \-e\*(Aq Encode::from_to($a, "utf16", "shift_jis", 0); print\*(Aq
+\&  UTF\-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.
+.Ve
+.PP
+Unlike other encodings where mappings are not one-to-one against
+Unicode, UTFs are supposed to map 100% against one another.  So Encode
+is more strict on UTFs.
+.PP
+Consider that "division by zero" of Encode :)
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+Encode, Encode::Unicode::UTF7, <https://www.unicode.org/glossary/>,
+<https://www.unicode.org/faq/utf_bom.html>,
+.PP
+RFC 2781 <http://www.ietf.org/rfc/rfc2781.txt>,
+.PP
+The whole Unicode standard <https://www.unicode.org/standard/standard.html>
+.PP
+Ch. 6 pp. 275 of \f(CW\*(C`Programming Perl (3rd Edition)\*(C'\fR
+by Tom Christiansen, brian d foy & Larry Wall;
+O'Reilly & Associates; ISBN 978\-0\-596\-00492\-7
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-15 19:43:11 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-15 19:43:11 +0000
commit	fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
tree	ce1e3bce06471410239a6f41282e328770aa404a /upstream/debian-unstable/man3/Encode::Unicode.3perl
parent	Initial commit. (diff)
download	manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip