summaryrefslogtreecommitdiffstats
path: root/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
commitfc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
treece1e3bce06471410239a6f41282e328770aa404a /upstream/mageia-cauldron/man3pm/Encode::Supported.3pm
parentInitial commit. (diff)
downloadmanpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/mageia-cauldron/man3pm/Encode::Supported.3pm')
-rw-r--r--upstream/mageia-cauldron/man3pm/Encode::Supported.3pm857
1 files changed, 857 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm b/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm
new file mode 100644
index 00000000..06011c19
--- /dev/null
+++ b/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm
@@ -0,0 +1,857 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "Encode::Supported 3pm"
+.TH Encode::Supported 3pm 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+Encode::Supported \-\- Encodings supported by Encode
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+.SS "Encoding Names"
+.IX Subsection "Encoding Names"
+Encoding names are case insensitive. White space in names
+is ignored. In addition, an encoding may have aliases.
+Each encoding has one "canonical" name. The "canonical"
+name is chosen from the names of the encoding by picking
+the first in the following sequence (with a few exceptions).
+.IP \(bu 2
+The name used by the Perl community. That includes 'utf8' and 'ascii'.
+Unlike aliases, canonical names directly reach the method so such
+frequently used words like 'utf8' don't need to do alias lookups.
+.IP \(bu 2
+The MIME name as defined in IETF RFCs. This includes all "iso\-"s.
+.IP \(bu 2
+The name in the IANA registry.
+.IP \(bu 2
+The name used by the organization that defined it.
+.PP
+In case \fIde jure\fR canonical names differ from that of the Encode
+module, they are always aliased if it ever be implemented. So you can
+safely tell if a given encoding is implemented or not just by passing
+the canonical name.
+.PP
+Because of all the alias issues, and because in the general case
+encodings have state, "Encode" uses an encoding object internally
+once an operation is in progress.
+.SH "Supported Encodings"
+.IX Header "Supported Encodings"
+As of Perl 5.8.0, at least the following encodings are recognized.
+Note that unless otherwise specified, they are all case insensitive
+(via alias) and all occurrence of spaces are replaced with '\-'.
+In other words, "ISO 8859 1" and "iso\-8859\-1" are identical.
+.PP
+Encodings are categorized and implemented in several different modules
+but you don't have to \f(CW\*(C`use Encode::XX\*(C'\fR to make them available for
+most cases. Encode.pm will automatically load those modules on demand.
+.SS "Built-in Encodings"
+.IX Subsection "Built-in Encodings"
+The following encodings are always available.
+.PP
+.Vb 8
+\& Canonical Aliases Comments & References
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& ascii US\-ascii ISO\-646\-US [ECMA]
+\& ascii\-ctrl Special Encoding
+\& iso\-8859\-1 latin1 [ISO]
+\& null Special Encoding
+\& utf8 UTF\-8 [RFC2279]
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.PP
+\&\fInull\fR and \fIascii-ctrl\fR are special. "null" fails for all character
+so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
+CHARACTERS will fall back to character references. Ditto for
+"ascii-ctrl" except for control characters. For fallback modes, see
+Encode.
+.SS "Encode::Unicode \-\- other Unicode encodings"
+.IX Subsection "Encode::Unicode -- other Unicode encodings"
+Unicode coding schemes other than native utf8 are supported by
+Encode::Unicode, which will be autoloaded on demand.
+.PP
+.Vb 11
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& UCS\-2BE UCS\-2, iso\-10646\-1 [IANA, UC]
+\& UCS\-2LE [UC]
+\& UTF\-16 [UC]
+\& UTF\-16BE [UC]
+\& UTF\-16LE [UC]
+\& UTF\-32 [UC]
+\& UTF\-32BE UCS\-4 [UC]
+\& UTF\-32LE [UC]
+\& UTF\-7 [RFC2152]
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.PP
+To find how (UCS\-2|UTF\-(16|32))(LE|BE)? differ from one another,
+see Encode::Unicode.
+.PP
+UTF\-7 is a special encoding which "re-encodes" UTF\-16BE into a 7\-bit
+encoding. It is implemented separately by Encode::Unicode::UTF7.
+.SS "Encode::Byte \-\- Extended ASCII"
+.IX Subsection "Encode::Byte -- Extended ASCII"
+Encode::Byte implements most single-byte encodings except for
+Symbols and EBCDIC. The following encodings are based on single-byte
+encodings implemented as extended ASCII. Most of them map
+\&\ex80\-\exff (upper half) to non-ASCII characters.
+.IP "ISO\-8859 and corresponding vendor mappings" 2
+.IX Item "ISO-8859 and corresponding vendor mappings"
+Since there are so many, they are presented in table format with
+languages and corresponding encoding names by vendors. Note that
+the table is sorted in order of ISO\-8859 and the corresponding vendor
+mappings are slightly different from that of ISO. See
+<http://czyborra.com/charsets/iso8859.html> for details.
+.Sp
+.Vb 10
+\& Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& N. America (ASCII) cp437 AdobeStandardEncoding
+\& cp863 (DOSCanadaF)
+\& W. Europe iso\-8859\-1 cp850 cp1252 MacRoman nextstep
+\& hp\-roman8
+\& cp860 (DOSPortuguese)
+\& Cntrl. Europe iso\-8859\-2 cp852 cp1250 MacCentralEurRoman
+\& MacCroatian
+\& MacRomanian
+\& MacRumanian
+\& Latin3[1] iso\-8859\-3
+\& Latin4[2] iso\-8859\-4
+\& Cyrillics iso\-8859\-5 cp855 cp1251 MacCyrillic
+\& (See also next section) cp866 MacUkrainian
+\& Arabic iso\-8859\-6 cp864 cp1256 MacArabic
+\& cp1006 MacFarsi
+\& Greek iso\-8859\-7 cp737 cp1253 MacGreek
+\& cp869 (DOSGreek2)
+\& Hebrew iso\-8859\-8 cp862 cp1255 MacHebrew
+\& Turkish iso\-8859\-9 cp857 cp1254 MacTurkish
+\& Nordics iso\-8859\-10 cp865
+\& cp861 MacIcelandic
+\& MacSami
+\& Thai iso\-8859\-11[3] cp874 MacThai
+\& (iso\-8859\-12 is nonexistent. Reserved for Indics?)
+\& Baltics iso\-8859\-13 cp775 cp1257
+\& Celtics iso\-8859\-14
+\& Latin9 [4] iso\-8859\-15
+\& Latin10 iso\-8859\-16
+\& Vietnamese viscii cp1258 MacVietnamese
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&
+\& [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859\-9.
+\& [2] Baltics. Now on 8859\-10, except for Latvian.
+\& [3] TIS 620 + Non\-Breaking Space (0xA0 / U+00A0)
+\& [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
+\& letters that are missing from 8859\-1 were added.
+.Ve
+.Sp
+All cp* are also available as ibm\-*, ms\-*, and windows\-* . See also
+<http://czyborra.com/charsets/codepages.html>.
+.Sp
+Macintosh encodings don't seem to be registered in such entities as
+IANA. "Canonical" names in Encode are based upon Apple's Tech Note
+1150. See <http://developer.apple.com/technotes/tn/tn1150.html>
+for details.
+.IP "KOI8 \- De Facto Standard for the Cyrillic world" 2
+.IX Item "KOI8 - De Facto Standard for the Cyrillic world"
+Though ISO\-8859 does have ISO\-8859\-5, the KOI8 series is far more
+popular in the Net. Encode comes with the following KOI charsets.
+For gory details, see <http://czyborra.com/charsets/cyrillic.html>
+.Sp
+.Vb 5
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& koi8\-f
+\& koi8\-r cp878 [RFC1489]
+\& koi8\-u [RFC2319]
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.SS "gsm0338 \- Hentai Latin 1"
+.IX Subsection "gsm0338 - Hentai Latin 1"
+GSM0338 is for GSM handsets. Though it shares alphanumerals with
+ASCII, control character ranges and other parts are mapped very
+differently, mainly to store Greek characters. There are also escape
+sequences (starting with 0x1B) to cover e.g. the Euro sign.
+.PP
+This was once handled by Encode::Bytes but because of all those
+unusual specifications, Encode 2.20 has relocated the support to
+Encode::GSM0338. See Encode::GSM0338 for details.
+.IP "gsm0338 support before 2.19" 2
+.IX Item "gsm0338 support before 2.19"
+Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
+well-defined and \fBdecode()\fR will return an empty string for them.
+One possible workaround is
+.Sp
+.Vb 3
+\& $gsm =~ s/\ex00\ez/\ex00\ex00/;
+\& $uni = decode("gsm0338", $gsm);
+\& $uni .= "\exA0" if $gsm =~ /\ex1B\ez/;
+.Ve
+.Sp
+Note that the Encode implementation of GSM0338 does not implement the
+reuse of Latin capital letters as Greek capital letters (for example,
+the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
+LETTER ZETA).
+.Sp
+The GSM0338 is also covered in Encode::Byte even though it is not
+an "extended ASCII" encoding.
+.SS "CJK: Chinese, Japanese, Korean (Multibyte)"
+.IX Subsection "CJK: Chinese, Japanese, Korean (Multibyte)"
+Note that Vietnamese is listed above. Also read "Encoding vs Charset"
+below. Also note that these are implemented in distinct modules by
+countries, due to the size concerns (simplified Chinese is mapped
+to 'CN', continental China, while traditional Chinese is mapped to
+\&'TW', Taiwan). Please refer to their respective documentation pages.
+.IP "Encode::CN \-\- Continental China" 2
+.IX Item "Encode::CN -- Continental China"
+.Vb 9
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& euc\-cn [1] MacChineseSimp
+\& (gbk) cp936 [2]
+\& gb12345\-raw { GB12345 without CES }
+\& gb2312\-raw { GB2312 without CES }
+\& hz
+\& iso\-ir\-165
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&
+\& [1] GB2312 is aliased to this. See L<Microsoft\-related naming mess>
+\& [2] gbk is aliased to this. See L<Microsoft\-related naming mess>
+.Ve
+.IP "Encode::JP \-\- Japan" 2
+.IX Item "Encode::JP -- Japan"
+.Vb 11
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& euc\-jp
+\& shiftjis cp932 macJapanese
+\& 7bit\-jis
+\& iso\-2022\-jp [RFC1468]
+\& iso\-2022\-jp\-1 [RFC2237]
+\& jis0201\-raw { JIS X 0201 (roman + halfwidth kana) without CES }
+\& jis0208\-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
+\& jis0212\-raw { JIS X 0212 (Extended Kanji) without CES }
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP "Encode::KR \-\- Korea" 2
+.IX Item "Encode::KR -- Korea"
+.Vb 8
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& euc\-kr MacKorean [RFC1557]
+\& cp949 [1]
+\& iso\-2022\-kr [RFC1557]
+\& johab [KS X 1001:1998, Annex 3]
+\& ksc5601\-raw { KSC5601 without CES }
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&
+\& [1] ks_c_5601\-1987, (x\-)?windows\-949, and uhc are aliased to this.
+\& See below.
+.Ve
+.IP "Encode::TW \-\- Taiwan" 2
+.IX Item "Encode::TW -- Taiwan"
+.Vb 5
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& big5\-eten cp950 MacChineseTrad {big5 aliased to big5\-eten}
+\& big5\-hkscs
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP "Encode::HanExtra \-\- More Chinese via CPAN" 2
+.IX Item "Encode::HanExtra -- More Chinese via CPAN"
+Due to the size concerns, additional Chinese encodings below are
+distributed separately on CPAN, under the name Encode::HanExtra.
+.Sp
+.Vb 8
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& big5ext CMEX\*(Aqs Big5e Extension
+\& big5plus CMEX\*(Aqs Big5+ Extension
+\& cccii Chinese Character Code for Information Interchange
+\& euc\-tw EUC (Extended Unix Character)
+\& gb18030 GBK with Traditional Characters
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP "Encode::JIS2K \-\- JIS X 0213 encodings via CPAN" 2
+.IX Item "Encode::JIS2K -- JIS X 0213 encodings via CPAN"
+Due to size concerns, additional Japanese encodings below are
+distributed separately on CPAN, under the name Encode::JIS2K.
+.Sp
+.Vb 8
+\& Standard DOS/Win Macintosh Comment/Reference
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& euc\-jisx0213
+\& shiftjisx0123
+\& iso\-2022\-jp\-3
+\& jis0213\-1\-raw
+\& jis0213\-2\-raw
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.SS "Miscellaneous encodings"
+.IX Subsection "Miscellaneous encodings"
+.IP Encode::EBCDIC 2
+.IX Item "Encode::EBCDIC"
+See perlebcdic for details.
+.Sp
+.Vb 8
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& cp37
+\& cp500
+\& cp875
+\& cp1026
+\& cp1047
+\& posix\-bc
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP Encode::Symbols 2
+.IX Item "Encode::Symbols"
+For symbols and dingbats.
+.Sp
+.Vb 7
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& symbol
+\& dingbats
+\& MacDingbats
+\& AdobeZdingbat
+\& AdobeSymbol
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP Encode::MIME::Header 2
+.IX Item "Encode::MIME::Header"
+Strictly speaking, MIME header encoding documented in RFC 2047 is more
+of encapsulation than encoding. However, their support in modern
+world is imperative so they are supported.
+.Sp
+.Vb 5
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& MIME\-Header [RFC2047]
+\& MIME\-B [RFC2047]
+\& MIME\-Q [RFC2047]
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+.Ve
+.IP Encode::Guess 2
+.IX Item "Encode::Guess"
+This one is not a name of encoding but a utility that lets you pick up
+the most appropriate encoding for a data out of given \fIsuspects\fR. See
+Encode::Guess for details.
+.SH "Unsupported encodings"
+.IX Header "Unsupported encodings"
+The following encodings are not supported as yet; some because they
+are rarely used, some because of technical difficulties. They may
+be supported by external modules via CPAN in the future, however.
+.IP "ISO\-2022\-JP\-2 [RFC1554]" 2
+.IX Item "ISO-2022-JP-2 [RFC1554]"
+Not very popular yet. Needs Unicode Database or equivalent to
+implement \fBencode()\fR (because it includes JIS X 0208/0212, KSC5601, and
+GB2312 simultaneously, whose code points in Unicode overlap. So you
+need to lookup the database to determine to what character set a given
+Unicode character should belong).
+.IP "ISO\-2022\-CN [RFC1922]" 2
+.IX Item "ISO-2022-CN [RFC1922]"
+Not very popular. Needs CNS 11643\-1 and \-2 which are not available in
+this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
+Audrey Tang may add support for this encoding in her module in future.
+.IP "Various HP-UX encodings" 2
+.IX Item "Various HP-UX encodings"
+The following are unsupported due to the lack of mapping data.
+.Sp
+.Vb 2
+\& \*(Aq8\*(Aq \- arabic8, greek8, hebrew8, kana8, thai8, and turkish8
+\& \*(Aq15\*(Aq \- japanese15, korean15, and roi15
+.Ve
+.IP "Cyrillic encoding ISO\-IR\-111" 2
+.IX Item "Cyrillic encoding ISO-IR-111"
+Anton Tagunov doubts its usefulness.
+.IP "ISO\-8859\-8\-1 [Hebrew]" 2
+.IX Item "ISO-8859-8-1 [Hebrew]"
+None of the Encode team knows Hebrew enough (ISO\-8859\-8, cp1255 and
+MacHebrew are supported because and just because there were mappings
+available at <http://www.unicode.org/>). Contributions welcome.
+.IP "ISIRI 3342, Iran System, ISIRI 2900 [Farsi]" 2
+.IX Item "ISIRI 3342, Iran System, ISIRI 2900 [Farsi]"
+Ditto.
+.IP "Thai encoding TCVN" 2
+.IX Item "Thai encoding TCVN"
+Ditto.
+.IP "Vietnamese encodings VPS" 2
+.IX Item "Vietnamese encodings VPS"
+Though Jungshik Shin has reported that Mozilla supports this encoding,
+it was too late before 5.8.0 for us to add it. In the future, it
+may be available via a separate module. See
+<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
+and
+<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
+if you are interested in helping us.
+.IP "Various Mac encodings" 2
+.IX Item "Various Mac encodings"
+The following are unsupported due to the lack of mapping data.
+.Sp
+.Vb 5
+\& MacArmenian, MacBengali, MacBurmese, MacEthiopic
+\& MacExtArabic, MacGeorgian, MacKannada, MacKhmer
+\& MacLaotian, MacMalayalam, MacMongolian, MacOriya
+\& MacSinhalese, MacTamil, MacTelugu, MacTibetan
+\& MacVietnamese
+.Ve
+.Sp
+The rest which are already available are based upon the vendor mappings
+at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
+.IP "(Mac) Indic encodings" 2
+.IX Item "(Mac) Indic encodings"
+The maps for the following are available at <http://www.unicode.org/>
+but remain unsupported because those encodings need an algorithmical
+approach, currently unsupported by \fIenc2xs\fR:
+.Sp
+.Vb 3
+\& MacDevanagari
+\& MacGurmukhi
+\& MacGujarati
+.Ve
+.Sp
+For details, please see \f(CW\*(C`Unicode mapping issues and notes:\*(C'\fR at
+<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
+.Sp
+I believe this issue is prevalent not only for Mac Indics but also in
+other Indic encodings, but the above were the only Indic encodings
+maps that I could find at <http://www.unicode.org/> .
+.SH "Encoding vs. Charset \-\- terminology"
+.IX Header "Encoding vs. Charset -- terminology"
+We are used to using the term (character) \fIencoding\fR and \fIcharacter
+set\fR interchangeably. But just as confusing the terms byte and
+character is dangerous and the terms should be differentiated when
+needed, we need to differentiate \fIencoding\fR and \fIcharacter set\fR.
+.PP
+To understand that, here is a description of how we make computers
+grok our characters.
+.IP \(bu 2
+First we start with which characters to include. We call this
+collection of characters \fIcharacter repertoire\fR.
+.IP \(bu 2
+Then we have to give each character a unique ID so your computer can
+tell the difference between 'a' and 'A'. This itemized character
+repertoire is now a \fIcharacter set\fR.
+.IP \(bu 2
+If your computer can grow the character set without further
+processing, you can go ahead and use it. This is called a \fIcoded
+character set\fR (CCS) or \fIraw character encoding\fR. ASCII is used this
+way for most cases.
+.IP \(bu 2
+But in many cases, especially multi-byte CJK encodings, you have to
+tweak a little more. Your network connection may not accept any data
+with the Most Significant Bit set, and your computer may not be able to
+tell if a given byte is a whole character or just half of it. So you
+have to \fIencode\fR the character set to use it.
+.Sp
+A \fIcharacter encoding scheme\fR (CES) determines how to encode a given
+character set, or a set of multiple character sets. 7bit ISO\-2022 is
+an example of a CES. You switch between character sets via \fIescape
+sequences\fR.
+.PP
+Technically, or mathematically, speaking, a character set encoded in
+such a CES that maps character by character may form a CCS. EUC is such
+an example. The CES of EUC is as follows:
+.IP \(bu 2
+Map ASCII unchanged.
+.IP \(bu 2
+Map such a character set that consists of 94 or 96 powered by N
+members by adding 0x80 to each byte.
+.IP \(bu 2
+You can also use 0x8e and 0x8f to indicate that the following sequence of
+characters belongs to yet another character set. To each following byte
+is added the value 0x80.
+.PP
+By carefully looking at the encoded byte sequence, you can find that the
+byte sequence conforms a unique number. In that sense, EUC is a CCS
+generated by a CES above from up to four CCS (complicated?). UTF\-8
+falls into this category. See "UTF\-8" in perlUnicode to find out how
+UTF\-8 maps Unicode to a byte sequence.
+.PP
+You may also have found out by now why 7bit ISO\-2022 cannot comprise
+a CCS. If you look at a byte sequence \ex21\ex21, you can't tell if
+it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \exA1\exA1
+so you have no trouble differentiating between "!!". and "\ \ ".
+.SH "Encoding Classification (by Anton Tagunov and Dan Kogai)"
+.IX Header "Encoding Classification (by Anton Tagunov and Dan Kogai)"
+This section tries to classify the supported encodings by their
+applicability for information exchange over the Internet and to
+choose the most suitable aliases to name them in the context of
+such communication.
+.IP \(bu 2
+To (en|de)code encodings marked by \f(CW\*(C`(**)\*(C'\fR, you need
+\&\f(CW\*(C`Encode::HanExtra\*(C'\fR, available from CPAN.
+.PP
+Encoding names
+.PP
+.Vb 3
+\& US\-ASCII UTF\-8 ISO\-8859\-* KOI8\-R
+\& Shift_JIS EUC\-JP ISO\-2022\-JP ISO\-2022\-JP\-1
+\& EUC\-KR Big5 GB2312
+.Ve
+.PP
+are registered with IANA as preferred MIME names and may
+be used over the Internet.
+.PP
+\&\f(CW\*(C`Shift_JIS\*(C'\fR has been officialized by JIS X 0208:1997.
+"Microsoft-related naming mess" gives details.
+.PP
+\&\f(CW\*(C`GB2312\*(C'\fR is the IANA name for \f(CW\*(C`EUC\-CN\*(C'\fR.
+See "Microsoft-related naming mess" for details.
+.PP
+\&\f(CW\*(C`GB_2312\-80\*(C'\fR \fIraw\fR encoding is available as \f(CW\*(C`gb2312\-raw\*(C'\fR
+with Encode. See Encode::CN for details.
+.PP
+.Vb 2
+\& EUC\-CN
+\& KOI8\-U [RFC2319]
+.Ve
+.PP
+have not been registered with IANA (as of March 2002) but
+seem to be supported by major web browsers.
+The IANA name for \f(CW\*(C`EUC\-CN\*(C'\fR is \f(CW\*(C`GB2312\*(C'\fR.
+.PP
+.Vb 1
+\& KS_C_5601\-1987
+.Ve
+.PP
+is heavily misused.
+See "Microsoft-related naming mess" for details.
+.PP
+\&\f(CW\*(C`KS_C_5601\-1987\*(C'\fR \fIraw\fR encoding is available as \f(CW\*(C`kcs5601\-raw\*(C'\fR
+with Encode. See Encode::KR for details.
+.PP
+.Vb 1
+\& UTF\-16 UTF\-16BE UTF\-16LE
+.Ve
+.PP
+are IANA-registered \f(CW\*(C`charset\*(C'\fRs. See [RFC 2781] for details.
+Jungshik Shin reports that UTF\-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+.IP \(bu 2
+\&\f(CW\*(C`UTF\-16\*(C'\fR support in any software you're going to be
+using/interoperating with has probably been less tested
+then \f(CW\*(C`UTF\-8\*(C'\fR support
+.IP \(bu 2
+\&\f(CW\*(C`UTF\-8\*(C'\fR coded data seamlessly passes traditional
+command piping (\f(CW\*(C`cat\*(C'\fR, \f(CW\*(C`more\*(C'\fR, etc.) while \f(CW\*(C`UTF\-16\*(C'\fR coded
+data is likely to cause confusion (with its zero bytes,
+for example)
+.IP \(bu 2
+it is beyond the power of words to describe the way HTML browsers
+encode non\-\f(CW\*(C`ASCII\*(C'\fR form data. To get a general impression, visit
+<http://www.alanflavell.org.uk/charset/form\-i18n.html>.
+While encoding of form data has stabilized for \f(CW\*(C`UTF\-8\*(C'\fR encoded pages
+(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
+expect fun (and cross-browser discrepancies) with \f(CW\*(C`UTF\-16\*(C'\fR encoded
+pages!
+.PP
+The rule of thumb is to use \f(CW\*(C`UTF\-8\*(C'\fR unless you know what
+you're doing and unless you really benefit from using \f(CW\*(C`UTF\-16\*(C'\fR.
+.PP
+.Vb 5
+\& ISO\-IR\-165 [RFC1345]
+\& VISCII
+\& GB 12345
+\& GB 18030 (**) (see links below)
+\& EUC\-TW (**)
+.Ve
+.PP
+are totally valid encodings but not registered at IANA.
+The names under which they are listed here are probably the
+most widely-known names for these encodings and are recommended
+names.
+.PP
+.Vb 1
+\& BIG5PLUS (**)
+.Ve
+.PP
+is a proprietary name.
+.SS "Microsoft-related naming mess"
+.IX Subsection "Microsoft-related naming mess"
+Microsoft products misuse the following names:
+.IP KS_C_5601\-1987 2
+.IX Item "KS_C_5601-1987"
+Microsoft extension to \f(CW\*(C`EUC\-KR\*(C'\fR.
+.Sp
+Proper names: \f(CW\*(C`CP949\*(C'\fR, \f(CW\*(C`UHC\*(C'\fR, \f(CW\*(C`x\-windows\-949\*(C'\fR (as used by Mozilla).
+.Sp
+See <http://lists.w3.org/Archives/Public/ietf\-charsets/2001AprJun/0033.html>
+for details.
+.Sp
+Encode aliases \f(CW\*(C`KS_C_5601\-1987\*(C'\fR to \f(CW\*(C`cp949\*(C'\fR to reflect this common
+misusage. \fIRaw\fR \f(CW\*(C`KS_C_5601\-1987\*(C'\fR encoding is available as
+\&\f(CW\*(C`kcs5601\-raw\*(C'\fR.
+.Sp
+See Encode::KR for details.
+.IP GB2312 2
+.IX Item "GB2312"
+Microsoft extension to \f(CW\*(C`EUC\-CN\*(C'\fR.
+.Sp
+Proper names: \f(CW\*(C`CP936\*(C'\fR, \f(CW\*(C`GBK\*(C'\fR.
+.Sp
+\&\f(CW\*(C`GB2312\*(C'\fR has been registered in the \f(CW\*(C`EUC\-CN\*(C'\fR meaning at
+IANA. This has partially repaired the situation: Microsoft's
+\&\f(CW\*(C`GB2312\*(C'\fR has become a superset of the official \f(CW\*(C`GB2312\*(C'\fR.
+.Sp
+Encode aliases \f(CW\*(C`GB2312\*(C'\fR to \f(CW\*(C`euc\-cn\*(C'\fR in full agreement with
+IANA registration. \f(CW\*(C`cp936\*(C'\fR is supported separately.
+\&\fIRaw\fR \f(CW\*(C`GB_2312\-80\*(C'\fR encoding is available as \f(CW\*(C`gb2312\-raw\*(C'\fR.
+.Sp
+See Encode::CN for details.
+.IP Big5 2
+.IX Item "Big5"
+Microsoft extension to \f(CW\*(C`Big5\*(C'\fR.
+.Sp
+Proper name: \f(CW\*(C`CP950\*(C'\fR.
+.Sp
+Encode separately supports \f(CW\*(C`Big5\*(C'\fR and \f(CW\*(C`cp950\*(C'\fR.
+.IP Shift_JIS 2
+.IX Item "Shift_JIS"
+Microsoft's understanding of \f(CW\*(C`Shift_JIS\*(C'\fR.
+.Sp
+JIS has not endorsed the full Microsoft standard however.
+The official \f(CW\*(C`Shift_JIS\*(C'\fR includes only JIS X 0201 and JIS X 0208
+character sets, while Microsoft has always used \f(CW\*(C`Shift_JIS\*(C'\fR
+to encode a wider character repertoire. See \f(CW\*(C`IANA\*(C'\fR registration for
+\&\f(CW\*(C`Windows\-31J\*(C'\fR.
+.Sp
+As a historical predecessor, Microsoft's variant
+probably has more rights for the name, though it may be objected
+that Microsoft shouldn't have used JIS as part of the name
+in the first place.
+.Sp
+Unambiguous name: \f(CW\*(C`CP932\*(C'\fR. \f(CW\*(C`IANA\*(C'\fR name (also used by Mozilla, and
+provided as an alias by Encode): \f(CW\*(C`Windows\-31J\*(C'\fR.
+.Sp
+Encode separately supports \f(CW\*(C`Shift_JIS\*(C'\fR and \f(CW\*(C`cp932\*(C'\fR.
+.SH Glossary
+.IX Header "Glossary"
+.IP "character repertoire" 2
+.IX Item "character repertoire"
+A collection of unique characters. A \fIcharacter\fR set in the strictest
+sense. At this stage, characters are not numbered.
+.IP "coded character set (CCS)" 2
+.IX Item "coded character set (CCS)"
+A character set that is mapped in a way computers can use directly.
+Many character encodings, including EUC, fall in this category.
+.IP "character encoding scheme (CES)" 2
+.IX Item "character encoding scheme (CES)"
+An algorithm to map a character set to a byte sequence. You don't
+have to be able to tell which character set a given byte sequence
+belongs. 7\-bit ISO\-2022 is a CES but it cannot be a CCS. EUC is an
+example of being both a CCS and CES.
+.IP "charset (in MIME context)" 2
+.IX Item "charset (in MIME context)"
+has long been used in the meaning of \f(CW\*(C`encoding\*(C'\fR, CES.
+.Sp
+While the word combination \f(CW\*(C`character set\*(C'\fR has lost this meaning
+in MIME context since [RFC 2130], the \f(CW\*(C`charset\*(C'\fR abbreviation has
+retained it. This is how [RFC 2277] and [RFC 2278] bless \f(CW\*(C`charset\*(C'\fR:
+.Sp
+.Vb 7
+\& This document uses the term "charset" to mean a set of rules for
+\& mapping from a sequence of octets to a sequence of characters, such
+\& as the combination of a coded character set and a character encoding
+\& scheme; this is also what is used as an identifier in MIME "charset="
+\& parameters, and registered in the IANA charset registry ... (Note
+\& that this is NOT a term used by other standards bodies, such as ISO).
+\& [RFC 2277]
+.Ve
+.IP EUC 2
+.IX Item "EUC"
+Extended Unix Character. See ISO\-2022.
+.IP ISO\-2022 2
+.IX Item "ISO-2022"
+A CES that was carefully designed to coexist with ASCII. There are a 7
+bit version and an 8 bit version.
+.Sp
+The 7 bit version switches character set via escape sequence so it
+cannot form a CCS. Since this is more difficult to handle in programs
+than the 8 bit version, the 7 bit version is not very popular except for
+iso\-2022\-jp, the \fIde facto\fR standard CES for e\-mails.
+.Sp
+The 8 bit version can form a CCS. EUC and ISO\-8859 are two examples
+thereof. Pre\-5.6 perl could use them as string literals.
+.IP UCS 2
+.IX Item "UCS"
+Short for \fIUniversal Character Set\fR. When you say just UCS, it means
+\&\fIUnicode\fR.
+.IP UCS\-2 2
+.IX Item "UCS-2"
+ISO/IEC 10646 encoding form: Universal Character Set coded in two
+octets.
+.IP Unicode 2
+.IX Item "Unicode"
+A character set that aims to include all character repertoires of the
+world. Many character sets in various national as well as industrial
+standards have become, in a way, just subsets of Unicode.
+.IP UTF 2
+.IX Item "UTF"
+Short for \fIUnicode Transformation Format\fR. Determines how to map a
+Unicode character into a byte sequence.
+.IP UTF\-16 2
+.IX Item "UTF-16"
+A UTF in 16\-bit encoding. Can either be in big endian or little
+endian. The big endian version is called UTF\-16BE (equal to UCS\-2 +
+surrogate support) and the little endian version is called UTF\-16LE.
+.SH "See Also"
+.IX Header "See Also"
+Encode,
+Encode::Byte,
+Encode::CN, Encode::JP, Encode::KR, Encode::TW,
+Encode::EBCDIC, Encode::Symbol
+Encode::MIME::Header, Encode::Guess
+.SH References
+.IX Header "References"
+.IP ECMA 2
+.IX Item "ECMA"
+European Computer Manufacturers Association
+<http://www.ecma.ch>
+.RS 2
+.ie n .IP "ECMA\-035 (eq ""ISO\-2022"")" 2
+.el .IP "ECMA\-035 (eq \f(CWISO\-2022\fR)" 2
+.IX Item "ECMA-035 (eq ISO-2022)"
+<http://www.ecma.ch/ecma1/STAND/ECMA\-035.HTM>
+.Sp
+The specification of ISO\-2022 is available from the link above.
+.RE
+.RS 2
+.RE
+.IP IANA 2
+.IX Item "IANA"
+Internet Assigned Numbers Authority
+<http://www.iana.org/>
+.RS 2
+.IP "Assigned Charset Names by IANA" 2
+.IX Item "Assigned Charset Names by IANA"
+<http://www.iana.org/assignments/character\-sets>
+.Sp
+Most of the \f(CW\*(C`canonical names\*(C'\fR in Encode derive from this list
+so you can directly apply the string you have extracted from MIME
+header of mails and web pages.
+.RE
+.RS 2
+.RE
+.IP ISO 2
+.IX Item "ISO"
+International Organization for Standardization
+<http://www.iso.ch/>
+.IP RFC 2
+.IX Item "RFC"
+Request For Comments \-\- need I say more?
+<http://www.rfc\-editor.org/>, <http://www.ietf.org/rfc.html>,
+<http://www.faqs.org/rfcs/>
+.IP UC 2
+.IX Item "UC"
+Unicode Consortium
+<http://www.unicode.org/>
+.RS 2
+.IP "Unicode Glossary" 2
+.IX Item "Unicode Glossary"
+<http://www.unicode.org/glossary/>
+.Sp
+The glossary of this document is based upon this site.
+.RE
+.RS 2
+.RE
+.SS "Other Notable Sites"
+.IX Subsection "Other Notable Sites"
+.IP czyborra.com 2
+.IX Item "czyborra.com"
+<http://czyborra.com/>
+.Sp
+Contains a lot of useful information, especially gory details of ISO
+vs. vendor mappings.
+.IP CJK.inf 2
+.IX Item "CJK.inf"
+<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
+.Sp
+Somewhat obsolete (last update in 1996), but still useful. Also try
+.Sp
+<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
+.Sp
+You will find brief info on \f(CW\*(C`EUC\-CN\*(C'\fR, \f(CW\*(C`GBK\*(C'\fR and mostly on \f(CW\*(C`GB 18030\*(C'\fR.
+.IP "Jungshik Shin's Hangul FAQ" 2
+.IX Item "Jungshik Shin's Hangul FAQ"
+<http://jshin.net/faq>
+.Sp
+And especially its subject 8.
+.Sp
+<http://jshin.net/faq/qa8.html>
+.Sp
+A comprehensive overview of the Korean (\f(CW\*(C`KS *\*(C'\fR) standards.
+.IP "debian.org: ""Introduction to i18n""" 2
+.IX Item "debian.org: ""Introduction to i18n"""
+A brief description for most of the mentioned CJK encodings is
+contained in
+<http://www.debian.org/doc/manuals/intro\-i18n/ch\-codes.en.html>
+.SS "Offline sources"
+.IX Subsection "Offline sources"
+.ie n .IP """CJKV Information Processing"" by Ken Lunde" 2
+.el .IP "\f(CWCJKV Information Processing\fR by Ken Lunde" 2
+.IX Item "CJKV Information Processing by Ken Lunde"
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1\-56592\-224\-7
+.Sp
+The modern successor of \f(CW\*(C`CJK.inf\*(C'\fR.
+.Sp
+Features a comprehensive coverage of CJKV character sets and
+encodings along with many other issues faced by anyone trying
+to better support CJKV languages/scripts in all the areas of
+information processing.
+.Sp
+To purchase this book, visit
+<http://oreilly.com/catalog/9780596514471/>
+or your favourite bookstore.