diff options
Diffstat (limited to 'upstream/mageia-cauldron/man3pm/Encode::Supported.3pm')
-rw-r--r-- | upstream/mageia-cauldron/man3pm/Encode::Supported.3pm | 857 |
1 files changed, 857 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm b/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm new file mode 100644 index 00000000..06011c19 --- /dev/null +++ b/upstream/mageia-cauldron/man3pm/Encode::Supported.3pm @@ -0,0 +1,857 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "Encode::Supported 3pm" +.TH Encode::Supported 3pm 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +Encode::Supported \-\- Encodings supported by Encode +.SH DESCRIPTION +.IX Header "DESCRIPTION" +.SS "Encoding Names" +.IX Subsection "Encoding Names" +Encoding names are case insensitive. White space in names +is ignored. In addition, an encoding may have aliases. +Each encoding has one "canonical" name. The "canonical" +name is chosen from the names of the encoding by picking +the first in the following sequence (with a few exceptions). +.IP \(bu 2 +The name used by the Perl community. That includes 'utf8' and 'ascii'. +Unlike aliases, canonical names directly reach the method so such +frequently used words like 'utf8' don't need to do alias lookups. +.IP \(bu 2 +The MIME name as defined in IETF RFCs. This includes all "iso\-"s. +.IP \(bu 2 +The name in the IANA registry. +.IP \(bu 2 +The name used by the organization that defined it. +.PP +In case \fIde jure\fR canonical names differ from that of the Encode +module, they are always aliased if it ever be implemented. So you can +safely tell if a given encoding is implemented or not just by passing +the canonical name. +.PP +Because of all the alias issues, and because in the general case +encodings have state, "Encode" uses an encoding object internally +once an operation is in progress. +.SH "Supported Encodings" +.IX Header "Supported Encodings" +As of Perl 5.8.0, at least the following encodings are recognized. +Note that unless otherwise specified, they are all case insensitive +(via alias) and all occurrence of spaces are replaced with '\-'. +In other words, "ISO 8859 1" and "iso\-8859\-1" are identical. +.PP +Encodings are categorized and implemented in several different modules +but you don't have to \f(CW\*(C`use Encode::XX\*(C'\fR to make them available for +most cases. Encode.pm will automatically load those modules on demand. +.SS "Built-in Encodings" +.IX Subsection "Built-in Encodings" +The following encodings are always available. +.PP +.Vb 8 +\& Canonical Aliases Comments & References +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& ascii US\-ascii ISO\-646\-US [ECMA] +\& ascii\-ctrl Special Encoding +\& iso\-8859\-1 latin1 [ISO] +\& null Special Encoding +\& utf8 UTF\-8 [RFC2279] +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.PP +\&\fInull\fR and \fIascii-ctrl\fR are special. "null" fails for all character +so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL +CHARACTERS will fall back to character references. Ditto for +"ascii-ctrl" except for control characters. For fallback modes, see +Encode. +.SS "Encode::Unicode \-\- other Unicode encodings" +.IX Subsection "Encode::Unicode -- other Unicode encodings" +Unicode coding schemes other than native utf8 are supported by +Encode::Unicode, which will be autoloaded on demand. +.PP +.Vb 11 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& UCS\-2BE UCS\-2, iso\-10646\-1 [IANA, UC] +\& UCS\-2LE [UC] +\& UTF\-16 [UC] +\& UTF\-16BE [UC] +\& UTF\-16LE [UC] +\& UTF\-32 [UC] +\& UTF\-32BE UCS\-4 [UC] +\& UTF\-32LE [UC] +\& UTF\-7 [RFC2152] +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.PP +To find how (UCS\-2|UTF\-(16|32))(LE|BE)? differ from one another, +see Encode::Unicode. +.PP +UTF\-7 is a special encoding which "re-encodes" UTF\-16BE into a 7\-bit +encoding. It is implemented separately by Encode::Unicode::UTF7. +.SS "Encode::Byte \-\- Extended ASCII" +.IX Subsection "Encode::Byte -- Extended ASCII" +Encode::Byte implements most single-byte encodings except for +Symbols and EBCDIC. The following encodings are based on single-byte +encodings implemented as extended ASCII. Most of them map +\&\ex80\-\exff (upper half) to non-ASCII characters. +.IP "ISO\-8859 and corresponding vendor mappings" 2 +.IX Item "ISO-8859 and corresponding vendor mappings" +Since there are so many, they are presented in table format with +languages and corresponding encoding names by vendors. Note that +the table is sorted in order of ISO\-8859 and the corresponding vendor +mappings are slightly different from that of ISO. See +<http://czyborra.com/charsets/iso8859.html> for details. +.Sp +.Vb 10 +\& Lang/Regions ISO/Other Std. DOS Windows Macintosh Others +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& N. America (ASCII) cp437 AdobeStandardEncoding +\& cp863 (DOSCanadaF) +\& W. Europe iso\-8859\-1 cp850 cp1252 MacRoman nextstep +\& hp\-roman8 +\& cp860 (DOSPortuguese) +\& Cntrl. Europe iso\-8859\-2 cp852 cp1250 MacCentralEurRoman +\& MacCroatian +\& MacRomanian +\& MacRumanian +\& Latin3[1] iso\-8859\-3 +\& Latin4[2] iso\-8859\-4 +\& Cyrillics iso\-8859\-5 cp855 cp1251 MacCyrillic +\& (See also next section) cp866 MacUkrainian +\& Arabic iso\-8859\-6 cp864 cp1256 MacArabic +\& cp1006 MacFarsi +\& Greek iso\-8859\-7 cp737 cp1253 MacGreek +\& cp869 (DOSGreek2) +\& Hebrew iso\-8859\-8 cp862 cp1255 MacHebrew +\& Turkish iso\-8859\-9 cp857 cp1254 MacTurkish +\& Nordics iso\-8859\-10 cp865 +\& cp861 MacIcelandic +\& MacSami +\& Thai iso\-8859\-11[3] cp874 MacThai +\& (iso\-8859\-12 is nonexistent. Reserved for Indics?) +\& Baltics iso\-8859\-13 cp775 cp1257 +\& Celtics iso\-8859\-14 +\& Latin9 [4] iso\-8859\-15 +\& Latin10 iso\-8859\-16 +\& Vietnamese viscii cp1258 MacVietnamese +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& +\& [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859\-9. +\& [2] Baltics. Now on 8859\-10, except for Latvian. +\& [3] TIS 620 + Non\-Breaking Space (0xA0 / U+00A0) +\& [4] Nicknamed Latin0; the Euro sign as well as French and Finnish +\& letters that are missing from 8859\-1 were added. +.Ve +.Sp +All cp* are also available as ibm\-*, ms\-*, and windows\-* . See also +<http://czyborra.com/charsets/codepages.html>. +.Sp +Macintosh encodings don't seem to be registered in such entities as +IANA. "Canonical" names in Encode are based upon Apple's Tech Note +1150. See <http://developer.apple.com/technotes/tn/tn1150.html> +for details. +.IP "KOI8 \- De Facto Standard for the Cyrillic world" 2 +.IX Item "KOI8 - De Facto Standard for the Cyrillic world" +Though ISO\-8859 does have ISO\-8859\-5, the KOI8 series is far more +popular in the Net. Encode comes with the following KOI charsets. +For gory details, see <http://czyborra.com/charsets/cyrillic.html> +.Sp +.Vb 5 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& koi8\-f +\& koi8\-r cp878 [RFC1489] +\& koi8\-u [RFC2319] +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.SS "gsm0338 \- Hentai Latin 1" +.IX Subsection "gsm0338 - Hentai Latin 1" +GSM0338 is for GSM handsets. Though it shares alphanumerals with +ASCII, control character ranges and other parts are mapped very +differently, mainly to store Greek characters. There are also escape +sequences (starting with 0x1B) to cover e.g. the Euro sign. +.PP +This was once handled by Encode::Bytes but because of all those +unusual specifications, Encode 2.20 has relocated the support to +Encode::GSM0338. See Encode::GSM0338 for details. +.IP "gsm0338 support before 2.19" 2 +.IX Item "gsm0338 support before 2.19" +Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not +well-defined and \fBdecode()\fR will return an empty string for them. +One possible workaround is +.Sp +.Vb 3 +\& $gsm =~ s/\ex00\ez/\ex00\ex00/; +\& $uni = decode("gsm0338", $gsm); +\& $uni .= "\exA0" if $gsm =~ /\ex1B\ez/; +.Ve +.Sp +Note that the Encode implementation of GSM0338 does not implement the +reuse of Latin capital letters as Greek capital letters (for example, +the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL +LETTER ZETA). +.Sp +The GSM0338 is also covered in Encode::Byte even though it is not +an "extended ASCII" encoding. +.SS "CJK: Chinese, Japanese, Korean (Multibyte)" +.IX Subsection "CJK: Chinese, Japanese, Korean (Multibyte)" +Note that Vietnamese is listed above. Also read "Encoding vs Charset" +below. Also note that these are implemented in distinct modules by +countries, due to the size concerns (simplified Chinese is mapped +to 'CN', continental China, while traditional Chinese is mapped to +\&'TW', Taiwan). Please refer to their respective documentation pages. +.IP "Encode::CN \-\- Continental China" 2 +.IX Item "Encode::CN -- Continental China" +.Vb 9 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& euc\-cn [1] MacChineseSimp +\& (gbk) cp936 [2] +\& gb12345\-raw { GB12345 without CES } +\& gb2312\-raw { GB2312 without CES } +\& hz +\& iso\-ir\-165 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& +\& [1] GB2312 is aliased to this. See L<Microsoft\-related naming mess> +\& [2] gbk is aliased to this. See L<Microsoft\-related naming mess> +.Ve +.IP "Encode::JP \-\- Japan" 2 +.IX Item "Encode::JP -- Japan" +.Vb 11 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& euc\-jp +\& shiftjis cp932 macJapanese +\& 7bit\-jis +\& iso\-2022\-jp [RFC1468] +\& iso\-2022\-jp\-1 [RFC2237] +\& jis0201\-raw { JIS X 0201 (roman + halfwidth kana) without CES } +\& jis0208\-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } +\& jis0212\-raw { JIS X 0212 (Extended Kanji) without CES } +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP "Encode::KR \-\- Korea" 2 +.IX Item "Encode::KR -- Korea" +.Vb 8 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& euc\-kr MacKorean [RFC1557] +\& cp949 [1] +\& iso\-2022\-kr [RFC1557] +\& johab [KS X 1001:1998, Annex 3] +\& ksc5601\-raw { KSC5601 without CES } +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& +\& [1] ks_c_5601\-1987, (x\-)?windows\-949, and uhc are aliased to this. +\& See below. +.Ve +.IP "Encode::TW \-\- Taiwan" 2 +.IX Item "Encode::TW -- Taiwan" +.Vb 5 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& big5\-eten cp950 MacChineseTrad {big5 aliased to big5\-eten} +\& big5\-hkscs +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP "Encode::HanExtra \-\- More Chinese via CPAN" 2 +.IX Item "Encode::HanExtra -- More Chinese via CPAN" +Due to the size concerns, additional Chinese encodings below are +distributed separately on CPAN, under the name Encode::HanExtra. +.Sp +.Vb 8 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& big5ext CMEX\*(Aqs Big5e Extension +\& big5plus CMEX\*(Aqs Big5+ Extension +\& cccii Chinese Character Code for Information Interchange +\& euc\-tw EUC (Extended Unix Character) +\& gb18030 GBK with Traditional Characters +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP "Encode::JIS2K \-\- JIS X 0213 encodings via CPAN" 2 +.IX Item "Encode::JIS2K -- JIS X 0213 encodings via CPAN" +Due to size concerns, additional Japanese encodings below are +distributed separately on CPAN, under the name Encode::JIS2K. +.Sp +.Vb 8 +\& Standard DOS/Win Macintosh Comment/Reference +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& euc\-jisx0213 +\& shiftjisx0123 +\& iso\-2022\-jp\-3 +\& jis0213\-1\-raw +\& jis0213\-2\-raw +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.SS "Miscellaneous encodings" +.IX Subsection "Miscellaneous encodings" +.IP Encode::EBCDIC 2 +.IX Item "Encode::EBCDIC" +See perlebcdic for details. +.Sp +.Vb 8 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& cp37 +\& cp500 +\& cp875 +\& cp1026 +\& cp1047 +\& posix\-bc +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP Encode::Symbols 2 +.IX Item "Encode::Symbols" +For symbols and dingbats. +.Sp +.Vb 7 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& symbol +\& dingbats +\& MacDingbats +\& AdobeZdingbat +\& AdobeSymbol +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP Encode::MIME::Header 2 +.IX Item "Encode::MIME::Header" +Strictly speaking, MIME header encoding documented in RFC 2047 is more +of encapsulation than encoding. However, their support in modern +world is imperative so they are supported. +.Sp +.Vb 5 +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& MIME\-Header [RFC2047] +\& MIME\-B [RFC2047] +\& MIME\-Q [RFC2047] +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +.Ve +.IP Encode::Guess 2 +.IX Item "Encode::Guess" +This one is not a name of encoding but a utility that lets you pick up +the most appropriate encoding for a data out of given \fIsuspects\fR. See +Encode::Guess for details. +.SH "Unsupported encodings" +.IX Header "Unsupported encodings" +The following encodings are not supported as yet; some because they +are rarely used, some because of technical difficulties. They may +be supported by external modules via CPAN in the future, however. +.IP "ISO\-2022\-JP\-2 [RFC1554]" 2 +.IX Item "ISO-2022-JP-2 [RFC1554]" +Not very popular yet. Needs Unicode Database or equivalent to +implement \fBencode()\fR (because it includes JIS X 0208/0212, KSC5601, and +GB2312 simultaneously, whose code points in Unicode overlap. So you +need to lookup the database to determine to what character set a given +Unicode character should belong). +.IP "ISO\-2022\-CN [RFC1922]" 2 +.IX Item "ISO-2022-CN [RFC1922]" +Not very popular. Needs CNS 11643\-1 and \-2 which are not available in +this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. +Audrey Tang may add support for this encoding in her module in future. +.IP "Various HP-UX encodings" 2 +.IX Item "Various HP-UX encodings" +The following are unsupported due to the lack of mapping data. +.Sp +.Vb 2 +\& \*(Aq8\*(Aq \- arabic8, greek8, hebrew8, kana8, thai8, and turkish8 +\& \*(Aq15\*(Aq \- japanese15, korean15, and roi15 +.Ve +.IP "Cyrillic encoding ISO\-IR\-111" 2 +.IX Item "Cyrillic encoding ISO-IR-111" +Anton Tagunov doubts its usefulness. +.IP "ISO\-8859\-8\-1 [Hebrew]" 2 +.IX Item "ISO-8859-8-1 [Hebrew]" +None of the Encode team knows Hebrew enough (ISO\-8859\-8, cp1255 and +MacHebrew are supported because and just because there were mappings +available at <http://www.unicode.org/>). Contributions welcome. +.IP "ISIRI 3342, Iran System, ISIRI 2900 [Farsi]" 2 +.IX Item "ISIRI 3342, Iran System, ISIRI 2900 [Farsi]" +Ditto. +.IP "Thai encoding TCVN" 2 +.IX Item "Thai encoding TCVN" +Ditto. +.IP "Vietnamese encodings VPS" 2 +.IX Item "Vietnamese encodings VPS" +Though Jungshik Shin has reported that Mozilla supports this encoding, +it was too late before 5.8.0 for us to add it. In the future, it +may be available via a separate module. See +<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> +and +<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> +if you are interested in helping us. +.IP "Various Mac encodings" 2 +.IX Item "Various Mac encodings" +The following are unsupported due to the lack of mapping data. +.Sp +.Vb 5 +\& MacArmenian, MacBengali, MacBurmese, MacEthiopic +\& MacExtArabic, MacGeorgian, MacKannada, MacKhmer +\& MacLaotian, MacMalayalam, MacMongolian, MacOriya +\& MacSinhalese, MacTamil, MacTelugu, MacTibetan +\& MacVietnamese +.Ve +.Sp +The rest which are already available are based upon the vendor mappings +at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . +.IP "(Mac) Indic encodings" 2 +.IX Item "(Mac) Indic encodings" +The maps for the following are available at <http://www.unicode.org/> +but remain unsupported because those encodings need an algorithmical +approach, currently unsupported by \fIenc2xs\fR: +.Sp +.Vb 3 +\& MacDevanagari +\& MacGurmukhi +\& MacGujarati +.Ve +.Sp +For details, please see \f(CW\*(C`Unicode mapping issues and notes:\*(C'\fR at +<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . +.Sp +I believe this issue is prevalent not only for Mac Indics but also in +other Indic encodings, but the above were the only Indic encodings +maps that I could find at <http://www.unicode.org/> . +.SH "Encoding vs. Charset \-\- terminology" +.IX Header "Encoding vs. Charset -- terminology" +We are used to using the term (character) \fIencoding\fR and \fIcharacter +set\fR interchangeably. But just as confusing the terms byte and +character is dangerous and the terms should be differentiated when +needed, we need to differentiate \fIencoding\fR and \fIcharacter set\fR. +.PP +To understand that, here is a description of how we make computers +grok our characters. +.IP \(bu 2 +First we start with which characters to include. We call this +collection of characters \fIcharacter repertoire\fR. +.IP \(bu 2 +Then we have to give each character a unique ID so your computer can +tell the difference between 'a' and 'A'. This itemized character +repertoire is now a \fIcharacter set\fR. +.IP \(bu 2 +If your computer can grow the character set without further +processing, you can go ahead and use it. This is called a \fIcoded +character set\fR (CCS) or \fIraw character encoding\fR. ASCII is used this +way for most cases. +.IP \(bu 2 +But in many cases, especially multi-byte CJK encodings, you have to +tweak a little more. Your network connection may not accept any data +with the Most Significant Bit set, and your computer may not be able to +tell if a given byte is a whole character or just half of it. So you +have to \fIencode\fR the character set to use it. +.Sp +A \fIcharacter encoding scheme\fR (CES) determines how to encode a given +character set, or a set of multiple character sets. 7bit ISO\-2022 is +an example of a CES. You switch between character sets via \fIescape +sequences\fR. +.PP +Technically, or mathematically, speaking, a character set encoded in +such a CES that maps character by character may form a CCS. EUC is such +an example. The CES of EUC is as follows: +.IP \(bu 2 +Map ASCII unchanged. +.IP \(bu 2 +Map such a character set that consists of 94 or 96 powered by N +members by adding 0x80 to each byte. +.IP \(bu 2 +You can also use 0x8e and 0x8f to indicate that the following sequence of +characters belongs to yet another character set. To each following byte +is added the value 0x80. +.PP +By carefully looking at the encoded byte sequence, you can find that the +byte sequence conforms a unique number. In that sense, EUC is a CCS +generated by a CES above from up to four CCS (complicated?). UTF\-8 +falls into this category. See "UTF\-8" in perlUnicode to find out how +UTF\-8 maps Unicode to a byte sequence. +.PP +You may also have found out by now why 7bit ISO\-2022 cannot comprise +a CCS. If you look at a byte sequence \ex21\ex21, you can't tell if +it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \exA1\exA1 +so you have no trouble differentiating between "!!". and "\ \ ". +.SH "Encoding Classification (by Anton Tagunov and Dan Kogai)" +.IX Header "Encoding Classification (by Anton Tagunov and Dan Kogai)" +This section tries to classify the supported encodings by their +applicability for information exchange over the Internet and to +choose the most suitable aliases to name them in the context of +such communication. +.IP \(bu 2 +To (en|de)code encodings marked by \f(CW\*(C`(**)\*(C'\fR, you need +\&\f(CW\*(C`Encode::HanExtra\*(C'\fR, available from CPAN. +.PP +Encoding names +.PP +.Vb 3 +\& US\-ASCII UTF\-8 ISO\-8859\-* KOI8\-R +\& Shift_JIS EUC\-JP ISO\-2022\-JP ISO\-2022\-JP\-1 +\& EUC\-KR Big5 GB2312 +.Ve +.PP +are registered with IANA as preferred MIME names and may +be used over the Internet. +.PP +\&\f(CW\*(C`Shift_JIS\*(C'\fR has been officialized by JIS X 0208:1997. +"Microsoft-related naming mess" gives details. +.PP +\&\f(CW\*(C`GB2312\*(C'\fR is the IANA name for \f(CW\*(C`EUC\-CN\*(C'\fR. +See "Microsoft-related naming mess" for details. +.PP +\&\f(CW\*(C`GB_2312\-80\*(C'\fR \fIraw\fR encoding is available as \f(CW\*(C`gb2312\-raw\*(C'\fR +with Encode. See Encode::CN for details. +.PP +.Vb 2 +\& EUC\-CN +\& KOI8\-U [RFC2319] +.Ve +.PP +have not been registered with IANA (as of March 2002) but +seem to be supported by major web browsers. +The IANA name for \f(CW\*(C`EUC\-CN\*(C'\fR is \f(CW\*(C`GB2312\*(C'\fR. +.PP +.Vb 1 +\& KS_C_5601\-1987 +.Ve +.PP +is heavily misused. +See "Microsoft-related naming mess" for details. +.PP +\&\f(CW\*(C`KS_C_5601\-1987\*(C'\fR \fIraw\fR encoding is available as \f(CW\*(C`kcs5601\-raw\*(C'\fR +with Encode. See Encode::KR for details. +.PP +.Vb 1 +\& UTF\-16 UTF\-16BE UTF\-16LE +.Ve +.PP +are IANA-registered \f(CW\*(C`charset\*(C'\fRs. See [RFC 2781] for details. +Jungshik Shin reports that UTF\-16 with a BOM is well accepted +by MS IE 5/6 and NS 4/6. Beware however that +.IP \(bu 2 +\&\f(CW\*(C`UTF\-16\*(C'\fR support in any software you're going to be +using/interoperating with has probably been less tested +then \f(CW\*(C`UTF\-8\*(C'\fR support +.IP \(bu 2 +\&\f(CW\*(C`UTF\-8\*(C'\fR coded data seamlessly passes traditional +command piping (\f(CW\*(C`cat\*(C'\fR, \f(CW\*(C`more\*(C'\fR, etc.) while \f(CW\*(C`UTF\-16\*(C'\fR coded +data is likely to cause confusion (with its zero bytes, +for example) +.IP \(bu 2 +it is beyond the power of words to describe the way HTML browsers +encode non\-\f(CW\*(C`ASCII\*(C'\fR form data. To get a general impression, visit +<http://www.alanflavell.org.uk/charset/form\-i18n.html>. +While encoding of form data has stabilized for \f(CW\*(C`UTF\-8\*(C'\fR encoded pages +(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to +expect fun (and cross-browser discrepancies) with \f(CW\*(C`UTF\-16\*(C'\fR encoded +pages! +.PP +The rule of thumb is to use \f(CW\*(C`UTF\-8\*(C'\fR unless you know what +you're doing and unless you really benefit from using \f(CW\*(C`UTF\-16\*(C'\fR. +.PP +.Vb 5 +\& ISO\-IR\-165 [RFC1345] +\& VISCII +\& GB 12345 +\& GB 18030 (**) (see links below) +\& EUC\-TW (**) +.Ve +.PP +are totally valid encodings but not registered at IANA. +The names under which they are listed here are probably the +most widely-known names for these encodings and are recommended +names. +.PP +.Vb 1 +\& BIG5PLUS (**) +.Ve +.PP +is a proprietary name. +.SS "Microsoft-related naming mess" +.IX Subsection "Microsoft-related naming mess" +Microsoft products misuse the following names: +.IP KS_C_5601\-1987 2 +.IX Item "KS_C_5601-1987" +Microsoft extension to \f(CW\*(C`EUC\-KR\*(C'\fR. +.Sp +Proper names: \f(CW\*(C`CP949\*(C'\fR, \f(CW\*(C`UHC\*(C'\fR, \f(CW\*(C`x\-windows\-949\*(C'\fR (as used by Mozilla). +.Sp +See <http://lists.w3.org/Archives/Public/ietf\-charsets/2001AprJun/0033.html> +for details. +.Sp +Encode aliases \f(CW\*(C`KS_C_5601\-1987\*(C'\fR to \f(CW\*(C`cp949\*(C'\fR to reflect this common +misusage. \fIRaw\fR \f(CW\*(C`KS_C_5601\-1987\*(C'\fR encoding is available as +\&\f(CW\*(C`kcs5601\-raw\*(C'\fR. +.Sp +See Encode::KR for details. +.IP GB2312 2 +.IX Item "GB2312" +Microsoft extension to \f(CW\*(C`EUC\-CN\*(C'\fR. +.Sp +Proper names: \f(CW\*(C`CP936\*(C'\fR, \f(CW\*(C`GBK\*(C'\fR. +.Sp +\&\f(CW\*(C`GB2312\*(C'\fR has been registered in the \f(CW\*(C`EUC\-CN\*(C'\fR meaning at +IANA. This has partially repaired the situation: Microsoft's +\&\f(CW\*(C`GB2312\*(C'\fR has become a superset of the official \f(CW\*(C`GB2312\*(C'\fR. +.Sp +Encode aliases \f(CW\*(C`GB2312\*(C'\fR to \f(CW\*(C`euc\-cn\*(C'\fR in full agreement with +IANA registration. \f(CW\*(C`cp936\*(C'\fR is supported separately. +\&\fIRaw\fR \f(CW\*(C`GB_2312\-80\*(C'\fR encoding is available as \f(CW\*(C`gb2312\-raw\*(C'\fR. +.Sp +See Encode::CN for details. +.IP Big5 2 +.IX Item "Big5" +Microsoft extension to \f(CW\*(C`Big5\*(C'\fR. +.Sp +Proper name: \f(CW\*(C`CP950\*(C'\fR. +.Sp +Encode separately supports \f(CW\*(C`Big5\*(C'\fR and \f(CW\*(C`cp950\*(C'\fR. +.IP Shift_JIS 2 +.IX Item "Shift_JIS" +Microsoft's understanding of \f(CW\*(C`Shift_JIS\*(C'\fR. +.Sp +JIS has not endorsed the full Microsoft standard however. +The official \f(CW\*(C`Shift_JIS\*(C'\fR includes only JIS X 0201 and JIS X 0208 +character sets, while Microsoft has always used \f(CW\*(C`Shift_JIS\*(C'\fR +to encode a wider character repertoire. See \f(CW\*(C`IANA\*(C'\fR registration for +\&\f(CW\*(C`Windows\-31J\*(C'\fR. +.Sp +As a historical predecessor, Microsoft's variant +probably has more rights for the name, though it may be objected +that Microsoft shouldn't have used JIS as part of the name +in the first place. +.Sp +Unambiguous name: \f(CW\*(C`CP932\*(C'\fR. \f(CW\*(C`IANA\*(C'\fR name (also used by Mozilla, and +provided as an alias by Encode): \f(CW\*(C`Windows\-31J\*(C'\fR. +.Sp +Encode separately supports \f(CW\*(C`Shift_JIS\*(C'\fR and \f(CW\*(C`cp932\*(C'\fR. +.SH Glossary +.IX Header "Glossary" +.IP "character repertoire" 2 +.IX Item "character repertoire" +A collection of unique characters. A \fIcharacter\fR set in the strictest +sense. At this stage, characters are not numbered. +.IP "coded character set (CCS)" 2 +.IX Item "coded character set (CCS)" +A character set that is mapped in a way computers can use directly. +Many character encodings, including EUC, fall in this category. +.IP "character encoding scheme (CES)" 2 +.IX Item "character encoding scheme (CES)" +An algorithm to map a character set to a byte sequence. You don't +have to be able to tell which character set a given byte sequence +belongs. 7\-bit ISO\-2022 is a CES but it cannot be a CCS. EUC is an +example of being both a CCS and CES. +.IP "charset (in MIME context)" 2 +.IX Item "charset (in MIME context)" +has long been used in the meaning of \f(CW\*(C`encoding\*(C'\fR, CES. +.Sp +While the word combination \f(CW\*(C`character set\*(C'\fR has lost this meaning +in MIME context since [RFC 2130], the \f(CW\*(C`charset\*(C'\fR abbreviation has +retained it. This is how [RFC 2277] and [RFC 2278] bless \f(CW\*(C`charset\*(C'\fR: +.Sp +.Vb 7 +\& This document uses the term "charset" to mean a set of rules for +\& mapping from a sequence of octets to a sequence of characters, such +\& as the combination of a coded character set and a character encoding +\& scheme; this is also what is used as an identifier in MIME "charset=" +\& parameters, and registered in the IANA charset registry ... (Note +\& that this is NOT a term used by other standards bodies, such as ISO). +\& [RFC 2277] +.Ve +.IP EUC 2 +.IX Item "EUC" +Extended Unix Character. See ISO\-2022. +.IP ISO\-2022 2 +.IX Item "ISO-2022" +A CES that was carefully designed to coexist with ASCII. There are a 7 +bit version and an 8 bit version. +.Sp +The 7 bit version switches character set via escape sequence so it +cannot form a CCS. Since this is more difficult to handle in programs +than the 8 bit version, the 7 bit version is not very popular except for +iso\-2022\-jp, the \fIde facto\fR standard CES for e\-mails. +.Sp +The 8 bit version can form a CCS. EUC and ISO\-8859 are two examples +thereof. Pre\-5.6 perl could use them as string literals. +.IP UCS 2 +.IX Item "UCS" +Short for \fIUniversal Character Set\fR. When you say just UCS, it means +\&\fIUnicode\fR. +.IP UCS\-2 2 +.IX Item "UCS-2" +ISO/IEC 10646 encoding form: Universal Character Set coded in two +octets. +.IP Unicode 2 +.IX Item "Unicode" +A character set that aims to include all character repertoires of the +world. Many character sets in various national as well as industrial +standards have become, in a way, just subsets of Unicode. +.IP UTF 2 +.IX Item "UTF" +Short for \fIUnicode Transformation Format\fR. Determines how to map a +Unicode character into a byte sequence. +.IP UTF\-16 2 +.IX Item "UTF-16" +A UTF in 16\-bit encoding. Can either be in big endian or little +endian. The big endian version is called UTF\-16BE (equal to UCS\-2 + +surrogate support) and the little endian version is called UTF\-16LE. +.SH "See Also" +.IX Header "See Also" +Encode, +Encode::Byte, +Encode::CN, Encode::JP, Encode::KR, Encode::TW, +Encode::EBCDIC, Encode::Symbol +Encode::MIME::Header, Encode::Guess +.SH References +.IX Header "References" +.IP ECMA 2 +.IX Item "ECMA" +European Computer Manufacturers Association +<http://www.ecma.ch> +.RS 2 +.ie n .IP "ECMA\-035 (eq ""ISO\-2022"")" 2 +.el .IP "ECMA\-035 (eq \f(CWISO\-2022\fR)" 2 +.IX Item "ECMA-035 (eq ISO-2022)" +<http://www.ecma.ch/ecma1/STAND/ECMA\-035.HTM> +.Sp +The specification of ISO\-2022 is available from the link above. +.RE +.RS 2 +.RE +.IP IANA 2 +.IX Item "IANA" +Internet Assigned Numbers Authority +<http://www.iana.org/> +.RS 2 +.IP "Assigned Charset Names by IANA" 2 +.IX Item "Assigned Charset Names by IANA" +<http://www.iana.org/assignments/character\-sets> +.Sp +Most of the \f(CW\*(C`canonical names\*(C'\fR in Encode derive from this list +so you can directly apply the string you have extracted from MIME +header of mails and web pages. +.RE +.RS 2 +.RE +.IP ISO 2 +.IX Item "ISO" +International Organization for Standardization +<http://www.iso.ch/> +.IP RFC 2 +.IX Item "RFC" +Request For Comments \-\- need I say more? +<http://www.rfc\-editor.org/>, <http://www.ietf.org/rfc.html>, +<http://www.faqs.org/rfcs/> +.IP UC 2 +.IX Item "UC" +Unicode Consortium +<http://www.unicode.org/> +.RS 2 +.IP "Unicode Glossary" 2 +.IX Item "Unicode Glossary" +<http://www.unicode.org/glossary/> +.Sp +The glossary of this document is based upon this site. +.RE +.RS 2 +.RE +.SS "Other Notable Sites" +.IX Subsection "Other Notable Sites" +.IP czyborra.com 2 +.IX Item "czyborra.com" +<http://czyborra.com/> +.Sp +Contains a lot of useful information, especially gory details of ISO +vs. vendor mappings. +.IP CJK.inf 2 +.IX Item "CJK.inf" +<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf> +.Sp +Somewhat obsolete (last update in 1996), but still useful. Also try +.Sp +<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> +.Sp +You will find brief info on \f(CW\*(C`EUC\-CN\*(C'\fR, \f(CW\*(C`GBK\*(C'\fR and mostly on \f(CW\*(C`GB 18030\*(C'\fR. +.IP "Jungshik Shin's Hangul FAQ" 2 +.IX Item "Jungshik Shin's Hangul FAQ" +<http://jshin.net/faq> +.Sp +And especially its subject 8. +.Sp +<http://jshin.net/faq/qa8.html> +.Sp +A comprehensive overview of the Korean (\f(CW\*(C`KS *\*(C'\fR) standards. +.IP "debian.org: ""Introduction to i18n""" 2 +.IX Item "debian.org: ""Introduction to i18n""" +A brief description for most of the mentioned CJK encodings is +contained in +<http://www.debian.org/doc/manuals/intro\-i18n/ch\-codes.en.html> +.SS "Offline sources" +.IX Subsection "Offline sources" +.ie n .IP """CJKV Information Processing"" by Ken Lunde" 2 +.el .IP "\f(CWCJKV Information Processing\fR by Ken Lunde" 2 +.IX Item "CJKV Information Processing by Ken Lunde" +CJKV Information Processing +1999 O'Reilly & Associates, ISBN : 1\-56592\-224\-7 +.Sp +The modern successor of \f(CW\*(C`CJK.inf\*(C'\fR. +.Sp +Features a comprehensive coverage of CJKV character sets and +encodings along with many other issues faced by anyone trying +to better support CJKV languages/scripts in all the areas of +information processing. +.Sp +To purchase this book, visit +<http://oreilly.com/catalog/9780596514471/> +or your favourite bookstore. |