diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
commit | fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch) | |
tree | ce1e3bce06471410239a6f41282e328770aa404a /upstream/mageia-cauldron/man3pm/encoding.3pm | |
parent | Initial commit. (diff) | |
download | manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip |
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/mageia-cauldron/man3pm/encoding.3pm')
-rw-r--r-- | upstream/mageia-cauldron/man3pm/encoding.3pm | 544 |
1 files changed, 544 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man3pm/encoding.3pm b/upstream/mageia-cauldron/man3pm/encoding.3pm new file mode 100644 index 00000000..3f7216a9 --- /dev/null +++ b/upstream/mageia-cauldron/man3pm/encoding.3pm @@ -0,0 +1,544 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "encoding 3pm" +.TH encoding 3pm 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +encoding \- allows you to write your script in non\-ASCII and non\-UTF\-8 +.SH WARNING +.IX Header "WARNING" +This module has been deprecated since perl v5.18. See "DESCRIPTION" and +"BUGS". +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 2 +\& use encoding "greek"; # Perl like Greek to you? +\& use encoding "euc\-jp"; # Jperl! +\& +\& # or you can even do this if your shell supports your native encoding +\& +\& perl \-Mencoding=latin2 \-e\*(Aq...\*(Aq # Feeling centrally European? +\& perl \-Mencoding=euc\-kr \-e\*(Aq...\*(Aq # Or Korean? +\& +\& # more control +\& +\& # A simple euc\-cn => utf\-8 converter +\& use encoding "euc\-cn", STDOUT => "utf8"; while(<>){print}; +\& +\& # "no encoding;" supported +\& no encoding; +\& +\& # an alternate way, Filter +\& use encoding "euc\-jp", Filter=>1; +\& # now you can use kanji identifiers \-\- in euc\-jp! +\& +\& # encode based on the current locale \- specialized purposes only; +\& # fraught with danger!! +\& use encoding \*(Aq:locale\*(Aq; +.Ve +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This pragma is used to enable a Perl script to be written in encodings that +aren't strictly ASCII nor UTF\-8. It translates all or portions of the Perl +program script from a given encoding into UTF\-8, and changes the PerlIO layers +of \f(CW\*(C`STDIN\*(C'\fR and \f(CW\*(C`STDOUT\*(C'\fR to the encoding specified. +.PP +This pragma dates from the days when UTF\-8\-enabled editors were uncommon. But +that was long ago, and the need for it is greatly diminished. That, coupled +with the fact that it doesn't work with threads, along with other problems, +(see "BUGS") have led to its being deprecated. It is planned to remove this +pragma in a future Perl version. New code should be written in UTF\-8, and the +\&\f(CW\*(C`use utf8\*(C'\fR pragma used instead (see perluniintro and utf8 for details). +Old code should be converted to UTF\-8, via something like the recipe in the +"SYNOPSIS" (though this simple approach may require manual adjustments +afterwards). +.PP +If UTF\-8 is not an option, it is recommended that one use a simple source +filter, such as that provided by Filter::Encoding on CPAN or this +pragma's own \f(CW\*(C`Filter\*(C'\fR option (see below). +.PP +The only legitimate use of this pragma is almost certainly just one per file, +near the top, with file scope, as the file is likely going to only be written +in one encoding. Further restrictions apply in Perls before v5.22 (see +"Prior to Perl v5.22"). +.PP +There are two basic modes of operation (plus turning if off): +.ie n .IP """use encoding [\*(Aq\fIENCNAME\fR\*(Aq] ;""" 4 +.el .IP "\f(CWuse encoding [\*(Aq\fR\f(CIENCNAME\fR\f(CW\*(Aq] ;\fR" 4 +.IX Item "use encoding [ENCNAME] ;" +Please note: This mode of operation is no longer supported as of Perl +v5.26. +.Sp +This is the normal operation. It translates various literals encountered in +the Perl source file from the encoding \fIENCNAME\fR into UTF\-8, and similarly +converts character code points. This is used when the script is a combination +of ASCII (for the variable names and punctuation, \fIetc\fR), but the literal +data is in the specified encoding. +.Sp +\&\fIENCNAME\fR is optional. If omitted, the encoding specified in the environment +variable \f(CW\*(C`PERL_ENCODING\*(C'\fR is used. If this isn't +set, or the resolved-to encoding is not known to \f(CW\*(C`Encode\*(C'\fR, the error +\&\f(CW\*(C`Unknown encoding \*(Aq\fR\f(CIENCNAME\fR\f(CW\*(Aq\*(C'\fR will be thrown. +.Sp +Starting in Perl v5.8.6 (\f(CW\*(C`Encode\*(C'\fR version 2.0.1), \fIENCNAME\fR may be the +name \f(CW\*(C`:locale\*(C'\fR. This is for very specialized applications, and is documented +in "The \f(CW\*(C`:locale\*(C'\fR sub-pragma" below. +.Sp +The literals that are converted are \f(CW\*(C`q//, qq//, qr//, qw///, qx//\*(C'\fR, and +starting in v5.8.1, \f(CW\*(C`tr///\*(C'\fR. Operations that do conversions include \f(CW\*(C`chr\*(C'\fR, +\&\f(CW\*(C`ord\*(C'\fR, \f(CW\*(C`utf8::upgrade\*(C'\fR (but not \f(CW\*(C`utf8::downgrade\*(C'\fR), and \f(CW\*(C`chomp\*(C'\fR. +.Sp +Also starting in v5.8.1, the \f(CW\*(C`DATA\*(C'\fR pseudo-filehandle is translated from the +encoding into UTF\-8. +.Sp +For example, you can write code in EUC-JP as follows: +.Sp +.Vb 3 +\& my $Rakuda = "\exF1\exD1\exF1\exCC"; # Camel in Kanji +\& #<\-char\-><\-char\-> # 4 octets +\& s/\ebCamel\eb/$Rakuda/; +.Ve +.Sp +And with \f(CW\*(C`use encoding "euc\-jp"\*(C'\fR in effect, it is the same thing as +that code in UTF\-8: +.Sp +.Vb 2 +\& my $Rakuda = "\ex{99F1}\ex{99DD}"; # two Unicode Characters +\& s/\ebCamel\eb/$Rakuda/; +.Ve +.Sp +See "EXAMPLE" below for a more complete example. +.Sp +Unless \f(CW\*(C`${^UNICODE}\*(C'\fR (available starting in v5.8.2) exists and is non-zero, the +PerlIO layers of \f(CW\*(C`STDIN\*(C'\fR and \f(CW\*(C`STDOUT\*(C'\fR are set to "\f(CW:encoding(\fR\f(CIENCNAME\fR\f(CW)\fR". +Therefore, +.Sp +.Vb 5 +\& use encoding "euc\-jp"; +\& my $message = "Camel is the symbol of perl.\en"; +\& my $Rakuda = "\exF1\exD1\exF1\exCC"; # Camel in Kanji +\& $message =~ s/\ebCamel\eb/$Rakuda/; +\& print $message; +.Ve +.Sp +will print +.Sp +.Vb 1 +\& "\exF1\exD1\exF1\exCC is the symbol of perl.\en" +.Ve +.Sp +not +.Sp +.Vb 1 +\& "\ex{99F1}\ex{99DD} is the symbol of perl.\en" +.Ve +.Sp +You can override this by giving extra arguments; see below. +.Sp +Note that \f(CW\*(C`STDERR\*(C'\fR WILL NOT be changed, regardless. +.Sp +Also note that non-STD file handles remain unaffected. Use \f(CW\*(C`use +open\*(C'\fR or \f(CW\*(C`binmode\*(C'\fR to change the layers of those. +.ie n .IP """use encoding \fIENCNAME\fR, Filter=>1;""" 4 +.el .IP "\f(CWuse encoding \fR\f(CIENCNAME\fR\f(CW, Filter=>1;\fR" 4 +.IX Item "use encoding ENCNAME, Filter=>1;" +This operates as above, but the \f(CW\*(C`Filter\*(C'\fR argument with a non-zero +value causes the entire script, and not just literals, to be translated from +the encoding into UTF\-8. This allows identifiers in the source to be in that +encoding as well. (Problems may occur if the encoding is not a superset of +ASCII; imagine all your semi-colons being translated into something +different.) One can use this form to make +.Sp +.Vb 1 +\& ${"\ex{4eba}"}++ +.Ve +.Sp +work. (This is equivalent to \f(CW\*(C`$\fR\f(CIhuman\fR\f(CW++\*(C'\fR, where \fIhuman\fR is a single Han +ideograph). +.Sp +This effectively means that your source code behaves as if it were written in +UTF\-8 with \f(CW\*(C`\*(Aquse utf8\*(C'\fR' in effect. So even if your editor only supports +Shift_JIS, for example, you can still try examples in Chapter 15 of +\&\f(CW\*(C`Programming Perl, 3rd Ed.\*(C'\fR. +.Sp +This option is significantly slower than the other one. +.ie n .IP """no encoding;""" 4 +.el .IP "\f(CWno encoding;\fR" 4 +.IX Item "no encoding;" +Unsets the script encoding. The layers of \f(CW\*(C`STDIN\*(C'\fR, \f(CW\*(C`STDOUT\*(C'\fR are +reset to "\f(CW\*(C`:raw\*(C'\fR" (the default unprocessed raw stream of bytes). +.SH OPTIONS +.IX Header "OPTIONS" +.ie n .SS "Setting ""STDIN"" and/or ""STDOUT"" individually" +.el .SS "Setting \f(CWSTDIN\fP and/or \f(CWSTDOUT\fP individually" +.IX Subsection "Setting STDIN and/or STDOUT individually" +The encodings of \f(CW\*(C`STDIN\*(C'\fR and \f(CW\*(C`STDOUT\*(C'\fR are individually settable by parameters to +the pragma: +.PP +.Vb 1 +\& use encoding \*(Aqeuc\-tw\*(Aq, STDIN => \*(Aqgreek\*(Aq ...; +.Ve +.PP +In this case, you cannot omit the first \fIENCNAME\fR. \f(CW\*(C`STDIN => undef\*(C'\fR +turns the I/O transcoding completely off for that filehandle. +.PP +When \f(CW\*(C`${^UNICODE}\*(C'\fR (available starting in v5.8.2) exists and is non-zero, +these options will be completely ignored. See "\f(CW\*(C`${^UNICODE}\*(C'\fR" in perlvar and +"\f(CW\*(C`\-C\*(C'\fR" in perlrun for details. +.ie n .SS "The "":locale"" sub-pragma" +.el .SS "The \f(CW:locale\fP sub-pragma" +.IX Subsection "The :locale sub-pragma" +Starting in v5.8.6, the encoding name may be \f(CW\*(C`:locale\*(C'\fR. This means that the +encoding is taken from the current locale, and not hard-coded by the pragma. +Since a script really can only be encoded in exactly one encoding, this option +is dangerous. It makes sense only if the script itself is written in ASCII, +and all the possible locales that will be in use when the script is executed +are supersets of ASCII. That means that the script itself doesn't get +changed, but the I/O handles have the specified encoding added, and the +operations like \f(CW\*(C`chr\*(C'\fR and \f(CW\*(C`ord\*(C'\fR use that encoding. +.PP +The logic of finding which locale \f(CW\*(C`:locale\*(C'\fR uses is as follows: +.IP 1. 4 +If the platform supports the \f(CWlanginfo(CODESET)\fR interface, the codeset +returned is used as the default encoding for the open pragma. +.IP 2. 4 +If 1. didn't work but we are under the locale pragma, the environment +variables \f(CW\*(C`LC_ALL\*(C'\fR and \f(CW\*(C`LANG\*(C'\fR (in that order) are matched for encodings +(the part after "\f(CW\*(C`.\*(C'\fR", if any), and if any found, that is used +as the default encoding for the open pragma. +.IP 3. 4 +If 1. and 2. didn't work, the environment variables \f(CW\*(C`LC_ALL\*(C'\fR and \f(CW\*(C`LANG\*(C'\fR +(in that order) are matched for anything looking like UTF\-8, and if +any found, \f(CW\*(C`:utf8\*(C'\fR is used as the default encoding for the open +pragma. +.PP +If your locale environment variables (\f(CW\*(C`LC_ALL\*(C'\fR, \f(CW\*(C`LC_CTYPE\*(C'\fR, \f(CW\*(C`LANG\*(C'\fR) +contain the strings 'UTF\-8' or 'UTF8' (case-insensitive matching), +the default encoding of your \f(CW\*(C`STDIN\*(C'\fR, \f(CW\*(C`STDOUT\*(C'\fR, and \f(CW\*(C`STDERR\*(C'\fR, and of +\&\fBany subsequent file open\fR, is UTF\-8. +.SH CAVEATS +.IX Header "CAVEATS" +.SS "SIDE EFFECTS" +.IX Subsection "SIDE EFFECTS" +.IP \(bu 4 +If the \f(CW\*(C`encoding\*(C'\fR pragma is in scope then the lengths returned are +calculated from the length of \f(CW$/\fR in Unicode characters, which is not +always the same as the length of \f(CW$/\fR in the native encoding. +.IP \(bu 4 +Without this pragma, if strings operating under byte semantics and strings +with Unicode character data are concatenated, the new string will +be created by decoding the byte strings as \fIISO 8859\-1 (Latin\-1)\fR. +.Sp +The \fBencoding\fR pragma changes this to use the specified encoding +instead. For example: +.Sp +.Vb 5 +\& use encoding \*(Aqutf8\*(Aq; +\& my $string = chr(20000); # a Unicode string +\& utf8::encode($string); # now it\*(Aqs a UTF\-8 encoded byte string +\& # concatenate with another Unicode string +\& print length($string . chr(20000)); +.Ve +.Sp +Will print \f(CW2\fR, because \f(CW$string\fR is upgraded as UTF\-8. Without +\&\f(CW\*(C`use encoding \*(Aqutf8\*(Aq;\*(C'\fR, it will print \f(CW4\fR instead, since \f(CW$string\fR +is three octets when interpreted as Latin\-1. +.SS "DO NOT MIX MULTIPLE ENCODINGS" +.IX Subsection "DO NOT MIX MULTIPLE ENCODINGS" +Notice that only literals (string or regular expression) having only +legacy code points are affected: if you mix data like this +.PP +.Vb 2 +\& \ex{100}\exDF +\& \exDF\ex{100} +.Ve +.PP +the data is assumed to be in (Latin 1 and) Unicode, not in your native +encoding. In other words, this will match in "greek": +.PP +.Vb 1 +\& "\exDF" =~ /\ex{3af}/ +.Ve +.PP +but this will not +.PP +.Vb 1 +\& "\exDF\ex{100}" =~ /\ex{3af}\ex{100}/ +.Ve +.PP +since the \f(CW\*(C`\exDF\*(C'\fR (ISO 8859\-7 GREEK SMALL LETTER IOTA WITH TONOS) on +the left will \fBnot\fR be upgraded to \f(CW\*(C`\ex{3af}\*(C'\fR (Unicode GREEK SMALL +LETTER IOTA WITH TONOS) because of the \f(CW\*(C`\ex{100}\*(C'\fR on the left. You +should not be mixing your legacy data and Unicode in the same string. +.PP +This pragma also affects encoding of the 0x80..0xFF code point range: +normally characters in that range are left as eight-bit bytes (unless +they are combined with characters with code points 0x100 or larger, +in which case all characters need to become UTF\-8 encoded), but if +the \f(CW\*(C`encoding\*(C'\fR pragma is present, even the 0x80..0xFF range always +gets UTF\-8 encoded. +.PP +After all, the best thing about this pragma is that you don't have to +resort to \ex{....} just to spell your name in a native encoding. +So feel free to put your strings in your encoding in quotes and +regexes. +.SS "Prior to Perl v5.22" +.IX Subsection "Prior to Perl v5.22" +The pragma was a per script, not a per block lexical. Only the last +\&\f(CW\*(C`use encoding\*(C'\fR or \f(CW\*(C`no encoding\*(C'\fR mattered, and it affected +\&\fBthe whole script\fR. However, the \f(CW\*(C`no encoding\*(C'\fR pragma was supported and +\&\f(CW\*(C`use encoding\*(C'\fR could appear as many times as you want in a given script +(though only the last was effective). +.PP +Since the scope wasn't lexical, other modules' use of \f(CW\*(C`chr\*(C'\fR, \f(CW\*(C`ord\*(C'\fR, \fIetc.\fR +were affected. This leads to spooky, incorrect action at a distance that is +hard to debug. +.PP +This means you would have to be very careful of the load order: +.PP +.Vb 5 +\& # called module +\& package Module_IN_BAR; +\& use encoding "bar"; +\& # stuff in "bar" encoding here +\& 1; +\& +\& # caller script +\& use encoding "foo" +\& use Module_IN_BAR; +\& # surprise! use encoding "bar" is in effect. +.Ve +.PP +The best way to avoid this oddity is to use this pragma RIGHT AFTER +other modules are loaded. i.e. +.PP +.Vb 2 +\& use Module_IN_BAR; +\& use encoding "foo"; +.Ve +.SS "Prior to Encode version 1.87" +.IX Subsection "Prior to Encode version 1.87" +.IP \(bu 4 +\&\f(CW\*(C`STDIN\*(C'\fR and \f(CW\*(C`STDOUT\*(C'\fR were not set under the filter option. +And \f(CW\*(C`STDIN=>\fR\f(CIENCODING\fR\f(CW\*(C'\fR and \f(CW\*(C`STDOUT=>\fR\f(CIENCODING\fR\f(CW\*(C'\fR didn't work like +non-filter version. +.IP \(bu 4 +\&\f(CW\*(C`use utf8\*(C'\fR wasn't implicitly declared so you have to \f(CW\*(C`use utf8\*(C'\fR to do +.Sp +.Vb 1 +\& ${"\ex{4eba}"}++ +.Ve +.SS "Prior to Perl v5.8.1" +.IX Subsection "Prior to Perl v5.8.1" +.IP """NON-EUC"" doublebyte encodings" 4 +.IX Item """NON-EUC"" doublebyte encodings" +Because perl needs to parse the script before applying this pragma, such +encodings as Shift_JIS and Big\-5 that may contain \f(CW\*(Aq\e\*(Aq\fR (BACKSLASH; +\&\f(CW\*(C`\ex5c\*(C'\fR) in the second byte fail because the second byte may +accidentally escape the quoting character that follows. +.ie n .IP """tr///""" 4 +.el .IP \f(CWtr///\fR 4 +.IX Item "tr///" +The \fBencoding\fR pragma works by decoding string literals in +\&\f(CW\*(C`q//,qq//,qr//,qw///, qx//\*(C'\fR and so forth. In perl v5.8.0, this +does not apply to \f(CW\*(C`tr///\*(C'\fR. Therefore, +.Sp +.Vb 4 +\& use encoding \*(Aqeuc\-jp\*(Aq; +\& #.... +\& $kana =~ tr/\exA4\exA1\-\exA4\exF3/\exA5\exA1\-\exA5\exF3/; +\& # \-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\- +.Ve +.Sp +Does not work as +.Sp +.Vb 1 +\& $kana =~ tr/\ex{3041}\-\ex{3093}/\ex{30a1}\-\ex{30f3}/; +.Ve +.RS 4 +.IP "Legend of characters above" 4 +.IX Item "Legend of characters above" +.Vb 6 +\& utf8 euc\-jp charnames::viacode() +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& \ex{3041} \exA4\exA1 HIRAGANA LETTER SMALL A +\& \ex{3093} \exA4\exF3 HIRAGANA LETTER N +\& \ex{30a1} \exA5\exA1 KATAKANA LETTER SMALL A +\& \ex{30f3} \exA5\exF3 KATAKANA LETTER N +.Ve +.RE +.RS 4 +.Sp +This counterintuitive behavior has been fixed in perl v5.8.1. +.Sp +In perl v5.8.0, you can work around this as follows; +.Sp +.Vb 3 +\& use encoding \*(Aqeuc\-jp\*(Aq; +\& # .... +\& eval qq{ \e$kana =~ tr/\exA4\exA1\-\exA4\exF3/\exA5\exA1\-\exA5\exF3/ }; +.Ve +.Sp +Note the \f(CW\*(C`tr//\*(C'\fR expression is surrounded by \f(CW\*(C`qq{}\*(C'\fR. The idea behind +this is the same as the classic idiom that makes \f(CW\*(C`tr///\*(C'\fR 'interpolate': +.Sp +.Vb 2 +\& tr/$from/$to/; # wrong! +\& eval qq{ tr/$from/$to/ }; # workaround. +.Ve +.RE +.SH "EXAMPLE \- Greekperl" +.IX Header "EXAMPLE - Greekperl" +.Vb 1 +\& use encoding "iso 8859\-7"; +\& +\& # \exDF in ISO 8859\-7 (Greek) is \ex{3af} in Unicode. +\& +\& $a = "\exDF"; +\& $b = "\ex{100}"; +\& +\& printf "%#x\en", ord($a); # will print 0x3af, not 0xdf +\& +\& $c = $a . $b; +\& +\& # $c will be "\ex{3af}\ex{100}", not "\ex{df}\ex{100}". +\& +\& # chr() is affected, and ... +\& +\& print "mega\en" if ord(chr(0xdf)) == 0x3af; +\& +\& # ... ord() is affected by the encoding pragma ... +\& +\& print "tera\en" if ord(pack("C", 0xdf)) == 0x3af; +\& +\& # ... as are eq and cmp ... +\& +\& print "peta\en" if "\ex{3af}" eq pack("C", 0xdf); +\& print "exa\en" if "\ex{3af}" cmp pack("C", 0xdf) == 0; +\& +\& # ... but pack/unpack C are not affected, in case you still +\& # want to go back to your native encoding +\& +\& print "zetta\en" if unpack("C", (pack("C", 0xdf))) == 0xdf; +.Ve +.SH BUGS +.IX Header "BUGS" +.IP "Thread safety" 4 +.IX Item "Thread safety" +\&\f(CW\*(C`use encoding ...\*(C'\fR is not thread-safe (i.e., do not use in threaded +applications). +.IP "Can't be used by more than one module in a single program." 4 +.IX Item "Can't be used by more than one module in a single program." +Only one encoding is allowed. If you combine modules in a program that have +different encodings, only one will be actually used. +.ie n .IP "Other modules using ""STDIN"" and ""STDOUT"" get the encoded stream" 4 +.el .IP "Other modules using \f(CWSTDIN\fR and \f(CWSTDOUT\fR get the encoded stream" 4 +.IX Item "Other modules using STDIN and STDOUT get the encoded stream" +They may be expecting something completely different. +.IP "literals in regex that are longer than 127 bytes" 4 +.IX Item "literals in regex that are longer than 127 bytes" +For native multibyte encodings (either fixed or variable length), +the current implementation of the regular expressions may introduce +recoding errors for regular expression literals longer than 127 bytes. +.IP EBCDIC 4 +.IX Item "EBCDIC" +The encoding pragma is not supported on EBCDIC platforms. +.ie n .IP """format""" 4 +.el .IP \f(CWformat\fR 4 +.IX Item "format" +This pragma doesn't work well with \f(CW\*(C`format\*(C'\fR because PerlIO does not +get along very well with it. When \f(CW\*(C`format\*(C'\fR contains non-ASCII +characters it prints funny or gets "wide character warnings". +To understand it, try the code below. +.Sp +.Vb 11 +\& # Save this one in utf8 +\& # replace *non\-ascii* with a non\-ascii string +\& my $camel; +\& format STDOUT = +\& *non\-ascii*@>>>>>>> +\& $camel +\& . +\& $camel = "*non\-ascii*"; +\& binmode(STDOUT=>\*(Aq:encoding(utf8)\*(Aq); # bang! +\& write; # funny +\& print $camel, "\en"; # fine +.Ve +.Sp +Without binmode this happens to work but without binmode, \fBprint()\fR +fails instead of \fBwrite()\fR. +.Sp +At any rate, the very use of \f(CW\*(C`format\*(C'\fR is questionable when it comes to +unicode characters since you have to consider such things as character +width (i.e. double-width for ideographs) and directions (i.e. BIDI for +Arabic and Hebrew). +.IP "See also ""CAVEATS""" 4 +.IX Item "See also ""CAVEATS""" +.SH HISTORY +.IX Header "HISTORY" +This pragma first appeared in Perl v5.8.0. It has been enhanced in later +releases as specified above. +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlunicode, Encode, open, Filter::Util::Call, +.PP +Ch. 15 of \f(CW\*(C`Programming Perl (3rd Edition)\*(C'\fR +by Larry Wall, Tom Christiansen, Jon Orwant; +O'Reilly & Associates; ISBN 0\-596\-00027\-8 |