diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
commit | fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch) | |
tree | ce1e3bce06471410239a6f41282e328770aa404a /upstream/mageia-cauldron/man3pm/Encode.3pm | |
parent | Initial commit. (diff) | |
download | manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip |
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/mageia-cauldron/man3pm/Encode.3pm')
-rw-r--r-- | upstream/mageia-cauldron/man3pm/Encode.3pm | 881 |
1 files changed, 881 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man3pm/Encode.3pm b/upstream/mageia-cauldron/man3pm/Encode.3pm new file mode 100644 index 00000000..e5ec2aec --- /dev/null +++ b/upstream/mageia-cauldron/man3pm/Encode.3pm @@ -0,0 +1,881 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "Encode 3pm" +.TH Encode 3pm 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +Encode \- character encodings in Perl +.SH SYNOPSIS +.IX Header "SYNOPSIS" +.Vb 3 +\& use Encode qw(decode encode); +\& $characters = decode(\*(AqUTF\-8\*(Aq, $octets, Encode::FB_CROAK); +\& $octets = encode(\*(AqUTF\-8\*(Aq, $characters, Encode::FB_CROAK); +.Ve +.SS "Table of Contents" +.IX Subsection "Table of Contents" +Encode consists of a collection of modules whose details are too extensive +to fit in one document. This one itself explains the top-level APIs +and general topics at a glance. For other topics and more details, +see the documentation for these modules: +.IP "Encode::Alias \- Alias definitions to encodings" 2 +.IX Item "Encode::Alias - Alias definitions to encodings" +.PD 0 +.IP "Encode::Encoding \- Encode Implementation Base Class" 2 +.IX Item "Encode::Encoding - Encode Implementation Base Class" +.IP "Encode::Supported \- List of Supported Encodings" 2 +.IX Item "Encode::Supported - List of Supported Encodings" +.IP "Encode::CN \- Simplified Chinese Encodings" 2 +.IX Item "Encode::CN - Simplified Chinese Encodings" +.IP "Encode::JP \- Japanese Encodings" 2 +.IX Item "Encode::JP - Japanese Encodings" +.IP "Encode::KR \- Korean Encodings" 2 +.IX Item "Encode::KR - Korean Encodings" +.IP "Encode::TW \- Traditional Chinese Encodings" 2 +.IX Item "Encode::TW - Traditional Chinese Encodings" +.PD +.SH DESCRIPTION +.IX Header "DESCRIPTION" +The \f(CW\*(C`Encode\*(C'\fR module provides the interface between Perl strings +and the rest of the system. Perl strings are sequences of +\&\fIcharacters\fR. +.PP +The repertoire of characters that Perl can represent is a superset of those +defined by the Unicode Consortium. On most platforms the ordinal +values of a character as returned by \f(CWord(\fR\f(CIS\fR\f(CW)\fR is the \fIUnicode +codepoint\fR for that character. The exceptions are platforms where +the legacy encoding is some variant of EBCDIC rather than a superset +of ASCII; see perlebcdic. +.PP +During recent history, data is moved around a computer in 8\-bit chunks, +often called "bytes" but also known as "octets" in standards documents. +Perl is widely used to manipulate data of many types: not only strings of +characters representing human or computer languages, but also "binary" +data, being the machine's representation of numbers, pixels in an image, or +just about anything. +.PP +When Perl is processing "binary data", the programmer wants Perl to +process "sequences of bytes". This is not a problem for Perl: because a +byte has 256 possible values, it easily fits in Perl's much larger +"logical character". +.PP +This document mostly explains the \fIhow\fR. perlunitut and perlunifaq +explain the \fIwhy\fR. +.SS TERMINOLOGY +.IX Subsection "TERMINOLOGY" +\fIcharacter\fR +.IX Subsection "character" +.PP +A character in the range 0 .. 2**32\-1 (or more); +what Perl's strings are made of. +.PP +\fIbyte\fR +.IX Subsection "byte" +.PP +A character in the range 0..255; +a special case of a Perl character. +.PP +\fIoctet\fR +.IX Subsection "octet" +.PP +8 bits of data, with ordinal values 0..255; +term for bytes passed to or from a non-Perl context, such as a disk file, +standard I/O stream, database, command-line argument, environment variable, +socket etc. +.SH "THE PERL ENCODING API" +.IX Header "THE PERL ENCODING API" +.SS "Basic methods" +.IX Subsection "Basic methods" +\fIencode\fR +.IX Subsection "encode" +.PP +.Vb 1 +\& $octets = encode(ENCODING, STRING[, CHECK]) +.Ve +.PP +Encodes the scalar value \fISTRING\fR from Perl's internal form into +\&\fIENCODING\fR and returns a sequence of octets. \fIENCODING\fR can be either a +canonical name or an alias. For encoding names and aliases, see +"Defining Aliases". For CHECK, see "Handling Malformed Data". +.PP +\&\fBCAVEAT\fR: the input scalar \fISTRING\fR might be modified in-place depending +on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be +left unchanged. +.PP +For example, to convert a string from Perl's internal format into +ISO\-8859\-1, also known as Latin1: +.PP +.Vb 1 +\& $octets = encode("iso\-8859\-1", $string); +.Ve +.PP +\&\fBCAVEAT\fR: When you run \f(CW\*(C`$octets = encode("UTF\-8", $string)\*(C'\fR, then +\&\f(CW$octets\fR \fImight not be equal to\fR \f(CW$string\fR. Though both contain the +same data, the UTF8 flag for \f(CW$octets\fR is \fIalways\fR off. When you +encode anything, the UTF8 flag on the result is always off, even when it +contains a completely valid UTF\-8 string. See "The UTF8 flag" below. +.PP +If the \f(CW$string\fR is \f(CW\*(C`undef\*(C'\fR, then \f(CW\*(C`undef\*(C'\fR is returned. +.PP +\&\f(CW\*(C`str2bytes\*(C'\fR may be used as an alias for \f(CW\*(C`encode\*(C'\fR. +.PP +\fIdecode\fR +.IX Subsection "decode" +.PP +.Vb 1 +\& $string = decode(ENCODING, OCTETS[, CHECK]) +.Ve +.PP +This function returns the string that results from decoding the scalar +value \fIOCTETS\fR, assumed to be a sequence of octets in \fIENCODING\fR, into +Perl's internal form. As with \fBencode()\fR, +\&\fIENCODING\fR can be either a canonical name or an alias. For encoding names +and aliases, see "Defining Aliases"; for \fICHECK\fR, see "Handling +Malformed Data". +.PP +\&\fBCAVEAT\fR: the input scalar \fIOCTETS\fR might be modified in-place depending +on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be +left unchanged. +.PP +For example, to convert ISO\-8859\-1 data into a string in Perl's +internal format: +.PP +.Vb 1 +\& $string = decode("iso\-8859\-1", $octets); +.Ve +.PP +\&\fBCAVEAT\fR: When you run \f(CW\*(C`$string = decode("UTF\-8", $octets)\*(C'\fR, then \f(CW$string\fR +\&\fImight not be equal to\fR \f(CW$octets\fR. Though both contain the same data, the +UTF8 flag for \f(CW$string\fR is on. See "The UTF8 flag" +below. +.PP +If the \f(CW$string\fR is \f(CW\*(C`undef\*(C'\fR, then \f(CW\*(C`undef\*(C'\fR is returned. +.PP +\&\f(CW\*(C`bytes2str\*(C'\fR may be used as an alias for \f(CW\*(C`decode\*(C'\fR. +.PP +\fIfind_encoding\fR +.IX Subsection "find_encoding" +.PP +.Vb 1 +\& [$obj =] find_encoding(ENCODING) +.Ve +.PP +Returns the \fIencoding object\fR corresponding to \fIENCODING\fR. Returns +\&\f(CW\*(C`undef\*(C'\fR if no matching \fIENCODING\fR is find. The returned object is +what does the actual encoding or decoding. +.PP +.Vb 1 +\& $string = decode($name, $bytes); +.Ve +.PP +is in fact +.PP +.Vb 5 +\& $string = do { +\& $obj = find_encoding($name); +\& croak qq(encoding "$name" not found) unless ref $obj; +\& $obj\->decode($bytes); +\& }; +.Ve +.PP +with more error checking. +.PP +You can therefore save time by reusing this object as follows; +.PP +.Vb 5 +\& my $enc = find_encoding("iso\-8859\-1"); +\& while(<>) { +\& my $string = $enc\->decode($_); +\& ... # now do something with $string; +\& } +.Ve +.PP +Besides "decode" and "encode", other methods are +available as well. For instance, \f(CWname()\fR returns the canonical +name of the encoding object. +.PP +.Vb 1 +\& find_encoding("latin1")\->name; # iso\-8859\-1 +.Ve +.PP +See Encode::Encoding for details. +.PP +\fIfind_mime_encoding\fR +.IX Subsection "find_mime_encoding" +.PP +.Vb 1 +\& [$obj =] find_mime_encoding(MIME_ENCODING) +.Ve +.PP +Returns the \fIencoding object\fR corresponding to \fIMIME_ENCODING\fR. Acts +same as \f(CWfind_encoding()\fR but \f(CWmime_name()\fR of returned object must +match to \fIMIME_ENCODING\fR. So as opposite of \f(CWfind_encoding()\fR +canonical names and aliases are not used when searching for object. +.PP +.Vb 4 +\& find_mime_encoding("utf8"); # returns undef because "utf8" is not valid I<MIME_ENCODING> +\& find_mime_encoding("utf\-8"); # returns encode object "utf\-8\-strict" +\& find_mime_encoding("UTF\-8"); # same as "utf\-8" because I<MIME_ENCODING> is case insensitive +\& find_mime_encoding("utf\-8\-strict"); returns undef because "utf\-8\-strict" is not valid I<MIME_ENCODING> +.Ve +.PP +\fIfrom_to\fR +.IX Subsection "from_to" +.PP +.Vb 1 +\& [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK]) +.Ve +.PP +Converts \fIin-place\fR data between two encodings. The data in \f(CW$octets\fR +must be encoded as octets and \fInot\fR as characters in Perl's internal +format. For example, to convert ISO\-8859\-1 data into Microsoft's CP1250 +encoding: +.PP +.Vb 1 +\& from_to($octets, "iso\-8859\-1", "cp1250"); +.Ve +.PP +and to convert it back: +.PP +.Vb 1 +\& from_to($octets, "cp1250", "iso\-8859\-1"); +.Ve +.PP +Because the conversion happens in place, the data to be +converted cannot be a string constant: it must be a scalar variable. +.PP +\&\f(CWfrom_to()\fR returns the length of the converted string in octets on success, +and \f(CW\*(C`undef\*(C'\fR on error. +.PP +\&\fBCAVEAT\fR: The following operations may look the same, but are not: +.PP +.Vb 2 +\& from_to($data, "iso\-8859\-1", "UTF\-8"); #1 +\& $data = decode("iso\-8859\-1", $data); #2 +.Ve +.PP +Both #1 and #2 make \f(CW$data\fR consist of a completely valid UTF\-8 string, +but only #2 turns the UTF8 flag on. #1 is equivalent to: +.PP +.Vb 1 +\& $data = encode("UTF\-8", decode("iso\-8859\-1", $data)); +.Ve +.PP +See "The UTF8 flag" below. +.PP +Also note that: +.PP +.Vb 1 +\& from_to($octets, $from, $to, $check); +.Ve +.PP +is equivalent to: +.PP +.Vb 1 +\& $octets = encode($to, decode($from, $octets), $check); +.Ve +.PP +Yes, it does \fInot\fR respect the \f(CW$check\fR during decoding. It is +deliberately done that way. If you need minute control, use \f(CW\*(C`decode\*(C'\fR +followed by \f(CW\*(C`encode\*(C'\fR as follows: +.PP +.Vb 1 +\& $octets = encode($to, decode($from, $octets, $check_from), $check_to); +.Ve +.PP +\fIencode_utf8\fR +.IX Subsection "encode_utf8" +.PP +.Vb 1 +\& $octets = encode_utf8($string); +.Ve +.PP +\&\fBWARNING\fR: This function can produce invalid UTF\-8! +Do not use it for data exchange. +Unless you want Perl's older "lax" mode, prefer +\&\f(CW\*(C`$octets = encode("UTF\-8", $string)\*(C'\fR. +.PP +Equivalent to \f(CW\*(C`$octets = encode("utf8", $string)\*(C'\fR. The characters in +\&\f(CW$string\fR are encoded in Perl's internal format, and the result is returned +as a sequence of octets. Because all possible characters in Perl have a +(loose, not strict) utf8 representation, this function cannot fail. +.PP +\fIdecode_utf8\fR +.IX Subsection "decode_utf8" +.PP +.Vb 1 +\& $string = decode_utf8($octets [, CHECK]); +.Ve +.PP +\&\fBWARNING\fR: This function accepts invalid UTF\-8! +Do not use it for data exchange. +Unless you want Perl's older "lax" mode, prefer +\&\f(CW\*(C`$string = decode("UTF\-8", $octets [, CHECK])\*(C'\fR. +.PP +Equivalent to \f(CW\*(C`$string = decode("utf8", $octets [, CHECK])\*(C'\fR. +The sequence of octets represented by \f(CW$octets\fR is decoded +from (loose, not strict) utf8 into a sequence of logical characters. +Because not all sequences of octets are valid not strict utf8, +it is quite possible for this function to fail. +For CHECK, see "Handling Malformed Data". +.PP +\&\fBCAVEAT\fR: the input \fR\f(CI$octets\fR\fI\fR might be modified in-place depending on +what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be +left unchanged. +.SS "Listing available encodings" +.IX Subsection "Listing available encodings" +.Vb 2 +\& use Encode; +\& @list = Encode\->encodings(); +.Ve +.PP +Returns a list of canonical names of available encodings that have already +been loaded. To get a list of all available encodings including those that +have not yet been loaded, say: +.PP +.Vb 1 +\& @all_encodings = Encode\->encodings(":all"); +.Ve +.PP +Or you can give the name of a specific module: +.PP +.Vb 1 +\& @with_jp = Encode\->encodings("Encode::JP"); +.Ve +.PP +When "\f(CW\*(C`::\*(C'\fR" is not in the name, "\f(CW\*(C`Encode::\*(C'\fR" is assumed. +.PP +.Vb 1 +\& @ebcdic = Encode\->encodings("EBCDIC"); +.Ve +.PP +To find out in detail which encodings are supported by this package, +see Encode::Supported. +.SS "Defining Aliases" +.IX Subsection "Defining Aliases" +To add a new alias to a given encoding, use: +.PP +.Vb 3 +\& use Encode; +\& use Encode::Alias; +\& define_alias(NEWNAME => ENCODING); +.Ve +.PP +After that, \fINEWNAME\fR can be used as an alias for \fIENCODING\fR. +\&\fIENCODING\fR may be either the name of an encoding or an +\&\fIencoding object\fR. +.PP +Before you do that, first make sure the alias is nonexistent using +\&\f(CWresolve_alias()\fR, which returns the canonical name thereof. +For example: +.PP +.Vb 3 +\& Encode::resolve_alias("latin1") eq "iso\-8859\-1" # true +\& Encode::resolve_alias("iso\-8859\-12") # false; nonexistent +\& Encode::resolve_alias($name) eq $name # true if $name is canonical +.Ve +.PP +\&\f(CWresolve_alias()\fR does not need \f(CW\*(C`use Encode::Alias\*(C'\fR; it can be +imported via \f(CW\*(C`use Encode qw(resolve_alias)\*(C'\fR. +.PP +See Encode::Alias for details. +.SS "Finding IANA Character Set Registry names" +.IX Subsection "Finding IANA Character Set Registry names" +The canonical name of a given encoding does not necessarily agree with +IANA Character Set Registry, commonly seen as \f(CW\*(C`Content\-Type: +text/plain; charset=\fR\f(CIWHATEVER\fR\f(CW\*(C'\fR. For most cases, the canonical name +works, but sometimes it does not, most notably with "utf\-8\-strict". +.PP +As of \f(CW\*(C`Encode\*(C'\fR version 2.21, a new method \f(CWmime_name()\fR is therefore added. +.PP +.Vb 4 +\& use Encode; +\& my $enc = find_encoding("UTF\-8"); +\& warn $enc\->name; # utf\-8\-strict +\& warn $enc\->mime_name; # UTF\-8 +.Ve +.PP +See also: Encode::Encoding +.SH "Encoding via PerlIO" +.IX Header "Encoding via PerlIO" +If your perl supports \f(CW\*(C`PerlIO\*(C'\fR (which is the default), you can use a +\&\f(CW\*(C`PerlIO\*(C'\fR layer to decode and encode directly via a filehandle. The +following two examples are fully identical in functionality: +.PP +.Vb 10 +\& ### Version 1 via PerlIO +\& open(INPUT, "< :encoding(shiftjis)", $infile) +\& || die "Can\*(Aqt open < $infile for reading: $!"; +\& open(OUTPUT, "> :encoding(euc\-jp)", $outfile) +\& || die "Can\*(Aqt open > $output for writing: $!"; +\& while (<INPUT>) { # auto decodes $_ +\& print OUTPUT; # auto encodes $_ +\& } +\& close(INPUT) || die "can\*(Aqt close $infile: $!"; +\& close(OUTPUT) || die "can\*(Aqt close $outfile: $!"; +\& +\& ### Version 2 via from_to() +\& open(INPUT, "< :raw", $infile) +\& || die "Can\*(Aqt open < $infile for reading: $!"; +\& open(OUTPUT, "> :raw", $outfile) +\& || die "Can\*(Aqt open > $output for writing: $!"; +\& +\& while (<INPUT>) { +\& from_to($_, "shiftjis", "euc\-jp", 1); # switch encoding +\& print OUTPUT; # emit raw (but properly encoded) data +\& } +\& close(INPUT) || die "can\*(Aqt close $infile: $!"; +\& close(OUTPUT) || die "can\*(Aqt close $outfile: $!"; +.Ve +.PP +In the first version above, you let the appropriate encoding layer +handle the conversion. In the second, you explicitly translate +from one encoding to the other. +.PP +Unfortunately, it may be that encodings are not \f(CW\*(C`PerlIO\*(C'\fR\-savvy. You can check +to see whether your encoding is supported by \f(CW\*(C`PerlIO\*(C'\fR by invoking the +\&\f(CW\*(C`perlio_ok\*(C'\fR method on it: +.PP +.Vb 2 +\& Encode::perlio_ok("hz"); # false +\& find_encoding("euc\-cn")\->perlio_ok; # true wherever PerlIO is available +\& +\& use Encode qw(perlio_ok); # imported upon request +\& perlio_ok("euc\-jp") +.Ve +.PP +Fortunately, all encodings that come with \f(CW\*(C`Encode\*(C'\fR core are \f(CW\*(C`PerlIO\*(C'\fR\-savvy +except for \f(CW\*(C`hz\*(C'\fR and \f(CW\*(C`ISO\-2022\-kr\*(C'\fR. For the gory details, see +Encode::Encoding and Encode::PerlIO. +.SH "Handling Malformed Data" +.IX Header "Handling Malformed Data" +The optional \fICHECK\fR argument tells \f(CW\*(C`Encode\*(C'\fR what to do when +encountering malformed data. Without \fICHECK\fR, \f(CW\*(C`Encode::FB_DEFAULT\*(C'\fR +(== 0) is assumed. +.PP +As of version 2.12, \f(CW\*(C`Encode\*(C'\fR supports coderef values for \f(CW\*(C`CHECK\*(C'\fR; +see below. +.PP +\&\fBNOTE:\fR Not all encodings support this feature. +Some encodings ignore the \fICHECK\fR argument. For example, +Encode::Unicode ignores \fICHECK\fR and it always croaks on error. +.SS "List of \fICHECK\fP values" +.IX Subsection "List of CHECK values" +\fIFB_DEFAULT\fR +.IX Subsection "FB_DEFAULT" +.PP +.Vb 1 +\& I<CHECK> = Encode::FB_DEFAULT ( == 0) +.Ve +.PP +If \fICHECK\fR is 0, encoding and decoding replace any malformed character +with a \fIsubstitution character\fR. When you encode, \fISUBCHAR\fR is used. +When you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is +used. If the data is supposed to be UTF\-8, an optional lexical warning of +warning category \f(CW"utf8"\fR is given. +.PP +\fIFB_CROAK\fR +.IX Subsection "FB_CROAK" +.PP +.Vb 1 +\& I<CHECK> = Encode::FB_CROAK ( == 1) +.Ve +.PP +If \fICHECK\fR is 1, methods immediately die with an error +message. Therefore, when \fICHECK\fR is 1, you should trap +exceptions with \f(CW\*(C`eval{}\*(C'\fR, unless you really want to let it \f(CW\*(C`die\*(C'\fR. +.PP +\fIFB_QUIET\fR +.IX Subsection "FB_QUIET" +.PP +.Vb 1 +\& I<CHECK> = Encode::FB_QUIET +.Ve +.PP +If \fICHECK\fR is set to \f(CW\*(C`Encode::FB_QUIET\*(C'\fR, encoding and decoding immediately +return the portion of the data that has been processed so far when an +error occurs. The data argument is overwritten with everything +after that point; that is, the unprocessed portion of the data. This is +handy when you have to call \f(CW\*(C`decode\*(C'\fR repeatedly in the case where your +source data may contain partial multi-byte character sequences, +(that is, you are reading with a fixed-width buffer). Here's some sample +code to do exactly that: +.PP +.Vb 5 +\& my($buffer, $string) = ("", ""); +\& while (read($fh, $buffer, 256, length($buffer))) { +\& $string .= decode($encoding, $buffer, Encode::FB_QUIET); +\& # $buffer now contains the unprocessed partial character +\& } +.Ve +.PP +\fIFB_WARN\fR +.IX Subsection "FB_WARN" +.PP +.Vb 1 +\& I<CHECK> = Encode::FB_WARN +.Ve +.PP +This is the same as \f(CW\*(C`FB_QUIET\*(C'\fR above, except that instead of being silent +on errors, it issues a warning. This is handy for when you are debugging. +.PP +\&\fBCAVEAT\fR: All warnings from Encode module are reported, independently of +pragma warnings settings. If you want to follow settings of +lexical warnings configured by pragma warnings then append +also check value \f(CW\*(C`ENCODE::ONLY_PRAGMA_WARNINGS\*(C'\fR. This value is available +since Encode version 2.99. +.PP +\fIFB_PERLQQ FB_HTMLCREF FB_XMLCREF\fR +.IX Subsection "FB_PERLQQ FB_HTMLCREF FB_XMLCREF" +.IP "perlqq mode (\fICHECK\fR = Encode::FB_PERLQQ)" 2 +.IX Item "perlqq mode (CHECK = Encode::FB_PERLQQ)" +.PD 0 +.IP "HTML charref mode (\fICHECK\fR = Encode::FB_HTMLCREF)" 2 +.IX Item "HTML charref mode (CHECK = Encode::FB_HTMLCREF)" +.IP "XML charref mode (\fICHECK\fR = Encode::FB_XMLCREF)" 2 +.IX Item "XML charref mode (CHECK = Encode::FB_XMLCREF)" +.PD +.PP +For encodings that are implemented by the \f(CW\*(C`Encode::XS\*(C'\fR module, \f(CW\*(C`CHECK\*(C'\fR \f(CW\*(C`==\*(C'\fR +\&\f(CW\*(C`Encode::FB_PERLQQ\*(C'\fR puts \f(CW\*(C`encode\*(C'\fR and \f(CW\*(C`decode\*(C'\fR into \f(CW\*(C`perlqq\*(C'\fR fallback mode. +.PP +When you decode, \f(CW\*(C`\ex\fR\f(CIHH\fR\f(CW\*(C'\fR is inserted for a malformed character, where +\&\fIHH\fR is the hex representation of the octet that could not be decoded to +utf8. When you encode, \f(CW\*(C`\ex{\fR\f(CIHHHH\fR\f(CW}\*(C'\fR will be inserted, where \fIHHHH\fR is +the Unicode code point (in any number of hex digits) of the character that +cannot be found in the character repertoire of the encoding. +.PP +The HTML/XML character reference modes are about the same. In place of +\&\f(CW\*(C`\ex{\fR\f(CIHHHH\fR\f(CW}\*(C'\fR, HTML uses \f(CW\*(C`&#\fR\f(CINNN\fR\f(CW;\*(C'\fR where \fINNN\fR is a decimal number, and +XML uses \f(CW\*(C`&#x\fR\f(CIHHHH\fR\f(CW;\*(C'\fR where \fIHHHH\fR is the hexadecimal number. +.PP +In \f(CW\*(C`Encode\*(C'\fR 2.10 or later, \f(CW\*(C`LEAVE_SRC\*(C'\fR is also implied. +.PP +\fIThe bitmask\fR +.IX Subsection "The bitmask" +.PP +These modes are all actually set via a bitmask. Here is how the \f(CW\*(C`FB_\fR\f(CIXXX\fR\f(CW\*(C'\fR +constants are laid out. You can import the \f(CW\*(C`FB_\fR\f(CIXXX\fR\f(CW\*(C'\fR constants via +\&\f(CW\*(C`use Encode qw(:fallbacks)\*(C'\fR, and you can import the generic bitmask +constants via \f(CW\*(C`use Encode qw(:fallback_all)\*(C'\fR. +.PP +.Vb 8 +\& FB_DEFAULT FB_CROAK FB_QUIET FB_WARN FB_PERLQQ +\& DIE_ON_ERR 0x0001 X +\& WARN_ON_ERR 0x0002 X +\& RETURN_ON_ERR 0x0004 X X +\& LEAVE_SRC 0x0008 X +\& PERLQQ 0x0100 X +\& HTMLCREF 0x0200 +\& XMLCREF 0x0400 +.Ve +.PP +\fILEAVE_SRC\fR +.IX Subsection "LEAVE_SRC" +.PP +.Vb 1 +\& Encode::LEAVE_SRC +.Ve +.PP +If the \f(CW\*(C`Encode::LEAVE_SRC\*(C'\fR bit is \fInot\fR set but \fICHECK\fR is set, then the +source string to \fBencode()\fR or \fBdecode()\fR will be overwritten in place. +If you're not interested in this, then bitwise-OR it with the bitmask. +.SS "coderef for CHECK" +.IX Subsection "coderef for CHECK" +As of \f(CW\*(C`Encode\*(C'\fR 2.12, \f(CW\*(C`CHECK\*(C'\fR can also be a code reference which takes the +ordinal value of the unmapped character as an argument and returns +octets that represent the fallback character. For instance: +.PP +.Vb 1 +\& $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift }); +.Ve +.PP +Acts like \f(CW\*(C`FB_PERLQQ\*(C'\fR but U+\fIXXXX\fR is used instead of \f(CW\*(C`\ex{\fR\f(CIXXXX\fR\f(CW}\*(C'\fR. +.PP +Fallback for \f(CW\*(C`decode\*(C'\fR must return decoded string (sequence of characters) +and takes a list of ordinal values as its arguments. So for +example if you wish to decode octets as UTF\-8, and use ISO\-8859\-15 as +a fallback for bytes that are not valid UTF\-8, you could write +.PP +.Vb 4 +\& $str = decode \*(AqUTF\-8\*(Aq, $octets, sub { +\& my $tmp = join \*(Aq\*(Aq, map chr, @_; +\& return decode \*(AqISO\-8859\-15\*(Aq, $tmp; +\& }; +.Ve +.SH "Defining Encodings" +.IX Header "Defining Encodings" +To define a new encoding, use: +.PP +.Vb 2 +\& use Encode qw(define_encoding); +\& define_encoding($object, CANONICAL_NAME [, alias...]); +.Ve +.PP +\&\fICANONICAL_NAME\fR will be associated with \fR\f(CI$object\fR\fI\fR. The object +should provide the interface described in Encode::Encoding. +If more than two arguments are provided, additional +arguments are considered aliases for \fI\fR\f(CI$object\fR\fI\fR. +.PP +See Encode::Encoding for details. +.SH "The UTF8 flag" +.IX Header "The UTF8 flag" +Before the introduction of Unicode support in Perl, The \f(CW\*(C`eq\*(C'\fR operator +just compared the strings represented by two scalars. Beginning with +Perl 5.8, \f(CW\*(C`eq\*(C'\fR compares two strings with simultaneous consideration of +\&\fIthe UTF8 flag\fR. To explain why we made it so, I quote from page 402 of +\&\fIProgramming Perl, 3rd ed.\fR +.IP "Goal #1:" 2 +.IX Item "Goal #1:" +Old byte-oriented programs should not spontaneously break on the old +byte-oriented data they used to work on. +.IP "Goal #2:" 2 +.IX Item "Goal #2:" +Old byte-oriented programs should magically start working on the new +character-oriented data when appropriate. +.IP "Goal #3:" 2 +.IX Item "Goal #3:" +Programs should run just as fast in the new character-oriented mode +as in the old byte-oriented mode. +.IP "Goal #4:" 2 +.IX Item "Goal #4:" +Perl should remain one language, rather than forking into a +byte-oriented Perl and a character-oriented Perl. +.PP +When \fIProgramming Perl, 3rd ed.\fR was written, not even Perl 5.6.0 had been +born yet, many features documented in the book remained unimplemented for a +long time. Perl 5.8 corrected much of this, and the introduction of the +UTF8 flag is one of them. You can think of there being two fundamentally +different kinds of strings and string-operations in Perl: one a +byte-oriented mode for when the internal UTF8 flag is off, and the other a +character-oriented mode for when the internal UTF8 flag is on. +.PP +This UTF8 flag is not visible in Perl scripts, exactly for the same reason +you cannot (or rather, you \fIdon't have to\fR) see whether a scalar contains +a string, an integer, or a floating-point number. But you can still peek +and poke these if you will. See the next section. +.SS "Messing with Perl's Internals" +.IX Subsection "Messing with Perl's Internals" +The following API uses parts of Perl's internals in the current +implementation. As such, they are efficient but may change in a future +release. +.PP +\fIis_utf8\fR +.IX Subsection "is_utf8" +.PP +.Vb 1 +\& is_utf8(STRING [, CHECK]) +.Ve +.PP +[INTERNAL] Tests whether the UTF8 flag is turned on in the \fISTRING\fR. +If \fICHECK\fR is true, also checks whether \fISTRING\fR contains well-formed +UTF\-8. Returns true if successful, false otherwise. +.PP +Typically only necessary for debugging and testing. Don't use this flag as +a marker to distinguish character and binary data, that should be decided +for each variable when you write your code. +.PP +\&\fBCAVEAT\fR: If \fISTRING\fR has UTF8 flag set, it does \fBNOT\fR mean that +\&\fISTRING\fR is UTF\-8 encoded and vice-versa. +.PP +As of Perl 5.8.1, utf8 also has the \f(CW\*(C`utf8::is_utf8\*(C'\fR function. +.PP +\fI_utf8_on\fR +.IX Subsection "_utf8_on" +.PP +.Vb 1 +\& _utf8_on(STRING) +.Ve +.PP +[INTERNAL] Turns the \fISTRING\fR's internal UTF8 flag \fBon\fR. The \fISTRING\fR +is \fInot\fR checked for containing only well-formed UTF\-8. Do not use this +unless you \fIknow with absolute certainty\fR that the STRING holds only +well-formed UTF\-8. Returns the previous state of the UTF8 flag (so please +don't treat the return value as indicating success or failure), or \f(CW\*(C`undef\*(C'\fR +if \fISTRING\fR is not a string. +.PP +\&\fBNOTE\fR: For security reasons, this function does not work on tainted values. +.PP +\fI_utf8_off\fR +.IX Subsection "_utf8_off" +.PP +.Vb 1 +\& _utf8_off(STRING) +.Ve +.PP +[INTERNAL] Turns the \fISTRING\fR's internal UTF8 flag \fBoff\fR. Do not use +frivolously. Returns the previous state of the UTF8 flag, or \f(CW\*(C`undef\*(C'\fR if +\&\fISTRING\fR is not a string. Do not treat the return value as indicative of +success or failure, because that isn't what it means: it is only the +previous setting. +.PP +\&\fBNOTE\fR: For security reasons, this function does not work on tainted values. +.SH "UTF\-8 vs. utf8 vs. UTF8" +.IX Header "UTF-8 vs. utf8 vs. UTF8" +.Vb 3 +\& ....We now view strings not as sequences of bytes, but as sequences +\& of numbers in the range 0 .. 2**32\-1 (or in the case of 64\-bit +\& computers, 0 .. 2**64\-1) \-\- Programming Perl, 3rd ed. +.Ve +.PP +That has historically been Perl's notion of UTF\-8, as that is how UTF\-8 was +first conceived by Ken Thompson when he invented it. However, thanks to +later revisions to the applicable standards, official UTF\-8 is now rather +stricter than that. For example, its range is much narrower (0 .. 0x10_FFFF +to cover only 21 bits instead of 32 or 64 bits) and some sequences +are not allowed, like those used in surrogate pairs, the 31 non-character +code points 0xFDD0 .. 0xFDEF, the last two code points in \fIany\fR plane +(0x\fIXX\fR_FFFE and 0x\fIXX\fR_FFFF), all non-shortest encodings, etc. +.PP +The former default in which Perl would always use a loose interpretation of +UTF\-8 has now been overruled: +.PP +.Vb 5 +\& From: Larry Wall <larry@wall.org> +\& Date: December 04, 2004 11:51:58 JST +\& To: perl\-unicode@perl.org +\& Subject: Re: Make Encode.pm support the real UTF\-8 +\& Message\-Id: <20041204025158.GA28754@wall.org> +\& +\& On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: +\& : I\*(Aqve no problem with \*(Aqutf8\*(Aq being perl\*(Aqs unrestricted uft8 encoding, +\& : but "UTF\-8" is the name of the standard and should give the +\& : corresponding behaviour. +\& +\& For what it\*(Aqs worth, that\*(Aqs how I\*(Aqve always kept them straight in my +\& head. +\& +\& Also for what it\*(Aqs worth, Perl 6 will mostly default to strict but +\& make it easy to switch back to lax. +\& +\& Larry +.Ve +.PP +Got that? As of Perl 5.8.7, \fB"UTF\-8"\fR means UTF\-8 in its current +sense, which is conservative and strict and security-conscious, whereas +\&\fB"utf8"\fR means UTF\-8 in its former sense, which was liberal and loose and +lax. \f(CW\*(C`Encode\*(C'\fR version 2.10 or later thus groks this subtle but critically +important distinction between \f(CW"UTF\-8"\fR and \f(CW"utf8"\fR. +.PP +.Vb 2 +\& encode("utf8", "\ex{FFFF_FFFF}", 1); # okay +\& encode("UTF\-8", "\ex{FFFF_FFFF}", 1); # croaks +.Ve +.PP +This distinction is also important for decoding. In the following, +\&\f(CW$s\fR stores character U+200000, which exceeds UTF\-8's allowed range. +\&\f(CW$s\fR thus stores an invalid Unicode code point: +.PP +.Vb 1 +\& $s = decode("utf8", "\exf8\ex88\ex80\ex80\ex80"); +.Ve +.PP +\&\f(CW"UTF\-8"\fR, by contrast, will either coerce the input to something valid: +.PP +.Vb 1 +\& $s = decode("UTF\-8", "\exf8\ex88\ex80\ex80\ex80"); # U+FFFD +.Ve +.PP +\&.. or croak: +.PP +.Vb 1 +\& decode("UTF\-8", "\exf8\ex88\ex80\ex80\ex80", FB_CROAK|LEAVE_SRC); +.Ve +.PP +In the \f(CW\*(C`Encode\*(C'\fR module, \f(CW"UTF\-8"\fR is actually a canonical name for +\&\f(CW"utf\-8\-strict"\fR. That hyphen between the \f(CW"UTF"\fR and the \f(CW"8"\fR is +critical; without it, \f(CW\*(C`Encode\*(C'\fR goes "liberal" and (perhaps overly\-)permissive: +.PP +.Vb 4 +\& find_encoding("UTF\-8")\->name # is \*(Aqutf\-8\-strict\*(Aq +\& find_encoding("utf\-8")\->name # ditto. names are case insensitive +\& find_encoding("utf_8")\->name # ditto. "_" are treated as "\-" +\& find_encoding("UTF8")\->name # is \*(Aqutf8\*(Aq. +.Ve +.PP +Perl's internal UTF8 flag is called "UTF8", without a hyphen. It indicates +whether a string is internally encoded as "utf8", also without a hyphen. +.SH "SEE ALSO" +.IX Header "SEE ALSO" +Encode::Encoding, +Encode::Supported, +Encode::PerlIO, +encoding, +perlebcdic, +"open" in perlfunc, +perlunicode, perluniintro, perlunifaq, perlunitut +utf8, +the Perl Unicode Mailing List <http://lists.perl.org/list/perl\-unicode.html> +.SH MAINTAINER +.IX Header "MAINTAINER" +This project was originated by the late Nick Ing-Simmons and later +maintained by Dan Kogai \fI<dankogai@cpan.org>\fR. See AUTHORS +for a full list of people involved. For any questions, send mail to +\&\fI<perl\-unicode@perl.org>\fR so that we can all share. +.PP +While Dan Kogai retains the copyright as a maintainer, credit +should go to all those involved. See AUTHORS for a list of those +who submitted code to the project. +.SH COPYRIGHT +.IX Header "COPYRIGHT" +Copyright 2002\-2014 Dan Kogai \fI<dankogai@cpan.org>\fR. +.PP +This library is free software; you can redistribute it and/or modify +it under the same terms as Perl itself. |