diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunifaq.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlunifaq.1 | 389 |
1 files changed, 389 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunifaq.1 b/upstream/mageia-cauldron/man1/perlunifaq.1 new file mode 100644 index 00000000..35be7580 --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlunifaq.1 @@ -0,0 +1,389 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLUNIFAQ 1" +.TH PERLUNIFAQ 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlunifaq \- Perl Unicode FAQ +.SH "Q and A" +.IX Header "Q and A" +This is a list of questions and answers about Unicode in Perl, intended to be +read after perlunitut. +.SS "perlunitut isn't really a Unicode tutorial, is it?" +.IX Subsection "perlunitut isn't really a Unicode tutorial, is it?" +No, and this isn't really a Unicode FAQ. +.PP +Perl has an abstracted interface for all supported character encodings, so this +is actually a generic \f(CW\*(C`Encode\*(C'\fR tutorial and \f(CW\*(C`Encode\*(C'\fR FAQ. But many people +think that Unicode is special and magical, and I didn't want to disappoint +them, so I decided to call the document a Unicode tutorial. +.SS "What character encodings does Perl support?" +.IX Subsection "What character encodings does Perl support?" +To find out which character encodings your Perl supports, run: +.PP +.Vb 1 +\& perl \-MEncode \-le "print for Encode\->encodings(\*(Aq:all\*(Aq)" +.Ve +.SS "Which version of perl should I use?" +.IX Subsection "Which version of perl should I use?" +Well, if you can, upgrade to the most recent, but certainly \f(CW5.8.1\fR or newer. +The tutorial and FAQ assume the latest release. +.PP +You should also check your modules, and upgrade them if necessary. For example, +HTML::Entities requires version >= 1.32 to function correctly, even though the +changelog is silent about this. +.SS "What about binary data, like images?" +.IX Subsection "What about binary data, like images?" +Well, apart from a bare \f(CW\*(C`binmode $fh\*(C'\fR, you shouldn't treat them specially. +(The binmode is needed because otherwise Perl may convert line endings on Win32 +systems.) +.PP +Be careful, though, to never combine text strings with binary strings. If you +need text in a binary stream, encode your text strings first using the +appropriate encoding, then join them with binary strings. See also: "What if I +don't encode?". +.SS "When should I decode or encode?" +.IX Subsection "When should I decode or encode?" +Whenever you're communicating text with anything that is external to your perl +process, like a database, a text file, a socket, or another program. Even if +the thing you're communicating with is also written in Perl. +.SS "What if I don't decode?" +.IX Subsection "What if I don't decode?" +Whenever your encoded, binary string is used together with a text string, Perl +will assume that your binary string was encoded with ISO\-8859\-1, also known as +latin\-1. If it wasn't latin\-1, then your data is unpleasantly converted. For +example, if it was UTF\-8, the individual bytes of multibyte characters are seen +as separate characters, and then again converted to UTF\-8. Such double encoding +can be compared to double HTML encoding (\f(CW\*(C`&gt;\*(C'\fR), or double URI encoding +(\f(CW%253E\fR). +.PP +This silent implicit decoding is known as "upgrading". That may sound +positive, but it's best to avoid it. +.SS "What if I don't encode?" +.IX Subsection "What if I don't encode?" +It depends on what you output and how you output it. +.PP +\fIOutput via a filehandle\fR +.IX Subsection "Output via a filehandle" +.IP \(bu 4 +If the string's characters are all code point 255 or lower, Perl +outputs bytes that match those code points. This is what happens with encoded +strings. It can also, though, happen with unencoded strings that happen to be +all code point 255 or lower. +.IP \(bu 4 +Otherwise, Perl outputs the string encoded as UTF\-8. This only happens +with strings you neglected to encode. Since that should not happen, Perl also +throws a "wide character" warning in this case. +.PP +\fIOther output mechanisms (e.g., \fR\f(CI\*(C`exec\*(C'\fR\fI, \fR\f(CI\*(C`chdir\*(C'\fR\fI, ..)\fR +.IX Subsection "Other output mechanisms (e.g., exec, chdir, ..)" +.PP +Your text string will be sent using the bytes in Perl's internal format. +.PP +Because the internal format is often UTF\-8, these bugs are hard to spot, +because UTF\-8 is usually the encoding you wanted! But don't be lazy, and don't +use the fact that Perl's internal format is UTF\-8 to your advantage. Encode +explicitly to avoid weird bugs, and to show to maintenance programmers that you +thought this through. +.SS "Is there a way to automatically decode or encode?" +.IX Subsection "Is there a way to automatically decode or encode?" +If all data that comes from a certain handle is encoded in exactly the same +way, you can tell the PerlIO system to automatically decode everything, with +the \f(CW\*(C`encoding\*(C'\fR layer. If you do this, you can't accidentally forget to decode +or encode anymore, on things that use the layered handle. +.PP +You can provide this layer when \f(CW\*(C`open\*(C'\fRing the file: +.PP +.Vb 2 +\& open my $fh, \*(Aq>:encoding(UTF\-8)\*(Aq, $filename; # auto encoding on write +\& open my $fh, \*(Aq<:encoding(UTF\-8)\*(Aq, $filename; # auto decoding on read +.Ve +.PP +Or if you already have an open filehandle: +.PP +.Vb 1 +\& binmode $fh, \*(Aq:encoding(UTF\-8)\*(Aq; +.Ve +.PP +Some database drivers for DBI can also automatically encode and decode, but +that is sometimes limited to the UTF\-8 encoding. +.SS "What if I don't know which encoding was used?" +.IX Subsection "What if I don't know which encoding was used?" +Do whatever you can to find out, and if you have to: guess. (Don't forget to +document your guess with a comment.) +.PP +You could open the document in a web browser, and change the character set or +character encoding until you can visually confirm that all characters look the +way they should. +.PP +There is no way to reliably detect the encoding automatically, so if people +keep sending you data without charset indication, you may have to educate them. +.SS "Can I use Unicode in my Perl sources?" +.IX Subsection "Can I use Unicode in my Perl sources?" +Yes, you can! If your sources are UTF\-8 encoded, you can indicate that with the +\&\f(CW\*(C`use utf8\*(C'\fR pragma. +.PP +.Vb 1 +\& use utf8; +.Ve +.PP +This doesn't do anything to your input, or to your output. It only influences +the way your sources are read. You can use Unicode in string literals, in +identifiers (but they still have to be "word characters" according to \f(CW\*(C`\ew\*(C'\fR), +and even in custom delimiters. +.SS "Data::Dumper doesn't restore the UTF8 flag; is it broken?" +.IX Subsection "Data::Dumper doesn't restore the UTF8 flag; is it broken?" +No, Data::Dumper's Unicode abilities are as they should be. There have been +some complaints that it should restore the UTF8 flag when the data is read +again with \f(CW\*(C`eval\*(C'\fR. However, you should really not look at the flag, and +nothing indicates that Data::Dumper should break this rule. +.PP +Here's what happens: when Perl reads in a string literal, it sticks to 8 bit +encoding as long as it can. (But perhaps originally it was internally encoded +as UTF\-8, when you dumped it.) When it has to give that up because other +characters are added to the text string, it silently upgrades the string to +UTF\-8. +.PP +If you properly encode your strings for output, none of this is of your +concern, and you can just \f(CW\*(C`eval\*(C'\fR dumped data as always. +.SS "Why do regex character classes sometimes match only in the ASCII range?" +.IX Subsection "Why do regex character classes sometimes match only in the ASCII range?" +Starting in Perl 5.14 (and partially in Perl 5.12), just put a +\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR near the beginning of your program. +Within its lexical scope you shouldn't have this problem. It also is +automatically enabled under \f(CW\*(C`use feature \*(Aq:5.12\*(Aq\*(C'\fR or \f(CW\*(C`use v5.12\*(C'\fR or +using \f(CW\*(C`\-E\*(C'\fR on the command line for Perl 5.12 or higher. +.PP +The rationale for requiring this is to not break older programs that +rely on the way things worked before Unicode came along. Those older +programs knew only about the ASCII character set, and so may not work +properly for additional characters. When a string is encoded in UTF\-8, +Perl assumes that the program is prepared to deal with Unicode, but when +the string isn't, Perl assumes that only ASCII +is wanted, and so those characters that are not ASCII +characters aren't recognized as to what they would be in Unicode. +\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR tells Perl to treat all characters as +Unicode, whether the string is encoded in UTF\-8 or not, thus avoiding +the problem. +.PP +However, on earlier Perls, or if you pass strings to subroutines outside +the feature's scope, you can force Unicode rules by changing the +encoding to UTF\-8 by doing \f(CWutf8::upgrade($string)\fR. This can be used +safely on any string, as it checks and does not change strings that have +already been upgraded. +.PP +For a more detailed discussion, see Unicode::Semantics on CPAN. +.SS "Why do some characters not uppercase or lowercase correctly?" +.IX Subsection "Why do some characters not uppercase or lowercase correctly?" +See the answer to the previous question. +.SS "How can I determine if a string is a text string or a binary string?" +.IX Subsection "How can I determine if a string is a text string or a binary string?" +You can't. Some use the UTF8 flag for this, but that's misuse, and makes well +behaved modules like Data::Dumper look bad. The flag is useless for this +purpose, because it's off when an 8 bit encoding (by default ISO\-8859\-1) is +used to store the string. +.PP +This is something you, the programmer, has to keep track of; sorry. You could +consider adopting a kind of "Hungarian notation" to help with this. +.SS "How do I convert from encoding FOO to encoding BAR?" +.IX Subsection "How do I convert from encoding FOO to encoding BAR?" +By first converting the FOO-encoded byte string to a text string, and then the +text string to a BAR-encoded byte string: +.PP +.Vb 2 +\& my $text_string = decode(\*(AqFOO\*(Aq, $foo_string); +\& my $bar_string = encode(\*(AqBAR\*(Aq, $text_string); +.Ve +.PP +or by skipping the text string part, and going directly from one binary +encoding to the other: +.PP +.Vb 2 +\& use Encode qw(from_to); +\& from_to($string, \*(AqFOO\*(Aq, \*(AqBAR\*(Aq); # changes contents of $string +.Ve +.PP +or by letting automatic decoding and encoding do all the work: +.PP +.Vb 3 +\& open my $foofh, \*(Aq<:encoding(FOO)\*(Aq, \*(Aqexample.foo.txt\*(Aq; +\& open my $barfh, \*(Aq>:encoding(BAR)\*(Aq, \*(Aqexample.bar.txt\*(Aq; +\& print { $barfh } $_ while <$foofh>; +.Ve +.ie n .SS "What are ""decode_utf8"" and ""encode_utf8""?" +.el .SS "What are \f(CWdecode_utf8\fP and \f(CWencode_utf8\fP?" +.IX Subsection "What are decode_utf8 and encode_utf8?" +These are alternate syntaxes for \f(CW\*(C`decode(\*(Aqutf8\*(Aq, ...)\*(C'\fR and \f(CW\*(C`encode(\*(Aqutf8\*(Aq, +\&...)\*(C'\fR. Do not use these functions for data exchange. Instead use +\&\f(CW\*(C`decode(\*(AqUTF\-8\*(Aq, ...)\*(C'\fR and \f(CW\*(C`encode(\*(AqUTF\-8\*(Aq, ...)\*(C'\fR; see +"What's the difference between UTF\-8 and utf8?" below. +.SS "What is a ""wide character""?" +.IX Subsection "What is a ""wide character""?" +This is a term used for characters occupying more than one byte. +.PP +The Perl warning "Wide character in ..." is caused by such a character. +With no specified encoding layer, Perl tries to +fit things into a single byte. When it can't, it +emits this warning (if warnings are enabled), and uses UTF\-8 encoded data +instead. +.PP +To avoid this warning and to avoid having different output encodings in a single +stream, always specify an encoding explicitly, for example with a PerlIO layer: +.PP +.Vb 1 +\& binmode STDOUT, ":encoding(UTF\-8)"; +.Ve +.SH INTERNALS +.IX Header "INTERNALS" +.SS "What is ""the UTF8 flag""?" +.IX Subsection "What is ""the UTF8 flag""?" +Please, unless you're hacking the internals, or debugging weirdness, don't +think about the UTF8 flag at all. That means that you very probably shouldn't +use \f(CW\*(C`is_utf8\*(C'\fR, \f(CW\*(C`_utf8_on\*(C'\fR or \f(CW\*(C`_utf8_off\*(C'\fR at all. +.PP +The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the +current internal representation is UTF\-8. Without the flag, it is assumed to be +ISO\-8859\-1. Perl converts between these automatically. (Actually Perl usually +assumes the representation is ASCII; see "Why do regex character classes +sometimes match only in the ASCII range?" above.) +.PP +One of Perl's internal formats happens to be UTF\-8. Unfortunately, Perl can't +keep a secret, so everyone knows about this. That is the source of much +confusion. It's better to pretend that the internal format is some unknown +encoding, and that you always have to encode and decode explicitly. +.ie n .SS "What about the ""use bytes"" pragma?" +.el .SS "What about the \f(CWuse bytes\fP pragma?" +.IX Subsection "What about the use bytes pragma?" +Don't use it. It makes no sense to deal with bytes in a text string, and it +makes no sense to deal with characters in a byte string. Do the proper +conversions (by decoding/encoding), and things will work out well: you get +character counts for decoded data, and byte counts for encoded data. +.PP +\&\f(CW\*(C`use bytes\*(C'\fR is usually a failed attempt to do something useful. Just forget +about it. +.ie n .SS "What about the ""use encoding"" pragma?" +.el .SS "What about the \f(CWuse encoding\fP pragma?" +.IX Subsection "What about the use encoding pragma?" +Don't use it. Unfortunately, it assumes that the programmer's environment and +that of the user will use the same encoding. It will use the same encoding for +the source code and for STDIN and STDOUT. When a program is copied to another +machine, the source code does not change, but the STDIO environment might. +.PP +If you need non-ASCII characters in your source code, make it a UTF\-8 encoded +file and \f(CW\*(C`use utf8\*(C'\fR. +.PP +If you need to set the encoding for STDIN, STDOUT, and STDERR, for example +based on the user's locale, \f(CW\*(C`use open\*(C'\fR. +.ie n .SS "What is the difference between "":encoding"" and "":utf8""?" +.el .SS "What is the difference between \f(CW:encoding\fP and \f(CW:utf8\fP?" +.IX Subsection "What is the difference between :encoding and :utf8?" +Because UTF\-8 is one of Perl's internal formats, you can often just skip the +encoding or decoding step, and manipulate the UTF8 flag directly. +.PP +Instead of \f(CW:encoding(UTF\-8)\fR, you can simply use \f(CW\*(C`:utf8\*(C'\fR, which skips the +encoding step if the data was already represented as UTF8 internally. This is +widely accepted as good behavior when you're writing, but it can be dangerous +when reading, because it causes internal inconsistency when you have invalid +byte sequences. Using \f(CW\*(C`:utf8\*(C'\fR for input can sometimes result in security +breaches, so please use \f(CW:encoding(UTF\-8)\fR instead. +.PP +Instead of \f(CW\*(C`decode\*(C'\fR and \f(CW\*(C`encode\*(C'\fR, you could use \f(CW\*(C`_utf8_on\*(C'\fR and \f(CW\*(C`_utf8_off\*(C'\fR, +but this is considered bad style. Especially \f(CW\*(C`_utf8_on\*(C'\fR can be dangerous, for +the same reason that \f(CW\*(C`:utf8\*(C'\fR can. +.PP +There are some shortcuts for oneliners; +see \-C in perlrun. +.ie n .SS "What's the difference between ""UTF\-8"" and ""utf8""?" +.el .SS "What's the difference between \f(CWUTF\-8\fP and \f(CWutf8\fP?" +.IX Subsection "What's the difference between UTF-8 and utf8?" +\&\f(CW\*(C`UTF\-8\*(C'\fR is the official standard. \f(CW\*(C`utf8\*(C'\fR is Perl's way of being liberal in +what it accepts. If you have to communicate with things that aren't so liberal, +you may want to consider using \f(CW\*(C`UTF\-8\*(C'\fR. If you have to communicate with things +that are too liberal, you may have to use \f(CW\*(C`utf8\*(C'\fR. The full explanation is in +"UTF\-8 vs. utf8 vs. UTF8" in Encode. +.PP +\&\f(CW\*(C`UTF\-8\*(C'\fR is internally known as \f(CW\*(C`utf\-8\-strict\*(C'\fR. The tutorial uses UTF\-8 +consistently, even where utf8 is actually used internally, because the +distinction can be hard to make, and is mostly irrelevant. +.PP +For example, utf8 can be used for code points that don't exist in Unicode, like +9999999, but if you encode that to UTF\-8, you get a substitution character (by +default; see "Handling Malformed Data" in Encode for more ways of dealing with +this.) +.PP +Okay, if you insist: the "internal format" is utf8, not UTF\-8. (When it's not +some other encoding.) +.SS "I lost track; what encoding is the internal format really?" +.IX Subsection "I lost track; what encoding is the internal format really?" +It's good that you lost track, because you shouldn't depend on the internal +format being any specific encoding. But since you asked: by default, the +internal format is either ISO\-8859\-1 (latin\-1), or utf8, depending on the +history of the string. On EBCDIC platforms, this may be different even. +.PP +Perl knows how it stored the string internally, and will use that knowledge +when you \f(CW\*(C`encode\*(C'\fR. In other words: don't try to find out what the internal +encoding for a certain string is, but instead just encode it into the encoding +that you want. +.SH AUTHOR +.IX Header "AUTHOR" +Juerd Waalboer <#####@juerd.nl> +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlunicode, perluniintro, Encode |