summaryrefslogtreecommitdiffstats
path: root/upstream/mageia-cauldron/man1/perlunifaq.1
diff options
context:
space:
mode:
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunifaq.1')
-rw-r--r--upstream/mageia-cauldron/man1/perlunifaq.1389
1 files changed, 389 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunifaq.1 b/upstream/mageia-cauldron/man1/perlunifaq.1
new file mode 100644
index 00000000..35be7580
--- /dev/null
+++ b/upstream/mageia-cauldron/man1/perlunifaq.1
@@ -0,0 +1,389 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLUNIFAQ 1"
+.TH PERLUNIFAQ 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlunifaq \- Perl Unicode FAQ
+.SH "Q and A"
+.IX Header "Q and A"
+This is a list of questions and answers about Unicode in Perl, intended to be
+read after perlunitut.
+.SS "perlunitut isn't really a Unicode tutorial, is it?"
+.IX Subsection "perlunitut isn't really a Unicode tutorial, is it?"
+No, and this isn't really a Unicode FAQ.
+.PP
+Perl has an abstracted interface for all supported character encodings, so this
+is actually a generic \f(CW\*(C`Encode\*(C'\fR tutorial and \f(CW\*(C`Encode\*(C'\fR FAQ. But many people
+think that Unicode is special and magical, and I didn't want to disappoint
+them, so I decided to call the document a Unicode tutorial.
+.SS "What character encodings does Perl support?"
+.IX Subsection "What character encodings does Perl support?"
+To find out which character encodings your Perl supports, run:
+.PP
+.Vb 1
+\& perl \-MEncode \-le "print for Encode\->encodings(\*(Aq:all\*(Aq)"
+.Ve
+.SS "Which version of perl should I use?"
+.IX Subsection "Which version of perl should I use?"
+Well, if you can, upgrade to the most recent, but certainly \f(CW5.8.1\fR or newer.
+The tutorial and FAQ assume the latest release.
+.PP
+You should also check your modules, and upgrade them if necessary. For example,
+HTML::Entities requires version >= 1.32 to function correctly, even though the
+changelog is silent about this.
+.SS "What about binary data, like images?"
+.IX Subsection "What about binary data, like images?"
+Well, apart from a bare \f(CW\*(C`binmode $fh\*(C'\fR, you shouldn't treat them specially.
+(The binmode is needed because otherwise Perl may convert line endings on Win32
+systems.)
+.PP
+Be careful, though, to never combine text strings with binary strings. If you
+need text in a binary stream, encode your text strings first using the
+appropriate encoding, then join them with binary strings. See also: "What if I
+don't encode?".
+.SS "When should I decode or encode?"
+.IX Subsection "When should I decode or encode?"
+Whenever you're communicating text with anything that is external to your perl
+process, like a database, a text file, a socket, or another program. Even if
+the thing you're communicating with is also written in Perl.
+.SS "What if I don't decode?"
+.IX Subsection "What if I don't decode?"
+Whenever your encoded, binary string is used together with a text string, Perl
+will assume that your binary string was encoded with ISO\-8859\-1, also known as
+latin\-1. If it wasn't latin\-1, then your data is unpleasantly converted. For
+example, if it was UTF\-8, the individual bytes of multibyte characters are seen
+as separate characters, and then again converted to UTF\-8. Such double encoding
+can be compared to double HTML encoding (\f(CW\*(C`&amp;gt;\*(C'\fR), or double URI encoding
+(\f(CW%253E\fR).
+.PP
+This silent implicit decoding is known as "upgrading". That may sound
+positive, but it's best to avoid it.
+.SS "What if I don't encode?"
+.IX Subsection "What if I don't encode?"
+It depends on what you output and how you output it.
+.PP
+\fIOutput via a filehandle\fR
+.IX Subsection "Output via a filehandle"
+.IP \(bu 4
+If the string's characters are all code point 255 or lower, Perl
+outputs bytes that match those code points. This is what happens with encoded
+strings. It can also, though, happen with unencoded strings that happen to be
+all code point 255 or lower.
+.IP \(bu 4
+Otherwise, Perl outputs the string encoded as UTF\-8. This only happens
+with strings you neglected to encode. Since that should not happen, Perl also
+throws a "wide character" warning in this case.
+.PP
+\fIOther output mechanisms (e.g., \fR\f(CI\*(C`exec\*(C'\fR\fI, \fR\f(CI\*(C`chdir\*(C'\fR\fI, ..)\fR
+.IX Subsection "Other output mechanisms (e.g., exec, chdir, ..)"
+.PP
+Your text string will be sent using the bytes in Perl's internal format.
+.PP
+Because the internal format is often UTF\-8, these bugs are hard to spot,
+because UTF\-8 is usually the encoding you wanted! But don't be lazy, and don't
+use the fact that Perl's internal format is UTF\-8 to your advantage. Encode
+explicitly to avoid weird bugs, and to show to maintenance programmers that you
+thought this through.
+.SS "Is there a way to automatically decode or encode?"
+.IX Subsection "Is there a way to automatically decode or encode?"
+If all data that comes from a certain handle is encoded in exactly the same
+way, you can tell the PerlIO system to automatically decode everything, with
+the \f(CW\*(C`encoding\*(C'\fR layer. If you do this, you can't accidentally forget to decode
+or encode anymore, on things that use the layered handle.
+.PP
+You can provide this layer when \f(CW\*(C`open\*(C'\fRing the file:
+.PP
+.Vb 2
+\& open my $fh, \*(Aq>:encoding(UTF\-8)\*(Aq, $filename; # auto encoding on write
+\& open my $fh, \*(Aq<:encoding(UTF\-8)\*(Aq, $filename; # auto decoding on read
+.Ve
+.PP
+Or if you already have an open filehandle:
+.PP
+.Vb 1
+\& binmode $fh, \*(Aq:encoding(UTF\-8)\*(Aq;
+.Ve
+.PP
+Some database drivers for DBI can also automatically encode and decode, but
+that is sometimes limited to the UTF\-8 encoding.
+.SS "What if I don't know which encoding was used?"
+.IX Subsection "What if I don't know which encoding was used?"
+Do whatever you can to find out, and if you have to: guess. (Don't forget to
+document your guess with a comment.)
+.PP
+You could open the document in a web browser, and change the character set or
+character encoding until you can visually confirm that all characters look the
+way they should.
+.PP
+There is no way to reliably detect the encoding automatically, so if people
+keep sending you data without charset indication, you may have to educate them.
+.SS "Can I use Unicode in my Perl sources?"
+.IX Subsection "Can I use Unicode in my Perl sources?"
+Yes, you can! If your sources are UTF\-8 encoded, you can indicate that with the
+\&\f(CW\*(C`use utf8\*(C'\fR pragma.
+.PP
+.Vb 1
+\& use utf8;
+.Ve
+.PP
+This doesn't do anything to your input, or to your output. It only influences
+the way your sources are read. You can use Unicode in string literals, in
+identifiers (but they still have to be "word characters" according to \f(CW\*(C`\ew\*(C'\fR),
+and even in custom delimiters.
+.SS "Data::Dumper doesn't restore the UTF8 flag; is it broken?"
+.IX Subsection "Data::Dumper doesn't restore the UTF8 flag; is it broken?"
+No, Data::Dumper's Unicode abilities are as they should be. There have been
+some complaints that it should restore the UTF8 flag when the data is read
+again with \f(CW\*(C`eval\*(C'\fR. However, you should really not look at the flag, and
+nothing indicates that Data::Dumper should break this rule.
+.PP
+Here's what happens: when Perl reads in a string literal, it sticks to 8 bit
+encoding as long as it can. (But perhaps originally it was internally encoded
+as UTF\-8, when you dumped it.) When it has to give that up because other
+characters are added to the text string, it silently upgrades the string to
+UTF\-8.
+.PP
+If you properly encode your strings for output, none of this is of your
+concern, and you can just \f(CW\*(C`eval\*(C'\fR dumped data as always.
+.SS "Why do regex character classes sometimes match only in the ASCII range?"
+.IX Subsection "Why do regex character classes sometimes match only in the ASCII range?"
+Starting in Perl 5.14 (and partially in Perl 5.12), just put a
+\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR near the beginning of your program.
+Within its lexical scope you shouldn't have this problem. It also is
+automatically enabled under \f(CW\*(C`use feature \*(Aq:5.12\*(Aq\*(C'\fR or \f(CW\*(C`use v5.12\*(C'\fR or
+using \f(CW\*(C`\-E\*(C'\fR on the command line for Perl 5.12 or higher.
+.PP
+The rationale for requiring this is to not break older programs that
+rely on the way things worked before Unicode came along. Those older
+programs knew only about the ASCII character set, and so may not work
+properly for additional characters. When a string is encoded in UTF\-8,
+Perl assumes that the program is prepared to deal with Unicode, but when
+the string isn't, Perl assumes that only ASCII
+is wanted, and so those characters that are not ASCII
+characters aren't recognized as to what they would be in Unicode.
+\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR tells Perl to treat all characters as
+Unicode, whether the string is encoded in UTF\-8 or not, thus avoiding
+the problem.
+.PP
+However, on earlier Perls, or if you pass strings to subroutines outside
+the feature's scope, you can force Unicode rules by changing the
+encoding to UTF\-8 by doing \f(CWutf8::upgrade($string)\fR. This can be used
+safely on any string, as it checks and does not change strings that have
+already been upgraded.
+.PP
+For a more detailed discussion, see Unicode::Semantics on CPAN.
+.SS "Why do some characters not uppercase or lowercase correctly?"
+.IX Subsection "Why do some characters not uppercase or lowercase correctly?"
+See the answer to the previous question.
+.SS "How can I determine if a string is a text string or a binary string?"
+.IX Subsection "How can I determine if a string is a text string or a binary string?"
+You can't. Some use the UTF8 flag for this, but that's misuse, and makes well
+behaved modules like Data::Dumper look bad. The flag is useless for this
+purpose, because it's off when an 8 bit encoding (by default ISO\-8859\-1) is
+used to store the string.
+.PP
+This is something you, the programmer, has to keep track of; sorry. You could
+consider adopting a kind of "Hungarian notation" to help with this.
+.SS "How do I convert from encoding FOO to encoding BAR?"
+.IX Subsection "How do I convert from encoding FOO to encoding BAR?"
+By first converting the FOO-encoded byte string to a text string, and then the
+text string to a BAR-encoded byte string:
+.PP
+.Vb 2
+\& my $text_string = decode(\*(AqFOO\*(Aq, $foo_string);
+\& my $bar_string = encode(\*(AqBAR\*(Aq, $text_string);
+.Ve
+.PP
+or by skipping the text string part, and going directly from one binary
+encoding to the other:
+.PP
+.Vb 2
+\& use Encode qw(from_to);
+\& from_to($string, \*(AqFOO\*(Aq, \*(AqBAR\*(Aq); # changes contents of $string
+.Ve
+.PP
+or by letting automatic decoding and encoding do all the work:
+.PP
+.Vb 3
+\& open my $foofh, \*(Aq<:encoding(FOO)\*(Aq, \*(Aqexample.foo.txt\*(Aq;
+\& open my $barfh, \*(Aq>:encoding(BAR)\*(Aq, \*(Aqexample.bar.txt\*(Aq;
+\& print { $barfh } $_ while <$foofh>;
+.Ve
+.ie n .SS "What are ""decode_utf8"" and ""encode_utf8""?"
+.el .SS "What are \f(CWdecode_utf8\fP and \f(CWencode_utf8\fP?"
+.IX Subsection "What are decode_utf8 and encode_utf8?"
+These are alternate syntaxes for \f(CW\*(C`decode(\*(Aqutf8\*(Aq, ...)\*(C'\fR and \f(CW\*(C`encode(\*(Aqutf8\*(Aq,
+\&...)\*(C'\fR. Do not use these functions for data exchange. Instead use
+\&\f(CW\*(C`decode(\*(AqUTF\-8\*(Aq, ...)\*(C'\fR and \f(CW\*(C`encode(\*(AqUTF\-8\*(Aq, ...)\*(C'\fR; see
+"What's the difference between UTF\-8 and utf8?" below.
+.SS "What is a ""wide character""?"
+.IX Subsection "What is a ""wide character""?"
+This is a term used for characters occupying more than one byte.
+.PP
+The Perl warning "Wide character in ..." is caused by such a character.
+With no specified encoding layer, Perl tries to
+fit things into a single byte. When it can't, it
+emits this warning (if warnings are enabled), and uses UTF\-8 encoded data
+instead.
+.PP
+To avoid this warning and to avoid having different output encodings in a single
+stream, always specify an encoding explicitly, for example with a PerlIO layer:
+.PP
+.Vb 1
+\& binmode STDOUT, ":encoding(UTF\-8)";
+.Ve
+.SH INTERNALS
+.IX Header "INTERNALS"
+.SS "What is ""the UTF8 flag""?"
+.IX Subsection "What is ""the UTF8 flag""?"
+Please, unless you're hacking the internals, or debugging weirdness, don't
+think about the UTF8 flag at all. That means that you very probably shouldn't
+use \f(CW\*(C`is_utf8\*(C'\fR, \f(CW\*(C`_utf8_on\*(C'\fR or \f(CW\*(C`_utf8_off\*(C'\fR at all.
+.PP
+The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the
+current internal representation is UTF\-8. Without the flag, it is assumed to be
+ISO\-8859\-1. Perl converts between these automatically. (Actually Perl usually
+assumes the representation is ASCII; see "Why do regex character classes
+sometimes match only in the ASCII range?" above.)
+.PP
+One of Perl's internal formats happens to be UTF\-8. Unfortunately, Perl can't
+keep a secret, so everyone knows about this. That is the source of much
+confusion. It's better to pretend that the internal format is some unknown
+encoding, and that you always have to encode and decode explicitly.
+.ie n .SS "What about the ""use bytes"" pragma?"
+.el .SS "What about the \f(CWuse bytes\fP pragma?"
+.IX Subsection "What about the use bytes pragma?"
+Don't use it. It makes no sense to deal with bytes in a text string, and it
+makes no sense to deal with characters in a byte string. Do the proper
+conversions (by decoding/encoding), and things will work out well: you get
+character counts for decoded data, and byte counts for encoded data.
+.PP
+\&\f(CW\*(C`use bytes\*(C'\fR is usually a failed attempt to do something useful. Just forget
+about it.
+.ie n .SS "What about the ""use encoding"" pragma?"
+.el .SS "What about the \f(CWuse encoding\fP pragma?"
+.IX Subsection "What about the use encoding pragma?"
+Don't use it. Unfortunately, it assumes that the programmer's environment and
+that of the user will use the same encoding. It will use the same encoding for
+the source code and for STDIN and STDOUT. When a program is copied to another
+machine, the source code does not change, but the STDIO environment might.
+.PP
+If you need non-ASCII characters in your source code, make it a UTF\-8 encoded
+file and \f(CW\*(C`use utf8\*(C'\fR.
+.PP
+If you need to set the encoding for STDIN, STDOUT, and STDERR, for example
+based on the user's locale, \f(CW\*(C`use open\*(C'\fR.
+.ie n .SS "What is the difference between "":encoding"" and "":utf8""?"
+.el .SS "What is the difference between \f(CW:encoding\fP and \f(CW:utf8\fP?"
+.IX Subsection "What is the difference between :encoding and :utf8?"
+Because UTF\-8 is one of Perl's internal formats, you can often just skip the
+encoding or decoding step, and manipulate the UTF8 flag directly.
+.PP
+Instead of \f(CW:encoding(UTF\-8)\fR, you can simply use \f(CW\*(C`:utf8\*(C'\fR, which skips the
+encoding step if the data was already represented as UTF8 internally. This is
+widely accepted as good behavior when you're writing, but it can be dangerous
+when reading, because it causes internal inconsistency when you have invalid
+byte sequences. Using \f(CW\*(C`:utf8\*(C'\fR for input can sometimes result in security
+breaches, so please use \f(CW:encoding(UTF\-8)\fR instead.
+.PP
+Instead of \f(CW\*(C`decode\*(C'\fR and \f(CW\*(C`encode\*(C'\fR, you could use \f(CW\*(C`_utf8_on\*(C'\fR and \f(CW\*(C`_utf8_off\*(C'\fR,
+but this is considered bad style. Especially \f(CW\*(C`_utf8_on\*(C'\fR can be dangerous, for
+the same reason that \f(CW\*(C`:utf8\*(C'\fR can.
+.PP
+There are some shortcuts for oneliners;
+see \-C in perlrun.
+.ie n .SS "What's the difference between ""UTF\-8"" and ""utf8""?"
+.el .SS "What's the difference between \f(CWUTF\-8\fP and \f(CWutf8\fP?"
+.IX Subsection "What's the difference between UTF-8 and utf8?"
+\&\f(CW\*(C`UTF\-8\*(C'\fR is the official standard. \f(CW\*(C`utf8\*(C'\fR is Perl's way of being liberal in
+what it accepts. If you have to communicate with things that aren't so liberal,
+you may want to consider using \f(CW\*(C`UTF\-8\*(C'\fR. If you have to communicate with things
+that are too liberal, you may have to use \f(CW\*(C`utf8\*(C'\fR. The full explanation is in
+"UTF\-8 vs. utf8 vs. UTF8" in Encode.
+.PP
+\&\f(CW\*(C`UTF\-8\*(C'\fR is internally known as \f(CW\*(C`utf\-8\-strict\*(C'\fR. The tutorial uses UTF\-8
+consistently, even where utf8 is actually used internally, because the
+distinction can be hard to make, and is mostly irrelevant.
+.PP
+For example, utf8 can be used for code points that don't exist in Unicode, like
+9999999, but if you encode that to UTF\-8, you get a substitution character (by
+default; see "Handling Malformed Data" in Encode for more ways of dealing with
+this.)
+.PP
+Okay, if you insist: the "internal format" is utf8, not UTF\-8. (When it's not
+some other encoding.)
+.SS "I lost track; what encoding is the internal format really?"
+.IX Subsection "I lost track; what encoding is the internal format really?"
+It's good that you lost track, because you shouldn't depend on the internal
+format being any specific encoding. But since you asked: by default, the
+internal format is either ISO\-8859\-1 (latin\-1), or utf8, depending on the
+history of the string. On EBCDIC platforms, this may be different even.
+.PP
+Perl knows how it stored the string internally, and will use that knowledge
+when you \f(CW\*(C`encode\*(C'\fR. In other words: don't try to find out what the internal
+encoding for a certain string is, but instead just encode it into the encoding
+that you want.
+.SH AUTHOR
+.IX Header "AUTHOR"
+Juerd Waalboer <#####@juerd.nl>
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+perlunicode, perluniintro, Encode