diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunitut.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlunitut.1 | 285 |
1 files changed, 285 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunitut.1 b/upstream/mageia-cauldron/man1/perlunitut.1 new file mode 100644 index 00000000..bb943ee2 --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlunitut.1 @@ -0,0 +1,285 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLUNITUT 1" +.TH PERLUNITUT 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlunitut \- Perl Unicode Tutorial +.SH DESCRIPTION +.IX Header "DESCRIPTION" +The days of just flinging strings around are over. It's well established that +modern programs need to be capable of communicating funny accented letters, and +things like euro symbols. This means that programmers need new habits. It's +easy to program Unicode capable software, but it does require discipline to do +it right. +.PP +There's a lot to know about character sets, and text encodings. It's probably +best to spend a full day learning all this, but the basics can be learned in +minutes. +.PP +These are not the very basics, though. It is assumed that you already +know the difference between bytes and characters, and realise (and accept!) +that there are many different character sets and encodings, and that your +program has to be explicit about them. Recommended reading is "The Absolute +Minimum Every Software Developer Absolutely, Positively Must Know About Unicode +and Character Sets (No Excuses!)" by Joel Spolsky, at +<http://joelonsoftware.com/articles/Unicode.html>. +.PP +This tutorial speaks in rather absolute terms, and provides only a limited view +of the wealth of character string related features that Perl has to offer. For +most projects, this information will probably suffice. +.SS Definitions +.IX Subsection "Definitions" +It's important to set a few things straight first. This is the most important +part of this tutorial. This view may conflict with other information that you +may have found on the web, but that's mostly because many sources are wrong. +.PP +You may have to re-read this entire section a few times... +.PP +\fIUnicode\fR +.IX Subsection "Unicode" +.PP +\&\fBUnicode\fR is a character set with room for lots of characters. The ordinal +value of a character is called a \fBcode point\fR. (But in practice, the +distinction between code point and character is blurred, so the terms often +are used interchangeably.) +.PP +There are many, many code points, but computers work with bytes, and a byte has +room for only 256 values. Unicode has many more characters than that, +so you need a method to make these accessible. +.PP +Unicode is encoded using several competing encodings, of which UTF\-8 is the +most used. In a Unicode encoding, multiple subsequent bytes can be used to +store a single code point, or simply: character. +.PP +\fIUTF\-8\fR +.IX Subsection "UTF-8" +.PP +\&\fBUTF\-8\fR is a Unicode encoding. Many people think that Unicode and UTF\-8 are +the same thing, but they're not. There are more Unicode encodings, but much of +the world has standardized on UTF\-8. +.PP +UTF\-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take +only one byte per character. All other characters are encoded as two to +four bytes using a complex scheme. Fortunately, Perl handles this for +us, so we don't have to worry about this. +.PP +\fIText strings (character strings)\fR +.IX Subsection "Text strings (character strings)" +.PP +\&\fBText strings\fR, or \fBcharacter strings\fR are made of characters. Bytes are +irrelevant here, and so are encodings. Each character is just that: the +character. +.PP +On a text string, you would do things like: +.PP +.Vb 4 +\& $text =~ s/foo/bar/; +\& if ($string =~ /^\ed+$/) { ... } +\& $text = ucfirst $text; +\& my $character_count = length $text; +.Ve +.PP +The value of a character (\f(CW\*(C`ord\*(C'\fR, \f(CW\*(C`chr\*(C'\fR) is the corresponding Unicode code +point. +.PP +\fIBinary strings (byte strings)\fR +.IX Subsection "Binary strings (byte strings)" +.PP +\&\fBBinary strings\fR, or \fBbyte strings\fR are made of bytes. Here, you don't have +characters, just bytes. All communication with the outside world (anything +outside of your current Perl process) is done in binary. +.PP +On a binary string, you would do things like: +.PP +.Vb 4 +\& my (@length_content) = unpack "(V/a)*", $binary; +\& $binary =~ s/\ex00\ex0F/\exFF\exF0/; # for the brave :) +\& print {$fh} $binary; +\& my $byte_count = length $binary; +.Ve +.PP +\fIEncoding\fR +.IX Subsection "Encoding" +.PP +\&\fBEncoding\fR (as a verb) is the conversion from \fItext\fR to \fIbinary\fR. To encode, +you have to supply the target encoding, for example \f(CW\*(C`iso\-8859\-1\*(C'\fR or \f(CW\*(C`UTF\-8\*(C'\fR. +Some encodings, like the \f(CW\*(C`iso\-8859\*(C'\fR ("latin") range, do not support the full +Unicode standard; characters that can't be represented are lost in the +conversion. +.PP +\fIDecoding\fR +.IX Subsection "Decoding" +.PP +\&\fBDecoding\fR is the conversion from \fIbinary\fR to \fItext\fR. To decode, you have to +know what encoding was used during the encoding phase. And most of all, it must +be something decodable. It doesn't make much sense to decode a PNG image into a +text string. +.PP +\fIInternal format\fR +.IX Subsection "Internal format" +.PP +Perl has an \fBinternal format\fR, an encoding that it uses to encode text strings +so it can store them in memory. All text strings are in this internal format. +In fact, text strings are never in any other format! +.PP +You shouldn't worry about what this format is, because conversion is +automatically done when you decode or encode. +.SS "Your new toolkit" +.IX Subsection "Your new toolkit" +Add to your standard heading the following line: +.PP +.Vb 1 +\& use Encode qw(encode decode); +.Ve +.PP +Or, if you're lazy, just: +.PP +.Vb 1 +\& use Encode; +.Ve +.SS "I/O flow (the actual 5 minute tutorial)" +.IX Subsection "I/O flow (the actual 5 minute tutorial)" +The typical input/output flow of a program is: +.PP +.Vb 3 +\& 1. Receive and decode +\& 2. Process +\& 3. Encode and output +.Ve +.PP +If your input is binary, and is supposed to remain binary, you shouldn't decode +it to a text string, of course. But in all other cases, you should decode it. +.PP +Decoding can't happen reliably if you don't know how the data was encoded. If +you get to choose, it's a good idea to standardize on UTF\-8. +.PP +.Vb 3 +\& my $foo = decode(\*(AqUTF\-8\*(Aq, get \*(Aqhttp://example.com/\*(Aq); +\& my $bar = decode(\*(AqISO\-8859\-1\*(Aq, readline STDIN); +\& my $xyzzy = decode(\*(AqWindows\-1251\*(Aq, $cgi\->param(\*(Aqfoo\*(Aq)); +.Ve +.PP +Processing happens as you knew before. The only difference is that you're now +using characters instead of bytes. That's very useful if you use things like +\&\f(CW\*(C`substr\*(C'\fR, or \f(CW\*(C`length\*(C'\fR. +.PP +It's important to realize that there are no bytes in a text string. Of course, +Perl has its internal encoding to store the string in memory, but ignore that. +If you have to do anything with the number of bytes, it's probably best to move +that part to step 3, just after you've encoded the string. Then you know +exactly how many bytes it will be in the destination string. +.PP +The syntax for encoding text strings to binary strings is as simple as decoding: +.PP +.Vb 1 +\& $body = encode(\*(AqUTF\-8\*(Aq, $body); +.Ve +.PP +If you needed to know the length of the string in bytes, now's the perfect time +for that. Because \f(CW$body\fR is now a byte string, \f(CW\*(C`length\*(C'\fR will report the +number of bytes, instead of the number of characters. The number of +characters is no longer known, because characters only exist in text strings. +.PP +.Vb 1 +\& my $byte_count = length $body; +.Ve +.PP +And if the protocol you're using supports a way of letting the recipient know +which character encoding you used, please help the receiving end by using that +feature! For example, E\-mail and HTTP support MIME headers, so you can use the +\&\f(CW\*(C`Content\-Type\*(C'\fR header. They can also have \f(CW\*(C`Content\-Length\*(C'\fR to indicate the +number of \fIbytes\fR, which is always a good idea to supply if the number is +known. +.PP +.Vb 2 +\& "Content\-Type: text/plain; charset=UTF\-8", +\& "Content\-Length: $byte_count" +.Ve +.SH SUMMARY +.IX Header "SUMMARY" +Decode everything you receive, encode everything you send out. (If it's text +data.) +.SH "Q and A (or FAQ)" +.IX Header "Q and A (or FAQ)" +After reading this document, you ought to read perlunifaq too, then +perluniintro. +.SH ACKNOWLEDGEMENTS +.IX Header "ACKNOWLEDGEMENTS" +Thanks to Johan Vromans from Squirrel Consultancy. His UTF\-8 rants during the +Amsterdam Perl Mongers meetings got me interested and determined to find out +how to use character encodings in Perl in ways that don't break easily. +.PP +Thanks to Gerard Goossen from TTY. His presentation "UTF\-8 in the wild" (Dutch +Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial. +.PP +Thanks to the people who asked about this kind of stuff in several Perl IRC +channels, and have constantly reminded me that a simpler explanation was +needed. +.PP +Thanks to the people who reviewed this document for me, before it went public. +They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan +Gray. +.SH AUTHOR +.IX Header "AUTHOR" +Juerd Waalboer <#####@juerd.nl> +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlunifaq, perlunicode, perluniintro, Encode |