summaryrefslogtreecommitdiffstats
path: root/upstream/mageia-cauldron/man1/perlunitut.1
diff options
context:
space:
mode:
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunitut.1')
-rw-r--r--upstream/mageia-cauldron/man1/perlunitut.1285
1 files changed, 285 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunitut.1 b/upstream/mageia-cauldron/man1/perlunitut.1
new file mode 100644
index 00000000..bb943ee2
--- /dev/null
+++ b/upstream/mageia-cauldron/man1/perlunitut.1
@@ -0,0 +1,285 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLUNITUT 1"
+.TH PERLUNITUT 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlunitut \- Perl Unicode Tutorial
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+The days of just flinging strings around are over. It's well established that
+modern programs need to be capable of communicating funny accented letters, and
+things like euro symbols. This means that programmers need new habits. It's
+easy to program Unicode capable software, but it does require discipline to do
+it right.
+.PP
+There's a lot to know about character sets, and text encodings. It's probably
+best to spend a full day learning all this, but the basics can be learned in
+minutes.
+.PP
+These are not the very basics, though. It is assumed that you already
+know the difference between bytes and characters, and realise (and accept!)
+that there are many different character sets and encodings, and that your
+program has to be explicit about them. Recommended reading is "The Absolute
+Minimum Every Software Developer Absolutely, Positively Must Know About Unicode
+and Character Sets (No Excuses!)" by Joel Spolsky, at
+<http://joelonsoftware.com/articles/Unicode.html>.
+.PP
+This tutorial speaks in rather absolute terms, and provides only a limited view
+of the wealth of character string related features that Perl has to offer. For
+most projects, this information will probably suffice.
+.SS Definitions
+.IX Subsection "Definitions"
+It's important to set a few things straight first. This is the most important
+part of this tutorial. This view may conflict with other information that you
+may have found on the web, but that's mostly because many sources are wrong.
+.PP
+You may have to re-read this entire section a few times...
+.PP
+\fIUnicode\fR
+.IX Subsection "Unicode"
+.PP
+\&\fBUnicode\fR is a character set with room for lots of characters. The ordinal
+value of a character is called a \fBcode point\fR. (But in practice, the
+distinction between code point and character is blurred, so the terms often
+are used interchangeably.)
+.PP
+There are many, many code points, but computers work with bytes, and a byte has
+room for only 256 values. Unicode has many more characters than that,
+so you need a method to make these accessible.
+.PP
+Unicode is encoded using several competing encodings, of which UTF\-8 is the
+most used. In a Unicode encoding, multiple subsequent bytes can be used to
+store a single code point, or simply: character.
+.PP
+\fIUTF\-8\fR
+.IX Subsection "UTF-8"
+.PP
+\&\fBUTF\-8\fR is a Unicode encoding. Many people think that Unicode and UTF\-8 are
+the same thing, but they're not. There are more Unicode encodings, but much of
+the world has standardized on UTF\-8.
+.PP
+UTF\-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take
+only one byte per character. All other characters are encoded as two to
+four bytes using a complex scheme. Fortunately, Perl handles this for
+us, so we don't have to worry about this.
+.PP
+\fIText strings (character strings)\fR
+.IX Subsection "Text strings (character strings)"
+.PP
+\&\fBText strings\fR, or \fBcharacter strings\fR are made of characters. Bytes are
+irrelevant here, and so are encodings. Each character is just that: the
+character.
+.PP
+On a text string, you would do things like:
+.PP
+.Vb 4
+\& $text =~ s/foo/bar/;
+\& if ($string =~ /^\ed+$/) { ... }
+\& $text = ucfirst $text;
+\& my $character_count = length $text;
+.Ve
+.PP
+The value of a character (\f(CW\*(C`ord\*(C'\fR, \f(CW\*(C`chr\*(C'\fR) is the corresponding Unicode code
+point.
+.PP
+\fIBinary strings (byte strings)\fR
+.IX Subsection "Binary strings (byte strings)"
+.PP
+\&\fBBinary strings\fR, or \fBbyte strings\fR are made of bytes. Here, you don't have
+characters, just bytes. All communication with the outside world (anything
+outside of your current Perl process) is done in binary.
+.PP
+On a binary string, you would do things like:
+.PP
+.Vb 4
+\& my (@length_content) = unpack "(V/a)*", $binary;
+\& $binary =~ s/\ex00\ex0F/\exFF\exF0/; # for the brave :)
+\& print {$fh} $binary;
+\& my $byte_count = length $binary;
+.Ve
+.PP
+\fIEncoding\fR
+.IX Subsection "Encoding"
+.PP
+\&\fBEncoding\fR (as a verb) is the conversion from \fItext\fR to \fIbinary\fR. To encode,
+you have to supply the target encoding, for example \f(CW\*(C`iso\-8859\-1\*(C'\fR or \f(CW\*(C`UTF\-8\*(C'\fR.
+Some encodings, like the \f(CW\*(C`iso\-8859\*(C'\fR ("latin") range, do not support the full
+Unicode standard; characters that can't be represented are lost in the
+conversion.
+.PP
+\fIDecoding\fR
+.IX Subsection "Decoding"
+.PP
+\&\fBDecoding\fR is the conversion from \fIbinary\fR to \fItext\fR. To decode, you have to
+know what encoding was used during the encoding phase. And most of all, it must
+be something decodable. It doesn't make much sense to decode a PNG image into a
+text string.
+.PP
+\fIInternal format\fR
+.IX Subsection "Internal format"
+.PP
+Perl has an \fBinternal format\fR, an encoding that it uses to encode text strings
+so it can store them in memory. All text strings are in this internal format.
+In fact, text strings are never in any other format!
+.PP
+You shouldn't worry about what this format is, because conversion is
+automatically done when you decode or encode.
+.SS "Your new toolkit"
+.IX Subsection "Your new toolkit"
+Add to your standard heading the following line:
+.PP
+.Vb 1
+\& use Encode qw(encode decode);
+.Ve
+.PP
+Or, if you're lazy, just:
+.PP
+.Vb 1
+\& use Encode;
+.Ve
+.SS "I/O flow (the actual 5 minute tutorial)"
+.IX Subsection "I/O flow (the actual 5 minute tutorial)"
+The typical input/output flow of a program is:
+.PP
+.Vb 3
+\& 1. Receive and decode
+\& 2. Process
+\& 3. Encode and output
+.Ve
+.PP
+If your input is binary, and is supposed to remain binary, you shouldn't decode
+it to a text string, of course. But in all other cases, you should decode it.
+.PP
+Decoding can't happen reliably if you don't know how the data was encoded. If
+you get to choose, it's a good idea to standardize on UTF\-8.
+.PP
+.Vb 3
+\& my $foo = decode(\*(AqUTF\-8\*(Aq, get \*(Aqhttp://example.com/\*(Aq);
+\& my $bar = decode(\*(AqISO\-8859\-1\*(Aq, readline STDIN);
+\& my $xyzzy = decode(\*(AqWindows\-1251\*(Aq, $cgi\->param(\*(Aqfoo\*(Aq));
+.Ve
+.PP
+Processing happens as you knew before. The only difference is that you're now
+using characters instead of bytes. That's very useful if you use things like
+\&\f(CW\*(C`substr\*(C'\fR, or \f(CW\*(C`length\*(C'\fR.
+.PP
+It's important to realize that there are no bytes in a text string. Of course,
+Perl has its internal encoding to store the string in memory, but ignore that.
+If you have to do anything with the number of bytes, it's probably best to move
+that part to step 3, just after you've encoded the string. Then you know
+exactly how many bytes it will be in the destination string.
+.PP
+The syntax for encoding text strings to binary strings is as simple as decoding:
+.PP
+.Vb 1
+\& $body = encode(\*(AqUTF\-8\*(Aq, $body);
+.Ve
+.PP
+If you needed to know the length of the string in bytes, now's the perfect time
+for that. Because \f(CW$body\fR is now a byte string, \f(CW\*(C`length\*(C'\fR will report the
+number of bytes, instead of the number of characters. The number of
+characters is no longer known, because characters only exist in text strings.
+.PP
+.Vb 1
+\& my $byte_count = length $body;
+.Ve
+.PP
+And if the protocol you're using supports a way of letting the recipient know
+which character encoding you used, please help the receiving end by using that
+feature! For example, E\-mail and HTTP support MIME headers, so you can use the
+\&\f(CW\*(C`Content\-Type\*(C'\fR header. They can also have \f(CW\*(C`Content\-Length\*(C'\fR to indicate the
+number of \fIbytes\fR, which is always a good idea to supply if the number is
+known.
+.PP
+.Vb 2
+\& "Content\-Type: text/plain; charset=UTF\-8",
+\& "Content\-Length: $byte_count"
+.Ve
+.SH SUMMARY
+.IX Header "SUMMARY"
+Decode everything you receive, encode everything you send out. (If it's text
+data.)
+.SH "Q and A (or FAQ)"
+.IX Header "Q and A (or FAQ)"
+After reading this document, you ought to read perlunifaq too, then
+perluniintro.
+.SH ACKNOWLEDGEMENTS
+.IX Header "ACKNOWLEDGEMENTS"
+Thanks to Johan Vromans from Squirrel Consultancy. His UTF\-8 rants during the
+Amsterdam Perl Mongers meetings got me interested and determined to find out
+how to use character encodings in Perl in ways that don't break easily.
+.PP
+Thanks to Gerard Goossen from TTY. His presentation "UTF\-8 in the wild" (Dutch
+Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial.
+.PP
+Thanks to the people who asked about this kind of stuff in several Perl IRC
+channels, and have constantly reminded me that a simpler explanation was
+needed.
+.PP
+Thanks to the people who reviewed this document for me, before it went public.
+They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan
+Gray.
+.SH AUTHOR
+.IX Header "AUTHOR"
+Juerd Waalboer <#####@juerd.nl>
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+perlunifaq, perlunicode, perluniintro, Encode