summaryrefslogtreecommitdiffstats
path: root/upstream/archlinux/man1/perluniintro.1perl
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-15 19:43:11 +0000
commitfc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
treece1e3bce06471410239a6f41282e328770aa404a /upstream/archlinux/man1/perluniintro.1perl
parentInitial commit. (diff)
downloadmanpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/archlinux/man1/perluniintro.1perl')
-rw-r--r--upstream/archlinux/man1/perluniintro.1perl1073
1 files changed, 1073 insertions, 0 deletions
diff --git a/upstream/archlinux/man1/perluniintro.1perl b/upstream/archlinux/man1/perluniintro.1perl
new file mode 100644
index 00000000..c9727da7
--- /dev/null
+++ b/upstream/archlinux/man1/perluniintro.1perl
@@ -0,0 +1,1073 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLUNIINTRO 1perl"
+.TH PERLUNIINTRO 1perl 2024-02-11 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perluniintro \- Perl Unicode introduction
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+This document gives a general idea of Unicode and how to use Unicode
+in Perl. See "Further Resources" for references to more in-depth
+treatments of Unicode.
+.SS Unicode
+.IX Subsection "Unicode"
+Unicode is a character set standard which plans to codify all of the
+writing systems of the world, plus many other symbols.
+.PP
+Unicode and ISO/IEC 10646 are coordinated standards that unify
+almost all other modern character set standards,
+covering more than 80 writing systems and hundreds of languages,
+including all commercially-important modern languages. All characters
+in the largest Chinese, Japanese, and Korean dictionaries are also
+encoded. The standards will eventually cover almost all characters in
+more than 250 writing systems and thousands of languages.
+Unicode 1.0 was released in October 1991, and 6.0 in October 2010.
+.PP
+A Unicode \fIcharacter\fR is an abstract entity. It is not bound to any
+particular integer width, especially not to the C language \f(CW\*(C`char\*(C'\fR.
+Unicode is language-neutral and display-neutral: it does not encode the
+language of the text, and it does not generally define fonts or other graphical
+layout details. Unicode operates on characters and on text built from
+those characters.
+.PP
+Unicode defines characters like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR or \f(CW\*(C`GREEK
+SMALL LETTER ALPHA\*(C'\fR and unique numbers for the characters, in this
+case 0x0041 and 0x03B1, respectively. These unique numbers are called
+\&\fIcode points\fR. A code point is essentially the position of the
+character within the set of all possible Unicode characters, and thus in
+Perl, the term \fIordinal\fR is often used interchangeably with it.
+.PP
+The Unicode standard prefers using hexadecimal notation for the code
+points. If numbers like \f(CW0x0041\fR are unfamiliar to you, take a peek
+at a later section, "Hexadecimal Notation". The Unicode standard
+uses the notation \f(CW\*(C`U+0041 LATIN CAPITAL LETTER A\*(C'\fR, to give the
+hexadecimal code point and the normative name of the character.
+.PP
+Unicode also defines various \fIproperties\fR for the characters, like
+"uppercase" or "lowercase", "decimal digit", or "punctuation";
+these properties are independent of the names of the characters.
+Furthermore, various operations on the characters like uppercasing,
+lowercasing, and collating (sorting) are defined.
+.PP
+A Unicode \fIlogical\fR "character" can actually consist of more than one internal
+\&\fIactual\fR "character" or code point. For Western languages, this is adequately
+modelled by a \fIbase character\fR (like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR) followed
+by one or more \fImodifiers\fR (like \f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR). This sequence of
+base character and modifiers is called a \fIcombining character
+sequence\fR. Some non-western languages require more complicated
+models, so Unicode created the \fIgrapheme cluster\fR concept, which was
+later further refined into the \fIextended grapheme cluster\fR. For
+example, a Korean Hangul syllable is considered a single logical
+character, but most often consists of three actual
+Unicode characters: a leading consonant followed by an interior vowel followed
+by a trailing consonant.
+.PP
+Whether to call these extended grapheme clusters "characters" depends on your
+point of view. If you are a programmer, you probably would tend towards seeing
+each element in the sequences as one unit, or "character". However from
+the user's point of view, the whole sequence could be seen as one
+"character" since that's probably what it looks like in the context of the
+user's language. In this document, we take the programmer's point of
+view: one "character" is one Unicode code point.
+.PP
+For some combinations of base character and modifiers, there are
+\&\fIprecomposed\fR characters. There is a single character equivalent, for
+example, for the sequence \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR followed by
+\&\f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR. It is called \f(CW\*(C`LATIN CAPITAL LETTER A WITH
+ACUTE\*(C'\fR. These precomposed characters are, however, only available for
+some combinations, and are mainly meant to support round-trip
+conversions between Unicode and legacy standards (like ISO 8859). Using
+sequences, as Unicode does, allows for needing fewer basic building blocks
+(code points) to express many more potential grapheme clusters. To
+support conversion between equivalent forms, various \fInormalization
+forms\fR are also defined. Thus, \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR is
+in \fINormalization Form Composed\fR, (abbreviated NFC), and the sequence
+\&\f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR followed by \f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR
+represents the same character in \fINormalization Form Decomposed\fR (NFD).
+.PP
+Because of backward compatibility with legacy encodings, the "a unique
+number for every character" idea breaks down a bit: instead, there is
+"at least one number for every character". The same character could
+be represented differently in several legacy encodings. The
+converse is not true: some code points do not have an assigned
+character. Firstly, there are unallocated code points within
+otherwise used blocks. Secondly, there are special Unicode control
+characters that do not represent true characters.
+.PP
+When Unicode was first conceived, it was thought that all the world's
+characters could be represented using a 16\-bit word; that is a maximum of
+\&\f(CW0x10000\fR (or 65,536) characters would be needed, from \f(CW0x0000\fR to
+\&\f(CW0xFFFF\fR. This soon proved to be wrong, and since Unicode 2.0 (July
+1996), Unicode has been defined all the way up to 21 bits (\f(CW0x10FFFF\fR),
+and Unicode 3.1 (March 2001) defined the first characters above \f(CW0xFFFF\fR.
+The first \f(CW0x10000\fR characters are called the \fIPlane 0\fR, or the
+\&\fIBasic Multilingual Plane\fR (BMP). With Unicode 3.1, 17 (yes,
+seventeen) planes in all were defined\-\-but they are nowhere near full of
+defined characters, yet.
+.PP
+When a new language is being encoded, Unicode generally will choose a
+\&\f(CW\*(C`block\*(C'\fR of consecutive unallocated code points for its characters. So
+far, the number of code points in these blocks has always been evenly
+divisible by 16. Extras in a block, not currently needed, are left
+unallocated, for future growth. But there have been occasions when
+a later release needed more code points than the available extras, and a
+new block had to allocated somewhere else, not contiguous to the initial
+one, to handle the overflow. Thus, it became apparent early on that
+"block" wasn't an adequate organizing principle, and so the \f(CW\*(C`Script\*(C'\fR
+property was created. (Later an improved script property was added as
+well, the \f(CW\*(C`Script_Extensions\*(C'\fR property.) Those code points that are in
+overflow blocks can still
+have the same script as the original ones. The script concept fits more
+closely with natural language: there is \f(CW\*(C`Latin\*(C'\fR script, \f(CW\*(C`Greek\*(C'\fR
+script, and so on; and there are several artificial scripts, like
+\&\f(CW\*(C`Common\*(C'\fR for characters that are used in multiple scripts, such as
+mathematical symbols. Scripts usually span varied parts of several
+blocks. For more information about scripts, see "Scripts" in perlunicode.
+The division into blocks exists, but it is almost completely
+accidental\-\-an artifact of how the characters have been and still are
+allocated. (Note that this paragraph has oversimplified things for the
+sake of this being an introduction. Unicode doesn't really encode
+languages, but the writing systems for them\-\-their scripts; and one
+script can be used by many languages. Unicode also encodes things that
+aren't really about languages, such as symbols like \f(CW\*(C`BAGGAGE CLAIM\*(C'\fR.)
+.PP
+The Unicode code points are just abstract numbers. To input and
+output these abstract numbers, the numbers must be \fIencoded\fR or
+\&\fIserialised\fR somehow. Unicode defines several \fIcharacter encoding
+forms\fR, of which \fIUTF\-8\fR is the most popular. UTF\-8 is a
+variable length encoding that encodes Unicode characters as 1 to 4
+bytes. Other encodings
+include UTF\-16 and UTF\-32 and their big\- and little-endian variants
+(UTF\-8 is byte-order independent). The ISO/IEC 10646 defines the UCS\-2
+and UCS\-4 encoding forms.
+.PP
+For more information about encodings\-\-for instance, to learn what
+\&\fIsurrogates\fR and \fIbyte order marks\fR (BOMs) are\-\-see perlunicode.
+.SS "Perl's Unicode Support"
+.IX Subsection "Perl's Unicode Support"
+Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
+natively. Perl v5.8.0, however, is the first recommended release for
+serious Unicode work. The maintenance release 5.6.1 fixed many of the
+problems of the initial Unicode implementation, but for example
+regular expressions still do not work with Unicode in 5.6.1.
+Perl v5.14.0 is the first release where Unicode support is
+(almost) seamlessly integratable without some gotchas. (There are a few
+exceptions. Firstly, some differences in quotemeta
+were fixed starting in Perl 5.16.0. Secondly, some differences in
+the range operator were fixed starting in
+Perl 5.26.0. Thirdly, some differences in split were fixed
+started in Perl 5.28.0.)
+.PP
+To enable this
+seamless support, you should \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (which is
+automatically selected if you \f(CW\*(C`use v5.12\*(C'\fR or higher). See feature.
+(5.14 also fixes a number of bugs and departures from the Unicode
+standard.)
+.PP
+Before Perl v5.8.0, the use of \f(CW\*(C`use utf8\*(C'\fR was used to declare
+that operations in the current block or file would be Unicode-aware.
+This model was found to be wrong, or at least clumsy: the "Unicodeness"
+is now carried with the data, instead of being attached to the
+operations.
+Starting with Perl v5.8.0, only one case remains where an explicit \f(CW\*(C`use
+utf8\*(C'\fR is needed: if your Perl script itself is encoded in UTF\-8, you can
+use UTF\-8 in your identifier names, and in string and regular expression
+literals, by saying \f(CW\*(C`use utf8\*(C'\fR. This is not the default because
+scripts with legacy 8\-bit data in them would break. See utf8.
+.SS "Perl's Unicode Model"
+.IX Subsection "Perl's Unicode Model"
+Perl supports both pre\-5.6 strings of eight-bit native bytes, and
+strings of Unicode characters. The general principle is that Perl tries
+to keep its data as eight-bit bytes for as long as possible, but as soon
+as Unicodeness cannot be avoided, the data is transparently upgraded
+to Unicode. Prior to Perl v5.14.0, the upgrade was not completely
+transparent (see "The "Unicode Bug"" in perlunicode), and for backwards
+compatibility, full transparency is not gained unless \f(CWuse feature
+\&\*(Aqunicode_strings\*(Aq\fR (see feature) or \f(CW\*(C`use v5.12\*(C'\fR (or higher) is
+selected.
+.PP
+Internally, Perl currently uses either whatever the native eight-bit
+character set of the platform (for example Latin\-1) is, defaulting to
+UTF\-8, to encode Unicode strings. Specifically, if all code points in
+the string are \f(CW0xFF\fR or less, Perl uses the native eight-bit
+character set. Otherwise, it uses UTF\-8.
+.PP
+A user of Perl does not normally need to know nor care how Perl
+happens to encode its internal strings, but it becomes relevant when
+outputting Unicode strings to a stream without a PerlIO layer (one with
+the "default" encoding). In such a case, the raw bytes used internally
+(the native character set or UTF\-8, as appropriate for each string)
+will be used, and a "Wide character" warning will be issued if those
+strings contain a character beyond 0x00FF.
+.PP
+For example,
+.PP
+.Vb 1
+\& perl \-e \*(Aqprint "\ex{DF}\en", "\ex{0100}\ex{DF}\en"\*(Aq
+.Ve
+.PP
+produces a fairly useless mixture of native bytes and UTF\-8, as well
+as a warning:
+.PP
+.Vb 1
+\& Wide character in print at ...
+.Ve
+.PP
+To output UTF\-8, use the \f(CW\*(C`:encoding\*(C'\fR or \f(CW\*(C`:utf8\*(C'\fR output layer. Prepending
+.PP
+.Vb 1
+\& binmode(STDOUT, ":utf8");
+.Ve
+.PP
+to this sample program ensures that the output is completely UTF\-8,
+and removes the program's warning.
+.PP
+You can enable automatic UTF\-8\-ification of your standard file
+handles, default \f(CWopen()\fR layer, and \f(CW@ARGV\fR by using either
+the \f(CW\*(C`\-C\*(C'\fR command line switch or the \f(CW\*(C`PERL_UNICODE\*(C'\fR environment
+variable, see perlrun for the
+documentation of the \f(CW\*(C`\-C\*(C'\fR switch.
+.PP
+Note that this means that Perl expects other software to work the same
+way:
+if Perl has been led to believe that STDIN should be UTF\-8, but then
+STDIN coming in from another command is not UTF\-8, Perl will likely
+complain about the malformed UTF\-8.
+.PP
+All features that combine Unicode and I/O also require using the new
+PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
+you can see whether yours is by running "perl \-V" and looking for
+\&\f(CW\*(C`useperlio=define\*(C'\fR.
+.SS "Unicode and EBCDIC"
+.IX Subsection "Unicode and EBCDIC"
+Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support
+was allowed to lapse in later releases, but was revived in 5.22.
+Unicode support is somewhat more complex to implement since additional
+conversions are needed. See perlebcdic for more information.
+.PP
+On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
+instead of UTF\-8. The difference is that as UTF\-8 is "ASCII-safe" in
+that ASCII characters encode to UTF\-8 as-is, while UTF-EBCDIC is
+"EBCDIC-safe", in that all the basic characters (which includes all
+those that have ASCII equivalents (like \f(CW"A"\fR, \f(CW"0"\fR, \f(CW"%"\fR, \fIetc.\fR)
+are the same in both EBCDIC and UTF-EBCDIC. Often, documentation
+will use the term "UTF\-8" to mean UTF-EBCDIC as well. This is the case
+in this document.
+.SS "Creating Unicode"
+.IX Subsection "Creating Unicode"
+This section applies fully to Perls starting with v5.22. Various
+caveats for earlier releases are in the "Earlier releases caveats"
+subsection below.
+.PP
+To create Unicode characters in literals,
+use the \f(CW\*(C`\eN{...}\*(C'\fR notation in double-quoted strings:
+.PP
+.Vb 2
+\& my $smiley_from_name = "\eN{WHITE SMILING FACE}";
+\& my $smiley_from_code_point = "\eN{U+263a}";
+.Ve
+.PP
+Similarly, they can be used in regular expression literals
+.PP
+.Vb 2
+\& $smiley =~ /\eN{WHITE SMILING FACE}/;
+\& $smiley =~ /\eN{U+263a}/;
+.Ve
+.PP
+or, starting in v5.32:
+.PP
+.Vb 2
+\& $smiley =~ /\ep{Name=WHITE SMILING FACE}/;
+\& $smiley =~ /\ep{Name=whitesmilingface}/;
+.Ve
+.PP
+At run-time you can use:
+.PP
+.Vb 4
+\& use charnames ();
+\& my $hebrew_alef_from_name
+\& = charnames::string_vianame("HEBREW LETTER ALEF");
+\& my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0");
+.Ve
+.PP
+Naturally, \f(CWord()\fR will do the reverse: it turns a character into
+a code point.
+.PP
+There are other runtime options as well. You can use \f(CWpack()\fR:
+.PP
+.Vb 1
+\& my $hebrew_alef_from_code_point = pack("U", 0x05d0);
+.Ve
+.PP
+Or you can use \f(CWchr()\fR, though it is less convenient in the general
+case:
+.PP
+.Vb 2
+\& $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0));
+\& utf8::upgrade($hebrew_alef_from_code_point);
+.Ve
+.PP
+The \f(CWutf8::unicode_to_native()\fR and \f(CWutf8::upgrade()\fR aren't needed if
+the argument is above 0xFF, so the above could have been written as
+.PP
+.Vb 1
+\& $hebrew_alef_from_code_point = chr(0x05d0);
+.Ve
+.PP
+since 0x5d0 is above 255.
+.PP
+\&\f(CW\*(C`\ex{}\*(C'\fR and \f(CW\*(C`\eo{}\*(C'\fR can also be used to specify code points at compile
+time in double-quotish strings, but, for backward compatibility with
+older Perls, the same rules apply as with \f(CWchr()\fR for code points less
+than 256.
+.PP
+\&\f(CWutf8::unicode_to_native()\fR is used so that the Perl code is portable
+to EBCDIC platforms. You can omit it if you're \fIreally\fR sure no one
+will ever want to use your code on a non-ASCII platform. Starting in
+Perl v5.22, calls to it on ASCII platforms are optimized out, so there's
+no performance penalty at all in adding it. Or you can simply use the
+other constructs that don't require it.
+.PP
+See "Further Resources" for how to find all these names and numeric
+codes.
+.PP
+\fIEarlier releases caveats\fR
+.IX Subsection "Earlier releases caveats"
+.PP
+On EBCDIC platforms, prior to v5.22, using \f(CW\*(C`\eN{U+...}\*(C'\fR doesn't work
+properly.
+.PP
+Prior to v5.16, using \f(CW\*(C`\eN{...}\*(C'\fR with a character name (as opposed to a
+\&\f(CW\*(C`U+...\*(C'\fR code point) required a \f(CW\*(C`use\ charnames\ :full\*(C'\fR.
+.PP
+Prior to v5.14, there were some bugs in \f(CW\*(C`\eN{...}\*(C'\fR with a character name
+(as opposed to a \f(CW\*(C`U+...\*(C'\fR code point).
+.PP
+\&\f(CWcharnames::string_vianame()\fR was introduced in v5.14. Prior to that,
+\&\f(CWcharnames::vianame()\fR should work, but only if the argument is of the
+form \f(CW"U+..."\fR. Your best bet there for runtime Unicode by character
+name is probably:
+.PP
+.Vb 3
+\& use charnames ();
+\& my $hebrew_alef_from_name
+\& = pack("U", charnames::vianame("HEBREW LETTER ALEF"));
+.Ve
+.SS "Handling Unicode"
+.IX Subsection "Handling Unicode"
+Handling Unicode is for the most part transparent: just use the
+strings as usual. Functions like \f(CWindex()\fR, \f(CWlength()\fR, and
+\&\f(CWsubstr()\fR will work on the Unicode characters; regular expressions
+will work on the Unicode characters (see perlunicode and perlretut).
+.PP
+Note that Perl considers grapheme clusters to be separate characters, so for
+example
+.PP
+.Vb 2
+\& print length("\eN{LATIN CAPITAL LETTER A}\eN{COMBINING ACUTE ACCENT}"),
+\& "\en";
+.Ve
+.PP
+will print 2, not 1. The only exception is that regular expressions
+have \f(CW\*(C`\eX\*(C'\fR for matching an extended grapheme cluster. (Thus \f(CW\*(C`\eX\*(C'\fR in a
+regular expression would match the entire sequence of both the example
+characters.)
+.PP
+Life is not quite so transparent, however, when working with legacy
+encodings, I/O, and certain special cases:
+.SS "Legacy Encodings"
+.IX Subsection "Legacy Encodings"
+When you combine legacy data and Unicode, the legacy data needs
+to be upgraded to Unicode. Normally the legacy data is assumed to be
+ISO 8859\-1 (or EBCDIC, if applicable).
+.PP
+The \f(CW\*(C`Encode\*(C'\fR module knows about many encodings and has interfaces
+for doing conversions between those encodings:
+.PP
+.Vb 2
+\& use Encode \*(Aqdecode\*(Aq;
+\& $data = decode("iso\-8859\-3", $data); # convert from legacy
+.Ve
+.SS "Unicode I/O"
+.IX Subsection "Unicode I/O"
+Normally, writing out Unicode data
+.PP
+.Vb 1
+\& print FH $some_string_with_unicode, "\en";
+.Ve
+.PP
+produces raw bytes that Perl happens to use to internally encode the
+Unicode string. Perl's internal encoding depends on the system as
+well as what characters happen to be in the string at the time. If
+any of the characters are at code points \f(CW0x100\fR or above, you will get
+a warning. To ensure that the output is explicitly rendered in the
+encoding you desire\-\-and to avoid the warning\-\-open the stream with
+the desired encoding. Some examples:
+.PP
+.Vb 1
+\& open FH, ">:utf8", "file";
+\&
+\& open FH, ">:encoding(ucs2)", "file";
+\& open FH, ">:encoding(UTF\-8)", "file";
+\& open FH, ">:encoding(shift_jis)", "file";
+.Ve
+.PP
+and on already open streams, use \f(CWbinmode()\fR:
+.PP
+.Vb 1
+\& binmode(STDOUT, ":utf8");
+\&
+\& binmode(STDOUT, ":encoding(ucs2)");
+\& binmode(STDOUT, ":encoding(UTF\-8)");
+\& binmode(STDOUT, ":encoding(shift_jis)");
+.Ve
+.PP
+The matching of encoding names is loose: case does not matter, and
+many encodings have several aliases. Note that the \f(CW\*(C`:utf8\*(C'\fR layer
+must always be specified exactly like that; it is \fInot\fR subject to
+the loose matching of encoding names. Also note that currently \f(CW\*(C`:utf8\*(C'\fR is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF\-8; you should instead use \f(CW:encoding(UTF\-8)\fR (with or without a
+hyphen).
+.PP
+See PerlIO for the \f(CW\*(C`:utf8\*(C'\fR layer, PerlIO::encoding and
+Encode::PerlIO for the \f(CW:encoding()\fR layer, and
+Encode::Supported for many encodings supported by the \f(CW\*(C`Encode\*(C'\fR
+module.
+.PP
+Reading in a file that you know happens to be encoded in one of the
+Unicode or legacy encodings does not magically turn the data into
+Unicode in Perl's eyes. To do that, specify the appropriate
+layer when opening files
+.PP
+.Vb 2
+\& open(my $fh,\*(Aq<:encoding(UTF\-8)\*(Aq, \*(Aqanything\*(Aq);
+\& my $line_of_unicode = <$fh>;
+\&
+\& open(my $fh,\*(Aq<:encoding(Big5)\*(Aq, \*(Aqanything\*(Aq);
+\& my $line_of_unicode = <$fh>;
+.Ve
+.PP
+The I/O layers can also be specified more flexibly with
+the \f(CW\*(C`open\*(C'\fR pragma. See open, or look at the following example.
+.PP
+.Vb 8
+\& use open \*(Aq:encoding(UTF\-8)\*(Aq; # input/output default encoding will be
+\& # UTF\-8
+\& open X, ">file";
+\& print X chr(0x100), "\en";
+\& close X;
+\& open Y, "<file";
+\& printf "%#x\en", ord(<Y>); # this should print 0x100
+\& close Y;
+.Ve
+.PP
+With the \f(CW\*(C`open\*(C'\fR pragma you can use the \f(CW\*(C`:locale\*(C'\fR layer
+.PP
+.Vb 10
+\& BEGIN { $ENV{LC_ALL} = $ENV{LANG} = \*(Aqru_RU.KOI8\-R\*(Aq }
+\& # the :locale will probe the locale environment variables like
+\& # LC_ALL
+\& use open OUT => \*(Aq:locale\*(Aq; # russki parusski
+\& open(O, ">koi8");
+\& print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8\-R 0xc1
+\& close O;
+\& open(I, "<koi8");
+\& printf "%#x\en", ord(<I>), "\en"; # this should print 0xc1
+\& close I;
+.Ve
+.PP
+These methods install a transparent filter on the I/O stream that
+converts data from the specified encoding when it is read in from the
+stream. The result is always Unicode.
+.PP
+The open pragma affects all the \f(CWopen()\fR calls after the pragma by
+setting default layers. If you want to affect only certain
+streams, use explicit layers directly in the \f(CWopen()\fR call.
+.PP
+You can switch encodings on an already opened stream by using
+\&\f(CWbinmode()\fR; see "binmode" in perlfunc.
+.PP
+The \f(CW\*(C`:locale\*(C'\fR does not currently work with
+\&\f(CWopen()\fR and \f(CWbinmode()\fR, only with the \f(CW\*(C`open\*(C'\fR pragma. The
+\&\f(CW\*(C`:utf8\*(C'\fR and \f(CW:encoding(...)\fR methods do work with all of \f(CWopen()\fR,
+\&\f(CWbinmode()\fR, and the \f(CW\*(C`open\*(C'\fR pragma.
+.PP
+Similarly, you may use these I/O layers on output streams to
+automatically convert Unicode to the specified encoding when it is
+written to the stream. For example, the following snippet copies the
+contents of the file "text.jis" (encoded as ISO\-2022\-JP, aka JIS) to
+the file "text.utf8", encoded as UTF\-8:
+.PP
+.Vb 3
+\& open(my $nihongo, \*(Aq<:encoding(iso\-2022\-jp)\*(Aq, \*(Aqtext.jis\*(Aq);
+\& open(my $unicode, \*(Aq>:utf8\*(Aq, \*(Aqtext.utf8\*(Aq);
+\& while (<$nihongo>) { print $unicode $_ }
+.Ve
+.PP
+The naming of encodings, both by the \f(CWopen()\fR and by the \f(CW\*(C`open\*(C'\fR
+pragma allows for flexible names: \f(CW\*(C`koi8\-r\*(C'\fR and \f(CW\*(C`KOI8R\*(C'\fR will both be
+understood.
+.PP
+Common encodings recognized by ISO, MIME, IANA, and various other
+standardisation organisations are recognised; for a more detailed
+list see Encode::Supported.
+.PP
+\&\f(CWread()\fR reads characters and returns the number of characters.
+\&\f(CWseek()\fR and \f(CWtell()\fR operate on byte counts, as does \f(CWsysseek()\fR.
+.PP
+\&\f(CWsysread()\fR and \f(CWsyswrite()\fR should not be used on file handles with
+character encoding layers, they behave badly, and that behaviour has
+been deprecated since perl 5.24.
+.PP
+Notice that because of the default behaviour of not doing any
+conversion upon input if there is no default layer,
+it is easy to mistakenly write code that keeps on expanding a file
+by repeatedly encoding the data:
+.PP
+.Vb 8
+\& # BAD CODE WARNING
+\& open F, "file";
+\& local $/; ## read in the whole file of 8\-bit characters
+\& $t = <F>;
+\& close F;
+\& open F, ">:encoding(UTF\-8)", "file";
+\& print F $t; ## convert to UTF\-8 on output
+\& close F;
+.Ve
+.PP
+If you run this code twice, the contents of the \fIfile\fR will be twice
+UTF\-8 encoded. A \f(CW\*(C`use open \*(Aq:encoding(UTF\-8)\*(Aq\*(C'\fR would have avoided the
+bug, or explicitly opening also the \fIfile\fR for input as UTF\-8.
+.PP
+\&\fBNOTE\fR: the \f(CW\*(C`:utf8\*(C'\fR and \f(CW\*(C`:encoding\*(C'\fR features work only if your
+Perl has been built with PerlIO, which is the default
+on most systems.
+.SS "Displaying Unicode As Text"
+.IX Subsection "Displaying Unicode As Text"
+Sometimes you might want to display Perl scalars containing Unicode as
+simple ASCII (or EBCDIC) text. The following subroutine converts
+its argument so that Unicode characters with code points greater than
+255 are displayed as \f(CW\*(C`\ex{...}\*(C'\fR, control characters (like \f(CW\*(C`\en\*(C'\fR) are
+displayed as \f(CW\*(C`\ex..\*(C'\fR, and the rest of the characters as themselves:
+.PP
+.Vb 9
+\& sub nice_string {
+\& join("",
+\& map { $_ > 255 # if wide character...
+\& ? sprintf("\e\ex{%04X}", $_) # \ex{...}
+\& : chr($_) =~ /[[:cntrl:]]/ # else if control character...
+\& ? sprintf("\e\ex%02X", $_) # \ex..
+\& : quotemeta(chr($_)) # else quoted or as themselves
+\& } unpack("W*", $_[0])); # unpack Unicode characters
+\& }
+.Ve
+.PP
+For example,
+.PP
+.Vb 1
+\& nice_string("foo\ex{100}bar\en")
+.Ve
+.PP
+returns the string
+.PP
+.Vb 1
+\& \*(Aqfoo\ex{0100}bar\ex0A\*(Aq
+.Ve
+.PP
+which is ready to be printed.
+.PP
+(\f(CW\*(C`\e\ex{}\*(C'\fR is used here instead of \f(CW\*(C`\e\eN{}\*(C'\fR, since it's most likely that
+you want to see what the native values are.)
+.SS "Special Cases"
+.IX Subsection "Special Cases"
+.IP \(bu 4
+Starting in Perl 5.28, it is illegal for bit operators, like \f(CW\*(C`~\*(C'\fR, to
+operate on strings containing code points above 255.
+.IP \(bu 4
+The \fBvec()\fR function may produce surprising results if
+used on strings containing characters with ordinal values above
+255. In such a case, the results are consistent with the internal
+encoding of the characters, but not with much else. So don't do
+that, and starting in Perl 5.28, a deprecation message is issued if you
+do so, becoming illegal in Perl 5.32.
+.IP \(bu 4
+Peeking At Perl's Internal Encoding
+.Sp
+Normal users of Perl should never care how Perl encodes any particular
+Unicode string (because the normal ways to get at the contents of a
+string with Unicode\-\-via input and output\-\-should always be via
+explicitly-defined I/O layers). But if you must, there are two
+ways of looking behind the scenes.
+.Sp
+One way of peeking inside the internal encoding of Unicode characters
+is to use \f(CW\*(C`unpack("C*", ...\*(C'\fR to get the bytes of whatever the string
+encoding happens to be, or \f(CW\*(C`unpack("U0..", ...)\*(C'\fR to get the bytes of the
+UTF\-8 encoding:
+.Sp
+.Vb 2
+\& # this prints c4 80 for the UTF\-8 bytes 0xc4 0x80
+\& print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\en";
+.Ve
+.Sp
+Yet another way would be to use the Devel::Peek module:
+.Sp
+.Vb 1
+\& perl \-MDevel::Peek \-e \*(AqDump(chr(0x100))\*(Aq
+.Ve
+.Sp
+That shows the \f(CW\*(C`UTF8\*(C'\fR flag in FLAGS and both the UTF\-8 bytes
+and Unicode characters in \f(CW\*(C`PV\*(C'\fR. See also later in this document
+the discussion about the \f(CWutf8::is_utf8()\fR function.
+.SS "Advanced Topics"
+.IX Subsection "Advanced Topics"
+.IP \(bu 4
+String Equivalence
+.Sp
+The question of string equivalence turns somewhat complicated
+in Unicode: what do you mean by "equal"?
+.Sp
+(Is \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR equal to
+\&\f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR?)
+.Sp
+The short answer is that by default Perl compares equivalence (\f(CW\*(C`eq\*(C'\fR,
+\&\f(CW\*(C`ne\*(C'\fR) based only on code points of the characters. In the above
+case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any
+CAPITAL LETTER A's should be considered equal, or even A's of any case.
+.Sp
+The long answer is that you need to consider character normalization
+and casing issues: see Unicode::Normalize, Unicode Technical Report #15,
+Unicode Normalization Forms <https://www.unicode.org/reports/tr15> and
+sections on case mapping in the Unicode Standard <https://www.unicode.org>.
+.Sp
+As of Perl 5.8.0, the "Full" case-folding of \fICase
+Mappings/SpecialCasing\fR is implemented, but bugs remain in \f(CW\*(C`qr//i\*(C'\fR with them,
+mostly fixed by 5.14, and essentially entirely by 5.18.
+.IP \(bu 4
+String Collation
+.Sp
+People like to see their strings nicely sorted\-\-or as Unicode
+parlance goes, collated. But again, what do you mean by collate?
+.Sp
+(Does \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR come before or after
+\&\f(CW\*(C`LATIN CAPITAL LETTER A WITH GRAVE\*(C'\fR?)
+.Sp
+The short answer is that by default, Perl compares strings (\f(CW\*(C`lt\*(C'\fR,
+\&\f(CW\*(C`le\*(C'\fR, \f(CW\*(C`cmp\*(C'\fR, \f(CW\*(C`ge\*(C'\fR, \f(CW\*(C`gt\*(C'\fR) based only on the code points of the
+characters. In the above case, the answer is "after", since
+\&\f(CW0x00C1\fR > \f(CW0x00C0\fR.
+.Sp
+The long answer is that "it depends", and a good answer cannot be
+given without knowing (at the very least) the language context.
+See Unicode::Collate, and \fIUnicode Collation Algorithm\fR
+<https://www.unicode.org/reports/tr10/>
+.SS Miscellaneous
+.IX Subsection "Miscellaneous"
+.IP \(bu 4
+Character Ranges and Classes
+.Sp
+Character ranges in regular expression bracketed character classes ( e.g.,
+\&\f(CW\*(C`/[a\-z]/\*(C'\fR) and in the \f(CW\*(C`tr///\*(C'\fR (also known as \f(CW\*(C`y///\*(C'\fR) operator are not
+magically Unicode-aware. What this means is that \f(CW\*(C`[A\-Za\-z]\*(C'\fR will not
+magically start to mean "all alphabetic letters" (not that it does mean that
+even for 8\-bit characters; for those, if you are using locales (perllocale),
+use \f(CW\*(C`/[[:alpha:]]/\*(C'\fR; and if not, use the 8\-bit\-aware property \f(CW\*(C`\ep{alpha}\*(C'\fR).
+.Sp
+All the properties that begin with \f(CW\*(C`\ep\*(C'\fR (and its inverse \f(CW\*(C`\eP\*(C'\fR) are actually
+character classes that are Unicode-aware. There are dozens of them, see
+perluniprops.
+.Sp
+Starting in v5.22, you can use Unicode code points as the end points of
+regular expression pattern character ranges, and the range will include
+all Unicode code points that lie between those end points, inclusive.
+.Sp
+.Vb 1
+\& qr/ [ \eN{U+03} \- \eN{U+20} ] /xx
+.Ve
+.Sp
+includes the code points
+\&\f(CW\*(C`\eN{U+03}\*(C'\fR, \f(CW\*(C`\eN{U+04}\*(C'\fR, ..., \f(CW\*(C`\eN{U+20}\*(C'\fR.
+.Sp
+This also works for ranges in \f(CW\*(C`tr///\*(C'\fR starting in Perl v5.24.
+.IP \(bu 4
+String-To-Number Conversions
+.Sp
+Unicode does define several other decimal\-\-and numeric\-\-characters
+besides the familiar 0 to 9, such as the Arabic and Indic digits.
+Perl does not support string-to-number conversion for digits other
+than ASCII \f(CW0\fR to \f(CW9\fR (and ASCII \f(CW\*(C`a\*(C'\fR to \f(CW\*(C`f\*(C'\fR for hexadecimal).
+To get safe conversions from any Unicode string, use
+"\fBnum()\fR" in Unicode::UCD.
+.SS "Questions With Answers"
+.IX Subsection "Questions With Answers"
+.IP \(bu 4
+Will My Old Scripts Break?
+.Sp
+Very probably not. Unless you are generating Unicode characters
+somehow, old behaviour should be preserved. About the only behaviour
+that has changed and which could start generating Unicode is the old
+behaviour of \f(CWchr()\fR where supplying an argument more than 255
+produced a character modulo 255. \f(CWchr(300)\fR, for example, was equal
+to \f(CWchr(45)\fR or "\-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH
+BREVE.
+.IP \(bu 4
+How Do I Make My Scripts Work With Unicode?
+.Sp
+Very little work should be needed since nothing changes until you
+generate Unicode data. The most important thing is getting input as
+Unicode; for that, see the earlier I/O discussion.
+To get full seamless Unicode support, add
+\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (or \f(CW\*(C`use v5.12\*(C'\fR or higher) to your
+script.
+.IP \(bu 4
+How Do I Know Whether My String Is In Unicode?
+.Sp
+You shouldn't have to care. But you may if your Perl is before 5.14.0
+or you haven't specified \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR or \f(CWuse
+5.012\fR (or higher) because otherwise the rules for the code points
+in the range 128 to 255 are different depending on
+whether the string they are contained within is in Unicode or not.
+(See "When Unicode Does Not Happen" in perlunicode.)
+.Sp
+To determine if a string is in Unicode, use:
+.Sp
+.Vb 1
+\& print utf8::is_utf8($string) ? 1 : 0, "\en";
+.Ve
+.Sp
+But note that this doesn't mean that any of the characters in the
+string are necessary UTF\-8 encoded, or that any of the characters have
+code points greater than 0xFF (255) or even 0x80 (128), or that the
+string has any characters at all. All the \f(CWis_utf8()\fR does is to
+return the value of the internal "utf8ness" flag attached to the
+\&\f(CW$string\fR. If the flag is off, the bytes in the scalar are interpreted
+as a single byte encoding. If the flag is on, the bytes in the scalar
+are interpreted as the (variable-length, potentially multi-byte) UTF\-8 encoded
+code points of the characters. Bytes added to a UTF\-8 encoded string are
+automatically upgraded to UTF\-8. If mixed non\-UTF\-8 and UTF\-8 scalars
+are merged (double-quoted interpolation, explicit concatenation, or
+printf/sprintf parameter substitution), the result will be UTF\-8 encoded
+as if copies of the byte strings were upgraded to UTF\-8: for example,
+.Sp
+.Vb 3
+\& $a = "ab\ex80c";
+\& $b = "\ex{100}";
+\& print "$a = $b\en";
+.Ve
+.Sp
+the output string will be UTF\-8\-encoded \f(CW\*(C`ab\ex80c = \ex{100}\en\*(C'\fR, but
+\&\f(CW$a\fR will stay byte-encoded.
+.Sp
+Sometimes you might really need to know the byte length of a string
+instead of the character length. For that use the \f(CW\*(C`bytes\*(C'\fR pragma
+and the \f(CWlength()\fR function:
+.Sp
+.Vb 6
+\& my $unicode = chr(0x100);
+\& print length($unicode), "\en"; # will print 1
+\& use bytes;
+\& print length($unicode), "\en"; # will print 2
+\& # (the 0xC4 0x80 of the UTF\-8)
+\& no bytes;
+.Ve
+.IP \(bu 4
+How Do I Find Out What Encoding a File Has?
+.Sp
+You might try Encode::Guess, but it has a number of limitations.
+.IP \(bu 4
+How Do I Detect Data That's Not Valid In a Particular Encoding?
+.Sp
+Use the \f(CW\*(C`Encode\*(C'\fR package to try converting it.
+For example,
+.Sp
+.Vb 1
+\& use Encode \*(Aqdecode\*(Aq;
+\&
+\& if (eval { decode(\*(AqUTF\-8\*(Aq, $string, Encode::FB_CROAK); 1 }) {
+\& # $string is valid UTF\-8
+\& } else {
+\& # $string is not valid UTF\-8
+\& }
+.Ve
+.Sp
+Or use \f(CW\*(C`unpack\*(C'\fR to try decoding it:
+.Sp
+.Vb 2
+\& use warnings;
+\& @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
+.Ve
+.Sp
+If invalid, a \f(CW\*(C`Malformed UTF\-8 character\*(C'\fR warning is produced. The "C0" means
+"process the string character per character". Without that, the
+\&\f(CW\*(C`unpack("U*", ...)\*(C'\fR would work in \f(CW\*(C`U0\*(C'\fR mode (the default if the format
+string starts with \f(CW\*(C`U\*(C'\fR) and it would return the bytes making up the UTF\-8
+encoding of the target string, something that will always work.
+.IP \(bu 4
+How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
+.Sp
+This probably isn't as useful as you might think.
+Normally, you shouldn't need to.
+.Sp
+In one sense, what you are asking doesn't make much sense: encodings
+are for characters, and binary data are not "characters", so converting
+"data" into some encoding isn't meaningful unless you know in what
+character set and encoding the binary data is in, in which case it's
+not just binary data, now is it?
+.Sp
+If you have a raw sequence of bytes that you know should be
+interpreted via a particular encoding, you can use \f(CW\*(C`Encode\*(C'\fR:
+.Sp
+.Vb 2
+\& use Encode \*(Aqfrom_to\*(Aq;
+\& from_to($data, "iso\-8859\-1", "UTF\-8"); # from latin\-1 to UTF\-8
+.Ve
+.Sp
+The call to \f(CWfrom_to()\fR changes the bytes in \f(CW$data\fR, but nothing
+material about the nature of the string has changed as far as Perl is
+concerned. Both before and after the call, the string \f(CW$data\fR
+contains just a bunch of 8\-bit bytes. As far as Perl is concerned,
+the encoding of the string remains as "system-native 8\-bit bytes".
+.Sp
+You might relate this to a fictional 'Translate' module:
+.Sp
+.Vb 4
+\& use Translate;
+\& my $phrase = "Yes";
+\& Translate::from_to($phrase, \*(Aqenglish\*(Aq, \*(Aqdeutsch\*(Aq);
+\& ## phrase now contains "Ja"
+.Ve
+.Sp
+The contents of the string changes, but not the nature of the string.
+Perl doesn't know any more after the call than before that the
+contents of the string indicates the affirmative.
+.Sp
+Back to converting data. If you have (or want) data in your system's
+native 8\-bit encoding (e.g. Latin\-1, EBCDIC, etc.), you can use
+pack/unpack to convert to/from Unicode.
+.Sp
+.Vb 2
+\& $native_string = pack("W*", unpack("U*", $Unicode_string));
+\& $Unicode_string = pack("U*", unpack("W*", $native_string));
+.Ve
+.Sp
+If you have a sequence of bytes you \fBknow\fR is valid UTF\-8,
+but Perl doesn't know it yet, you can make Perl a believer, too:
+.Sp
+.Vb 2
+\& $Unicode = $bytes;
+\& utf8::decode($Unicode);
+.Ve
+.Sp
+or:
+.Sp
+.Vb 1
+\& $Unicode = pack("U0a*", $bytes);
+.Ve
+.Sp
+You can find the bytes that make up a UTF\-8 sequence with
+.Sp
+.Vb 1
+\& @bytes = unpack("C*", $Unicode_string)
+.Ve
+.Sp
+and you can create well-formed Unicode with
+.Sp
+.Vb 1
+\& $Unicode_string = pack("U*", 0xff, ...)
+.Ve
+.IP \(bu 4
+How Do I Display Unicode? How Do I Input Unicode?
+.Sp
+See <http://www.alanwood.net/unicode/> and
+<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
+.IP \(bu 4
+How Does Unicode Work With Traditional Locales?
+.Sp
+If your locale is a UTF\-8 locale, starting in Perl v5.26, Perl works
+well for all categories; before this, starting with Perl v5.20, it works
+for all categories but \f(CW\*(C`LC_COLLATE\*(C'\fR, which deals with
+sorting and the \f(CW\*(C`cmp\*(C'\fR operator. But note that the standard
+\&\f(CW\*(C`Unicode::Collate\*(C'\fR and \f(CW\*(C`Unicode::Collate::Locale\*(C'\fR modules offer
+much more powerful solutions to collation issues, and work on earlier
+releases.
+.Sp
+For other locales, starting in Perl 5.16, you can specify
+.Sp
+.Vb 1
+\& use locale \*(Aq:not_characters\*(Aq;
+.Ve
+.Sp
+to get Perl to work well with them. The catch is that you
+have to translate from the locale character set to/from Unicode
+yourself. See "Unicode I/O" above for how to
+.Sp
+.Vb 1
+\& use open \*(Aq:locale\*(Aq;
+.Ve
+.Sp
+to accomplish this, but full details are in "Unicode and
+UTF\-8" in perllocale, including gotchas that happen if you don't specify
+\&\f(CW\*(C`:not_characters\*(C'\fR.
+.SS "Hexadecimal Notation"
+.IX Subsection "Hexadecimal Notation"
+The Unicode standard prefers using hexadecimal notation because
+that more clearly shows the division of Unicode into blocks of 256 characters.
+Hexadecimal is also simply shorter than decimal. You can use decimal
+notation, too, but learning to use hexadecimal just makes life easier
+with the Unicode standard. The \f(CW\*(C`U+HHHH\*(C'\fR notation uses hexadecimal,
+for example.
+.PP
+The \f(CW\*(C`0x\*(C'\fR prefix means a hexadecimal number, the digits are 0\-9 \fIand\fR
+a\-f (or A\-F, case doesn't matter). Each hexadecimal digit represents
+four bits, or half a byte. \f(CW\*(C`print 0x..., "\en"\*(C'\fR will show a
+hexadecimal number in decimal, and \f(CW\*(C`printf "%x\en", $decimal\*(C'\fR will
+show a decimal number in hexadecimal. If you have just the
+"hex digits" of a hexadecimal number, you can use the \f(CWhex()\fR function.
+.PP
+.Vb 6
+\& print 0x0009, "\en"; # 9
+\& print 0x000a, "\en"; # 10
+\& print 0x000f, "\en"; # 15
+\& print 0x0010, "\en"; # 16
+\& print 0x0011, "\en"; # 17
+\& print 0x0100, "\en"; # 256
+\&
+\& print 0x0041, "\en"; # 65
+\&
+\& printf "%x\en", 65; # 41
+\& printf "%#x\en", 65; # 0x41
+\&
+\& print hex("41"), "\en"; # 65
+.Ve
+.SS "Further Resources"
+.IX Subsection "Further Resources"
+.IP \(bu 4
+Unicode Consortium
+.Sp
+<https://www.unicode.org/>
+.IP \(bu 4
+Unicode FAQ
+.Sp
+<https://www.unicode.org/faq/>
+.IP \(bu 4
+Unicode Glossary
+.Sp
+<https://www.unicode.org/glossary/>
+.IP \(bu 4
+Unicode Recommended Reading List
+.Sp
+The Unicode Consortium has a list of articles and books, some of which
+give a much more in depth treatment of Unicode:
+<http://unicode.org/resources/readinglist.html>
+.IP \(bu 4
+Unicode Useful Resources
+.Sp
+<https://www.unicode.org/unicode/onlinedat/resources.html>
+.IP \(bu 4
+Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
+.Sp
+<http://www.alanwood.net/unicode/>
+.IP \(bu 4
+UTF\-8 and Unicode FAQ for Unix/Linux
+.Sp
+<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
+.IP \(bu 4
+Legacy Character Sets
+.Sp
+<http://www.czyborra.com/>
+<http://www.eki.ee/letter/>
+.IP \(bu 4
+You can explore various information from the Unicode data files using
+the \f(CW\*(C`Unicode::UCD\*(C'\fR module.
+.SH "UNICODE IN OLDER PERLS"
+.IX Header "UNICODE IN OLDER PERLS"
+If you cannot upgrade your Perl to 5.8.0 or later, you can still
+do some Unicode processing by using the modules \f(CW\*(C`Unicode::String\*(C'\fR,
+\&\f(CW\*(C`Unicode::Map8\*(C'\fR, and \f(CW\*(C`Unicode::Map\*(C'\fR, available from CPAN.
+If you have the GNU recode installed, you can also use the
+Perl front-end \f(CW\*(C`Convert::Recode\*(C'\fR for character conversions.
+.PP
+The following are fast conversions from ISO 8859\-1 (Latin\-1) bytes
+to UTF\-8 bytes and back, the code works even with older Perl 5 versions.
+.PP
+.Vb 2
+\& # ISO 8859\-1 to UTF\-8
+\& s/([\ex80\-\exFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
+\&
+\& # UTF\-8 to ISO 8859\-1
+\& s/([\exC2\exC3])([\ex80\-\exBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
+.Ve
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+perlunitut, perlunicode, Encode, open, utf8, bytes,
+perlretut, perlrun, Unicode::Collate, Unicode::Normalize,
+Unicode::UCD
+.SH ACKNOWLEDGMENTS
+.IX Header "ACKNOWLEDGMENTS"
+Thanks to the kind readers of the perl5\-porters@perl.org,
+perl\-unicode@perl.org, linux\-utf8@nl.linux.org, and unicore@unicode.org
+mailing lists for their valuable feedback.
+.SH "AUTHOR, COPYRIGHT, AND LICENSE"
+.IX Header "AUTHOR, COPYRIGHT, AND LICENSE"
+Copyright 2001\-2011 Jarkko Hietaniemi <jhi@iki.fi>.
+Now maintained by Perl 5 Porters.
+.PP
+This document may be distributed under the same terms as Perl itself.