diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
commit | fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch) | |
tree | ce1e3bce06471410239a6f41282e328770aa404a /upstream/archlinux/man1/perluniintro.1perl | |
parent | Initial commit. (diff) | |
download | manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip |
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/archlinux/man1/perluniintro.1perl')
-rw-r--r-- | upstream/archlinux/man1/perluniintro.1perl | 1073 |
1 files changed, 1073 insertions, 0 deletions
diff --git a/upstream/archlinux/man1/perluniintro.1perl b/upstream/archlinux/man1/perluniintro.1perl new file mode 100644 index 00000000..c9727da7 --- /dev/null +++ b/upstream/archlinux/man1/perluniintro.1perl @@ -0,0 +1,1073 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLUNIINTRO 1perl" +.TH PERLUNIINTRO 1perl 2024-02-11 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perluniintro \- Perl Unicode introduction +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This document gives a general idea of Unicode and how to use Unicode +in Perl. See "Further Resources" for references to more in-depth +treatments of Unicode. +.SS Unicode +.IX Subsection "Unicode" +Unicode is a character set standard which plans to codify all of the +writing systems of the world, plus many other symbols. +.PP +Unicode and ISO/IEC 10646 are coordinated standards that unify +almost all other modern character set standards, +covering more than 80 writing systems and hundreds of languages, +including all commercially-important modern languages. All characters +in the largest Chinese, Japanese, and Korean dictionaries are also +encoded. The standards will eventually cover almost all characters in +more than 250 writing systems and thousands of languages. +Unicode 1.0 was released in October 1991, and 6.0 in October 2010. +.PP +A Unicode \fIcharacter\fR is an abstract entity. It is not bound to any +particular integer width, especially not to the C language \f(CW\*(C`char\*(C'\fR. +Unicode is language-neutral and display-neutral: it does not encode the +language of the text, and it does not generally define fonts or other graphical +layout details. Unicode operates on characters and on text built from +those characters. +.PP +Unicode defines characters like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR or \f(CW\*(C`GREEK +SMALL LETTER ALPHA\*(C'\fR and unique numbers for the characters, in this +case 0x0041 and 0x03B1, respectively. These unique numbers are called +\&\fIcode points\fR. A code point is essentially the position of the +character within the set of all possible Unicode characters, and thus in +Perl, the term \fIordinal\fR is often used interchangeably with it. +.PP +The Unicode standard prefers using hexadecimal notation for the code +points. If numbers like \f(CW0x0041\fR are unfamiliar to you, take a peek +at a later section, "Hexadecimal Notation". The Unicode standard +uses the notation \f(CW\*(C`U+0041 LATIN CAPITAL LETTER A\*(C'\fR, to give the +hexadecimal code point and the normative name of the character. +.PP +Unicode also defines various \fIproperties\fR for the characters, like +"uppercase" or "lowercase", "decimal digit", or "punctuation"; +these properties are independent of the names of the characters. +Furthermore, various operations on the characters like uppercasing, +lowercasing, and collating (sorting) are defined. +.PP +A Unicode \fIlogical\fR "character" can actually consist of more than one internal +\&\fIactual\fR "character" or code point. For Western languages, this is adequately +modelled by a \fIbase character\fR (like \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR) followed +by one or more \fImodifiers\fR (like \f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR). This sequence of +base character and modifiers is called a \fIcombining character +sequence\fR. Some non-western languages require more complicated +models, so Unicode created the \fIgrapheme cluster\fR concept, which was +later further refined into the \fIextended grapheme cluster\fR. For +example, a Korean Hangul syllable is considered a single logical +character, but most often consists of three actual +Unicode characters: a leading consonant followed by an interior vowel followed +by a trailing consonant. +.PP +Whether to call these extended grapheme clusters "characters" depends on your +point of view. If you are a programmer, you probably would tend towards seeing +each element in the sequences as one unit, or "character". However from +the user's point of view, the whole sequence could be seen as one +"character" since that's probably what it looks like in the context of the +user's language. In this document, we take the programmer's point of +view: one "character" is one Unicode code point. +.PP +For some combinations of base character and modifiers, there are +\&\fIprecomposed\fR characters. There is a single character equivalent, for +example, for the sequence \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR followed by +\&\f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR. It is called \f(CW\*(C`LATIN CAPITAL LETTER A WITH +ACUTE\*(C'\fR. These precomposed characters are, however, only available for +some combinations, and are mainly meant to support round-trip +conversions between Unicode and legacy standards (like ISO 8859). Using +sequences, as Unicode does, allows for needing fewer basic building blocks +(code points) to express many more potential grapheme clusters. To +support conversion between equivalent forms, various \fInormalization +forms\fR are also defined. Thus, \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR is +in \fINormalization Form Composed\fR, (abbreviated NFC), and the sequence +\&\f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR followed by \f(CW\*(C`COMBINING ACUTE ACCENT\*(C'\fR +represents the same character in \fINormalization Form Decomposed\fR (NFD). +.PP +Because of backward compatibility with legacy encodings, the "a unique +number for every character" idea breaks down a bit: instead, there is +"at least one number for every character". The same character could +be represented differently in several legacy encodings. The +converse is not true: some code points do not have an assigned +character. Firstly, there are unallocated code points within +otherwise used blocks. Secondly, there are special Unicode control +characters that do not represent true characters. +.PP +When Unicode was first conceived, it was thought that all the world's +characters could be represented using a 16\-bit word; that is a maximum of +\&\f(CW0x10000\fR (or 65,536) characters would be needed, from \f(CW0x0000\fR to +\&\f(CW0xFFFF\fR. This soon proved to be wrong, and since Unicode 2.0 (July +1996), Unicode has been defined all the way up to 21 bits (\f(CW0x10FFFF\fR), +and Unicode 3.1 (March 2001) defined the first characters above \f(CW0xFFFF\fR. +The first \f(CW0x10000\fR characters are called the \fIPlane 0\fR, or the +\&\fIBasic Multilingual Plane\fR (BMP). With Unicode 3.1, 17 (yes, +seventeen) planes in all were defined\-\-but they are nowhere near full of +defined characters, yet. +.PP +When a new language is being encoded, Unicode generally will choose a +\&\f(CW\*(C`block\*(C'\fR of consecutive unallocated code points for its characters. So +far, the number of code points in these blocks has always been evenly +divisible by 16. Extras in a block, not currently needed, are left +unallocated, for future growth. But there have been occasions when +a later release needed more code points than the available extras, and a +new block had to allocated somewhere else, not contiguous to the initial +one, to handle the overflow. Thus, it became apparent early on that +"block" wasn't an adequate organizing principle, and so the \f(CW\*(C`Script\*(C'\fR +property was created. (Later an improved script property was added as +well, the \f(CW\*(C`Script_Extensions\*(C'\fR property.) Those code points that are in +overflow blocks can still +have the same script as the original ones. The script concept fits more +closely with natural language: there is \f(CW\*(C`Latin\*(C'\fR script, \f(CW\*(C`Greek\*(C'\fR +script, and so on; and there are several artificial scripts, like +\&\f(CW\*(C`Common\*(C'\fR for characters that are used in multiple scripts, such as +mathematical symbols. Scripts usually span varied parts of several +blocks. For more information about scripts, see "Scripts" in perlunicode. +The division into blocks exists, but it is almost completely +accidental\-\-an artifact of how the characters have been and still are +allocated. (Note that this paragraph has oversimplified things for the +sake of this being an introduction. Unicode doesn't really encode +languages, but the writing systems for them\-\-their scripts; and one +script can be used by many languages. Unicode also encodes things that +aren't really about languages, such as symbols like \f(CW\*(C`BAGGAGE CLAIM\*(C'\fR.) +.PP +The Unicode code points are just abstract numbers. To input and +output these abstract numbers, the numbers must be \fIencoded\fR or +\&\fIserialised\fR somehow. Unicode defines several \fIcharacter encoding +forms\fR, of which \fIUTF\-8\fR is the most popular. UTF\-8 is a +variable length encoding that encodes Unicode characters as 1 to 4 +bytes. Other encodings +include UTF\-16 and UTF\-32 and their big\- and little-endian variants +(UTF\-8 is byte-order independent). The ISO/IEC 10646 defines the UCS\-2 +and UCS\-4 encoding forms. +.PP +For more information about encodings\-\-for instance, to learn what +\&\fIsurrogates\fR and \fIbyte order marks\fR (BOMs) are\-\-see perlunicode. +.SS "Perl's Unicode Support" +.IX Subsection "Perl's Unicode Support" +Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode +natively. Perl v5.8.0, however, is the first recommended release for +serious Unicode work. The maintenance release 5.6.1 fixed many of the +problems of the initial Unicode implementation, but for example +regular expressions still do not work with Unicode in 5.6.1. +Perl v5.14.0 is the first release where Unicode support is +(almost) seamlessly integratable without some gotchas. (There are a few +exceptions. Firstly, some differences in quotemeta +were fixed starting in Perl 5.16.0. Secondly, some differences in +the range operator were fixed starting in +Perl 5.26.0. Thirdly, some differences in split were fixed +started in Perl 5.28.0.) +.PP +To enable this +seamless support, you should \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (which is +automatically selected if you \f(CW\*(C`use v5.12\*(C'\fR or higher). See feature. +(5.14 also fixes a number of bugs and departures from the Unicode +standard.) +.PP +Before Perl v5.8.0, the use of \f(CW\*(C`use utf8\*(C'\fR was used to declare +that operations in the current block or file would be Unicode-aware. +This model was found to be wrong, or at least clumsy: the "Unicodeness" +is now carried with the data, instead of being attached to the +operations. +Starting with Perl v5.8.0, only one case remains where an explicit \f(CW\*(C`use +utf8\*(C'\fR is needed: if your Perl script itself is encoded in UTF\-8, you can +use UTF\-8 in your identifier names, and in string and regular expression +literals, by saying \f(CW\*(C`use utf8\*(C'\fR. This is not the default because +scripts with legacy 8\-bit data in them would break. See utf8. +.SS "Perl's Unicode Model" +.IX Subsection "Perl's Unicode Model" +Perl supports both pre\-5.6 strings of eight-bit native bytes, and +strings of Unicode characters. The general principle is that Perl tries +to keep its data as eight-bit bytes for as long as possible, but as soon +as Unicodeness cannot be avoided, the data is transparently upgraded +to Unicode. Prior to Perl v5.14.0, the upgrade was not completely +transparent (see "The "Unicode Bug"" in perlunicode), and for backwards +compatibility, full transparency is not gained unless \f(CWuse feature +\&\*(Aqunicode_strings\*(Aq\fR (see feature) or \f(CW\*(C`use v5.12\*(C'\fR (or higher) is +selected. +.PP +Internally, Perl currently uses either whatever the native eight-bit +character set of the platform (for example Latin\-1) is, defaulting to +UTF\-8, to encode Unicode strings. Specifically, if all code points in +the string are \f(CW0xFF\fR or less, Perl uses the native eight-bit +character set. Otherwise, it uses UTF\-8. +.PP +A user of Perl does not normally need to know nor care how Perl +happens to encode its internal strings, but it becomes relevant when +outputting Unicode strings to a stream without a PerlIO layer (one with +the "default" encoding). In such a case, the raw bytes used internally +(the native character set or UTF\-8, as appropriate for each string) +will be used, and a "Wide character" warning will be issued if those +strings contain a character beyond 0x00FF. +.PP +For example, +.PP +.Vb 1 +\& perl \-e \*(Aqprint "\ex{DF}\en", "\ex{0100}\ex{DF}\en"\*(Aq +.Ve +.PP +produces a fairly useless mixture of native bytes and UTF\-8, as well +as a warning: +.PP +.Vb 1 +\& Wide character in print at ... +.Ve +.PP +To output UTF\-8, use the \f(CW\*(C`:encoding\*(C'\fR or \f(CW\*(C`:utf8\*(C'\fR output layer. Prepending +.PP +.Vb 1 +\& binmode(STDOUT, ":utf8"); +.Ve +.PP +to this sample program ensures that the output is completely UTF\-8, +and removes the program's warning. +.PP +You can enable automatic UTF\-8\-ification of your standard file +handles, default \f(CWopen()\fR layer, and \f(CW@ARGV\fR by using either +the \f(CW\*(C`\-C\*(C'\fR command line switch or the \f(CW\*(C`PERL_UNICODE\*(C'\fR environment +variable, see perlrun for the +documentation of the \f(CW\*(C`\-C\*(C'\fR switch. +.PP +Note that this means that Perl expects other software to work the same +way: +if Perl has been led to believe that STDIN should be UTF\-8, but then +STDIN coming in from another command is not UTF\-8, Perl will likely +complain about the malformed UTF\-8. +.PP +All features that combine Unicode and I/O also require using the new +PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though: +you can see whether yours is by running "perl \-V" and looking for +\&\f(CW\*(C`useperlio=define\*(C'\fR. +.SS "Unicode and EBCDIC" +.IX Subsection "Unicode and EBCDIC" +Perl 5.8.0 added support for Unicode on EBCDIC platforms. This support +was allowed to lapse in later releases, but was revived in 5.22. +Unicode support is somewhat more complex to implement since additional +conversions are needed. See perlebcdic for more information. +.PP +On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC +instead of UTF\-8. The difference is that as UTF\-8 is "ASCII-safe" in +that ASCII characters encode to UTF\-8 as-is, while UTF-EBCDIC is +"EBCDIC-safe", in that all the basic characters (which includes all +those that have ASCII equivalents (like \f(CW"A"\fR, \f(CW"0"\fR, \f(CW"%"\fR, \fIetc.\fR) +are the same in both EBCDIC and UTF-EBCDIC. Often, documentation +will use the term "UTF\-8" to mean UTF-EBCDIC as well. This is the case +in this document. +.SS "Creating Unicode" +.IX Subsection "Creating Unicode" +This section applies fully to Perls starting with v5.22. Various +caveats for earlier releases are in the "Earlier releases caveats" +subsection below. +.PP +To create Unicode characters in literals, +use the \f(CW\*(C`\eN{...}\*(C'\fR notation in double-quoted strings: +.PP +.Vb 2 +\& my $smiley_from_name = "\eN{WHITE SMILING FACE}"; +\& my $smiley_from_code_point = "\eN{U+263a}"; +.Ve +.PP +Similarly, they can be used in regular expression literals +.PP +.Vb 2 +\& $smiley =~ /\eN{WHITE SMILING FACE}/; +\& $smiley =~ /\eN{U+263a}/; +.Ve +.PP +or, starting in v5.32: +.PP +.Vb 2 +\& $smiley =~ /\ep{Name=WHITE SMILING FACE}/; +\& $smiley =~ /\ep{Name=whitesmilingface}/; +.Ve +.PP +At run-time you can use: +.PP +.Vb 4 +\& use charnames (); +\& my $hebrew_alef_from_name +\& = charnames::string_vianame("HEBREW LETTER ALEF"); +\& my $hebrew_alef_from_code_point = charnames::string_vianame("U+05D0"); +.Ve +.PP +Naturally, \f(CWord()\fR will do the reverse: it turns a character into +a code point. +.PP +There are other runtime options as well. You can use \f(CWpack()\fR: +.PP +.Vb 1 +\& my $hebrew_alef_from_code_point = pack("U", 0x05d0); +.Ve +.PP +Or you can use \f(CWchr()\fR, though it is less convenient in the general +case: +.PP +.Vb 2 +\& $hebrew_alef_from_code_point = chr(utf8::unicode_to_native(0x05d0)); +\& utf8::upgrade($hebrew_alef_from_code_point); +.Ve +.PP +The \f(CWutf8::unicode_to_native()\fR and \f(CWutf8::upgrade()\fR aren't needed if +the argument is above 0xFF, so the above could have been written as +.PP +.Vb 1 +\& $hebrew_alef_from_code_point = chr(0x05d0); +.Ve +.PP +since 0x5d0 is above 255. +.PP +\&\f(CW\*(C`\ex{}\*(C'\fR and \f(CW\*(C`\eo{}\*(C'\fR can also be used to specify code points at compile +time in double-quotish strings, but, for backward compatibility with +older Perls, the same rules apply as with \f(CWchr()\fR for code points less +than 256. +.PP +\&\f(CWutf8::unicode_to_native()\fR is used so that the Perl code is portable +to EBCDIC platforms. You can omit it if you're \fIreally\fR sure no one +will ever want to use your code on a non-ASCII platform. Starting in +Perl v5.22, calls to it on ASCII platforms are optimized out, so there's +no performance penalty at all in adding it. Or you can simply use the +other constructs that don't require it. +.PP +See "Further Resources" for how to find all these names and numeric +codes. +.PP +\fIEarlier releases caveats\fR +.IX Subsection "Earlier releases caveats" +.PP +On EBCDIC platforms, prior to v5.22, using \f(CW\*(C`\eN{U+...}\*(C'\fR doesn't work +properly. +.PP +Prior to v5.16, using \f(CW\*(C`\eN{...}\*(C'\fR with a character name (as opposed to a +\&\f(CW\*(C`U+...\*(C'\fR code point) required a \f(CW\*(C`use\ charnames\ :full\*(C'\fR. +.PP +Prior to v5.14, there were some bugs in \f(CW\*(C`\eN{...}\*(C'\fR with a character name +(as opposed to a \f(CW\*(C`U+...\*(C'\fR code point). +.PP +\&\f(CWcharnames::string_vianame()\fR was introduced in v5.14. Prior to that, +\&\f(CWcharnames::vianame()\fR should work, but only if the argument is of the +form \f(CW"U+..."\fR. Your best bet there for runtime Unicode by character +name is probably: +.PP +.Vb 3 +\& use charnames (); +\& my $hebrew_alef_from_name +\& = pack("U", charnames::vianame("HEBREW LETTER ALEF")); +.Ve +.SS "Handling Unicode" +.IX Subsection "Handling Unicode" +Handling Unicode is for the most part transparent: just use the +strings as usual. Functions like \f(CWindex()\fR, \f(CWlength()\fR, and +\&\f(CWsubstr()\fR will work on the Unicode characters; regular expressions +will work on the Unicode characters (see perlunicode and perlretut). +.PP +Note that Perl considers grapheme clusters to be separate characters, so for +example +.PP +.Vb 2 +\& print length("\eN{LATIN CAPITAL LETTER A}\eN{COMBINING ACUTE ACCENT}"), +\& "\en"; +.Ve +.PP +will print 2, not 1. The only exception is that regular expressions +have \f(CW\*(C`\eX\*(C'\fR for matching an extended grapheme cluster. (Thus \f(CW\*(C`\eX\*(C'\fR in a +regular expression would match the entire sequence of both the example +characters.) +.PP +Life is not quite so transparent, however, when working with legacy +encodings, I/O, and certain special cases: +.SS "Legacy Encodings" +.IX Subsection "Legacy Encodings" +When you combine legacy data and Unicode, the legacy data needs +to be upgraded to Unicode. Normally the legacy data is assumed to be +ISO 8859\-1 (or EBCDIC, if applicable). +.PP +The \f(CW\*(C`Encode\*(C'\fR module knows about many encodings and has interfaces +for doing conversions between those encodings: +.PP +.Vb 2 +\& use Encode \*(Aqdecode\*(Aq; +\& $data = decode("iso\-8859\-3", $data); # convert from legacy +.Ve +.SS "Unicode I/O" +.IX Subsection "Unicode I/O" +Normally, writing out Unicode data +.PP +.Vb 1 +\& print FH $some_string_with_unicode, "\en"; +.Ve +.PP +produces raw bytes that Perl happens to use to internally encode the +Unicode string. Perl's internal encoding depends on the system as +well as what characters happen to be in the string at the time. If +any of the characters are at code points \f(CW0x100\fR or above, you will get +a warning. To ensure that the output is explicitly rendered in the +encoding you desire\-\-and to avoid the warning\-\-open the stream with +the desired encoding. Some examples: +.PP +.Vb 1 +\& open FH, ">:utf8", "file"; +\& +\& open FH, ">:encoding(ucs2)", "file"; +\& open FH, ">:encoding(UTF\-8)", "file"; +\& open FH, ">:encoding(shift_jis)", "file"; +.Ve +.PP +and on already open streams, use \f(CWbinmode()\fR: +.PP +.Vb 1 +\& binmode(STDOUT, ":utf8"); +\& +\& binmode(STDOUT, ":encoding(ucs2)"); +\& binmode(STDOUT, ":encoding(UTF\-8)"); +\& binmode(STDOUT, ":encoding(shift_jis)"); +.Ve +.PP +The matching of encoding names is loose: case does not matter, and +many encodings have several aliases. Note that the \f(CW\*(C`:utf8\*(C'\fR layer +must always be specified exactly like that; it is \fInot\fR subject to +the loose matching of encoding names. Also note that currently \f(CW\*(C`:utf8\*(C'\fR is unsafe for +input, because it accepts the data without validating that it is indeed valid +UTF\-8; you should instead use \f(CW:encoding(UTF\-8)\fR (with or without a +hyphen). +.PP +See PerlIO for the \f(CW\*(C`:utf8\*(C'\fR layer, PerlIO::encoding and +Encode::PerlIO for the \f(CW:encoding()\fR layer, and +Encode::Supported for many encodings supported by the \f(CW\*(C`Encode\*(C'\fR +module. +.PP +Reading in a file that you know happens to be encoded in one of the +Unicode or legacy encodings does not magically turn the data into +Unicode in Perl's eyes. To do that, specify the appropriate +layer when opening files +.PP +.Vb 2 +\& open(my $fh,\*(Aq<:encoding(UTF\-8)\*(Aq, \*(Aqanything\*(Aq); +\& my $line_of_unicode = <$fh>; +\& +\& open(my $fh,\*(Aq<:encoding(Big5)\*(Aq, \*(Aqanything\*(Aq); +\& my $line_of_unicode = <$fh>; +.Ve +.PP +The I/O layers can also be specified more flexibly with +the \f(CW\*(C`open\*(C'\fR pragma. See open, or look at the following example. +.PP +.Vb 8 +\& use open \*(Aq:encoding(UTF\-8)\*(Aq; # input/output default encoding will be +\& # UTF\-8 +\& open X, ">file"; +\& print X chr(0x100), "\en"; +\& close X; +\& open Y, "<file"; +\& printf "%#x\en", ord(<Y>); # this should print 0x100 +\& close Y; +.Ve +.PP +With the \f(CW\*(C`open\*(C'\fR pragma you can use the \f(CW\*(C`:locale\*(C'\fR layer +.PP +.Vb 10 +\& BEGIN { $ENV{LC_ALL} = $ENV{LANG} = \*(Aqru_RU.KOI8\-R\*(Aq } +\& # the :locale will probe the locale environment variables like +\& # LC_ALL +\& use open OUT => \*(Aq:locale\*(Aq; # russki parusski +\& open(O, ">koi8"); +\& print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8\-R 0xc1 +\& close O; +\& open(I, "<koi8"); +\& printf "%#x\en", ord(<I>), "\en"; # this should print 0xc1 +\& close I; +.Ve +.PP +These methods install a transparent filter on the I/O stream that +converts data from the specified encoding when it is read in from the +stream. The result is always Unicode. +.PP +The open pragma affects all the \f(CWopen()\fR calls after the pragma by +setting default layers. If you want to affect only certain +streams, use explicit layers directly in the \f(CWopen()\fR call. +.PP +You can switch encodings on an already opened stream by using +\&\f(CWbinmode()\fR; see "binmode" in perlfunc. +.PP +The \f(CW\*(C`:locale\*(C'\fR does not currently work with +\&\f(CWopen()\fR and \f(CWbinmode()\fR, only with the \f(CW\*(C`open\*(C'\fR pragma. The +\&\f(CW\*(C`:utf8\*(C'\fR and \f(CW:encoding(...)\fR methods do work with all of \f(CWopen()\fR, +\&\f(CWbinmode()\fR, and the \f(CW\*(C`open\*(C'\fR pragma. +.PP +Similarly, you may use these I/O layers on output streams to +automatically convert Unicode to the specified encoding when it is +written to the stream. For example, the following snippet copies the +contents of the file "text.jis" (encoded as ISO\-2022\-JP, aka JIS) to +the file "text.utf8", encoded as UTF\-8: +.PP +.Vb 3 +\& open(my $nihongo, \*(Aq<:encoding(iso\-2022\-jp)\*(Aq, \*(Aqtext.jis\*(Aq); +\& open(my $unicode, \*(Aq>:utf8\*(Aq, \*(Aqtext.utf8\*(Aq); +\& while (<$nihongo>) { print $unicode $_ } +.Ve +.PP +The naming of encodings, both by the \f(CWopen()\fR and by the \f(CW\*(C`open\*(C'\fR +pragma allows for flexible names: \f(CW\*(C`koi8\-r\*(C'\fR and \f(CW\*(C`KOI8R\*(C'\fR will both be +understood. +.PP +Common encodings recognized by ISO, MIME, IANA, and various other +standardisation organisations are recognised; for a more detailed +list see Encode::Supported. +.PP +\&\f(CWread()\fR reads characters and returns the number of characters. +\&\f(CWseek()\fR and \f(CWtell()\fR operate on byte counts, as does \f(CWsysseek()\fR. +.PP +\&\f(CWsysread()\fR and \f(CWsyswrite()\fR should not be used on file handles with +character encoding layers, they behave badly, and that behaviour has +been deprecated since perl 5.24. +.PP +Notice that because of the default behaviour of not doing any +conversion upon input if there is no default layer, +it is easy to mistakenly write code that keeps on expanding a file +by repeatedly encoding the data: +.PP +.Vb 8 +\& # BAD CODE WARNING +\& open F, "file"; +\& local $/; ## read in the whole file of 8\-bit characters +\& $t = <F>; +\& close F; +\& open F, ">:encoding(UTF\-8)", "file"; +\& print F $t; ## convert to UTF\-8 on output +\& close F; +.Ve +.PP +If you run this code twice, the contents of the \fIfile\fR will be twice +UTF\-8 encoded. A \f(CW\*(C`use open \*(Aq:encoding(UTF\-8)\*(Aq\*(C'\fR would have avoided the +bug, or explicitly opening also the \fIfile\fR for input as UTF\-8. +.PP +\&\fBNOTE\fR: the \f(CW\*(C`:utf8\*(C'\fR and \f(CW\*(C`:encoding\*(C'\fR features work only if your +Perl has been built with PerlIO, which is the default +on most systems. +.SS "Displaying Unicode As Text" +.IX Subsection "Displaying Unicode As Text" +Sometimes you might want to display Perl scalars containing Unicode as +simple ASCII (or EBCDIC) text. The following subroutine converts +its argument so that Unicode characters with code points greater than +255 are displayed as \f(CW\*(C`\ex{...}\*(C'\fR, control characters (like \f(CW\*(C`\en\*(C'\fR) are +displayed as \f(CW\*(C`\ex..\*(C'\fR, and the rest of the characters as themselves: +.PP +.Vb 9 +\& sub nice_string { +\& join("", +\& map { $_ > 255 # if wide character... +\& ? sprintf("\e\ex{%04X}", $_) # \ex{...} +\& : chr($_) =~ /[[:cntrl:]]/ # else if control character... +\& ? sprintf("\e\ex%02X", $_) # \ex.. +\& : quotemeta(chr($_)) # else quoted or as themselves +\& } unpack("W*", $_[0])); # unpack Unicode characters +\& } +.Ve +.PP +For example, +.PP +.Vb 1 +\& nice_string("foo\ex{100}bar\en") +.Ve +.PP +returns the string +.PP +.Vb 1 +\& \*(Aqfoo\ex{0100}bar\ex0A\*(Aq +.Ve +.PP +which is ready to be printed. +.PP +(\f(CW\*(C`\e\ex{}\*(C'\fR is used here instead of \f(CW\*(C`\e\eN{}\*(C'\fR, since it's most likely that +you want to see what the native values are.) +.SS "Special Cases" +.IX Subsection "Special Cases" +.IP \(bu 4 +Starting in Perl 5.28, it is illegal for bit operators, like \f(CW\*(C`~\*(C'\fR, to +operate on strings containing code points above 255. +.IP \(bu 4 +The \fBvec()\fR function may produce surprising results if +used on strings containing characters with ordinal values above +255. In such a case, the results are consistent with the internal +encoding of the characters, but not with much else. So don't do +that, and starting in Perl 5.28, a deprecation message is issued if you +do so, becoming illegal in Perl 5.32. +.IP \(bu 4 +Peeking At Perl's Internal Encoding +.Sp +Normal users of Perl should never care how Perl encodes any particular +Unicode string (because the normal ways to get at the contents of a +string with Unicode\-\-via input and output\-\-should always be via +explicitly-defined I/O layers). But if you must, there are two +ways of looking behind the scenes. +.Sp +One way of peeking inside the internal encoding of Unicode characters +is to use \f(CW\*(C`unpack("C*", ...\*(C'\fR to get the bytes of whatever the string +encoding happens to be, or \f(CW\*(C`unpack("U0..", ...)\*(C'\fR to get the bytes of the +UTF\-8 encoding: +.Sp +.Vb 2 +\& # this prints c4 80 for the UTF\-8 bytes 0xc4 0x80 +\& print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\en"; +.Ve +.Sp +Yet another way would be to use the Devel::Peek module: +.Sp +.Vb 1 +\& perl \-MDevel::Peek \-e \*(AqDump(chr(0x100))\*(Aq +.Ve +.Sp +That shows the \f(CW\*(C`UTF8\*(C'\fR flag in FLAGS and both the UTF\-8 bytes +and Unicode characters in \f(CW\*(C`PV\*(C'\fR. See also later in this document +the discussion about the \f(CWutf8::is_utf8()\fR function. +.SS "Advanced Topics" +.IX Subsection "Advanced Topics" +.IP \(bu 4 +String Equivalence +.Sp +The question of string equivalence turns somewhat complicated +in Unicode: what do you mean by "equal"? +.Sp +(Is \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR equal to +\&\f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR?) +.Sp +The short answer is that by default Perl compares equivalence (\f(CW\*(C`eq\*(C'\fR, +\&\f(CW\*(C`ne\*(C'\fR) based only on code points of the characters. In the above +case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any +CAPITAL LETTER A's should be considered equal, or even A's of any case. +.Sp +The long answer is that you need to consider character normalization +and casing issues: see Unicode::Normalize, Unicode Technical Report #15, +Unicode Normalization Forms <https://www.unicode.org/reports/tr15> and +sections on case mapping in the Unicode Standard <https://www.unicode.org>. +.Sp +As of Perl 5.8.0, the "Full" case-folding of \fICase +Mappings/SpecialCasing\fR is implemented, but bugs remain in \f(CW\*(C`qr//i\*(C'\fR with them, +mostly fixed by 5.14, and essentially entirely by 5.18. +.IP \(bu 4 +String Collation +.Sp +People like to see their strings nicely sorted\-\-or as Unicode +parlance goes, collated. But again, what do you mean by collate? +.Sp +(Does \f(CW\*(C`LATIN CAPITAL LETTER A WITH ACUTE\*(C'\fR come before or after +\&\f(CW\*(C`LATIN CAPITAL LETTER A WITH GRAVE\*(C'\fR?) +.Sp +The short answer is that by default, Perl compares strings (\f(CW\*(C`lt\*(C'\fR, +\&\f(CW\*(C`le\*(C'\fR, \f(CW\*(C`cmp\*(C'\fR, \f(CW\*(C`ge\*(C'\fR, \f(CW\*(C`gt\*(C'\fR) based only on the code points of the +characters. In the above case, the answer is "after", since +\&\f(CW0x00C1\fR > \f(CW0x00C0\fR. +.Sp +The long answer is that "it depends", and a good answer cannot be +given without knowing (at the very least) the language context. +See Unicode::Collate, and \fIUnicode Collation Algorithm\fR +<https://www.unicode.org/reports/tr10/> +.SS Miscellaneous +.IX Subsection "Miscellaneous" +.IP \(bu 4 +Character Ranges and Classes +.Sp +Character ranges in regular expression bracketed character classes ( e.g., +\&\f(CW\*(C`/[a\-z]/\*(C'\fR) and in the \f(CW\*(C`tr///\*(C'\fR (also known as \f(CW\*(C`y///\*(C'\fR) operator are not +magically Unicode-aware. What this means is that \f(CW\*(C`[A\-Za\-z]\*(C'\fR will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8\-bit characters; for those, if you are using locales (perllocale), +use \f(CW\*(C`/[[:alpha:]]/\*(C'\fR; and if not, use the 8\-bit\-aware property \f(CW\*(C`\ep{alpha}\*(C'\fR). +.Sp +All the properties that begin with \f(CW\*(C`\ep\*(C'\fR (and its inverse \f(CW\*(C`\eP\*(C'\fR) are actually +character classes that are Unicode-aware. There are dozens of them, see +perluniprops. +.Sp +Starting in v5.22, you can use Unicode code points as the end points of +regular expression pattern character ranges, and the range will include +all Unicode code points that lie between those end points, inclusive. +.Sp +.Vb 1 +\& qr/ [ \eN{U+03} \- \eN{U+20} ] /xx +.Ve +.Sp +includes the code points +\&\f(CW\*(C`\eN{U+03}\*(C'\fR, \f(CW\*(C`\eN{U+04}\*(C'\fR, ..., \f(CW\*(C`\eN{U+20}\*(C'\fR. +.Sp +This also works for ranges in \f(CW\*(C`tr///\*(C'\fR starting in Perl v5.24. +.IP \(bu 4 +String-To-Number Conversions +.Sp +Unicode does define several other decimal\-\-and numeric\-\-characters +besides the familiar 0 to 9, such as the Arabic and Indic digits. +Perl does not support string-to-number conversion for digits other +than ASCII \f(CW0\fR to \f(CW9\fR (and ASCII \f(CW\*(C`a\*(C'\fR to \f(CW\*(C`f\*(C'\fR for hexadecimal). +To get safe conversions from any Unicode string, use +"\fBnum()\fR" in Unicode::UCD. +.SS "Questions With Answers" +.IX Subsection "Questions With Answers" +.IP \(bu 4 +Will My Old Scripts Break? +.Sp +Very probably not. Unless you are generating Unicode characters +somehow, old behaviour should be preserved. About the only behaviour +that has changed and which could start generating Unicode is the old +behaviour of \f(CWchr()\fR where supplying an argument more than 255 +produced a character modulo 255. \f(CWchr(300)\fR, for example, was equal +to \f(CWchr(45)\fR or "\-" (in ASCII), now it is LATIN CAPITAL LETTER I WITH +BREVE. +.IP \(bu 4 +How Do I Make My Scripts Work With Unicode? +.Sp +Very little work should be needed since nothing changes until you +generate Unicode data. The most important thing is getting input as +Unicode; for that, see the earlier I/O discussion. +To get full seamless Unicode support, add +\&\f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (or \f(CW\*(C`use v5.12\*(C'\fR or higher) to your +script. +.IP \(bu 4 +How Do I Know Whether My String Is In Unicode? +.Sp +You shouldn't have to care. But you may if your Perl is before 5.14.0 +or you haven't specified \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR or \f(CWuse +5.012\fR (or higher) because otherwise the rules for the code points +in the range 128 to 255 are different depending on +whether the string they are contained within is in Unicode or not. +(See "When Unicode Does Not Happen" in perlunicode.) +.Sp +To determine if a string is in Unicode, use: +.Sp +.Vb 1 +\& print utf8::is_utf8($string) ? 1 : 0, "\en"; +.Ve +.Sp +But note that this doesn't mean that any of the characters in the +string are necessary UTF\-8 encoded, or that any of the characters have +code points greater than 0xFF (255) or even 0x80 (128), or that the +string has any characters at all. All the \f(CWis_utf8()\fR does is to +return the value of the internal "utf8ness" flag attached to the +\&\f(CW$string\fR. If the flag is off, the bytes in the scalar are interpreted +as a single byte encoding. If the flag is on, the bytes in the scalar +are interpreted as the (variable-length, potentially multi-byte) UTF\-8 encoded +code points of the characters. Bytes added to a UTF\-8 encoded string are +automatically upgraded to UTF\-8. If mixed non\-UTF\-8 and UTF\-8 scalars +are merged (double-quoted interpolation, explicit concatenation, or +printf/sprintf parameter substitution), the result will be UTF\-8 encoded +as if copies of the byte strings were upgraded to UTF\-8: for example, +.Sp +.Vb 3 +\& $a = "ab\ex80c"; +\& $b = "\ex{100}"; +\& print "$a = $b\en"; +.Ve +.Sp +the output string will be UTF\-8\-encoded \f(CW\*(C`ab\ex80c = \ex{100}\en\*(C'\fR, but +\&\f(CW$a\fR will stay byte-encoded. +.Sp +Sometimes you might really need to know the byte length of a string +instead of the character length. For that use the \f(CW\*(C`bytes\*(C'\fR pragma +and the \f(CWlength()\fR function: +.Sp +.Vb 6 +\& my $unicode = chr(0x100); +\& print length($unicode), "\en"; # will print 1 +\& use bytes; +\& print length($unicode), "\en"; # will print 2 +\& # (the 0xC4 0x80 of the UTF\-8) +\& no bytes; +.Ve +.IP \(bu 4 +How Do I Find Out What Encoding a File Has? +.Sp +You might try Encode::Guess, but it has a number of limitations. +.IP \(bu 4 +How Do I Detect Data That's Not Valid In a Particular Encoding? +.Sp +Use the \f(CW\*(C`Encode\*(C'\fR package to try converting it. +For example, +.Sp +.Vb 1 +\& use Encode \*(Aqdecode\*(Aq; +\& +\& if (eval { decode(\*(AqUTF\-8\*(Aq, $string, Encode::FB_CROAK); 1 }) { +\& # $string is valid UTF\-8 +\& } else { +\& # $string is not valid UTF\-8 +\& } +.Ve +.Sp +Or use \f(CW\*(C`unpack\*(C'\fR to try decoding it: +.Sp +.Vb 2 +\& use warnings; +\& @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); +.Ve +.Sp +If invalid, a \f(CW\*(C`Malformed UTF\-8 character\*(C'\fR warning is produced. The "C0" means +"process the string character per character". Without that, the +\&\f(CW\*(C`unpack("U*", ...)\*(C'\fR would work in \f(CW\*(C`U0\*(C'\fR mode (the default if the format +string starts with \f(CW\*(C`U\*(C'\fR) and it would return the bytes making up the UTF\-8 +encoding of the target string, something that will always work. +.IP \(bu 4 +How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa? +.Sp +This probably isn't as useful as you might think. +Normally, you shouldn't need to. +.Sp +In one sense, what you are asking doesn't make much sense: encodings +are for characters, and binary data are not "characters", so converting +"data" into some encoding isn't meaningful unless you know in what +character set and encoding the binary data is in, in which case it's +not just binary data, now is it? +.Sp +If you have a raw sequence of bytes that you know should be +interpreted via a particular encoding, you can use \f(CW\*(C`Encode\*(C'\fR: +.Sp +.Vb 2 +\& use Encode \*(Aqfrom_to\*(Aq; +\& from_to($data, "iso\-8859\-1", "UTF\-8"); # from latin\-1 to UTF\-8 +.Ve +.Sp +The call to \f(CWfrom_to()\fR changes the bytes in \f(CW$data\fR, but nothing +material about the nature of the string has changed as far as Perl is +concerned. Both before and after the call, the string \f(CW$data\fR +contains just a bunch of 8\-bit bytes. As far as Perl is concerned, +the encoding of the string remains as "system-native 8\-bit bytes". +.Sp +You might relate this to a fictional 'Translate' module: +.Sp +.Vb 4 +\& use Translate; +\& my $phrase = "Yes"; +\& Translate::from_to($phrase, \*(Aqenglish\*(Aq, \*(Aqdeutsch\*(Aq); +\& ## phrase now contains "Ja" +.Ve +.Sp +The contents of the string changes, but not the nature of the string. +Perl doesn't know any more after the call than before that the +contents of the string indicates the affirmative. +.Sp +Back to converting data. If you have (or want) data in your system's +native 8\-bit encoding (e.g. Latin\-1, EBCDIC, etc.), you can use +pack/unpack to convert to/from Unicode. +.Sp +.Vb 2 +\& $native_string = pack("W*", unpack("U*", $Unicode_string)); +\& $Unicode_string = pack("U*", unpack("W*", $native_string)); +.Ve +.Sp +If you have a sequence of bytes you \fBknow\fR is valid UTF\-8, +but Perl doesn't know it yet, you can make Perl a believer, too: +.Sp +.Vb 2 +\& $Unicode = $bytes; +\& utf8::decode($Unicode); +.Ve +.Sp +or: +.Sp +.Vb 1 +\& $Unicode = pack("U0a*", $bytes); +.Ve +.Sp +You can find the bytes that make up a UTF\-8 sequence with +.Sp +.Vb 1 +\& @bytes = unpack("C*", $Unicode_string) +.Ve +.Sp +and you can create well-formed Unicode with +.Sp +.Vb 1 +\& $Unicode_string = pack("U*", 0xff, ...) +.Ve +.IP \(bu 4 +How Do I Display Unicode? How Do I Input Unicode? +.Sp +See <http://www.alanwood.net/unicode/> and +<http://www.cl.cam.ac.uk/~mgk25/unicode.html> +.IP \(bu 4 +How Does Unicode Work With Traditional Locales? +.Sp +If your locale is a UTF\-8 locale, starting in Perl v5.26, Perl works +well for all categories; before this, starting with Perl v5.20, it works +for all categories but \f(CW\*(C`LC_COLLATE\*(C'\fR, which deals with +sorting and the \f(CW\*(C`cmp\*(C'\fR operator. But note that the standard +\&\f(CW\*(C`Unicode::Collate\*(C'\fR and \f(CW\*(C`Unicode::Collate::Locale\*(C'\fR modules offer +much more powerful solutions to collation issues, and work on earlier +releases. +.Sp +For other locales, starting in Perl 5.16, you can specify +.Sp +.Vb 1 +\& use locale \*(Aq:not_characters\*(Aq; +.Ve +.Sp +to get Perl to work well with them. The catch is that you +have to translate from the locale character set to/from Unicode +yourself. See "Unicode I/O" above for how to +.Sp +.Vb 1 +\& use open \*(Aq:locale\*(Aq; +.Ve +.Sp +to accomplish this, but full details are in "Unicode and +UTF\-8" in perllocale, including gotchas that happen if you don't specify +\&\f(CW\*(C`:not_characters\*(C'\fR. +.SS "Hexadecimal Notation" +.IX Subsection "Hexadecimal Notation" +The Unicode standard prefers using hexadecimal notation because +that more clearly shows the division of Unicode into blocks of 256 characters. +Hexadecimal is also simply shorter than decimal. You can use decimal +notation, too, but learning to use hexadecimal just makes life easier +with the Unicode standard. The \f(CW\*(C`U+HHHH\*(C'\fR notation uses hexadecimal, +for example. +.PP +The \f(CW\*(C`0x\*(C'\fR prefix means a hexadecimal number, the digits are 0\-9 \fIand\fR +a\-f (or A\-F, case doesn't matter). Each hexadecimal digit represents +four bits, or half a byte. \f(CW\*(C`print 0x..., "\en"\*(C'\fR will show a +hexadecimal number in decimal, and \f(CW\*(C`printf "%x\en", $decimal\*(C'\fR will +show a decimal number in hexadecimal. If you have just the +"hex digits" of a hexadecimal number, you can use the \f(CWhex()\fR function. +.PP +.Vb 6 +\& print 0x0009, "\en"; # 9 +\& print 0x000a, "\en"; # 10 +\& print 0x000f, "\en"; # 15 +\& print 0x0010, "\en"; # 16 +\& print 0x0011, "\en"; # 17 +\& print 0x0100, "\en"; # 256 +\& +\& print 0x0041, "\en"; # 65 +\& +\& printf "%x\en", 65; # 41 +\& printf "%#x\en", 65; # 0x41 +\& +\& print hex("41"), "\en"; # 65 +.Ve +.SS "Further Resources" +.IX Subsection "Further Resources" +.IP \(bu 4 +Unicode Consortium +.Sp +<https://www.unicode.org/> +.IP \(bu 4 +Unicode FAQ +.Sp +<https://www.unicode.org/faq/> +.IP \(bu 4 +Unicode Glossary +.Sp +<https://www.unicode.org/glossary/> +.IP \(bu 4 +Unicode Recommended Reading List +.Sp +The Unicode Consortium has a list of articles and books, some of which +give a much more in depth treatment of Unicode: +<http://unicode.org/resources/readinglist.html> +.IP \(bu 4 +Unicode Useful Resources +.Sp +<https://www.unicode.org/unicode/onlinedat/resources.html> +.IP \(bu 4 +Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications +.Sp +<http://www.alanwood.net/unicode/> +.IP \(bu 4 +UTF\-8 and Unicode FAQ for Unix/Linux +.Sp +<http://www.cl.cam.ac.uk/~mgk25/unicode.html> +.IP \(bu 4 +Legacy Character Sets +.Sp +<http://www.czyborra.com/> +<http://www.eki.ee/letter/> +.IP \(bu 4 +You can explore various information from the Unicode data files using +the \f(CW\*(C`Unicode::UCD\*(C'\fR module. +.SH "UNICODE IN OLDER PERLS" +.IX Header "UNICODE IN OLDER PERLS" +If you cannot upgrade your Perl to 5.8.0 or later, you can still +do some Unicode processing by using the modules \f(CW\*(C`Unicode::String\*(C'\fR, +\&\f(CW\*(C`Unicode::Map8\*(C'\fR, and \f(CW\*(C`Unicode::Map\*(C'\fR, available from CPAN. +If you have the GNU recode installed, you can also use the +Perl front-end \f(CW\*(C`Convert::Recode\*(C'\fR for character conversions. +.PP +The following are fast conversions from ISO 8859\-1 (Latin\-1) bytes +to UTF\-8 bytes and back, the code works even with older Perl 5 versions. +.PP +.Vb 2 +\& # ISO 8859\-1 to UTF\-8 +\& s/([\ex80\-\exFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; +\& +\& # UTF\-8 to ISO 8859\-1 +\& s/([\exC2\exC3])([\ex80\-\exBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg; +.Ve +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlunitut, perlunicode, Encode, open, utf8, bytes, +perlretut, perlrun, Unicode::Collate, Unicode::Normalize, +Unicode::UCD +.SH ACKNOWLEDGMENTS +.IX Header "ACKNOWLEDGMENTS" +Thanks to the kind readers of the perl5\-porters@perl.org, +perl\-unicode@perl.org, linux\-utf8@nl.linux.org, and unicore@unicode.org +mailing lists for their valuable feedback. +.SH "AUTHOR, COPYRIGHT, AND LICENSE" +.IX Header "AUTHOR, COPYRIGHT, AND LICENSE" +Copyright 2001\-2011 Jarkko Hietaniemi <jhi@iki.fi>. +Now maintained by Perl 5 Porters. +.PP +This document may be distributed under the same terms as Perl itself. |