diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunicode.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlunicode.1 | 2232 |
1 files changed, 2232 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunicode.1 b/upstream/mageia-cauldron/man1/perlunicode.1 new file mode 100644 index 00000000..ead666c0 --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlunicode.1 @@ -0,0 +1,2232 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLUNICODE 1" +.TH PERLUNICODE 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlunicode \- Unicode support in Perl +.SH DESCRIPTION +.IX Header "DESCRIPTION" +If you haven't already, before reading this document, you should become +familiar with both perlunitut and perluniintro. +.PP +Unicode aims to \fBUNI\fR\-fy the en\-\fBCODE\fR\-ings of all the world's +character sets into a single Standard. For quite a few of the various +coding standards that existed when Unicode was first created, converting +from each to Unicode essentially meant adding a constant to each code +point in the original standard, and converting back meant just +subtracting that same constant. For ASCII and ISO\-8859\-1, the constant +is 0. For ISO\-8859\-5, (Cyrillic) the constant is 864; for Hebrew +(ISO\-8859\-8), it's 1488; Thai (ISO\-8859\-11), 3424; and so forth. This +made it easy to do the conversions, and facilitated the adoption of +Unicode. +.PP +And it worked; nowadays, those legacy standards are rarely used. Most +everyone uses Unicode. +.PP +Unicode is a comprehensive standard. It specifies many things outside +the scope of Perl, such as how to display sequences of characters. For +a full discussion of all aspects of Unicode, see +<https://www.unicode.org>. +.SS "Important Caveats" +.IX Subsection "Important Caveats" +Even though some of this section may not be understandable to you on +first reading, we think it's important enough to highlight some of the +gotchas before delving further, so here goes: +.PP +Unicode support is an extensive requirement. While Perl does not +implement the Unicode standard or the accompanying technical reports +from cover to cover, Perl does support many Unicode features. +.PP +Also, the use of Unicode may present security issues that aren't +obvious, see "Security Implications of Unicode" below. +.ie n .IP "Safest if you ""use feature \*(Aqunicode_strings\*(Aq""" 4 +.el .IP "Safest if you \f(CWuse feature \*(Aqunicode_strings\*(Aq\fR" 4 +.IX Item "Safest if you use feature unicode_strings" +In order to preserve backward compatibility, Perl does not turn +on full internal Unicode support unless the pragma +\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR +is specified. (This is automatically +selected if you \f(CW\*(C`use\ v5.12\*(C'\fR or higher.) Failure to do this can +trigger unexpected surprises. See "The "Unicode Bug"" below. +.Sp +This pragma doesn't affect I/O. Nor does it change the internal +representation of strings, only their interpretation. There are still +several places where Unicode isn't fully supported, such as in +filenames. +.IP "Input and Output Layers" 4 +.IX Item "Input and Output Layers" +Use the \f(CW:encoding(...)\fR layer to read from and write to +filehandles using the specified encoding. (See open.) +.IP "You must convert your non-ASCII, non\-UTF\-8 Perl scripts to be UTF\-8." 4 +.IX Item "You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8." +The encoding module has been deprecated since perl 5.18 and the +perl internals it requires have been removed with perl 5.26. +.ie n .IP """use utf8"" still needed to enable UTF\-8 in scripts" 4 +.el .IP "\f(CWuse utf8\fR still needed to enable UTF\-8 in scripts" 4 +.IX Item "use utf8 still needed to enable UTF-8 in scripts" +If your Perl script is itself encoded in UTF\-8, +the \f(CW\*(C`use\ utf8\*(C'\fR pragma must be explicitly included to enable +recognition of that (in string or regular expression literals, or in +identifier names). \fBThis is the only time when an explicit \fR\f(CB\*(C`use\ utf8\*(C'\fR\fB is needed.\fR (See utf8). +.Sp +If a Perl script begins with the bytes that form the UTF\-8 encoding of +the Unicode BYTE ORDER MARK (\f(CW\*(C`BOM\*(C'\fR, see "Unicode Encodings"), those +bytes are completely ignored. +.IP "UTF\-16 scripts autodetected" 4 +.IX Item "UTF-16 scripts autodetected" +If a Perl script begins with the Unicode \f(CW\*(C`BOM\*(C'\fR (UTF\-16LE, +UTF16\-BE), or if the script looks like non\-\f(CW\*(C`BOM\*(C'\fR\-marked +UTF\-16 of either endianness, Perl will correctly read in the script as +the appropriate Unicode encoding. +.SS "Byte and Character Semantics" +.IX Subsection "Byte and Character Semantics" +Before Unicode, most encodings used 8 bits (a single byte) to encode +each character. Thus a character was a byte, and a byte was a +character, and there could be only 256 or fewer possible characters. +"Byte Semantics" in the title of this section refers to +this behavior. There was no need to distinguish between "Byte" and +"Character". +.PP +Then along comes Unicode which has room for over a million characters +(and Perl allows for even more). This means that a character may +require more than a single byte to represent it, and so the two terms +are no longer equivalent. What matter are the characters as whole +entities, and not usually the bytes that comprise them. That's what the +term "Character Semantics" in the title of this section refers to. +.PP +Perl had to change internally to decouple "bytes" from "characters". +It is important that you too change your ideas, if you haven't already, +so that "byte" and "character" no longer mean the same thing in your +mind. +.PP +The basic building block of Perl strings has always been a "character". +The changes basically come down to that the implementation no longer +thinks that a character is always just a single byte. +.PP +There are various things to note: +.IP \(bu 4 +String handling functions, for the most part, continue to operate in +terms of characters. \f(CWlength()\fR, for example, returns the number of +characters in a string, just as before. But that number no longer is +necessarily the same as the number of bytes in the string (there may be +more bytes than characters). The other such functions include +\&\f(CWchop()\fR, \f(CWchomp()\fR, \f(CWsubstr()\fR, \f(CWpos()\fR, \f(CWindex()\fR, \f(CWrindex()\fR, +\&\f(CWsort()\fR, \f(CWsprintf()\fR, and \f(CWwrite()\fR. +.Sp +The exceptions are: +.RS 4 +.IP \(bu 4 +the bit-oriented \f(CW\*(C`vec\*(C'\fR +.Sp +\ +.IP \(bu 4 +the byte-oriented \f(CW\*(C`pack\*(C'\fR/\f(CW\*(C`unpack\*(C'\fR \f(CW"C"\fR format +.Sp +However, the \f(CW\*(C`W\*(C'\fR specifier does operate on whole characters, as does the +\&\f(CW\*(C`U\*(C'\fR specifier. +.IP \(bu 4 +some operators that interact with the platform's operating system +.Sp +Operators dealing with filenames are examples. +.IP \(bu 4 +when the functions are called from within the scope of the +\&\f(CW\*(C`use\ bytes\*(C'\fR pragma +.Sp +Likely, you should use this only for debugging anyway. +.RE +.RS 4 +.RE +.IP \(bu 4 +Strings\-\-including hash keys\-\-and regular expression patterns may +contain characters that have ordinal values larger than 255. +.Sp +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF\-8 encoding, or UTF\-16. +(The former requires a \f(CW\*(C`use utf8\*(C'\fR, the latter may require a \f(CW\*(C`BOM\*(C'\fR.) +.Sp +"Creating Unicode" in perluniintro gives other ways to place non-ASCII +characters in your strings. +.IP \(bu 4 +The \f(CWchr()\fR and \f(CWord()\fR functions work on whole characters. +.IP \(bu 4 +Regular expressions match whole characters. For example, \f(CW"."\fR matches +a whole character instead of only a single byte. +.IP \(bu 4 +The \f(CW\*(C`tr///\*(C'\fR operator translates whole characters. (Note that the +\&\f(CW\*(C`tr///CU\*(C'\fR functionality has been removed. For similar functionality to +that, see \f(CW\*(C`pack(\*(AqU0\*(Aq,\ ...)\*(C'\fR and \f(CW\*(C`pack(\*(AqC0\*(Aq,\ ...)\*(C'\fR). +.IP \(bu 4 +\&\f(CW\*(C`scalar reverse()\*(C'\fR reverses by character rather than by byte. +.IP \(bu 4 +The bit string operators, \f(CW\*(C`& | ^ ~\*(C'\fR and (starting in v5.22) +\&\f(CW\*(C`&. |. ^. ~.\*(C'\fR can operate on bit strings encoded in UTF\-8, but this +can give unexpected results if any of the strings contain code points +above 0xFF. Starting in v5.28, it is a fatal error to have such an +operand. Otherwise, the operation is performed on a non\-UTF\-8 copy of +the operand. If you're not sure about the encoding of a string, +downgrade it before using any of these operators; you can use +\&\f(CWutf8::utf8_downgrade()\fR. +.PP +The bottom line is that Perl has always practiced "Character Semantics", +but with the advent of Unicode, that is now different than "Byte +Semantics". +.SS "ASCII Rules versus Unicode Rules" +.IX Subsection "ASCII Rules versus Unicode Rules" +Before Unicode, when a character was a byte was a character, +Perl knew only about the 128 characters defined by ASCII, code points 0 +through 127 (except for under \f(CW\*(C`use\ locale\*(C'\fR). That +left the code +points 128 to 255 as unassigned, and available for whatever use a +program might want. The only semantics they have is their ordinal +numbers, and that they are members of none of the non-negative character +classes. None are considered to match \f(CW\*(C`\ew\*(C'\fR for example, but all match +\&\f(CW\*(C`\eW\*(C'\fR. +.PP +Unicode, of course, assigns each of those code points a particular +meaning (along with ones above 255). To preserve backward +compatibility, Perl only uses the Unicode meanings when there is some +indication that Unicode is what is intended; otherwise the non-ASCII +code points remain treated as if they are unassigned. +.PP +Here are the ways that Perl knows that a string should be treated as +Unicode: +.IP \(bu 4 +Within the scope of \f(CW\*(C`use\ utf8\*(C'\fR +.Sp +If the whole program is Unicode (signified by using 8\-bit \fBU\fRnicode +\&\fBT\fRransformation \fBF\fRormat), then all literal strings within it must be +Unicode. +.IP \(bu 4 +Within the scope of +\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR +.Sp +This pragma was created so you can explicitly tell Perl that operations +executed within its scope are to use Unicode rules. More operations are +affected with newer perls. See "The "Unicode Bug"". +.IP \(bu 4 +Within the scope of \f(CW\*(C`use\ v5.12\*(C'\fR or higher +.Sp +This implicitly turns on \f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR. +.IP \(bu 4 +Within the scope of +\&\f(CW\*(C`use\ locale\ \*(Aqnot_characters\*(Aq\*(C'\fR, +or \f(CW\*(C`use\ locale\*(C'\fR and the current +locale is a UTF\-8 locale. +.Sp +The former is defined to imply Unicode handling; and the latter +indicates a Unicode locale, hence a Unicode interpretation of all +strings within it. +.IP \(bu 4 +When the string contains a Unicode-only code point +.Sp +Perl has never accepted code points above 255 without them being +Unicode, so their use implies Unicode for the whole string. +.IP \(bu 4 +When the string contains a Unicode named code point \f(CW\*(C`\eN{...}\*(C'\fR +.Sp +The \f(CW\*(C`\eN{...}\*(C'\fR construct explicitly refers to a Unicode code point, +even if it is one that is also in ASCII. Therefore the string +containing it must be Unicode. +.IP \(bu 4 +When the string has come from an external source marked as +Unicode +.Sp +The \f(CW\*(C`\-C\*(C'\fR command line option can +specify that certain inputs to the program are Unicode, and the values +of this can be read by your Perl code, see "${^UNICODE}" in perlvar. +.IP \(bu 4 +When the string has been upgraded to UTF\-8 +.Sp +The function \f(CWutf8::utf8_upgrade()\fR +can be explicitly used to permanently (unless a subsequent +\&\f(CWutf8::utf8_downgrade()\fR is called) cause a string to be treated as +Unicode. +.IP \(bu 4 +There are additional methods for regular expression patterns +.Sp +A pattern that is compiled with the \f(CW\*(C`/u\*(C'\fR or \f(CW\*(C`/a\*(C'\fR modifiers is +treated as Unicode (though there are some restrictions with \f(CW\*(C`/a\*(C'\fR). +Under the \f(CW\*(C`/d\*(C'\fR and \f(CW\*(C`/l\*(C'\fR modifiers, there are several other +indications for Unicode; see "Character set modifiers" in perlre. +.PP +Note that all of the above are overridden within the scope of +\&\f(CW\*(C`use bytes\*(C'\fR; but you should be using this pragma only for +debugging. +.PP +Note also that some interactions with the platform's operating system +never use Unicode rules. +.PP +When Unicode rules are in effect: +.IP \(bu 4 +Case translation operators use the Unicode case translation tables. +.Sp +Note that \f(CWuc()\fR, or \f(CW\*(C`\eU\*(C'\fR in interpolated strings, translates to +uppercase, while \f(CW\*(C`ucfirst\*(C'\fR, or \f(CW\*(C`\eu\*(C'\fR in interpolated strings, +translates to titlecase in languages that make the distinction (which is +equivalent to uppercase in languages without the distinction). +.Sp +There is a CPAN module, \f(CW\*(C`Unicode::Casing\*(C'\fR, which allows you to +define your own mappings to be used in \f(CWlc()\fR, \f(CWlcfirst()\fR, \f(CWuc()\fR, +\&\f(CWucfirst()\fR, and \f(CW\*(C`fc\*(C'\fR (or their double-quoted string inlined versions +such as \f(CW\*(C`\eU\*(C'\fR). (Prior to Perl 5.16, this functionality was partially +provided in the Perl core, but suffered from a number of insurmountable +drawbacks, so the CPAN module was written instead.) +.IP \(bu 4 +Character classes in regular expressions match based on the character +properties specified in the Unicode properties database. +.Sp +\&\f(CW\*(C`\ew\*(C'\fR can be used to match a Japanese ideograph, for instance; and +\&\f(CW\*(C`[[:digit:]]\*(C'\fR a Bengali number. +.IP \(bu 4 +Named Unicode properties, scripts, and block ranges may be used (like +bracketed character classes) by using the \f(CW\*(C`\ep{}\*(C'\fR "matches property" +construct and the \f(CW\*(C`\eP{}\*(C'\fR negation, "doesn't match property". +.Sp +See "Unicode Character Properties" for more details. +.Sp +You can define your own character properties and use them +in the regular expression with the \f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR construct. +See "User-Defined Character Properties" for more details. +.SS "Extended Grapheme Clusters (Logical characters)" +.IX Subsection "Extended Grapheme Clusters (Logical characters)" +Consider a character, say \f(CW\*(C`H\*(C'\fR. It could appear with various marks around it, +such as an acute accent, or a circumflex, or various hooks, circles, arrows, +\&\fIetc.\fR, above, below, to one side or the other, \fIetc\fR. There are many +possibilities among the world's languages. The number of combinations is +astronomical, and if there were a character for each combination, it would +soon exhaust Unicode's more than a million possible characters. So Unicode +took a different approach: there is a character for the base \f(CW\*(C`H\*(C'\fR, and a +character for each of the possible marks, and these can be variously combined +to get a final logical character. So a logical character\-\-what appears to be a +single character\-\-can be a sequence of more than one individual characters. +The Unicode standard calls these "extended grapheme clusters" (which +is an improved version of the no-longer much used "grapheme cluster"); +Perl furnishes the \f(CW\*(C`\eX\*(C'\fR regular expression construct to match such +sequences in their entirety. +.PP +But Unicode's intent is to unify the existing character set standards and +practices, and several pre-existing standards have single characters that +mean the same thing as some of these combinations, like ISO\-8859\-1, +which has quite a few of them. For example, \f(CW"LATIN CAPITAL LETTER E +WITH ACUTE"\fR was already in this standard when Unicode came along. +Unicode therefore added it to its repertoire as that single character. +But this character is considered by Unicode to be equivalent to the +sequence consisting of the character \f(CW"LATIN CAPITAL LETTER E"\fR +followed by the character \f(CW"COMBINING ACUTE ACCENT"\fR. +.PP +\&\f(CW"LATIN CAPITAL LETTER E WITH ACUTE"\fR is called a "pre-composed" +character, and its equivalence with the "E" and the "COMBINING ACCENT" +sequence is called canonical equivalence. All pre-composed characters +are said to have a decomposition (into the equivalent sequence), and the +decomposition type is also called canonical. A string may be comprised +as much as possible of precomposed characters, or it may be comprised of +entirely decomposed characters. Unicode calls these respectively, +"Normalization Form Composed" (NFC) and "Normalization Form Decomposed". +The \f(CW\*(C`Unicode::Normalize\*(C'\fR module contains functions that convert +between the two. A string may also have both composed characters and +decomposed characters; this module can be used to make it all one or the +other. +.PP +You may be presented with strings in any of these equivalent forms. +There is currently nothing in Perl 5 that ignores the differences. So +you'll have to specially handle it. The usual advice is to convert your +inputs to \f(CW\*(C`NFD\*(C'\fR before processing further. +.PP +For more detailed information, see <http://unicode.org/reports/tr15/>. +.SS "Unicode Character Properties" +.IX Subsection "Unicode Character Properties" +(The only time that Perl considers a sequence of individual code +points as a single logical character is in the \f(CW\*(C`\eX\*(C'\fR construct, already +mentioned above. Therefore "character" in this discussion means a single +Unicode code point.) +.PP +Very nearly all Unicode character properties are accessible through +regular expressions by using the \f(CW\*(C`\ep{}\*(C'\fR "matches property" construct +and the \f(CW\*(C`\eP{}\*(C'\fR "doesn't match property" for its negation. +.PP +For instance, \f(CW\*(C`\ep{Uppercase}\*(C'\fR matches any single character with the Unicode +\&\f(CW"Uppercase"\fR property, while \f(CW\*(C`\ep{L}\*(C'\fR matches any character with a +\&\f(CW\*(C`General_Category\*(C'\fR of \f(CW"L"\fR (letter) property (see +"General_Category" below). Brackets are not +required for single letter property names, so \f(CW\*(C`\ep{L}\*(C'\fR is equivalent to \f(CW\*(C`\epL\*(C'\fR. +.PP +More formally, \f(CW\*(C`\ep{Uppercase}\*(C'\fR matches any single character whose Unicode +\&\f(CW\*(C`Uppercase\*(C'\fR property value is \f(CW\*(C`True\*(C'\fR, and \f(CW\*(C`\eP{Uppercase}\*(C'\fR matches any character +whose \f(CW\*(C`Uppercase\*(C'\fR property value is \f(CW\*(C`False\*(C'\fR, and they could have been written as +\&\f(CW\*(C`\ep{Uppercase=True}\*(C'\fR and \f(CW\*(C`\ep{Uppercase=False}\*(C'\fR, respectively. +.PP +This formality is needed when properties are not binary; that is, if they can +take on more values than just \f(CW\*(C`True\*(C'\fR and \f(CW\*(C`False\*(C'\fR. For example, the +\&\f(CW\*(C`Bidi_Class\*(C'\fR property (see "Bidirectional Character Types" below), +can take on several different +values, such as \f(CW\*(C`Left\*(C'\fR, \f(CW\*(C`Right\*(C'\fR, \f(CW\*(C`Whitespace\*(C'\fR, and others. To match these, one needs +to specify both the property name (\f(CW\*(C`Bidi_Class\*(C'\fR), AND the value being +matched against +(\f(CW\*(C`Left\*(C'\fR, \f(CW\*(C`Right\*(C'\fR, \fIetc.\fR). This is done, as in the examples above, by having the +two components separated by an equal sign (or interchangeably, a colon), like +\&\f(CW\*(C`\ep{Bidi_Class: Left}\*(C'\fR. +.PP +All Unicode-defined character properties may be written in these compound forms +of \f(CW\*(C`\ep{\fR\f(CIproperty\fR\f(CW=\fR\f(CIvalue\fR\f(CW}\*(C'\fR or \f(CW\*(C`\ep{\fR\f(CIproperty\fR\f(CW:\fR\f(CIvalue\fR\f(CW}\*(C'\fR, but Perl provides some +additional properties that are written only in the single form, as well as +single-form short-cuts for all binary properties and certain others described +below, in which you may omit the property name and the equals or colon +separator. +.PP +Most Unicode character properties have at least two synonyms (or aliases if you +prefer): a short one that is easier to type and a longer one that is more +descriptive and hence easier to understand. Thus the \f(CW"L"\fR and +\&\f(CW"Letter"\fR properties above are equivalent and can be used +interchangeably. Likewise, \f(CW"Upper"\fR is a synonym for \f(CW"Uppercase"\fR, +and we could have written \f(CW\*(C`\ep{Uppercase}\*(C'\fR equivalently as \f(CW\*(C`\ep{Upper}\*(C'\fR. +Also, there are typically various synonyms for the values the property +can be. For binary properties, \f(CW"True"\fR has 3 synonyms: \f(CW"T"\fR, +\&\f(CW"Yes"\fR, and \f(CW"Y"\fR; and \f(CW"False"\fR has correspondingly \f(CW"F"\fR, +\&\f(CW"No"\fR, and \f(CW"N"\fR. But be careful. A short form of a value for one +property may not mean the same thing as the short form spelled the same +for another. +Thus, for the \f(CW"General_Category"\fR property, \f(CW"L"\fR means +\&\f(CW"Letter"\fR, but for the \f(CW\*(C`Bidi_Class\*(C'\fR +property, \f(CW"L"\fR means \f(CW"Left"\fR. A complete list of properties and +synonyms is in perluniprops. +.PP +Upper/lower case differences in property names and values are irrelevant; +thus \f(CW\*(C`\ep{Upper}\*(C'\fR means the same thing as \f(CW\*(C`\ep{upper}\*(C'\fR or even \f(CW\*(C`\ep{UpPeR}\*(C'\fR. +Similarly, you can add or subtract underscores anywhere in the middle of a +word, so that these are also equivalent to \f(CW\*(C`\ep{U_p_p_e_r}\*(C'\fR. And white space +is generally irrelevant adjacent to non-word characters, such as the +braces and the equals or colon separators, so \f(CW\*(C`\ep{ Upper }\*(C'\fR and +\&\f(CW\*(C`\ep{ Upper_case : Y }\*(C'\fR are equivalent to these as well. In fact, white +space and even hyphens can usually be added or deleted anywhere. So +even \f(CW\*(C`\ep{ Up\-per case = Yes}\*(C'\fR is equivalent. All this is called +"loose-matching" by Unicode. The "name" property has some restrictions +on this due to a few outlier names. Full details are given in +<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>. +.PP +The few places where stricter matching is +used is in the middle of numbers, the "name" property, and in the Perl +extension properties that begin or end with an underscore. Stricter +matching cares about white space (except adjacent to non-word +characters), hyphens, and non-interior underscores. +.PP +You can also use negation in both \f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR by introducing a caret +(\f(CW\*(C`^\*(C'\fR) between the first brace and the property name: \f(CW\*(C`\ep{^Tamil}\*(C'\fR is +equal to \f(CW\*(C`\eP{Tamil}\*(C'\fR. +.PP +Almost all properties are immune to case-insensitive matching. That is, +adding a \f(CW\*(C`/i\*(C'\fR regular expression modifier does not change what they +match. There are two sets that are affected. +The first set is +\&\f(CW\*(C`Uppercase_Letter\*(C'\fR, +\&\f(CW\*(C`Lowercase_Letter\*(C'\fR, +and \f(CW\*(C`Titlecase_Letter\*(C'\fR, +all of which match \f(CW\*(C`Cased_Letter\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching. +And the second set is +\&\f(CW\*(C`Uppercase\*(C'\fR, +\&\f(CW\*(C`Lowercase\*(C'\fR, +and \f(CW\*(C`Titlecase\*(C'\fR, +all of which match \f(CW\*(C`Cased\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching. +This set also includes its subsets \f(CW\*(C`PosixUpper\*(C'\fR and \f(CW\*(C`PosixLower\*(C'\fR both +of which under \f(CW\*(C`/i\*(C'\fR match \f(CW\*(C`PosixAlpha\*(C'\fR. +(The difference between these sets is that some things, such as Roman +numerals, come in both upper and lower case so they are \f(CW\*(C`Cased\*(C'\fR, but +aren't considered letters, so they aren't \f(CW\*(C`Cased_Letter\*(C'\fR's.) +.PP +See "Beyond Unicode code points" for special considerations when +matching Unicode properties against non-Unicode code points. +.PP +\fR\f(BIGeneral_Category\fR\fI\fR +.IX Subsection "General_Category" +.PP +Every Unicode character is assigned a general category, which is the "most +usual categorization of a character" (from +<https://www.unicode.org/reports/tr44>). +.PP +The compound way of writing these is like \f(CW\*(C`\ep{General_Category=Number}\*(C'\fR +(short: \f(CW\*(C`\ep{gc:n}\*(C'\fR). But Perl furnishes shortcuts in which everything up +through the equal or colon separator is omitted. So you can instead just write +\&\f(CW\*(C`\epN\*(C'\fR. +.PP +Here are the short and long forms of the values the \f(CW\*(C`General Category\*(C'\fR property +can have: +.PP +.Vb 1 +\& Short Long +\& +\& L Letter +\& LC, L& Cased_Letter (that is: [\ep{Ll}\ep{Lu}\ep{Lt}]) +\& Lu Uppercase_Letter +\& Ll Lowercase_Letter +\& Lt Titlecase_Letter +\& Lm Modifier_Letter +\& Lo Other_Letter +\& +\& M Mark +\& Mn Nonspacing_Mark +\& Mc Spacing_Mark +\& Me Enclosing_Mark +\& +\& N Number +\& Nd Decimal_Number (also Digit) +\& Nl Letter_Number +\& No Other_Number +\& +\& P Punctuation (also Punct) +\& Pc Connector_Punctuation +\& Pd Dash_Punctuation +\& Ps Open_Punctuation +\& Pe Close_Punctuation +\& Pi Initial_Punctuation +\& (may behave like Ps or Pe depending on usage) +\& Pf Final_Punctuation +\& (may behave like Ps or Pe depending on usage) +\& Po Other_Punctuation +\& +\& S Symbol +\& Sm Math_Symbol +\& Sc Currency_Symbol +\& Sk Modifier_Symbol +\& So Other_Symbol +\& +\& Z Separator +\& Zs Space_Separator +\& Zl Line_Separator +\& Zp Paragraph_Separator +\& +\& C Other +\& Cc Control (also Cntrl) +\& Cf Format +\& Cs Surrogate +\& Co Private_Use +\& Cn Unassigned +.Ve +.PP +Single-letter properties match all characters in any of the +two-letter sub-properties starting with the same letter. +\&\f(CW\*(C`LC\*(C'\fR and \f(CW\*(C`L&\*(C'\fR are special: both are aliases for the set consisting of everything matched by \f(CW\*(C`Ll\*(C'\fR, \f(CW\*(C`Lu\*(C'\fR, and \f(CW\*(C`Lt\*(C'\fR. +.PP +\fR\f(BIBidirectional Character Types\fR\fI\fR +.IX Subsection "Bidirectional Character Types" +.PP +Because scripts differ in their directionality (Hebrew and Arabic are +written right to left, for example) Unicode supplies a \f(CW\*(C`Bidi_Class\*(C'\fR property. +Some of the values this property can have are: +.PP +.Vb 1 +\& Value Meaning +\& +\& L Left\-to\-Right +\& LRE Left\-to\-Right Embedding +\& LRO Left\-to\-Right Override +\& R Right\-to\-Left +\& AL Arabic Letter +\& RLE Right\-to\-Left Embedding +\& RLO Right\-to\-Left Override +\& PDF Pop Directional Format +\& EN European Number +\& ES European Separator +\& ET European Terminator +\& AN Arabic Number +\& CS Common Separator +\& NSM Non\-Spacing Mark +\& BN Boundary Neutral +\& B Paragraph Separator +\& S Segment Separator +\& WS Whitespace +\& ON Other Neutrals +.Ve +.PP +This property is always written in the compound form. +For example, \f(CW\*(C`\ep{Bidi_Class:R}\*(C'\fR matches characters that are normally +written right to left. Unlike the +\&\f(CW"General_Category"\fR property, this +property can have more values added in a future Unicode release. Those +listed above comprised the complete set for many Unicode releases, but +others were added in Unicode 6.3; you can always find what the +current ones are in perluniprops. And +<https://www.unicode.org/reports/tr9/> describes how to use them. +.PP +\fR\f(BIScripts\fR\fI\fR +.IX Subsection "Scripts" +.PP +The world's languages are written in many different scripts. This sentence +(unless you're reading it in translation) is written in Latin, while Russian is +written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in +Hiragana or Katakana. There are many more. +.PP +The Unicode \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR properties give what +script a given character is in. The \f(CW\*(C`Script_Extensions\*(C'\fR property is an +improved version of \f(CW\*(C`Script\*(C'\fR, as demonstrated below. Either property +can be specified with the compound form like +\&\f(CW\*(C`\ep{Script=Hebrew}\*(C'\fR (short: \f(CW\*(C`\ep{sc=hebr}\*(C'\fR), or +\&\f(CW\*(C`\ep{Script_Extensions=Javanese}\*(C'\fR (short: \f(CW\*(C`\ep{scx=java}\*(C'\fR). +In addition, Perl furnishes shortcuts for all +\&\f(CW\*(C`Script_Extensions\*(C'\fR property names. You can omit everything up through +the equals (or colon), and simply write \f(CW\*(C`\ep{Latin}\*(C'\fR or \f(CW\*(C`\eP{Cyrillic}\*(C'\fR. +(This is not true for \f(CW\*(C`Script\*(C'\fR, which is required to be +written in the compound form. Prior to Perl v5.26, the single form +returned the plain old \f(CW\*(C`Script\*(C'\fR version, but was changed because +\&\f(CW\*(C`Script_Extensions\*(C'\fR gives better results.) +.PP +The difference between these two properties involves characters that are +used in multiple scripts. For example the digits '0' through '9' are +used in many parts of the world. These are placed in a script named +\&\f(CW\*(C`Common\*(C'\fR. Other characters are used in just a few scripts. For +example, the \f(CW"KATAKANA\-HIRAGANA DOUBLE HYPHEN"\fR is used in both Japanese +scripts, Katakana and Hiragana, but nowhere else. The \f(CW\*(C`Script\*(C'\fR +property places all characters that are used in multiple scripts in the +\&\f(CW\*(C`Common\*(C'\fR script, while the \f(CW\*(C`Script_Extensions\*(C'\fR property places those +that are used in only a few scripts into each of those scripts; while +still using \f(CW\*(C`Common\*(C'\fR for those used in many scripts. Thus both these +match: +.PP +.Vb 2 +\& "0" =~ /\ep{sc=Common}/ # Matches +\& "0" =~ /\ep{scx=Common}/ # Matches +.Ve +.PP +and only the first of these match: +.PP +.Vb 2 +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Common} # Matches +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Common} # No match +.Ve +.PP +And only the last two of these match: +.PP +.Vb 4 +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Hiragana} # No match +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Katakana} # No match +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Hiragana} # Matches +\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Katakana} # Matches +.Ve +.PP +\&\f(CW\*(C`Script_Extensions\*(C'\fR is thus an improved \f(CW\*(C`Script\*(C'\fR, in which there are +fewer characters in the \f(CW\*(C`Common\*(C'\fR script, and correspondingly more in +other scripts. It is new in Unicode version 6.0, and its data are likely +to change significantly in later releases, as things get sorted out. +New code should probably be using \f(CW\*(C`Script_Extensions\*(C'\fR and not plain +\&\f(CW\*(C`Script\*(C'\fR. If you compile perl with a Unicode release that doesn't have +\&\f(CW\*(C`Script_Extensions\*(C'\fR, the single form Perl extensions will instead refer +to the plain \f(CW\*(C`Script\*(C'\fR property. If you compile with a version of +Unicode that doesn't have the \f(CW\*(C`Script\*(C'\fR property, these extensions will +not be defined at all. +.PP +(Actually, besides \f(CW\*(C`Common\*(C'\fR, the \f(CW\*(C`Inherited\*(C'\fR script, contains +characters that are used in multiple scripts. These are modifier +characters which inherit the script value +of the controlling character. Some of these are used in many scripts, +and so go into \f(CW\*(C`Inherited\*(C'\fR in both \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR. +Others are used in just a few scripts, so are in \f(CW\*(C`Inherited\*(C'\fR in +\&\f(CW\*(C`Script\*(C'\fR, but not in \f(CW\*(C`Script_Extensions\*(C'\fR.) +.PP +It is worth stressing that there are several different sets of digits in +Unicode that are equivalent to 0\-9 and are matchable by \f(CW\*(C`\ed\*(C'\fR in a +regular expression. If they are used in a single language only, they +are in that language's \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR. If they are +used in more than one script, they will be in \f(CW\*(C`sc=Common\*(C'\fR, but only +if they are used in many scripts should they be in \f(CW\*(C`scx=Common\*(C'\fR. +.PP +The explanation above has omitted some detail; refer to UAX#24 "Unicode +Script Property": <https://www.unicode.org/reports/tr24>. +.PP +A complete list of scripts and their shortcuts is in perluniprops. +.PP +\fR\f(BIUse of the \fR\f(CB"Is"\fR\f(BI Prefix\fR\fI\fR +.IX Subsection "Use of the ""Is"" Prefix" +.PP +For backward compatibility (with ancient Perl 5.6), all properties writable +without using the compound form mentioned +so far may have \f(CW\*(C`Is\*(C'\fR or \f(CW\*(C`Is_\*(C'\fR prepended to their name, so \f(CW\*(C`\eP{Is_Lu}\*(C'\fR, for +example, is equal to \f(CW\*(C`\eP{Lu}\*(C'\fR, and \f(CW\*(C`\ep{IsScript:Arabic}\*(C'\fR is equal to +\&\f(CW\*(C`\ep{Arabic}\*(C'\fR. +.PP +\fR\f(BIBlocks\fR\fI\fR +.IX Subsection "Blocks" +.PP +In addition to \fBscripts\fR, Unicode also defines \fBblocks\fR of +characters. The difference between scripts and blocks is that the +concept of scripts is closer to natural languages, while the concept +of blocks is more of an artificial grouping based on groups of Unicode +characters with consecutive ordinal values. For example, the \f(CW"Basic Latin"\fR +block is all the characters whose ordinals are between 0 and 127, inclusive; in +other words, the ASCII characters. The \f(CW"Latin"\fR script contains some letters +from this as well as several other blocks, like \f(CW"Latin\-1 Supplement"\fR, +\&\f(CW"Latin Extended\-A"\fR, \fIetc.\fR, but it does not contain all the characters from +those blocks. It does not, for example, contain the digits 0\-9, because +those digits are shared across many scripts, and hence are in the +\&\f(CW\*(C`Common\*(C'\fR script. +.PP +For more about scripts versus blocks, see UAX#24 "Unicode Script Property": +<https://www.unicode.org/reports/tr24> +.PP +The \f(CW\*(C`Script_Extensions\*(C'\fR or \f(CW\*(C`Script\*(C'\fR properties are likely to be the +ones you want to use when processing +natural language; the \f(CW\*(C`Block\*(C'\fR property may occasionally be useful in working +with the nuts and bolts of Unicode. +.PP +Block names are matched in the compound form, like \f(CW\*(C`\ep{Block: Arrows}\*(C'\fR or +\&\f(CW\*(C`\ep{Blk=Hebrew}\*(C'\fR. Unlike most other properties, only a few block names have a +Unicode-defined short name. +.PP +Perl also defines single form synonyms for the block property in cases +where these do not conflict with something else. But don't use any of +these, because they are unstable. Since these are Perl extensions, they +are subordinate to official Unicode property names; Unicode doesn't know +nor care about Perl's extensions. It may happen that a name that +currently means the Perl extension will later be changed without warning +to mean a different Unicode property in a future version of the perl +interpreter that uses a later Unicode release, and your code would no +longer work. The extensions are mentioned here for completeness: Take +the block name and prefix it with one of: \f(CW\*(C`In\*(C'\fR (for example +\&\f(CW\*(C`\ep{Blk=Arrows}\*(C'\fR can currently be written as \f(CW\*(C`\ep{In_Arrows}\*(C'\fR); or +sometimes \f(CW\*(C`Is\*(C'\fR (like \f(CW\*(C`\ep{Is_Arrows}\*(C'\fR); or sometimes no prefix at all +(\f(CW\*(C`\ep{Arrows}\*(C'\fR). As of this writing (Unicode 9.0) there are no +conflicts with using the \f(CW\*(C`In_\*(C'\fR prefix, but there are plenty with the +other two forms. For example, \f(CW\*(C`\ep{Is_Hebrew}\*(C'\fR and \f(CW\*(C`\ep{Hebrew}\*(C'\fR mean +\&\f(CW\*(C`\ep{Script_Extensions=Hebrew}\*(C'\fR which is NOT the same thing as +\&\f(CW\*(C`\ep{Blk=Hebrew}\*(C'\fR. Our +advice used to be to use the \f(CW\*(C`In_\*(C'\fR prefix as a single form way of +specifying a block. But Unicode 8.0 added properties whose names begin +with \f(CW\*(C`In\*(C'\fR, and it's now clear that it's only luck that's so far +prevented a conflict. Using \f(CW\*(C`In\*(C'\fR is only marginally less typing than +\&\f(CW\*(C`Blk:\*(C'\fR, and the latter's meaning is clearer anyway, and guaranteed to +never conflict. So don't take chances. Use \f(CW\*(C`\ep{Blk=foo}\*(C'\fR for new +code. And be sure that block is what you really really want to do. In +most cases scripts are what you want instead. +.PP +A complete list of blocks is in perluniprops. +.PP +\fR\f(BIOther Properties\fR\fI\fR +.IX Subsection "Other Properties" +.PP +There are many more properties than the very basic ones described here. +A complete list is in perluniprops. +.PP +Unicode defines all its properties in the compound form, so all single-form +properties are Perl extensions. Most of these are just synonyms for the +Unicode ones, but some are genuine extensions, including several that are in +the compound form. And quite a few of these are actually recommended by Unicode +(in <https://www.unicode.org/reports/tr18>). +.PP +This section gives some details on all extensions that aren't just +synonyms for compound-form Unicode properties +(for those properties, you'll have to refer to the +Unicode Standard <https://www.unicode.org/reports/tr44>. +.ie n .IP "\fR\fB""\ep{All}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{All}\fR\fB\fR 4 +.IX Item "p{All}" +This matches every possible code point. It is equivalent to \f(CW\*(C`qr/./s\*(C'\fR. +Unlike all the other non-user-defined \f(CW\*(C`\ep{}\*(C'\fR property matches, no +warning is ever generated if this is property is matched against a +non-Unicode code point (see "Beyond Unicode code points" below). +.ie n .IP "\fR\fB""\ep{Alnum}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Alnum}\fR\fB\fR 4 +.IX Item "p{Alnum}" +This matches any \f(CW\*(C`\ep{Alphabetic}\*(C'\fR or \f(CW\*(C`\ep{Decimal_Number}\*(C'\fR character. +.ie n .IP "\fR\fB""\ep{Any}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Any}\fR\fB\fR 4 +.IX Item "p{Any}" +This matches any of the 1_114_112 Unicode code points. It is a synonym +for \f(CW\*(C`\ep{Unicode}\*(C'\fR. +.ie n .IP "\fR\fB""\ep{ASCII}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{ASCII}\fR\fB\fR 4 +.IX Item "p{ASCII}" +This matches any of the 128 characters in the US-ASCII character set, +which is a subset of Unicode. +.ie n .IP "\fR\fB""\ep{Assigned}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Assigned}\fR\fB\fR 4 +.IX Item "p{Assigned}" +This matches any assigned code point; that is, any code point whose general +category is not \f(CW\*(C`Unassigned\*(C'\fR (or equivalently, not \f(CW\*(C`Cn\*(C'\fR). +.ie n .IP "\fR\fB""\ep{Blank}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Blank}\fR\fB\fR 4 +.IX Item "p{Blank}" +This is the same as \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ep{HorizSpace}\*(C'\fR: A character that changes the +spacing horizontally. +.ie n .IP "\fR\fB""\ep{Decomposition_Type: Non_Canonical}""\fR\fB\fR (Short: ""\ep{Dt=NonCanon}"")" 4 +.el .IP "\fR\f(CB\ep{Decomposition_Type: Non_Canonical}\fR\fB\fR (Short: \f(CW\ep{Dt=NonCanon}\fR)" 4 +.IX Item "p{Decomposition_Type: Non_Canonical} (Short: p{Dt=NonCanon})" +Matches a character that has any of the non-canonical decomposition +types. Canonical decompositions are introduced in the +"Extended Grapheme Clusters (Logical characters)" section above. +However, many more characters have a different type of decomposition, +generically called "compatible" decompositions, or "non-canonical". The +sequences that form these decompositions are not considered canonically +equivalent to the pre-composed character. An example is the +\&\f(CW"SUPERSCRIPT ONE"\fR. It is somewhat like a regular digit 1, but not +exactly; its decomposition into the digit 1 is called a "compatible" +decomposition, specifically a "super" (for "superscript") decomposition. +There are several such compatibility decompositions (see +<https://www.unicode.org/reports/tr44>). \f(CW\*(C`\ep{Dt:\ Non_Canon}\*(C'\fR is a +Perl extension that uses just one name to refer to the union of all of +them. +.Sp +Most Unicode characters don't have a decomposition, so their +decomposition type is \f(CW"None"\fR. Hence, \f(CW\*(C`Non_Canonical\*(C'\fR is equivalent +to +.Sp +.Vb 1 +\& qr/(?[ \eP{DT=Canonical} \- \ep{DT=None} ])/ +.Ve +.Sp +(Note that one of the non-canonical decompositions is named "compat", +which could perhaps have been better named "miscellaneous". It includes +just the things that Unicode couldn't figure out a better generic name +for.) +.ie n .IP "\fR\fB""\ep{Graph}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Graph}\fR\fB\fR 4 +.IX Item "p{Graph}" +Matches any character that is graphic. Theoretically, this means a character +that on a printer would cause ink to be used. +.ie n .IP "\fR\fB""\ep{HorizSpace}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{HorizSpace}\fR\fB\fR 4 +.IX Item "p{HorizSpace}" +This is the same as \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ep{Blank}\*(C'\fR: a character that changes the +spacing horizontally. +.ie n .IP "\fR\fB""\ep{In=*}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{In=*}\fR\fB\fR 4 +.IX Item "p{In=*}" +This is a synonym for \f(CW\*(C`\ep{Present_In=*}\*(C'\fR +.ie n .IP "\fR\fB""\ep{PerlSpace}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{PerlSpace}\fR\fB\fR 4 +.IX Item "p{PerlSpace}" +This is the same as \f(CW\*(C`\es\*(C'\fR, restricted to ASCII, namely \f(CW\*(C`[\ \ef\en\er\et]\*(C'\fR +and starting in Perl v5.18, a vertical tab. +.Sp +Mnemonic: Perl's (original) space +.ie n .IP "\fR\fB""\ep{PerlWord}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{PerlWord}\fR\fB\fR 4 +.IX Item "p{PerlWord}" +This is the same as \f(CW\*(C`\ew\*(C'\fR, restricted to ASCII, namely \f(CW\*(C`[A\-Za\-z0\-9_]\*(C'\fR +.Sp +Mnemonic: Perl's (original) word. +.ie n .IP "\fR\fB""\ep{Posix...}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Posix...}\fR\fB\fR 4 +.IX Item "p{Posix...}" +There are several of these, which are equivalents, using the \f(CW\*(C`\ep{}\*(C'\fR +notation, for Posix classes and are described in +"POSIX Character Classes" in perlrecharclass. +.ie n .IP "\fR\fB""\ep{Present_In: *}""\fR\fB\fR (Short: ""\ep{In=*}"")" 4 +.el .IP "\fR\f(CB\ep{Present_In: *}\fR\fB\fR (Short: \f(CW\ep{In=*}\fR)" 4 +.IX Item "p{Present_In: *} (Short: p{In=*})" +This property is used when you need to know in what Unicode version(s) a +character is. +.Sp +The "*" above stands for some Unicode version number, such as +\&\f(CW1.1\fR or \f(CW12.0\fR; or the "*" can also be \f(CW\*(C`Unassigned\*(C'\fR. This property will +match the code points whose final disposition has been settled as of the +Unicode release given by the version number; \f(CW\*(C`\ep{Present_In: Unassigned}\*(C'\fR +will match those code points whose meaning has yet to be assigned. +.Sp +For example, \f(CW\*(C`U+0041\*(C'\fR \f(CW"LATIN CAPITAL LETTER A"\fR was present in the very first +Unicode release available, which is \f(CW1.1\fR, so this property is true for all +valid "*" versions. On the other hand, \f(CW\*(C`U+1EFF\*(C'\fR was not assigned until version +5.1 when it became \f(CW"LATIN SMALL LETTER Y WITH LOOP"\fR, so the only "*" that +would match it are 5.1, 5.2, and later. +.Sp +Unicode furnishes the \f(CW\*(C`Age\*(C'\fR property from which this is derived. The problem +with Age is that a strict interpretation of it (which Perl takes) has it +matching the precise release a code point's meaning is introduced in. Thus +\&\f(CW\*(C`U+0041\*(C'\fR would match only 1.1; and \f(CW\*(C`U+1EFF\*(C'\fR only 5.1. This is not usually what +you want. +.Sp +Some non-Perl implementations of the Age property may change its meaning to be +the same as the Perl \f(CW\*(C`Present_In\*(C'\fR property; just be aware of that. +.Sp +Another confusion with both these properties is that the definition is not +that the code point has been \fIassigned\fR, but that the meaning of the code point +has been \fIdetermined\fR. This is because 66 code points will always be +unassigned, and so the \f(CW\*(C`Age\*(C'\fR for them is the Unicode version in which the decision +to make them so was made. For example, \f(CW\*(C`U+FDD0\*(C'\fR is to be permanently +unassigned to a character, and the decision to do that was made in version 3.1, +so \f(CW\*(C`\ep{Age=3.1}\*(C'\fR matches this character, as also does \f(CW\*(C`\ep{Present_In: 3.1}\*(C'\fR and up. +.ie n .IP "\fR\fB""\ep{Print}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Print}\fR\fB\fR 4 +.IX Item "p{Print}" +This matches any character that is graphical or blank, except controls. +.ie n .IP "\fR\fB""\ep{SpacePerl}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{SpacePerl}\fR\fB\fR 4 +.IX Item "p{SpacePerl}" +This is the same as \f(CW\*(C`\es\*(C'\fR, including beyond ASCII. +.Sp +Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab +until v5.18, which both the Posix standard and Unicode consider white space.) +.ie n .IP "\fR\fB""\ep{Title}""\fR\fB\fR and \fB\fR\fB""\ep{Titlecase}""\fR\fB\fR" 4 +.el .IP "\fR\f(CB\ep{Title}\fR\fB\fR and \fB\fR\f(CB\ep{Titlecase}\fR\fB\fR" 4 +.IX Item "p{Title} and p{Titlecase}" +Under case-sensitive matching, these both match the same code points as +\&\f(CW\*(C`\ep{General Category=Titlecase_Letter}\*(C'\fR (\f(CW\*(C`\ep{gc=lt}\*(C'\fR). The difference +is that under \f(CW\*(C`/i\*(C'\fR caseless matching, these match the same as +\&\f(CW\*(C`\ep{Cased}\*(C'\fR, whereas \f(CW\*(C`\ep{gc=lt}\*(C'\fR matches \f(CW\*(C`\ep{Cased_Letter\*(C'\fR). +.ie n .IP "\fR\fB""\ep{Unicode}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Unicode}\fR\fB\fR 4 +.IX Item "p{Unicode}" +This matches any of the 1_114_112 Unicode code points. +\&\f(CW\*(C`\ep{Any}\*(C'\fR. +.ie n .IP "\fR\fB""\ep{VertSpace}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{VertSpace}\fR\fB\fR 4 +.IX Item "p{VertSpace}" +This is the same as \f(CW\*(C`\ev\*(C'\fR: A character that changes the spacing vertically. +.ie n .IP "\fR\fB""\ep{Word}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{Word}\fR\fB\fR 4 +.IX Item "p{Word}" +This is the same as \f(CW\*(C`\ew\*(C'\fR, including over 100_000 characters beyond ASCII. +.ie n .IP "\fR\fB""\ep{XPosix...}""\fR\fB\fR" 4 +.el .IP \fR\f(CB\ep{XPosix...}\fR\fB\fR 4 +.IX Item "p{XPosix...}" +There are several of these, which are the standard Posix classes +extended to the full Unicode range. They are described in +"POSIX Character Classes" in perlrecharclass. +.ie n .SS "Comparison of ""\eN{...}"" and ""\ep{name=...}""" +.el .SS "Comparison of \f(CW\eN{...}\fP and \f(CW\ep{name=...}\fP" +.IX Subsection "Comparison of N{...} and p{name=...}" +Starting in Perl 5.32, you can specify a character by its name in +regular expression patterns using \f(CW\*(C`\ep{name=...}\*(C'\fR. This is in addition +to the longstanding method of using \f(CW\*(C`\eN{...}\*(C'\fR. The following +summarizes the differences between these two: +.PP +.Vb 6 +\& \eN{...} \ep{Name=...} +\& can interpolate only with eval yes [1] +\& custom names yes no [2] +\& name aliases yes yes [3] +\& named sequences yes yes [4] +\& name value parsing exact Unicode loose [5] +.Ve +.IP [1] 4 +.IX Item "[1]" +The ability to interpolate means you can do something like +.Sp +.Vb 1 +\& qr/\ep{na=latin capital letter $which}/ +.Ve +.Sp +and specify \f(CW$which\fR elsewhere. +.IP [2] 4 +.IX Item "[2]" +You can create your own names for characters, and override official +ones when using \f(CW\*(C`\eN{...}\*(C'\fR. See "CUSTOM ALIASES" in charnames. +.IP [3] 4 +.IX Item "[3]" +Some characters have multiple names (synonyms). +.IP [4] 4 +.IX Item "[4]" +Some particular sequences of characters are given a single name, in +addition to their individual ones. +.IP [5] 4 +.IX Item "[5]" +Exact name value matching means you have to specify case, hyphens, +underscores, and spaces precisely in the name you want. Loose matching +follows the Unicode rules +<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>, +where these are mostly irrelevant. Except for a few outlier character +names, these are the same rules as are already used for any other +\&\f(CW\*(C`\ep{...}\*(C'\fR property. +.SS "Wildcards in Property Values" +.IX Subsection "Wildcards in Property Values" +Starting in Perl 5.30, it is possible to do something like this: +.PP +.Vb 1 +\& qr!\ep{numeric_value=/\eA[0\-5]\ez/}! +.Ve +.PP +or, by abbreviating and adding \f(CW\*(C`/x\*(C'\fR, +.PP +.Vb 1 +\& qr! \ep{nv= /(?x) \eA [0\-5] \ez / }! +.Ve +.PP +This matches all code points whose numeric value is one of 0, 1, 2, 3, +4, or 5. This particular example could instead have been written as +.PP +.Vb 1 +\& qr! \eA [ \ep{nv=0}\ep{nv=1}\ep{nv=2}\ep{nv=3}\ep{nv=4}\ep{nv=5} ] \ez !xx +.Ve +.PP +in earlier perls, so in this case this feature just makes things easier +and shorter to write. If we hadn't included the \f(CW\*(C`\eA\*(C'\fR and \f(CW\*(C`\ez\*(C'\fR, these +would have matched things like \f(CW\*(C`1/2\*(C'\fR because that contains a 1 (as +well as a 2). As written, it matches things like subscripts that have +these numeric values. If we only wanted the decimal digits with those +numeric values, we could say, +.PP +.Vb 1 +\& qr! (?[ \ed & \ep{nv=/[0\-5]/ ]) }!x +.Ve +.PP +The \f(CW\*(C`\ed\*(C'\fR gets rid of needing to anchor the pattern, since it forces the +result to only match \f(CW\*(C`[0\-9]\*(C'\fR, and the \f(CW\*(C`[0\-5]\*(C'\fR further restricts it. +.PP +The text in the above examples enclosed between the \f(CW"/"\fR +characters can be just about any regular expression. It is independent +of the main pattern, so doesn't share any capturing groups, \fIetc\fR. The +delimiters for it must be ASCII punctuation, but it may NOT be +delimited by \f(CW"{"\fR, nor \f(CW"}"\fR nor contain a literal \f(CW"}"\fR, as that +delimits the end of the enclosing \f(CW\*(C`\ep{}\*(C'\fR. Like any pattern, certain +other delimiters are terminated by their mirror images. These are +\&\f(CW"("\fR, \f(CW\*(C`"[\*(C'\fR", and \f(CW"<"\fR. If the delimiter is any of \f(CW"\-"\fR, +\&\f(CW"_"\fR, \f(CW"+"\fR, or \f(CW"\e"\fR, or is the same delimiter as is used for the +enclosing pattern, it must be preceded by a backslash escape, both +fore and aft. +.PP +Beware of using \f(CW"$"\fR to indicate to match the end of the string. It +can too easily be interpreted as being a punctuation variable, like +\&\f(CW$/\fR. +.PP +No modifiers may follow the final delimiter. Instead, use +"(?adlupimnsx\-imnsx)" in perlre and/or +"(?adluimnsx\-imnsx:pattern)" in perlre to specify modifiers. +However, certain modifiers are illegal in your wildcard subpattern. +The only character set modifier specifiable is \f(CW\*(C`/aa\*(C'\fR; +any other character set, and \f(CW\*(C`\-m\*(C'\fR, and \f(CW\*(C`p\*(C'\fR, and \f(CW\*(C`s\*(C'\fR are all illegal. +Specifying modifiers like \f(CW\*(C`qr/.../gc\*(C'\fR that aren't legal in the +\&\f(CW\*(C`(?...)\*(C'\fR notation normally raise a warning, but with wildcard +subpatterns, their use is an error. The \f(CW\*(C`m\*(C'\fR modifier is ineffective; +everything that matches will be a single line. +.PP +By default, your pattern is matched case-insensitively, as if \f(CW\*(C`/i\*(C'\fR had +been specified. You can change this by saying \f(CW\*(C`(?\-i)\*(C'\fR in your pattern. +.PP +There are also certain operations that are illegal. You can't nest +\&\f(CW\*(C`\ep{...}\*(C'\fR and \f(CW\*(C`\eP{...}\*(C'\fR calls within a wildcard subpattern, and \f(CW\*(C`\eG\*(C'\fR +doesn't make sense, so is also prohibited. +.PP +And the \f(CW\*(C`*\*(C'\fR quantifier (or its equivalent \f(CW\*(C`(0,}\*(C'\fR) is illegal. +.PP +This feature is not available when the left-hand side is prefixed by +\&\f(CW\*(C`Is_\*(C'\fR, nor for any form that is marked as "Discouraged" in +"Discouraged" in perluniprops. +.PP +This experimental feature has been added to begin to implement +<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it +will raise a (default-on) warning in the +\&\f(CW\*(C`experimental::uniprop_wildcards\*(C'\fR category. We reserve the right to +change its operation as we gain experience. +.PP +Your subpattern can be just about anything, but for it to have some +utility, it should match when called with either or both of +a) the full name of the property value with underscores (and/or spaces +in the Block property) and some things uppercase; or b) the property +value in all lowercase with spaces and underscores squeezed out. For +example, +.PP +.Vb 2 +\& qr!\ep{Blk=/Old I.*/}! +\& qr!\ep{Blk=/oldi.*/}! +.Ve +.PP +would match the same things. +.PP +Another example that shows that within \f(CW\*(C`\ep{...}\*(C'\fR, \f(CW\*(C`/x\*(C'\fR isn't needed to +have spaces: +.PP +.Vb 1 +\& qr!\ep{scx= /Hebrew|Greek/ }! +.Ve +.PP +To be safe, we should have anchored the above example, to prevent +matches for something like \f(CW\*(C`Hebrew_Braille\*(C'\fR, but there aren't +any script names like that, so far. +A warning is issued if none of the legal values for a property are +matched by your pattern. It's likely that a future release will raise a +warning if your pattern ends up causing every possible code point to +match. +.PP +Starting in 5.32, the Name, Name Aliases, and Named Sequences properties +are allowed to be matched. They are considered to be a single +combination property, just as has long been the case for \f(CW\*(C`\eN{}\*(C'\fR. Loose +matching doesn't work in exactly the same way for these as it does for +the values of other properties. The rules are given in +<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>. As a +result, Perl doesn't try loose matching for you, like it does in other +properties. All letters in names are uppercase, but you can add \f(CW\*(C`(?i)\*(C'\fR +to your subpattern to ignore case. If you're uncertain where a blank +is, you can use \f(CW\*(C` ?\*(C'\fR in your subpattern. No character name contains an +underscore, so don't bother trying to match one. The use of hyphens is +particularly problematic; refer to the above link. But note that, as of +Unicode 13.0, the only script in modern usage which has weirdnesses with +these is Tibetan; also the two Korean characters U+116C HANGUL JUNGSEONG +OE and U+1180 HANGUL JUNGSEONG O\-E. Unicode makes no promises to not +add hyphen-problematic names in the future. +.PP +Using wildcards on these is resource intensive, given the hundreds of +thousands of legal names that must be checked against. +.PP +An example of using Name property wildcards is +.PP +.Vb 1 +\& qr!\ep{name=/(SMILING|GRINNING) FACE/}! +.Ve +.PP +Another is +.PP +.Vb 1 +\& qr/(?[ \ep{name=\e/CJK\e/} \- \ep{ideographic} ])/ +.Ve +.PP +which is the 200\-ish (as of Unicode 13.0) CJK characters that aren't +ideographs. +.PP +There are certain properties that wildcard subpatterns don't currently +work with. These are: +.PP +.Vb 9 +\& Bidi Mirroring Glyph +\& Bidi Paired Bracket +\& Case Folding +\& Decomposition Mapping +\& Equivalent Unified Ideograph +\& Lowercase Mapping +\& NFKC Case Fold +\& Titlecase Mapping +\& Uppercase Mapping +.Ve +.PP +Nor is the \f(CW\*(C`@\fR\f(CIunicode_property\fR\f(CW@\*(C'\fR form implemented. +.PP +Here's a complete example of matching IPV4 internet protocol addresses +in any (single) script +.PP +.Vb 1 +\& no warnings \*(Aqexperimental::uniprop_wildcards\*(Aq; +\& +\& # Can match a substring, so this intermediate regex needs to have +\& # context or anchoring in its final use. Using nt=de yields decimal +\& # digits. When specifying a subset of these, we must include \ed to +\& # prevent things like U+00B2 SUPERSCRIPT TWO from matching +\& my $zero_through_255 = +\& qr/ \eb (*sr: # All from same sript +\& (?[ \ep{nv=0} & \ed ])* # Optional leading zeros +\& ( # Then one of: +\& \ed{1,2} # 0 \- 99 +\& | (?[ \ep{nv=1} & \ed ]) \ed{2} # 100 \- 199 +\& | (?[ \ep{nv=2} & \ed ]) +\& ( (?[ \ep{nv=:[0\-4]:} & \ed ]) \ed # 200 \- 249 +\& | (?[ \ep{nv=5} & \ed ]) +\& (?[ \ep{nv=:[0\-5]:} & \ed ]) # 250 \- 255 +\& ) +\& ) +\& ) +\& \eb +\& /x; +\& +\& my $ipv4 = qr/ \eA (*sr: $zero_through_255 +\& (?: [.] $zero_through_255 ) {3} +\& ) +\& \ez +\& /x; +.Ve +.SS "User-Defined Character Properties" +.IX Subsection "User-Defined Character Properties" +You can define your own binary character properties by defining subroutines +whose names begin with \f(CW"In"\fR or \f(CW"Is"\fR. (The regex sets feature +"(?[ ])" in perlre provides an alternative which allows more complex +definitions.) The subroutines can be defined in any +package. They override any Unicode properties expressed as the same +names. The user-defined properties can be used in the regular +expression +\&\f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR constructs; if you are using a user-defined property from a +package other than the one you are in, you must specify its package in the +\&\f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR construct. +.PP +.Vb 3 +\& # assuming property IsForeign defined in Lang:: +\& package main; # property package name required +\& if ($txt =~ /\ep{Lang::IsForeign}+/) { ... } +\& +\& package Lang; # property package name not required +\& if ($txt =~ /\ep{IsForeign}+/) { ... } +.Ve +.PP +The subroutines are passed a single parameter, which is 0 if +case-sensitive matching is in effect and non-zero if caseless matching +is in effect. The subroutine may return different values depending on +the value of the flag. But the subroutine is never called more than +once for each flag value (zero vs non-zero). The return value is saved +and used instead of calling the sub ever again. If the sub is defined +at the time the pattern is compiled, it will be called then; if not, it +will be called the first time its value (for that flag) is needed during +execution. +.PP +Note that if the regular expression is tainted, then Perl will die rather +than calling the subroutine when the name of the subroutine is +determined by the tainted data. +.PP +The subroutines must return a specially-formatted string, with one +or more newline-separated lines. Each line must be one of the following: +.IP \(bu 4 +A single hexadecimal number denoting a code point to include. +.IP \(bu 4 +Two hexadecimal numbers separated by horizontal whitespace (space or +tabular characters) denoting a range of code points to include. The +second number must not be smaller than the first. +.IP \(bu 4 +Something to include, prefixed by \f(CW"+"\fR: a built-in character +property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package +name) user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. +.IP \(bu 4 +Something to exclude, prefixed by \f(CW"\-"\fR: an existing character +property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package +name) user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. +.IP \(bu 4 +Something to negate, prefixed \f(CW"!"\fR: an existing character +property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package +name) user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. +.IP \(bu 4 +Something to intersect with, prefixed by \f(CW"&"\fR: an existing character +property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package +name) user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. +.PP +For example, to define a property that covers both the Japanese +syllabaries (hiragana and katakana), you can define +.PP +.Vb 6 +\& sub InKana { +\& return <<END; +\& 3040\et309F +\& 30A0\et30FF +\& END +\& } +.Ve +.PP +Imagine that the here-doc end marker is at the beginning of the line. +Now you can use \f(CW\*(C`\ep{InKana}\*(C'\fR and \f(CW\*(C`\eP{InKana}\*(C'\fR. +.PP +You could also have used the existing block property names: +.PP +.Vb 6 +\& sub InKana { +\& return <<\*(AqEND\*(Aq; +\& +utf8::InHiragana +\& +utf8::InKatakana +\& END +\& } +.Ve +.PP +Suppose you wanted to match only the allocated characters, +not the raw block ranges: in other words, you want to remove +the unassigned characters: +.PP +.Vb 7 +\& sub InKana { +\& return <<\*(AqEND\*(Aq; +\& +utf8::InHiragana +\& +utf8::InKatakana +\& \-utf8::IsCn +\& END +\& } +.Ve +.PP +The negation is useful for defining (surprise!) negated classes. +.PP +.Vb 7 +\& sub InNotKana { +\& return <<\*(AqEND\*(Aq; +\& !utf8::InHiragana +\& \-utf8::InKatakana +\& +utf8::IsCn +\& END +\& } +.Ve +.PP +This will match all non-Unicode code points, since every one of them is +not in Kana. You can use intersection to exclude these, if desired, as +this modified example shows: +.PP +.Vb 8 +\& sub InNotKana { +\& return <<\*(AqEND\*(Aq; +\& !utf8::InHiragana +\& \-utf8::InKatakana +\& +utf8::IsCn +\& &utf8::Any +\& END +\& } +.Ve +.PP +\&\f(CW&utf8::Any\fR must be the last line in the definition. +.PP +Intersection is used generally for getting the common characters matched +by two (or more) classes. It's important to remember not to use \f(CW"&"\fR for +the first set; that would be intersecting with nothing, resulting in an +empty set. (Similarly using \f(CW"\-"\fR for the first set does nothing). +.PP +Unlike non-user-defined \f(CW\*(C`\ep{}\*(C'\fR property matches, no warning is ever +generated if these properties are matched against a non-Unicode code +point (see "Beyond Unicode code points" below). +.SS "User-Defined Case Mappings (for serious hackers only)" +.IX Subsection "User-Defined Case Mappings (for serious hackers only)" +\&\fBThis feature has been removed as of Perl 5.16.\fR +The CPAN module \f(CW\*(C`Unicode::Casing\*(C'\fR provides better functionality without +the drawbacks that this feature had. If you are using a Perl earlier +than 5.16, this feature was most fully documented in the 5.14 version of +this pod: +<http://perldoc.perl.org/5.14.0/perlunicode.html#User\-Defined\-Case\-Mappings\-%28for\-serious\-hackers\-only%29> +.SS "Character Encodings for Input and Output" +.IX Subsection "Character Encodings for Input and Output" +See Encode. +.SS "Unicode Regular Expression Support Level" +.IX Subsection "Unicode Regular Expression Support Level" +The following list of Unicode supported features for regular expressions describes +all features currently directly supported by core Perl. The references +to "Level \fIN\fR" and the section numbers refer to +UTS#18 "Unicode Regular Expressions" <https://www.unicode.org/reports/tr18>, +version 18, October 2016. +.PP +\fILevel 1 \- Basic Unicode Support\fR +.IX Subsection "Level 1 - Basic Unicode Support" +.PP +.Vb 8 +\& RL1.1 Hex Notation \- Done [1] +\& RL1.2 Properties \- Done [2] +\& RL1.2a Compatibility Properties \- Done [3] +\& RL1.3 Subtraction and Intersection \- Done [4] +\& RL1.4 Simple Word Boundaries \- Done [5] +\& RL1.5 Simple Loose Matches \- Done [6] +\& RL1.6 Line Boundaries \- Partial [7] +\& RL1.7 Supplementary Code Points \- Done [8] +.Ve +.ie n .IP "[1] ""\eN{U+...}"" and ""\ex{...}""" 4 +.el .IP "[1] \f(CW\eN{U+...}\fR and \f(CW\ex{...}\fR" 4 +.IX Item "[1] N{U+...} and x{...}" +.PD 0 +.ie n .IP "[2] ""\ep{...}"" ""\eP{...}"". This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties." 4 +.el .IP "[2] \f(CW\ep{...}\fR \f(CW\eP{...}\fR. This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties." 4 +.IX Item "[2] p{...} P{...}. This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties." +.IP [3] 4 +.IX Item "[3]" +.PD +Perl has \f(CW\*(C`\ed\*(C'\fR \f(CW\*(C`\eD\*(C'\fR \f(CW\*(C`\es\*(C'\fR \f(CW\*(C`\eS\*(C'\fR \f(CW\*(C`\ew\*(C'\fR \f(CW\*(C`\eW\*(C'\fR \f(CW\*(C`\eX\*(C'\fR \f(CW\*(C`[:\fR\f(CIprop\fR\f(CW:]\*(C'\fR +\&\f(CW\*(C`[:^\fR\f(CIprop\fR\f(CW:]\*(C'\fR, plus all the properties specified by +<https://www.unicode.org/reports/tr18/#Compatibility_Properties>. These +are described above in "Other Properties" +.IP [4] 4 +.IX Item "[4]" +The regex sets feature \f(CW"(?[...])"\fR starting in v5.18 accomplishes +this. See "(?[ ])" in perlre. +.ie n .IP "[5] ""\eb"" ""\eB"" meet most, but not all, the details of this requirement, but ""\eb{wb}"" and ""\eB{wb}"" do, as well as the stricter R2.3." 4 +.el .IP "[5] \f(CW\eb\fR \f(CW\eB\fR meet most, but not all, the details of this requirement, but \f(CW\eb{wb}\fR and \f(CW\eB{wb}\fR do, as well as the stricter R2.3." 4 +.IX Item "[5] b B meet most, but not all, the details of this requirement, but b{wb} and B{wb} do, as well as the stricter R2.3." +.PD 0 +.IP [6] 4 +.IX Item "[6]" +.PD +Note that Perl does Full case-folding in matching, not Simple: +.Sp +For example \f(CW\*(C`U+1F88\*(C'\fR is equivalent to \f(CW\*(C`U+1F00 U+03B9\*(C'\fR, instead of just +\&\f(CW\*(C`U+1F80\*(C'\fR. This difference matters mainly for certain Greek capital +letters with certain modifiers: the Full case-folding decomposes the +letter, while the Simple case-folding would map it to a single +character. +.IP [7] 4 +.IX Item "[7]" +The reason this is considered to be only partially implemented is that +Perl has \f(CW\*(C`qr/\eb{lb}/\*(C'\fR and +\&\f(CW\*(C`Unicode::LineBreak\*(C'\fR that are conformant with +UAX#14 "Unicode Line Breaking Algorithm" <https://www.unicode.org/reports/tr14>. +The regular expression construct provides default behavior, while the +heavier-weight module provides customizable line breaking. +.Sp +But Perl treats \f(CW\*(C`\en\*(C'\fR as the start\- and end-line +delimiter, whereas Unicode specifies more characters that should be +so-interpreted. +.Sp +These are: +.Sp +.Vb 6 +\& VT U+000B (\ev in C) +\& FF U+000C (\ef) +\& CR U+000D (\er) +\& NEL U+0085 +\& LS U+2028 +\& PS U+2029 +.Ve +.Sp +\&\f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR in regular expression patterns are supposed to match all +these, but don't. +These characters also don't, but should, affect \f(CW\*(C`<>\*(C'\fR \f(CW$.\fR, and +script line numbers. +.Sp +Also, lines should not be split within \f(CW\*(C`CRLF\*(C'\fR (i.e. there is no +empty line between \f(CW\*(C`\er\*(C'\fR and \f(CW\*(C`\en\*(C'\fR). For \f(CW\*(C`CRLF\*(C'\fR, try the \f(CW\*(C`:crlf\*(C'\fR +layer (see PerlIO). +.ie n .IP "[8] UTF\-8/UTF\-EBDDIC used in Perl allows not only ""U+10000"" to ""U+10FFFF"" but also beyond ""U+10FFFF""" 4 +.el .IP "[8] UTF\-8/UTF\-EBDDIC used in Perl allows not only \f(CWU+10000\fR to \f(CWU+10FFFF\fR but also beyond \f(CWU+10FFFF\fR" 4 +.IX Item "[8] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to U+10FFFF but also beyond U+10FFFF" +.PP +\fILevel 2 \- Extended Unicode Support\fR +.IX Subsection "Level 2 - Extended Unicode Support" +.PP +.Vb 10 +\& RL2.1 Canonical Equivalents \- Retracted [9] +\& by Unicode +\& RL2.2 Extended Grapheme Clusters and \- Partial [10] +\& Character Classes with Strings +\& RL2.3 Default Word Boundaries \- Done [11] +\& RL2.4 Default Case Conversion \- Done +\& RL2.5 Name Properties \- Done +\& RL2.6 Wildcards in Property Values \- Partial [12] +\& RL2.7 Full Properties \- Partial [13] +\& RL2.8 Optional Properties \- Partial [14] +.Ve +.IP "[9] Unicode has rewritten this portion of UTS#18 to say that getting canonical equivalence (see UAX#15 ""Unicode Normalization Forms"" <https://www.unicode.org/reports/tr15>) is basically to be done at the programmer level. Use NFD to write both your regular expressions and text to match them against (you can use Unicode::Normalize)." 4 +.IX Item "[9] Unicode has rewritten this portion of UTS#18 to say that getting canonical equivalence (see UAX#15 ""Unicode Normalization Forms"" <https://www.unicode.org/reports/tr15>) is basically to be done at the programmer level. Use NFD to write both your regular expressions and text to match them against (you can use Unicode::Normalize)." +.PD 0 +.ie n .IP "[10] Perl has ""\eX"" and ""\eb{gcb}"". Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support." 4 +.el .IP "[10] Perl has \f(CW\eX\fR and \f(CW\eb{gcb}\fR. Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support." 4 +.IX Item "[10] Perl has X and b{gcb}. Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support." +.IP "[11] see UAX#29 ""Unicode Text Segmentation"" <https://www.unicode.org/reports/tr29>," 4 +.IX Item "[11] see UAX#29 ""Unicode Text Segmentation"" <https://www.unicode.org/reports/tr29>," +.IP "[12] see ""Wildcards in Property Values"" above." 4 +.IX Item "[12] see ""Wildcards in Property Values"" above." +.IP "[13] Perl supports all the properties in the Unicode Character Database (UCD). It does not yet support the listed properties that come from other Unicode sources." 4 +.IX Item "[13] Perl supports all the properties in the Unicode Character Database (UCD). It does not yet support the listed properties that come from other Unicode sources." +.IP "[14] The only optional property that Perl supports is Named Sequence. None of these properties are in the UCD." 4 +.IX Item "[14] The only optional property that Perl supports is Named Sequence. None of these properties are in the UCD." +.PD +.PP +\fILevel 3 \- Tailored Support\fR +.IX Subsection "Level 3 - Tailored Support" +.PP +This has been retracted by Unicode. +.SS "Unicode Encodings" +.IX Subsection "Unicode Encodings" +Unicode characters are assigned to \fIcode points\fR, which are abstract +numbers. To use these numbers, various encodings are needed. +.IP \(bu 4 +UTF\-8 +.Sp +UTF\-8 is a variable-length (1 to 4 bytes), byte-order independent +encoding. In most of Perl's documentation, including elsewhere in this +document, the term "UTF\-8" means also "UTF-EBCDIC". But in this section, +"UTF\-8" refers only to the encoding used on ASCII platforms. It is a +superset of 7\-bit US-ASCII, so anything encoded in ASCII has the +identical representation when encoded in UTF\-8. +.Sp +The following table is from Unicode 3.2. +.Sp +.Vb 1 +\& Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte +\& +\& U+0000..U+007F 00..7F +\& U+0080..U+07FF * C2..DF 80..BF +\& U+0800..U+0FFF E0 * A0..BF 80..BF +\& U+1000..U+CFFF E1..EC 80..BF 80..BF +\& U+D000..U+D7FF ED 80..9F 80..BF +\& U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++ +\& U+E000..U+FFFF EE..EF 80..BF 80..BF +\& U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF +\& U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF +\& U+100000..U+10FFFF F4 80..8F 80..BF 80..BF +.Ve +.Sp +Note the gaps marked by "*" before several of the byte entries above. These are +caused by legal UTF\-8 avoiding non-shortest encodings: it is technically +possible to UTF\-8\-encode a single code point in different ways, but that is +explicitly forbidden, and the shortest possible encoding should always be used +(and that is what Perl does). +.Sp +Another way to look at it is via bits: +.Sp +.Vb 1 +\& Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte +\& +\& 0aaaaaaa 0aaaaaaa +\& 00000bbbbbaaaaaa 110bbbbb 10aaaaaa +\& ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa +\& 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa +.Ve +.Sp +As you can see, the continuation bytes all begin with \f(CW"10"\fR, and the +leading bits of the start byte tell how many bytes there are in the +encoded character. +.Sp +The original UTF\-8 specification allowed up to 6 bytes, to allow +encoding of numbers up to \f(CW\*(C`0x7FFF_FFFF\*(C'\fR. Perl continues to allow those, +and has extended that up to 13 bytes to encode code points up to what +can fit in a 64\-bit word. However, Perl will warn if you output any of +these as being non-portable; and under strict UTF\-8 input protocols, +they are forbidden. In addition, it is now illegal to use a code point +larger than what a signed integer variable on your system can hold. On +32\-bit ASCII systems, this means \f(CW\*(C`0x7FFF_FFFF\*(C'\fR is the legal maximum +(much higher on 64\-bit systems). +.IP \(bu 4 +UTF-EBCDIC +.Sp +Like UTF\-8, but EBCDIC-safe, in the way that UTF\-8 is ASCII-safe. +This means that all the basic characters (which includes all +those that have ASCII equivalents (like \f(CW"A"\fR, \f(CW"0"\fR, \f(CW"%"\fR, \fIetc.\fR) +are the same in both EBCDIC and UTF-EBCDIC.) +.Sp +UTF-EBCDIC is used on EBCDIC platforms. It generally requires more +bytes to represent a given code point than UTF\-8 does; the largest +Unicode code points take 5 bytes to represent (instead of 4 in UTF\-8), +and, extended for 64\-bit words, it uses 14 bytes instead of 13 bytes in +UTF\-8. +.IP \(bu 4 +UTF\-16, UTF\-16BE, UTF\-16LE, Surrogates, and \f(CW\*(C`BOM\*(C'\fR's (Byte Order Marks) +.Sp +The followings items are mostly for reference and general Unicode +knowledge, Perl doesn't use these constructs internally. +.Sp +Like UTF\-8, UTF\-16 is a variable-width encoding, but where +UTF\-8 uses 8\-bit code units, UTF\-16 uses 16\-bit code units. +All code points occupy either 2 or 4 bytes in UTF\-16: code points +\&\f(CW\*(C`U+0000..U+FFFF\*(C'\fR are stored in a single 16\-bit unit, and code +points \f(CW\*(C`U+10000..U+10FFFF\*(C'\fR in two 16\-bit units. The latter case is +using \fIsurrogates\fR, the first 16\-bit unit being the \fIhigh +surrogate\fR, and the second being the \fIlow surrogate\fR. +.Sp +Surrogates are code points set aside to encode the \f(CW\*(C`U+10000..U+10FFFF\*(C'\fR +range of Unicode code points in pairs of 16\-bit units. The \fIhigh +surrogates\fR are the range \f(CW\*(C`U+D800..U+DBFF\*(C'\fR and the \fIlow surrogates\fR +are the range \f(CW\*(C`U+DC00..U+DFFF\*(C'\fR. The surrogate encoding is +.Sp +.Vb 2 +\& $hi = ($uni \- 0x10000) / 0x400 + 0xD800; +\& $lo = ($uni \- 0x10000) % 0x400 + 0xDC00; +.Ve +.Sp +and the decoding is +.Sp +.Vb 1 +\& $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00); +.Ve +.Sp +Because of the 16\-bitness, UTF\-16 is byte-order dependent. UTF\-16 +itself can be used for in-memory computations, but if storage or +transfer is required either UTF\-16BE (big-endian) or UTF\-16LE +(little-endian) encodings must be chosen. +.Sp +This introduces another problem: what if you just know that your data +is UTF\-16, but you don't know which endianness? Byte Order Marks, or +\&\f(CW\*(C`BOM\*(C'\fR's, are a solution to this. A special character has been reserved +in Unicode to function as a byte order marker: the character with the +code point \f(CW\*(C`U+FEFF\*(C'\fR is the \f(CW\*(C`BOM\*(C'\fR. +.Sp +The trick is that if you read a \f(CW\*(C`BOM\*(C'\fR, you will know the byte order, +since if it was written on a big-endian platform, you will read the +bytes \f(CW\*(C`0xFE 0xFF\*(C'\fR, but if it was written on a little-endian platform, +you will read the bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR. (And if the originating platform +was writing in ASCII platform UTF\-8, you will read the bytes +\&\f(CW\*(C`0xEF 0xBB 0xBF\*(C'\fR.) +.Sp +The way this trick works is that the character with the code point +\&\f(CW\*(C`U+FFFE\*(C'\fR is not supposed to be in input streams, so the +sequence of bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR is unambiguously "\f(CW\*(C`BOM\*(C'\fR, represented in +little-endian format" and cannot be \f(CW\*(C`U+FFFE\*(C'\fR, represented in big-endian +format". +.Sp +Surrogates have no meaning in Unicode outside their use in pairs to +represent other code points. However, Perl allows them to be +represented individually internally, for example by saying +\&\f(CWchr(0xD801)\fR, so that all code points, not just those valid for open +interchange, are +representable. Unicode does define semantics for them, such as their +\&\f(CW"General_Category"\fR is \f(CW"Cs"\fR. But because their use is somewhat dangerous, +Perl will warn (using the warning category \f(CW"surrogate"\fR, which is a +sub-category of \f(CW"utf8"\fR) if an attempt is made +to do things like take the lower case of one, or match +case-insensitively, or to output them. (But don't try this on Perls +before 5.14.) +.IP \(bu 4 +UTF\-32, UTF\-32BE, UTF\-32LE +.Sp +The UTF\-32 family is pretty much like the UTF\-16 family, except that +the units are 32\-bit, and therefore the surrogate scheme is not +needed. UTF\-32 is a fixed-width encoding. The \f(CW\*(C`BOM\*(C'\fR signatures are +\&\f(CW\*(C`0x00 0x00 0xFE 0xFF\*(C'\fR for BE and \f(CW\*(C`0xFF 0xFE 0x00 0x00\*(C'\fR for LE. +.IP \(bu 4 +UCS\-2, UCS\-4 +.Sp +Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS\-2 is a 16\-bit +encoding. Unlike UTF\-16, UCS\-2 is not extensible beyond \f(CW\*(C`U+FFFF\*(C'\fR, +because it does not use surrogates. UCS\-4 is a 32\-bit encoding, +functionally identical to UTF\-32 (the difference being that +UCS\-4 forbids neither surrogates nor code points larger than \f(CW\*(C`0x10_FFFF\*(C'\fR). +.IP \(bu 4 +UTF\-7 +.Sp +A seven-bit safe (non-eight-bit) encoding, which is useful if the +transport or storage is not eight-bit safe. Defined by RFC 2152. +.SS "Noncharacter code points" +.IX Subsection "Noncharacter code points" +66 code points are set aside in Unicode as "noncharacter code points". +These all have the \f(CW\*(C`Unassigned\*(C'\fR (\f(CW\*(C`Cn\*(C'\fR) \f(CW"General_Category"\fR, and +no character will ever be assigned to any of them. They are the 32 code +points between \f(CW\*(C`U+FDD0\*(C'\fR and \f(CW\*(C`U+FDEF\*(C'\fR inclusive, and the 34 code +points: +.PP +.Vb 7 +\& U+FFFE U+FFFF +\& U+1FFFE U+1FFFF +\& U+2FFFE U+2FFFF +\& ... +\& U+EFFFE U+EFFFF +\& U+FFFFE U+FFFFF +\& U+10FFFE U+10FFFF +.Ve +.PP +Until Unicode 7.0, the noncharacters were "\fBforbidden\fR for use in open +interchange of Unicode text data", so that code that processed those +streams could use these code points as sentinels that could be mixed in +with character data, and would always be distinguishable from that data. +(Emphasis above and in the next paragraph are added in this document.) +.PP +Unicode 7.0 changed the wording so that they are "\fBnot recommended\fR for +use in open interchange of Unicode text data". The 7.0 Standard goes on +to say: +.Sp +.RS 4 +"If a noncharacter is received in open interchange, an application is +not required to interpret it in any way. It is good practice, however, +to recognize it as a noncharacter and to take appropriate action, such +as replacing it with \f(CW\*(C`U+FFFD\*(C'\fR replacement character, to indicate the +problem in the text. It is not recommended to simply delete +noncharacter code points from such text, because of the potential +security issues caused by deleting uninterpreted characters. (See +conformance clause C7 in Section 3.2, Conformance Requirements, and +Unicode Technical Report #36, "Unicode Security +Considerations" <https://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)." +.RE +.PP +This change was made because it was found that various commercial tools +like editors, or for things like source code control, had been written +so that they would not handle program files that used these code points, +effectively precluding their use almost entirely! And that was never +the intent. They've always been meant to be usable within an +application, or cooperating set of applications, at will. +.PP +If you're writing code, such as an editor, that is supposed to be able +to handle any Unicode text data, then you shouldn't be using these code +points yourself, and instead allow them in the input. If you need +sentinels, they should instead be something that isn't legal Unicode. +For UTF\-8 data, you can use the bytes 0xC0 and 0xC1 as sentinels, as +they never appear in well-formed UTF\-8. (There are equivalents for +UTF-EBCDIC). You can also store your Unicode code points in integer +variables and use negative values as sentinels. +.PP +If you're not writing such a tool, then whether you accept noncharacters +as input is up to you (though the Standard recommends that you not). If +you do strict input stream checking with Perl, these code points +continue to be forbidden. This is to maintain backward compatibility +(otherwise potential security holes could open up, as an unsuspecting +application that was written assuming the noncharacters would be +filtered out before getting to it, could now, without warning, start +getting them). To do strict checking, you can use the layer +\&\f(CW:encoding(\*(AqUTF\-8\*(Aq)\fR. +.PP +Perl continues to warn (using the warning category \f(CW"nonchar"\fR, which +is a sub-category of \f(CW"utf8"\fR) if an attempt is made to output +noncharacters. +.SS "Beyond Unicode code points" +.IX Subsection "Beyond Unicode code points" +The maximum Unicode code point is \f(CW\*(C`U+10FFFF\*(C'\fR, and Unicode only defines +operations on code points up through that. But Perl works on code +points up to the maximum permissible signed number available on the +platform. However, Perl will not accept these from input streams unless +lax rules are being used, and will warn (using the warning category +\&\f(CW"non_unicode"\fR, which is a sub-category of \f(CW"utf8"\fR) if any are output. +.PP +Since Unicode rules are not defined on these code points, if a +Unicode-defined operation is done on them, Perl uses what we believe are +sensible rules, while generally warning, using the \f(CW"non_unicode"\fR +category. For example, \f(CWuc("\ex{11_0000}")\fR will generate such a +warning, returning the input parameter as its result, since Perl defines +the uppercase of every non-Unicode code point to be the code point +itself. (All the case changing operations, not just uppercasing, work +this way.) +.PP +The situation with matching Unicode properties in regular expressions, +the \f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR constructs, against these code points is not as +clear cut, and how these are handled has changed as we've gained +experience. +.PP +One possibility is to treat any match against these code points as +undefined. But since Perl doesn't have the concept of a match being +undefined, it converts this to failing or \f(CW\*(C`FALSE\*(C'\fR. This is almost, but +not quite, what Perl did from v5.14 (when use of these code points +became generally reliable) through v5.18. The difference is that Perl +treated all \f(CW\*(C`\ep{}\*(C'\fR matches as failing, but all \f(CW\*(C`\eP{}\*(C'\fR matches as +succeeding. +.PP +One problem with this is that it leads to unexpected, and confusing +results in some cases: +.PP +.Vb 2 +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True} # Failed on <= v5.18 +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False} # Failed! on <= v5.18 +.Ve +.PP +That is, it treated both matches as undefined, and converted that to +false (raising a warning on each). The first case is the expected +result, but the second is likely counterintuitive: "How could both be +false when they are complements?" Another problem was that the +implementation optimized many Unicode property matches down to already +existing simpler, faster operations, which don't raise the warning. We +chose to not forgo those optimizations, which help the vast majority of +matches, just to generate a warning for the unlikely event that an +above-Unicode code point is being matched against. +.PP +As a result of these problems, starting in v5.20, what Perl does is +to treat non-Unicode code points as just typical unassigned Unicode +characters, and matches accordingly. (Note: Unicode has atypical +unassigned code points. For example, it has noncharacter code points, +and ones that, when they do get assigned, are destined to be written +Right-to-left, as Arabic and Hebrew are. Perl assumes that no +non-Unicode code point has any atypical properties.) +.PP +Perl, in most cases, will raise a warning when matching an above-Unicode +code point against a Unicode property when the result is \f(CW\*(C`TRUE\*(C'\fR for +\&\f(CW\*(C`\ep{}\*(C'\fR, and \f(CW\*(C`FALSE\*(C'\fR for \f(CW\*(C`\eP{}\*(C'\fR. For example: +.PP +.Vb 2 +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True} # Fails, no warning +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False} # Succeeds, with warning +.Ve +.PP +In both these examples, the character being matched is non-Unicode, so +Unicode doesn't define how it should match. It clearly isn't an ASCII +hex digit, so the first example clearly should fail, and so it does, +with no warning. But it is arguable that the second example should have +an undefined, hence \f(CW\*(C`FALSE\*(C'\fR, result. So a warning is raised for it. +.PP +Thus the warning is raised for many fewer cases than in earlier Perls, +and only when what the result is could be arguable. It turns out that +none of the optimizations made by Perl (or are ever likely to be made) +cause the warning to be skipped, so it solves both problems of Perl's +earlier approach. The most commonly used property that is affected by +this change is \f(CW\*(C`\ep{Unassigned}\*(C'\fR which is a short form for +\&\f(CW\*(C`\ep{General_Category=Unassigned}\*(C'\fR. Starting in v5.20, all non-Unicode +code points are considered \f(CW\*(C`Unassigned\*(C'\fR. In earlier releases the +matches failed because the result was considered undefined. +.PP +The only place where the warning is not raised when it might ought to +have been is if optimizations cause the whole pattern match to not even +be attempted. For example, Perl may figure out that for a string to +match a certain regular expression pattern, the string has to contain +the substring \f(CW"foobar"\fR. Before attempting the match, Perl may look +for that substring, and if not found, immediately fail the match without +actually trying it; so no warning gets generated even if the string +contains an above-Unicode code point. +.PP +This behavior is more "Do what I mean" than in earlier Perls for most +applications. But it catches fewer issues for code that needs to be +strictly Unicode compliant. Therefore there is an additional mode of +operation available to accommodate such code. This mode is enabled if a +regular expression pattern is compiled within the lexical scope where +the \f(CW"non_unicode"\fR warning class has been made fatal, say by: +.PP +.Vb 1 +\& use warnings FATAL => "non_unicode" +.Ve +.PP +(see warnings). In this mode of operation, Perl will raise the +warning for all matches against a non-Unicode code point (not just the +arguable ones), and it skips the optimizations that might cause the +warning to not be output. (It currently still won't warn if the match +isn't even attempted, like in the \f(CW"foobar"\fR example above.) +.PP +In summary, Perl now normally treats non-Unicode code points as typical +Unicode unassigned code points for regular expression matches, raising a +warning only when it is arguable what the result should be. However, if +this warning has been made fatal, it isn't skipped. +.PP +There is one exception to all this. \f(CW\*(C`\ep{All}\*(C'\fR looks like a Unicode +property, but it is a Perl extension that is defined to be true for all +possible code points, Unicode or not, so no warning is ever generated +when matching this against a non-Unicode code point. (Prior to v5.20, +it was an exact synonym for \f(CW\*(C`\ep{Any}\*(C'\fR, matching code points \f(CW0\fR +through \f(CW0x10FFFF\fR.) +.SS "Security Implications of Unicode" +.IX Subsection "Security Implications of Unicode" +First, read +Unicode Security Considerations <https://www.unicode.org/reports/tr36>. +.PP +Also, note the following: +.IP \(bu 4 +Malformed UTF\-8 +.Sp +UTF\-8 is very structured, so many combinations of bytes are invalid. In +the past, Perl tried to soldier on and make some sense of invalid +combinations, but this can lead to security holes, so now, if the Perl +core needs to process an invalid combination, it will either raise a +fatal error, or will replace those bytes by the sequence that forms the +Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it. +.Sp +Every code point can be represented by more than one possible +syntactically valid UTF\-8 sequence. Early on, both Unicode and Perl +considered any of these to be valid, but now, all sequences longer +than the shortest possible one are considered to be malformed. +.Sp +Unicode considers many code points to be illegal, or to be avoided. +Perl generally accepts them, once they have passed through any input +filters that may try to exclude them. These have been discussed above +(see "Surrogates" under UTF\-16 in "Unicode Encodings", +"Noncharacter code points", and "Beyond Unicode code points"). +.IP \(bu 4 +Regular expression pattern matching may surprise you if you're not +accustomed to Unicode. Starting in Perl 5.14, several pattern +modifiers are available to control this, called the character set +modifiers. Details are given in "Character set modifiers" in perlre. +.PP +As discussed elsewhere, Perl has one foot (two hooves?) planted in +each of two worlds: the old world of ASCII and single-byte locales, and +the new world of Unicode, upgrading when necessary. +If your legacy code does not explicitly use Unicode, no automatic +switch-over to Unicode should happen. +.SS "Unicode in Perl on EBCDIC" +.IX Subsection "Unicode in Perl on EBCDIC" +Unicode is supported on EBCDIC platforms. See perlebcdic. +.PP +Unless ASCII vs. EBCDIC issues are specifically being discussed, +references to UTF\-8 encoding in this document and elsewhere should be +read as meaning UTF-EBCDIC on EBCDIC platforms. +See "Unicode and UTF" in perlebcdic. +.PP +Because UTF-EBCDIC is so similar to UTF\-8, the differences are mostly +hidden from you; \f(CW\*(C`use\ utf8\*(C'\fR (and NOT something like +\&\f(CW\*(C`use\ utfebcdic\*(C'\fR) declares the script is in the platform's +"native" 8\-bit encoding of Unicode. (Similarly for the \f(CW":utf8"\fR +layer.) +.SS Locales +.IX Subsection "Locales" +See "Unicode and UTF\-8" in perllocale +.SS "When Unicode Does Not Happen" +.IX Subsection "When Unicode Does Not Happen" +There are still many places where Unicode (in some encoding or +another) could be given as arguments or received as results, or both in +Perl, but it is not, in spite of Perl having extensive ways to input and +output in Unicode, and a few other "entry points" like the \f(CW@ARGV\fR +array (which can sometimes be interpreted as UTF\-8). +.PP +The following are such interfaces. Also, see "The "Unicode Bug"". +For all of these interfaces Perl +currently (as of v5.16.0) simply assumes byte strings both as arguments +and results, or UTF\-8 strings if the (deprecated) \f(CW\*(C`encoding\*(C'\fR pragma has been used. +.PP +One reason that Perl does not attempt to resolve the role of Unicode in +these situations is that the answers are highly dependent on the operating +system and the file system(s). For example, whether filenames can be +in Unicode and in exactly what kind of encoding, is not exactly a +portable concept. Similarly for \f(CW\*(C`qx\*(C'\fR and \f(CW\*(C`system\*(C'\fR: how well will the +"command-line interface" (and which of them?) handle Unicode? +.IP \(bu 4 +\&\f(CW\*(C`chdir\*(C'\fR, \f(CW\*(C`chmod\*(C'\fR, \f(CW\*(C`chown\*(C'\fR, \f(CW\*(C`chroot\*(C'\fR, \f(CW\*(C`exec\*(C'\fR, \f(CW\*(C`link\*(C'\fR, \f(CW\*(C`lstat\*(C'\fR, \f(CW\*(C`mkdir\*(C'\fR, +\&\f(CW\*(C`rename\*(C'\fR, \f(CW\*(C`rmdir\*(C'\fR, \f(CW\*(C`stat\*(C'\fR, \f(CW\*(C`symlink\*(C'\fR, \f(CW\*(C`truncate\*(C'\fR, \f(CW\*(C`unlink\*(C'\fR, \f(CW\*(C`utime\*(C'\fR, \f(CW\*(C`\-X\*(C'\fR +.IP \(bu 4 +\&\f(CW%ENV\fR +.IP \(bu 4 +\&\f(CW\*(C`glob\*(C'\fR (aka the \f(CW\*(C`<*>\*(C'\fR) +.IP \(bu 4 +\&\f(CW\*(C`open\*(C'\fR, \f(CW\*(C`opendir\*(C'\fR, \f(CW\*(C`sysopen\*(C'\fR +.IP \(bu 4 +\&\f(CW\*(C`qx\*(C'\fR (aka the backtick operator), \f(CW\*(C`system\*(C'\fR +.IP \(bu 4 +\&\f(CW\*(C`readdir\*(C'\fR, \f(CW\*(C`readlink\*(C'\fR +.SS "The ""Unicode Bug""" +.IX Subsection "The ""Unicode Bug""" +The term, "Unicode bug" has been applied to an inconsistency with the +code points in the \f(CW\*(C`Latin\-1 Supplement\*(C'\fR block, that is, between +128 and 255. Without a locale specified, unlike all other characters or +code points, these characters can have very different semantics +depending on the rules in effect. (Characters whose code points are +above 255 force Unicode rules; whereas the rules for ASCII characters +are the same under both ASCII and Unicode rules.) +.PP +Under Unicode rules, these upper\-Latin1 characters are interpreted as +Unicode code points, which means they have the same semantics as Latin\-1 +(ISO\-8859\-1) and C1 controls. +.PP +As explained in "ASCII Rules versus Unicode Rules", under ASCII rules, +they are considered to be unassigned characters. +.PP +This can lead to unexpected results. For example, a string's +semantics can suddenly change if a code point above 255 is appended to +it, which changes the rules from ASCII to Unicode. As an +example, consider the following program and its output: +.PP +.Vb 11 +\& $ perl \-le\*(Aq +\& no feature "unicode_strings"; +\& $s1 = "\exC2"; +\& $s2 = "\ex{2660}"; +\& for ($s1, $s2, $s1.$s2) { +\& print /\ew/ || 0; +\& } +\& \*(Aq +\& 0 +\& 0 +\& 1 +.Ve +.PP +If there's no \f(CW\*(C`\ew\*(C'\fR in \f(CW\*(C`s1\*(C'\fR nor in \f(CW\*(C`s2\*(C'\fR, why does their concatenation +have one? +.PP +This anomaly stems from Perl's attempt to not disturb older programs that +didn't use Unicode, along with Perl's desire to add Unicode support +seamlessly. But the result turned out to not be seamless. (By the way, +you can choose to be warned when things like this happen. See +\&\f(CW\*(C`encoding::warnings\*(C'\fR.) +.PP +\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR +was added, starting in Perl v5.12, to address this problem. It affects +these things: +.IP \(bu 4 +Changing the case of a scalar, that is, using \f(CWuc()\fR, \f(CWucfirst()\fR, \f(CWlc()\fR, +and \f(CWlcfirst()\fR, or \f(CW\*(C`\eL\*(C'\fR, \f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\eu\*(C'\fR and \f(CW\*(C`\el\*(C'\fR in double-quotish +contexts, such as regular expression substitutions. +.Sp +Under \f(CW\*(C`unicode_strings\*(C'\fR starting in Perl 5.12.0, Unicode rules are +generally used. See "lc" in perlfunc for details on how this works +in combination with various other pragmas. +.IP \(bu 4 +Using caseless (\f(CW\*(C`/i\*(C'\fR) regular expression matching. +.Sp +Starting in Perl 5.14.0, regular expressions compiled within +the scope of \f(CW\*(C`unicode_strings\*(C'\fR use Unicode rules +even when executed or compiled into larger +regular expressions outside the scope. +.IP \(bu 4 +Matching any of several properties in regular expressions. +.Sp +These properties are \f(CW\*(C`\eb\*(C'\fR (without braces), \f(CW\*(C`\eB\*(C'\fR (without braces), +\&\f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\eW\*(C'\fR, and all the Posix character classes +\&\fIexcept\fR \f(CW\*(C`[[:ascii:]]\*(C'\fR. +.Sp +Starting in Perl 5.14.0, regular expressions compiled within +the scope of \f(CW\*(C`unicode_strings\*(C'\fR use Unicode rules +even when executed or compiled into larger +regular expressions outside the scope. +.IP \(bu 4 +In \f(CW\*(C`quotemeta\*(C'\fR or its inline equivalent \f(CW\*(C`\eQ\*(C'\fR. +.Sp +Starting in Perl 5.16.0, consistent quoting rules are used within the +scope of \f(CW\*(C`unicode_strings\*(C'\fR, as described in "quotemeta" in perlfunc. +Prior to that, or outside its scope, no code points above 127 are quoted +in UTF\-8 encoded strings, but in byte encoded strings, code points +between 128\-255 are always quoted. +.IP \(bu 4 +In the \f(CW\*(C`..\*(C'\fR or range operator. +.Sp +Starting in Perl 5.26.0, the range operator on strings treats their lengths +consistently within the scope of \f(CW\*(C`unicode_strings\*(C'\fR. Prior to that, or +outside its scope, it could produce strings whose length in characters +exceeded that of the right-hand side, where the right-hand side took up more +bytes than the correct range endpoint. +.IP \(bu 4 +In \f(CW\*(C`split\*(C'\fR's special-case whitespace splitting. +.Sp +Starting in Perl 5.28.0, the \f(CW\*(C`split\*(C'\fR function with a pattern specified as +a string containing a single space handles whitespace characters consistently +within the scope of \f(CW\*(C`unicode_strings\*(C'\fR. Prior to that, or outside its scope, +characters that are whitespace according to Unicode rules but not according to +ASCII rules were treated as field contents rather than field separators when +they appear in byte-encoded strings. +.PP +You can see from the above that the effect of \f(CW\*(C`unicode_strings\*(C'\fR +increased over several Perl releases. (And Perl's support for Unicode +continues to improve; it's best to use the latest available release in +order to get the most complete and accurate results possible.) Note that +\&\f(CW\*(C`unicode_strings\*(C'\fR is automatically chosen if you \f(CW\*(C`use\ v5.12\*(C'\fR or +higher. +.PP +For Perls earlier than those described above, or when a string is passed +to a function outside the scope of \f(CW\*(C`unicode_strings\*(C'\fR, see the next section. +.SS "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)" +.IX Subsection "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)" +Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"") +there are situations where you simply need to force a byte +string into UTF\-8, or vice versa. The standard module Encode can be +used for this, or the low-level calls +\&\f(CWutf8::upgrade($bytestring)\fR and +\&\f(CW\*(C`utf8::downgrade($utf8string[, FAIL_OK])\*(C'\fR. +.PP +Note that \f(CWutf8::downgrade()\fR can fail if the string contains characters +that don't fit into a byte. +.PP +Calling either function on a string that already is in the desired state is a +no-op. +.PP +"ASCII Rules versus Unicode Rules" gives all the ways that a string is +made to use Unicode rules. +.SS "Using Unicode in XS" +.IX Subsection "Using Unicode in XS" +See "Unicode Support" in perlguts for an introduction to Unicode at +the XS level, and "Unicode Support" in perlapi for the API details. +.SS "Hacking Perl to work on earlier Unicode versions (for very serious hackers only)" +.IX Subsection "Hacking Perl to work on earlier Unicode versions (for very serious hackers only)" +Perl by default comes with the latest supported Unicode version built-in, but +the goal is to allow you to change to use any earlier one. In Perls +v5.20 and v5.22, however, the earliest usable version is Unicode 5.1. +Perl v5.18 and v5.24 are able to handle all earlier versions. +.PP +Download the files in the desired version of Unicode from the Unicode web +site <https://www.unicode.org>). These should replace the existing files in +\&\fIlib/unicore\fR in the Perl source tree. Follow the instructions in +\&\fIREADME.perl\fR in that directory to change some of their names, and then build +perl (see INSTALL). +.SS "Porting code from perl\-5.6.X" +.IX Subsection "Porting code from perl-5.6.X" +Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the +programmer was required to use the \f(CW\*(C`utf8\*(C'\fR pragma to declare that a +given scope expected to deal with Unicode data and had to make sure that +only Unicode data were reaching that scope. If you have code that is +working with 5.6, you will need some of the following adjustments to +your code. The examples are written such that the code will continue to +work under 5.6, so you should be safe to try them out. +.IP \(bu 3 +A filehandle that should read or write UTF\-8 +.Sp +.Vb 3 +\& if ($] > 5.008) { +\& binmode $fh, ":encoding(UTF\-8)"; +\& } +.Ve +.IP \(bu 3 +A scalar that is going to be passed to some extension +.Sp +Be it \f(CW\*(C`Compress::Zlib\*(C'\fR, \f(CW\*(C`Apache::Request\*(C'\fR or any extension that has no +mention of Unicode in the manpage, you need to make sure that the +UTF8 flag is stripped off. Note that at the time of this writing +(January 2012) the mentioned modules are not UTF\-8\-aware. Please +check the documentation to verify if this is still true. +.Sp +.Vb 4 +\& if ($] > 5.008) { +\& require Encode; +\& $val = Encode::encode("UTF\-8", $val); # make octets +\& } +.Ve +.IP \(bu 3 +A scalar we got back from an extension +.Sp +If you believe the scalar comes back as UTF\-8, you will most likely +want the UTF8 flag restored: +.Sp +.Vb 4 +\& if ($] > 5.008) { +\& require Encode; +\& $val = Encode::decode("UTF\-8", $val); +\& } +.Ve +.IP \(bu 3 +Same thing, if you are really sure it is UTF\-8 +.Sp +.Vb 4 +\& if ($] > 5.008) { +\& require Encode; +\& Encode::_utf8_on($val); +\& } +.Ve +.IP \(bu 3 +A wrapper for DBI \f(CW\*(C`fetchrow_array\*(C'\fR and \f(CW\*(C`fetchrow_hashref\*(C'\fR +.Sp +When the database contains only UTF\-8, a wrapper function or method is +a convenient way to replace all your \f(CW\*(C`fetchrow_array\*(C'\fR and +\&\f(CW\*(C`fetchrow_hashref\*(C'\fR calls. A wrapper function will also make it easier to +adapt to future enhancements in your database driver. Note that at the +time of this writing (January 2012), the DBI has no standardized way +to deal with UTF\-8 data. Please check the DBI documentation to verify if +that is still true. +.Sp +.Vb 10 +\& sub fetchrow { +\& # $what is one of fetchrow_{array,hashref} +\& my($self, $sth, $what) = @_; +\& if ($] < 5.008) { +\& return $sth\->$what; +\& } else { +\& require Encode; +\& if (wantarray) { +\& my @arr = $sth\->$what; +\& for (@arr) { +\& defined && /[^\e000\-\e177]/ && Encode::_utf8_on($_); +\& } +\& return @arr; +\& } else { +\& my $ret = $sth\->$what; +\& if (ref $ret) { +\& for my $k (keys %$ret) { +\& defined +\& && /[^\e000\-\e177]/ +\& && Encode::_utf8_on($_) for $ret\->{$k}; +\& } +\& return $ret; +\& } else { +\& defined && /[^\e000\-\e177]/ && Encode::_utf8_on($_) for $ret; +\& return $ret; +\& } +\& } +\& } +\& } +.Ve +.IP \(bu 3 +A large scalar that you know can only contain ASCII +.Sp +Scalars that contain only ASCII and are marked as UTF\-8 are sometimes +a drag to your program. If you recognize such a situation, just remove +the UTF8 flag: +.Sp +.Vb 1 +\& utf8::downgrade($val) if $] > 5.008; +.Ve +.SH BUGS +.IX Header "BUGS" +See also "The "Unicode Bug"" above. +.SS "Interaction with Extensions" +.IX Subsection "Interaction with Extensions" +When Perl exchanges data with an extension, the extension should be +able to understand the UTF8 flag and act accordingly. If the +extension doesn't recognize that flag, it's likely that the extension +will return incorrectly-flagged data. +.PP +So if you're working with Unicode data, consult the documentation of +every module you're using if there are any issues with Unicode data +exchange. If the documentation does not talk about Unicode at all, +suspect the worst and probably look at the source to learn how the +module is implemented. Modules written completely in Perl shouldn't +cause problems. Modules that directly or indirectly access code written +in other programming languages are at risk. +.PP +For affected functions, the simple strategy to avoid data corruption is +to always make the encoding of the exchanged data explicit. Choose an +encoding that you know the extension can handle. Convert arguments passed +to the extensions to that encoding and convert results back from that +encoding. Write wrapper functions that do the conversions for you, so +you can later change the functions when the extension catches up. +.PP +To provide an example, let's say the popular \f(CW\*(C`Foo::Bar::escape_html\*(C'\fR +function doesn't deal with Unicode data yet. The wrapper function +would convert the argument to raw UTF\-8 and convert the result back to +Perl's internal representation like so: +.PP +.Vb 6 +\& sub my_escape_html ($) { +\& my($what) = shift; +\& return unless defined $what; +\& Encode::decode("UTF\-8", Foo::Bar::escape_html( +\& Encode::encode("UTF\-8", $what))); +\& } +.Ve +.PP +Sometimes, when the extension does not convert data but just stores +and retrieves it, you will be able to use the otherwise +dangerous \f(CWEncode::_utf8_on()\fR function. Let's say +the popular \f(CW\*(C`Foo::Bar\*(C'\fR extension, written in C, provides a \f(CW\*(C`param\*(C'\fR +method that lets you store and retrieve data according to these prototypes: +.PP +.Vb 2 +\& $self\->param($name, $value); # set a scalar +\& $value = $self\->param($name); # retrieve a scalar +.Ve +.PP +If it does not yet provide support for any encoding, one could write a +derived class with such a \f(CW\*(C`param\*(C'\fR method: +.PP +.Vb 12 +\& sub param { +\& my($self,$name,$value) = @_; +\& utf8::upgrade($name); # make sure it is UTF\-8 encoded +\& if (defined $value) { +\& utf8::upgrade($value); # make sure it is UTF\-8 encoded +\& return $self\->SUPER::param($name,$value); +\& } else { +\& my $ret = $self\->SUPER::param($name); +\& Encode::_utf8_on($ret); # we know, it is UTF\-8 encoded +\& return $ret; +\& } +\& } +.Ve +.PP +Some extensions provide filters on data entry/exit points, such as +\&\f(CW\*(C`DB_File::filter_store_key\*(C'\fR and family. Look out for such filters in +the documentation of your extensions; they can make the transition to +Unicode data much easier. +.SS Speed +.IX Subsection "Speed" +Some functions are slower when working on UTF\-8 encoded strings than +on byte encoded strings. All functions that need to hop over +characters such as \f(CWlength()\fR, \f(CWsubstr()\fR or \f(CWindex()\fR, or matching +regular expressions can work \fBmuch\fR faster when the underlying data are +byte-encoded. +.PP +In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 +a caching scheme was introduced which improved the situation. In general, +operations with UTF\-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like \f(CW\*(C`\ep{Nd}\*(C'\fR are known to +be quite a bit slower (5\-20 times) than their simpler counterparts +like \f(CW\*(C`[0\-9]\*(C'\fR (then again, there are hundreds of Unicode characters matching +\&\f(CW\*(C`Nd\*(C'\fR compared with the 10 ASCII characters matching \f(CW\*(C`[0\-9]\*(C'\fR). +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlunitut, perluniintro, perluniprops, Encode, open, utf8, bytes, +perlretut, "${^UNICODE}" in perlvar, +<https://www.unicode.org/reports/tr44>). |