summaryrefslogtreecommitdiffstats
path: root/upstream/mageia-cauldron/man1/perlunicode.1
diff options
context:
space:
mode:
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlunicode.1')
-rw-r--r--upstream/mageia-cauldron/man1/perlunicode.12232
1 files changed, 2232 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlunicode.1 b/upstream/mageia-cauldron/man1/perlunicode.1
new file mode 100644
index 00000000..ead666c0
--- /dev/null
+++ b/upstream/mageia-cauldron/man1/perlunicode.1
@@ -0,0 +1,2232 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLUNICODE 1"
+.TH PERLUNICODE 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlunicode \- Unicode support in Perl
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+If you haven't already, before reading this document, you should become
+familiar with both perlunitut and perluniintro.
+.PP
+Unicode aims to \fBUNI\fR\-fy the en\-\fBCODE\fR\-ings of all the world's
+character sets into a single Standard. For quite a few of the various
+coding standards that existed when Unicode was first created, converting
+from each to Unicode essentially meant adding a constant to each code
+point in the original standard, and converting back meant just
+subtracting that same constant. For ASCII and ISO\-8859\-1, the constant
+is 0. For ISO\-8859\-5, (Cyrillic) the constant is 864; for Hebrew
+(ISO\-8859\-8), it's 1488; Thai (ISO\-8859\-11), 3424; and so forth. This
+made it easy to do the conversions, and facilitated the adoption of
+Unicode.
+.PP
+And it worked; nowadays, those legacy standards are rarely used. Most
+everyone uses Unicode.
+.PP
+Unicode is a comprehensive standard. It specifies many things outside
+the scope of Perl, such as how to display sequences of characters. For
+a full discussion of all aspects of Unicode, see
+<https://www.unicode.org>.
+.SS "Important Caveats"
+.IX Subsection "Important Caveats"
+Even though some of this section may not be understandable to you on
+first reading, we think it's important enough to highlight some of the
+gotchas before delving further, so here goes:
+.PP
+Unicode support is an extensive requirement. While Perl does not
+implement the Unicode standard or the accompanying technical reports
+from cover to cover, Perl does support many Unicode features.
+.PP
+Also, the use of Unicode may present security issues that aren't
+obvious, see "Security Implications of Unicode" below.
+.ie n .IP "Safest if you ""use feature \*(Aqunicode_strings\*(Aq""" 4
+.el .IP "Safest if you \f(CWuse feature \*(Aqunicode_strings\*(Aq\fR" 4
+.IX Item "Safest if you use feature unicode_strings"
+In order to preserve backward compatibility, Perl does not turn
+on full internal Unicode support unless the pragma
+\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR
+is specified. (This is automatically
+selected if you \f(CW\*(C`use\ v5.12\*(C'\fR or higher.) Failure to do this can
+trigger unexpected surprises. See "The "Unicode Bug"" below.
+.Sp
+This pragma doesn't affect I/O. Nor does it change the internal
+representation of strings, only their interpretation. There are still
+several places where Unicode isn't fully supported, such as in
+filenames.
+.IP "Input and Output Layers" 4
+.IX Item "Input and Output Layers"
+Use the \f(CW:encoding(...)\fR layer to read from and write to
+filehandles using the specified encoding. (See open.)
+.IP "You must convert your non-ASCII, non\-UTF\-8 Perl scripts to be UTF\-8." 4
+.IX Item "You must convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8."
+The encoding module has been deprecated since perl 5.18 and the
+perl internals it requires have been removed with perl 5.26.
+.ie n .IP """use utf8"" still needed to enable UTF\-8 in scripts" 4
+.el .IP "\f(CWuse utf8\fR still needed to enable UTF\-8 in scripts" 4
+.IX Item "use utf8 still needed to enable UTF-8 in scripts"
+If your Perl script is itself encoded in UTF\-8,
+the \f(CW\*(C`use\ utf8\*(C'\fR pragma must be explicitly included to enable
+recognition of that (in string or regular expression literals, or in
+identifier names). \fBThis is the only time when an explicit \fR\f(CB\*(C`use\ utf8\*(C'\fR\fB is needed.\fR (See utf8).
+.Sp
+If a Perl script begins with the bytes that form the UTF\-8 encoding of
+the Unicode BYTE ORDER MARK (\f(CW\*(C`BOM\*(C'\fR, see "Unicode Encodings"), those
+bytes are completely ignored.
+.IP "UTF\-16 scripts autodetected" 4
+.IX Item "UTF-16 scripts autodetected"
+If a Perl script begins with the Unicode \f(CW\*(C`BOM\*(C'\fR (UTF\-16LE,
+UTF16\-BE), or if the script looks like non\-\f(CW\*(C`BOM\*(C'\fR\-marked
+UTF\-16 of either endianness, Perl will correctly read in the script as
+the appropriate Unicode encoding.
+.SS "Byte and Character Semantics"
+.IX Subsection "Byte and Character Semantics"
+Before Unicode, most encodings used 8 bits (a single byte) to encode
+each character. Thus a character was a byte, and a byte was a
+character, and there could be only 256 or fewer possible characters.
+"Byte Semantics" in the title of this section refers to
+this behavior. There was no need to distinguish between "Byte" and
+"Character".
+.PP
+Then along comes Unicode which has room for over a million characters
+(and Perl allows for even more). This means that a character may
+require more than a single byte to represent it, and so the two terms
+are no longer equivalent. What matter are the characters as whole
+entities, and not usually the bytes that comprise them. That's what the
+term "Character Semantics" in the title of this section refers to.
+.PP
+Perl had to change internally to decouple "bytes" from "characters".
+It is important that you too change your ideas, if you haven't already,
+so that "byte" and "character" no longer mean the same thing in your
+mind.
+.PP
+The basic building block of Perl strings has always been a "character".
+The changes basically come down to that the implementation no longer
+thinks that a character is always just a single byte.
+.PP
+There are various things to note:
+.IP \(bu 4
+String handling functions, for the most part, continue to operate in
+terms of characters. \f(CWlength()\fR, for example, returns the number of
+characters in a string, just as before. But that number no longer is
+necessarily the same as the number of bytes in the string (there may be
+more bytes than characters). The other such functions include
+\&\f(CWchop()\fR, \f(CWchomp()\fR, \f(CWsubstr()\fR, \f(CWpos()\fR, \f(CWindex()\fR, \f(CWrindex()\fR,
+\&\f(CWsort()\fR, \f(CWsprintf()\fR, and \f(CWwrite()\fR.
+.Sp
+The exceptions are:
+.RS 4
+.IP \(bu 4
+the bit-oriented \f(CW\*(C`vec\*(C'\fR
+.Sp
+\
+.IP \(bu 4
+the byte-oriented \f(CW\*(C`pack\*(C'\fR/\f(CW\*(C`unpack\*(C'\fR \f(CW"C"\fR format
+.Sp
+However, the \f(CW\*(C`W\*(C'\fR specifier does operate on whole characters, as does the
+\&\f(CW\*(C`U\*(C'\fR specifier.
+.IP \(bu 4
+some operators that interact with the platform's operating system
+.Sp
+Operators dealing with filenames are examples.
+.IP \(bu 4
+when the functions are called from within the scope of the
+\&\f(CW\*(C`use\ bytes\*(C'\fR pragma
+.Sp
+Likely, you should use this only for debugging anyway.
+.RE
+.RS 4
+.RE
+.IP \(bu 4
+Strings\-\-including hash keys\-\-and regular expression patterns may
+contain characters that have ordinal values larger than 255.
+.Sp
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF\-8 encoding, or UTF\-16.
+(The former requires a \f(CW\*(C`use utf8\*(C'\fR, the latter may require a \f(CW\*(C`BOM\*(C'\fR.)
+.Sp
+"Creating Unicode" in perluniintro gives other ways to place non-ASCII
+characters in your strings.
+.IP \(bu 4
+The \f(CWchr()\fR and \f(CWord()\fR functions work on whole characters.
+.IP \(bu 4
+Regular expressions match whole characters. For example, \f(CW"."\fR matches
+a whole character instead of only a single byte.
+.IP \(bu 4
+The \f(CW\*(C`tr///\*(C'\fR operator translates whole characters. (Note that the
+\&\f(CW\*(C`tr///CU\*(C'\fR functionality has been removed. For similar functionality to
+that, see \f(CW\*(C`pack(\*(AqU0\*(Aq,\ ...)\*(C'\fR and \f(CW\*(C`pack(\*(AqC0\*(Aq,\ ...)\*(C'\fR).
+.IP \(bu 4
+\&\f(CW\*(C`scalar reverse()\*(C'\fR reverses by character rather than by byte.
+.IP \(bu 4
+The bit string operators, \f(CW\*(C`& | ^ ~\*(C'\fR and (starting in v5.22)
+\&\f(CW\*(C`&. |. ^. ~.\*(C'\fR can operate on bit strings encoded in UTF\-8, but this
+can give unexpected results if any of the strings contain code points
+above 0xFF. Starting in v5.28, it is a fatal error to have such an
+operand. Otherwise, the operation is performed on a non\-UTF\-8 copy of
+the operand. If you're not sure about the encoding of a string,
+downgrade it before using any of these operators; you can use
+\&\f(CWutf8::utf8_downgrade()\fR.
+.PP
+The bottom line is that Perl has always practiced "Character Semantics",
+but with the advent of Unicode, that is now different than "Byte
+Semantics".
+.SS "ASCII Rules versus Unicode Rules"
+.IX Subsection "ASCII Rules versus Unicode Rules"
+Before Unicode, when a character was a byte was a character,
+Perl knew only about the 128 characters defined by ASCII, code points 0
+through 127 (except for under \f(CW\*(C`use\ locale\*(C'\fR). That
+left the code
+points 128 to 255 as unassigned, and available for whatever use a
+program might want. The only semantics they have is their ordinal
+numbers, and that they are members of none of the non-negative character
+classes. None are considered to match \f(CW\*(C`\ew\*(C'\fR for example, but all match
+\&\f(CW\*(C`\eW\*(C'\fR.
+.PP
+Unicode, of course, assigns each of those code points a particular
+meaning (along with ones above 255). To preserve backward
+compatibility, Perl only uses the Unicode meanings when there is some
+indication that Unicode is what is intended; otherwise the non-ASCII
+code points remain treated as if they are unassigned.
+.PP
+Here are the ways that Perl knows that a string should be treated as
+Unicode:
+.IP \(bu 4
+Within the scope of \f(CW\*(C`use\ utf8\*(C'\fR
+.Sp
+If the whole program is Unicode (signified by using 8\-bit \fBU\fRnicode
+\&\fBT\fRransformation \fBF\fRormat), then all literal strings within it must be
+Unicode.
+.IP \(bu 4
+Within the scope of
+\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR
+.Sp
+This pragma was created so you can explicitly tell Perl that operations
+executed within its scope are to use Unicode rules. More operations are
+affected with newer perls. See "The "Unicode Bug"".
+.IP \(bu 4
+Within the scope of \f(CW\*(C`use\ v5.12\*(C'\fR or higher
+.Sp
+This implicitly turns on \f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR.
+.IP \(bu 4
+Within the scope of
+\&\f(CW\*(C`use\ locale\ \*(Aqnot_characters\*(Aq\*(C'\fR,
+or \f(CW\*(C`use\ locale\*(C'\fR and the current
+locale is a UTF\-8 locale.
+.Sp
+The former is defined to imply Unicode handling; and the latter
+indicates a Unicode locale, hence a Unicode interpretation of all
+strings within it.
+.IP \(bu 4
+When the string contains a Unicode-only code point
+.Sp
+Perl has never accepted code points above 255 without them being
+Unicode, so their use implies Unicode for the whole string.
+.IP \(bu 4
+When the string contains a Unicode named code point \f(CW\*(C`\eN{...}\*(C'\fR
+.Sp
+The \f(CW\*(C`\eN{...}\*(C'\fR construct explicitly refers to a Unicode code point,
+even if it is one that is also in ASCII. Therefore the string
+containing it must be Unicode.
+.IP \(bu 4
+When the string has come from an external source marked as
+Unicode
+.Sp
+The \f(CW\*(C`\-C\*(C'\fR command line option can
+specify that certain inputs to the program are Unicode, and the values
+of this can be read by your Perl code, see "${^UNICODE}" in perlvar.
+.IP \(bu 4
+When the string has been upgraded to UTF\-8
+.Sp
+The function \f(CWutf8::utf8_upgrade()\fR
+can be explicitly used to permanently (unless a subsequent
+\&\f(CWutf8::utf8_downgrade()\fR is called) cause a string to be treated as
+Unicode.
+.IP \(bu 4
+There are additional methods for regular expression patterns
+.Sp
+A pattern that is compiled with the \f(CW\*(C`/u\*(C'\fR or \f(CW\*(C`/a\*(C'\fR modifiers is
+treated as Unicode (though there are some restrictions with \f(CW\*(C`/a\*(C'\fR).
+Under the \f(CW\*(C`/d\*(C'\fR and \f(CW\*(C`/l\*(C'\fR modifiers, there are several other
+indications for Unicode; see "Character set modifiers" in perlre.
+.PP
+Note that all of the above are overridden within the scope of
+\&\f(CW\*(C`use bytes\*(C'\fR; but you should be using this pragma only for
+debugging.
+.PP
+Note also that some interactions with the platform's operating system
+never use Unicode rules.
+.PP
+When Unicode rules are in effect:
+.IP \(bu 4
+Case translation operators use the Unicode case translation tables.
+.Sp
+Note that \f(CWuc()\fR, or \f(CW\*(C`\eU\*(C'\fR in interpolated strings, translates to
+uppercase, while \f(CW\*(C`ucfirst\*(C'\fR, or \f(CW\*(C`\eu\*(C'\fR in interpolated strings,
+translates to titlecase in languages that make the distinction (which is
+equivalent to uppercase in languages without the distinction).
+.Sp
+There is a CPAN module, \f(CW\*(C`Unicode::Casing\*(C'\fR, which allows you to
+define your own mappings to be used in \f(CWlc()\fR, \f(CWlcfirst()\fR, \f(CWuc()\fR,
+\&\f(CWucfirst()\fR, and \f(CW\*(C`fc\*(C'\fR (or their double-quoted string inlined versions
+such as \f(CW\*(C`\eU\*(C'\fR). (Prior to Perl 5.16, this functionality was partially
+provided in the Perl core, but suffered from a number of insurmountable
+drawbacks, so the CPAN module was written instead.)
+.IP \(bu 4
+Character classes in regular expressions match based on the character
+properties specified in the Unicode properties database.
+.Sp
+\&\f(CW\*(C`\ew\*(C'\fR can be used to match a Japanese ideograph, for instance; and
+\&\f(CW\*(C`[[:digit:]]\*(C'\fR a Bengali number.
+.IP \(bu 4
+Named Unicode properties, scripts, and block ranges may be used (like
+bracketed character classes) by using the \f(CW\*(C`\ep{}\*(C'\fR "matches property"
+construct and the \f(CW\*(C`\eP{}\*(C'\fR negation, "doesn't match property".
+.Sp
+See "Unicode Character Properties" for more details.
+.Sp
+You can define your own character properties and use them
+in the regular expression with the \f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR construct.
+See "User-Defined Character Properties" for more details.
+.SS "Extended Grapheme Clusters (Logical characters)"
+.IX Subsection "Extended Grapheme Clusters (Logical characters)"
+Consider a character, say \f(CW\*(C`H\*(C'\fR. It could appear with various marks around it,
+such as an acute accent, or a circumflex, or various hooks, circles, arrows,
+\&\fIetc.\fR, above, below, to one side or the other, \fIetc\fR. There are many
+possibilities among the world's languages. The number of combinations is
+astronomical, and if there were a character for each combination, it would
+soon exhaust Unicode's more than a million possible characters. So Unicode
+took a different approach: there is a character for the base \f(CW\*(C`H\*(C'\fR, and a
+character for each of the possible marks, and these can be variously combined
+to get a final logical character. So a logical character\-\-what appears to be a
+single character\-\-can be a sequence of more than one individual characters.
+The Unicode standard calls these "extended grapheme clusters" (which
+is an improved version of the no-longer much used "grapheme cluster");
+Perl furnishes the \f(CW\*(C`\eX\*(C'\fR regular expression construct to match such
+sequences in their entirety.
+.PP
+But Unicode's intent is to unify the existing character set standards and
+practices, and several pre-existing standards have single characters that
+mean the same thing as some of these combinations, like ISO\-8859\-1,
+which has quite a few of them. For example, \f(CW"LATIN CAPITAL LETTER E
+WITH ACUTE"\fR was already in this standard when Unicode came along.
+Unicode therefore added it to its repertoire as that single character.
+But this character is considered by Unicode to be equivalent to the
+sequence consisting of the character \f(CW"LATIN CAPITAL LETTER E"\fR
+followed by the character \f(CW"COMBINING ACUTE ACCENT"\fR.
+.PP
+\&\f(CW"LATIN CAPITAL LETTER E WITH ACUTE"\fR is called a "pre-composed"
+character, and its equivalence with the "E" and the "COMBINING ACCENT"
+sequence is called canonical equivalence. All pre-composed characters
+are said to have a decomposition (into the equivalent sequence), and the
+decomposition type is also called canonical. A string may be comprised
+as much as possible of precomposed characters, or it may be comprised of
+entirely decomposed characters. Unicode calls these respectively,
+"Normalization Form Composed" (NFC) and "Normalization Form Decomposed".
+The \f(CW\*(C`Unicode::Normalize\*(C'\fR module contains functions that convert
+between the two. A string may also have both composed characters and
+decomposed characters; this module can be used to make it all one or the
+other.
+.PP
+You may be presented with strings in any of these equivalent forms.
+There is currently nothing in Perl 5 that ignores the differences. So
+you'll have to specially handle it. The usual advice is to convert your
+inputs to \f(CW\*(C`NFD\*(C'\fR before processing further.
+.PP
+For more detailed information, see <http://unicode.org/reports/tr15/>.
+.SS "Unicode Character Properties"
+.IX Subsection "Unicode Character Properties"
+(The only time that Perl considers a sequence of individual code
+points as a single logical character is in the \f(CW\*(C`\eX\*(C'\fR construct, already
+mentioned above. Therefore "character" in this discussion means a single
+Unicode code point.)
+.PP
+Very nearly all Unicode character properties are accessible through
+regular expressions by using the \f(CW\*(C`\ep{}\*(C'\fR "matches property" construct
+and the \f(CW\*(C`\eP{}\*(C'\fR "doesn't match property" for its negation.
+.PP
+For instance, \f(CW\*(C`\ep{Uppercase}\*(C'\fR matches any single character with the Unicode
+\&\f(CW"Uppercase"\fR property, while \f(CW\*(C`\ep{L}\*(C'\fR matches any character with a
+\&\f(CW\*(C`General_Category\*(C'\fR of \f(CW"L"\fR (letter) property (see
+"General_Category" below). Brackets are not
+required for single letter property names, so \f(CW\*(C`\ep{L}\*(C'\fR is equivalent to \f(CW\*(C`\epL\*(C'\fR.
+.PP
+More formally, \f(CW\*(C`\ep{Uppercase}\*(C'\fR matches any single character whose Unicode
+\&\f(CW\*(C`Uppercase\*(C'\fR property value is \f(CW\*(C`True\*(C'\fR, and \f(CW\*(C`\eP{Uppercase}\*(C'\fR matches any character
+whose \f(CW\*(C`Uppercase\*(C'\fR property value is \f(CW\*(C`False\*(C'\fR, and they could have been written as
+\&\f(CW\*(C`\ep{Uppercase=True}\*(C'\fR and \f(CW\*(C`\ep{Uppercase=False}\*(C'\fR, respectively.
+.PP
+This formality is needed when properties are not binary; that is, if they can
+take on more values than just \f(CW\*(C`True\*(C'\fR and \f(CW\*(C`False\*(C'\fR. For example, the
+\&\f(CW\*(C`Bidi_Class\*(C'\fR property (see "Bidirectional Character Types" below),
+can take on several different
+values, such as \f(CW\*(C`Left\*(C'\fR, \f(CW\*(C`Right\*(C'\fR, \f(CW\*(C`Whitespace\*(C'\fR, and others. To match these, one needs
+to specify both the property name (\f(CW\*(C`Bidi_Class\*(C'\fR), AND the value being
+matched against
+(\f(CW\*(C`Left\*(C'\fR, \f(CW\*(C`Right\*(C'\fR, \fIetc.\fR). This is done, as in the examples above, by having the
+two components separated by an equal sign (or interchangeably, a colon), like
+\&\f(CW\*(C`\ep{Bidi_Class: Left}\*(C'\fR.
+.PP
+All Unicode-defined character properties may be written in these compound forms
+of \f(CW\*(C`\ep{\fR\f(CIproperty\fR\f(CW=\fR\f(CIvalue\fR\f(CW}\*(C'\fR or \f(CW\*(C`\ep{\fR\f(CIproperty\fR\f(CW:\fR\f(CIvalue\fR\f(CW}\*(C'\fR, but Perl provides some
+additional properties that are written only in the single form, as well as
+single-form short-cuts for all binary properties and certain others described
+below, in which you may omit the property name and the equals or colon
+separator.
+.PP
+Most Unicode character properties have at least two synonyms (or aliases if you
+prefer): a short one that is easier to type and a longer one that is more
+descriptive and hence easier to understand. Thus the \f(CW"L"\fR and
+\&\f(CW"Letter"\fR properties above are equivalent and can be used
+interchangeably. Likewise, \f(CW"Upper"\fR is a synonym for \f(CW"Uppercase"\fR,
+and we could have written \f(CW\*(C`\ep{Uppercase}\*(C'\fR equivalently as \f(CW\*(C`\ep{Upper}\*(C'\fR.
+Also, there are typically various synonyms for the values the property
+can be. For binary properties, \f(CW"True"\fR has 3 synonyms: \f(CW"T"\fR,
+\&\f(CW"Yes"\fR, and \f(CW"Y"\fR; and \f(CW"False"\fR has correspondingly \f(CW"F"\fR,
+\&\f(CW"No"\fR, and \f(CW"N"\fR. But be careful. A short form of a value for one
+property may not mean the same thing as the short form spelled the same
+for another.
+Thus, for the \f(CW"General_Category"\fR property, \f(CW"L"\fR means
+\&\f(CW"Letter"\fR, but for the \f(CW\*(C`Bidi_Class\*(C'\fR
+property, \f(CW"L"\fR means \f(CW"Left"\fR. A complete list of properties and
+synonyms is in perluniprops.
+.PP
+Upper/lower case differences in property names and values are irrelevant;
+thus \f(CW\*(C`\ep{Upper}\*(C'\fR means the same thing as \f(CW\*(C`\ep{upper}\*(C'\fR or even \f(CW\*(C`\ep{UpPeR}\*(C'\fR.
+Similarly, you can add or subtract underscores anywhere in the middle of a
+word, so that these are also equivalent to \f(CW\*(C`\ep{U_p_p_e_r}\*(C'\fR. And white space
+is generally irrelevant adjacent to non-word characters, such as the
+braces and the equals or colon separators, so \f(CW\*(C`\ep{ Upper }\*(C'\fR and
+\&\f(CW\*(C`\ep{ Upper_case : Y }\*(C'\fR are equivalent to these as well. In fact, white
+space and even hyphens can usually be added or deleted anywhere. So
+even \f(CW\*(C`\ep{ Up\-per case = Yes}\*(C'\fR is equivalent. All this is called
+"loose-matching" by Unicode. The "name" property has some restrictions
+on this due to a few outlier names. Full details are given in
+<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>.
+.PP
+The few places where stricter matching is
+used is in the middle of numbers, the "name" property, and in the Perl
+extension properties that begin or end with an underscore. Stricter
+matching cares about white space (except adjacent to non-word
+characters), hyphens, and non-interior underscores.
+.PP
+You can also use negation in both \f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR by introducing a caret
+(\f(CW\*(C`^\*(C'\fR) between the first brace and the property name: \f(CW\*(C`\ep{^Tamil}\*(C'\fR is
+equal to \f(CW\*(C`\eP{Tamil}\*(C'\fR.
+.PP
+Almost all properties are immune to case-insensitive matching. That is,
+adding a \f(CW\*(C`/i\*(C'\fR regular expression modifier does not change what they
+match. There are two sets that are affected.
+The first set is
+\&\f(CW\*(C`Uppercase_Letter\*(C'\fR,
+\&\f(CW\*(C`Lowercase_Letter\*(C'\fR,
+and \f(CW\*(C`Titlecase_Letter\*(C'\fR,
+all of which match \f(CW\*(C`Cased_Letter\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching.
+And the second set is
+\&\f(CW\*(C`Uppercase\*(C'\fR,
+\&\f(CW\*(C`Lowercase\*(C'\fR,
+and \f(CW\*(C`Titlecase\*(C'\fR,
+all of which match \f(CW\*(C`Cased\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching.
+This set also includes its subsets \f(CW\*(C`PosixUpper\*(C'\fR and \f(CW\*(C`PosixLower\*(C'\fR both
+of which under \f(CW\*(C`/i\*(C'\fR match \f(CW\*(C`PosixAlpha\*(C'\fR.
+(The difference between these sets is that some things, such as Roman
+numerals, come in both upper and lower case so they are \f(CW\*(C`Cased\*(C'\fR, but
+aren't considered letters, so they aren't \f(CW\*(C`Cased_Letter\*(C'\fR's.)
+.PP
+See "Beyond Unicode code points" for special considerations when
+matching Unicode properties against non-Unicode code points.
+.PP
+\fR\f(BIGeneral_Category\fR\fI\fR
+.IX Subsection "General_Category"
+.PP
+Every Unicode character is assigned a general category, which is the "most
+usual categorization of a character" (from
+<https://www.unicode.org/reports/tr44>).
+.PP
+The compound way of writing these is like \f(CW\*(C`\ep{General_Category=Number}\*(C'\fR
+(short: \f(CW\*(C`\ep{gc:n}\*(C'\fR). But Perl furnishes shortcuts in which everything up
+through the equal or colon separator is omitted. So you can instead just write
+\&\f(CW\*(C`\epN\*(C'\fR.
+.PP
+Here are the short and long forms of the values the \f(CW\*(C`General Category\*(C'\fR property
+can have:
+.PP
+.Vb 1
+\& Short Long
+\&
+\& L Letter
+\& LC, L& Cased_Letter (that is: [\ep{Ll}\ep{Lu}\ep{Lt}])
+\& Lu Uppercase_Letter
+\& Ll Lowercase_Letter
+\& Lt Titlecase_Letter
+\& Lm Modifier_Letter
+\& Lo Other_Letter
+\&
+\& M Mark
+\& Mn Nonspacing_Mark
+\& Mc Spacing_Mark
+\& Me Enclosing_Mark
+\&
+\& N Number
+\& Nd Decimal_Number (also Digit)
+\& Nl Letter_Number
+\& No Other_Number
+\&
+\& P Punctuation (also Punct)
+\& Pc Connector_Punctuation
+\& Pd Dash_Punctuation
+\& Ps Open_Punctuation
+\& Pe Close_Punctuation
+\& Pi Initial_Punctuation
+\& (may behave like Ps or Pe depending on usage)
+\& Pf Final_Punctuation
+\& (may behave like Ps or Pe depending on usage)
+\& Po Other_Punctuation
+\&
+\& S Symbol
+\& Sm Math_Symbol
+\& Sc Currency_Symbol
+\& Sk Modifier_Symbol
+\& So Other_Symbol
+\&
+\& Z Separator
+\& Zs Space_Separator
+\& Zl Line_Separator
+\& Zp Paragraph_Separator
+\&
+\& C Other
+\& Cc Control (also Cntrl)
+\& Cf Format
+\& Cs Surrogate
+\& Co Private_Use
+\& Cn Unassigned
+.Ve
+.PP
+Single-letter properties match all characters in any of the
+two-letter sub-properties starting with the same letter.
+\&\f(CW\*(C`LC\*(C'\fR and \f(CW\*(C`L&\*(C'\fR are special: both are aliases for the set consisting of everything matched by \f(CW\*(C`Ll\*(C'\fR, \f(CW\*(C`Lu\*(C'\fR, and \f(CW\*(C`Lt\*(C'\fR.
+.PP
+\fR\f(BIBidirectional Character Types\fR\fI\fR
+.IX Subsection "Bidirectional Character Types"
+.PP
+Because scripts differ in their directionality (Hebrew and Arabic are
+written right to left, for example) Unicode supplies a \f(CW\*(C`Bidi_Class\*(C'\fR property.
+Some of the values this property can have are:
+.PP
+.Vb 1
+\& Value Meaning
+\&
+\& L Left\-to\-Right
+\& LRE Left\-to\-Right Embedding
+\& LRO Left\-to\-Right Override
+\& R Right\-to\-Left
+\& AL Arabic Letter
+\& RLE Right\-to\-Left Embedding
+\& RLO Right\-to\-Left Override
+\& PDF Pop Directional Format
+\& EN European Number
+\& ES European Separator
+\& ET European Terminator
+\& AN Arabic Number
+\& CS Common Separator
+\& NSM Non\-Spacing Mark
+\& BN Boundary Neutral
+\& B Paragraph Separator
+\& S Segment Separator
+\& WS Whitespace
+\& ON Other Neutrals
+.Ve
+.PP
+This property is always written in the compound form.
+For example, \f(CW\*(C`\ep{Bidi_Class:R}\*(C'\fR matches characters that are normally
+written right to left. Unlike the
+\&\f(CW"General_Category"\fR property, this
+property can have more values added in a future Unicode release. Those
+listed above comprised the complete set for many Unicode releases, but
+others were added in Unicode 6.3; you can always find what the
+current ones are in perluniprops. And
+<https://www.unicode.org/reports/tr9/> describes how to use them.
+.PP
+\fR\f(BIScripts\fR\fI\fR
+.IX Subsection "Scripts"
+.PP
+The world's languages are written in many different scripts. This sentence
+(unless you're reading it in translation) is written in Latin, while Russian is
+written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
+Hiragana or Katakana. There are many more.
+.PP
+The Unicode \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR properties give what
+script a given character is in. The \f(CW\*(C`Script_Extensions\*(C'\fR property is an
+improved version of \f(CW\*(C`Script\*(C'\fR, as demonstrated below. Either property
+can be specified with the compound form like
+\&\f(CW\*(C`\ep{Script=Hebrew}\*(C'\fR (short: \f(CW\*(C`\ep{sc=hebr}\*(C'\fR), or
+\&\f(CW\*(C`\ep{Script_Extensions=Javanese}\*(C'\fR (short: \f(CW\*(C`\ep{scx=java}\*(C'\fR).
+In addition, Perl furnishes shortcuts for all
+\&\f(CW\*(C`Script_Extensions\*(C'\fR property names. You can omit everything up through
+the equals (or colon), and simply write \f(CW\*(C`\ep{Latin}\*(C'\fR or \f(CW\*(C`\eP{Cyrillic}\*(C'\fR.
+(This is not true for \f(CW\*(C`Script\*(C'\fR, which is required to be
+written in the compound form. Prior to Perl v5.26, the single form
+returned the plain old \f(CW\*(C`Script\*(C'\fR version, but was changed because
+\&\f(CW\*(C`Script_Extensions\*(C'\fR gives better results.)
+.PP
+The difference between these two properties involves characters that are
+used in multiple scripts. For example the digits '0' through '9' are
+used in many parts of the world. These are placed in a script named
+\&\f(CW\*(C`Common\*(C'\fR. Other characters are used in just a few scripts. For
+example, the \f(CW"KATAKANA\-HIRAGANA DOUBLE HYPHEN"\fR is used in both Japanese
+scripts, Katakana and Hiragana, but nowhere else. The \f(CW\*(C`Script\*(C'\fR
+property places all characters that are used in multiple scripts in the
+\&\f(CW\*(C`Common\*(C'\fR script, while the \f(CW\*(C`Script_Extensions\*(C'\fR property places those
+that are used in only a few scripts into each of those scripts; while
+still using \f(CW\*(C`Common\*(C'\fR for those used in many scripts. Thus both these
+match:
+.PP
+.Vb 2
+\& "0" =~ /\ep{sc=Common}/ # Matches
+\& "0" =~ /\ep{scx=Common}/ # Matches
+.Ve
+.PP
+and only the first of these match:
+.PP
+.Vb 2
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Common} # Matches
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Common} # No match
+.Ve
+.PP
+And only the last two of these match:
+.PP
+.Vb 4
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Hiragana} # No match
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{sc=Katakana} # No match
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Hiragana} # Matches
+\& "\eN{KATAKANA\-HIRAGANA DOUBLE HYPHEN}" =~ /\ep{scx=Katakana} # Matches
+.Ve
+.PP
+\&\f(CW\*(C`Script_Extensions\*(C'\fR is thus an improved \f(CW\*(C`Script\*(C'\fR, in which there are
+fewer characters in the \f(CW\*(C`Common\*(C'\fR script, and correspondingly more in
+other scripts. It is new in Unicode version 6.0, and its data are likely
+to change significantly in later releases, as things get sorted out.
+New code should probably be using \f(CW\*(C`Script_Extensions\*(C'\fR and not plain
+\&\f(CW\*(C`Script\*(C'\fR. If you compile perl with a Unicode release that doesn't have
+\&\f(CW\*(C`Script_Extensions\*(C'\fR, the single form Perl extensions will instead refer
+to the plain \f(CW\*(C`Script\*(C'\fR property. If you compile with a version of
+Unicode that doesn't have the \f(CW\*(C`Script\*(C'\fR property, these extensions will
+not be defined at all.
+.PP
+(Actually, besides \f(CW\*(C`Common\*(C'\fR, the \f(CW\*(C`Inherited\*(C'\fR script, contains
+characters that are used in multiple scripts. These are modifier
+characters which inherit the script value
+of the controlling character. Some of these are used in many scripts,
+and so go into \f(CW\*(C`Inherited\*(C'\fR in both \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR.
+Others are used in just a few scripts, so are in \f(CW\*(C`Inherited\*(C'\fR in
+\&\f(CW\*(C`Script\*(C'\fR, but not in \f(CW\*(C`Script_Extensions\*(C'\fR.)
+.PP
+It is worth stressing that there are several different sets of digits in
+Unicode that are equivalent to 0\-9 and are matchable by \f(CW\*(C`\ed\*(C'\fR in a
+regular expression. If they are used in a single language only, they
+are in that language's \f(CW\*(C`Script\*(C'\fR and \f(CW\*(C`Script_Extensions\*(C'\fR. If they are
+used in more than one script, they will be in \f(CW\*(C`sc=Common\*(C'\fR, but only
+if they are used in many scripts should they be in \f(CW\*(C`scx=Common\*(C'\fR.
+.PP
+The explanation above has omitted some detail; refer to UAX#24 "Unicode
+Script Property": <https://www.unicode.org/reports/tr24>.
+.PP
+A complete list of scripts and their shortcuts is in perluniprops.
+.PP
+\fR\f(BIUse of the \fR\f(CB"Is"\fR\f(BI Prefix\fR\fI\fR
+.IX Subsection "Use of the ""Is"" Prefix"
+.PP
+For backward compatibility (with ancient Perl 5.6), all properties writable
+without using the compound form mentioned
+so far may have \f(CW\*(C`Is\*(C'\fR or \f(CW\*(C`Is_\*(C'\fR prepended to their name, so \f(CW\*(C`\eP{Is_Lu}\*(C'\fR, for
+example, is equal to \f(CW\*(C`\eP{Lu}\*(C'\fR, and \f(CW\*(C`\ep{IsScript:Arabic}\*(C'\fR is equal to
+\&\f(CW\*(C`\ep{Arabic}\*(C'\fR.
+.PP
+\fR\f(BIBlocks\fR\fI\fR
+.IX Subsection "Blocks"
+.PP
+In addition to \fBscripts\fR, Unicode also defines \fBblocks\fR of
+characters. The difference between scripts and blocks is that the
+concept of scripts is closer to natural languages, while the concept
+of blocks is more of an artificial grouping based on groups of Unicode
+characters with consecutive ordinal values. For example, the \f(CW"Basic Latin"\fR
+block is all the characters whose ordinals are between 0 and 127, inclusive; in
+other words, the ASCII characters. The \f(CW"Latin"\fR script contains some letters
+from this as well as several other blocks, like \f(CW"Latin\-1 Supplement"\fR,
+\&\f(CW"Latin Extended\-A"\fR, \fIetc.\fR, but it does not contain all the characters from
+those blocks. It does not, for example, contain the digits 0\-9, because
+those digits are shared across many scripts, and hence are in the
+\&\f(CW\*(C`Common\*(C'\fR script.
+.PP
+For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
+<https://www.unicode.org/reports/tr24>
+.PP
+The \f(CW\*(C`Script_Extensions\*(C'\fR or \f(CW\*(C`Script\*(C'\fR properties are likely to be the
+ones you want to use when processing
+natural language; the \f(CW\*(C`Block\*(C'\fR property may occasionally be useful in working
+with the nuts and bolts of Unicode.
+.PP
+Block names are matched in the compound form, like \f(CW\*(C`\ep{Block: Arrows}\*(C'\fR or
+\&\f(CW\*(C`\ep{Blk=Hebrew}\*(C'\fR. Unlike most other properties, only a few block names have a
+Unicode-defined short name.
+.PP
+Perl also defines single form synonyms for the block property in cases
+where these do not conflict with something else. But don't use any of
+these, because they are unstable. Since these are Perl extensions, they
+are subordinate to official Unicode property names; Unicode doesn't know
+nor care about Perl's extensions. It may happen that a name that
+currently means the Perl extension will later be changed without warning
+to mean a different Unicode property in a future version of the perl
+interpreter that uses a later Unicode release, and your code would no
+longer work. The extensions are mentioned here for completeness: Take
+the block name and prefix it with one of: \f(CW\*(C`In\*(C'\fR (for example
+\&\f(CW\*(C`\ep{Blk=Arrows}\*(C'\fR can currently be written as \f(CW\*(C`\ep{In_Arrows}\*(C'\fR); or
+sometimes \f(CW\*(C`Is\*(C'\fR (like \f(CW\*(C`\ep{Is_Arrows}\*(C'\fR); or sometimes no prefix at all
+(\f(CW\*(C`\ep{Arrows}\*(C'\fR). As of this writing (Unicode 9.0) there are no
+conflicts with using the \f(CW\*(C`In_\*(C'\fR prefix, but there are plenty with the
+other two forms. For example, \f(CW\*(C`\ep{Is_Hebrew}\*(C'\fR and \f(CW\*(C`\ep{Hebrew}\*(C'\fR mean
+\&\f(CW\*(C`\ep{Script_Extensions=Hebrew}\*(C'\fR which is NOT the same thing as
+\&\f(CW\*(C`\ep{Blk=Hebrew}\*(C'\fR. Our
+advice used to be to use the \f(CW\*(C`In_\*(C'\fR prefix as a single form way of
+specifying a block. But Unicode 8.0 added properties whose names begin
+with \f(CW\*(C`In\*(C'\fR, and it's now clear that it's only luck that's so far
+prevented a conflict. Using \f(CW\*(C`In\*(C'\fR is only marginally less typing than
+\&\f(CW\*(C`Blk:\*(C'\fR, and the latter's meaning is clearer anyway, and guaranteed to
+never conflict. So don't take chances. Use \f(CW\*(C`\ep{Blk=foo}\*(C'\fR for new
+code. And be sure that block is what you really really want to do. In
+most cases scripts are what you want instead.
+.PP
+A complete list of blocks is in perluniprops.
+.PP
+\fR\f(BIOther Properties\fR\fI\fR
+.IX Subsection "Other Properties"
+.PP
+There are many more properties than the very basic ones described here.
+A complete list is in perluniprops.
+.PP
+Unicode defines all its properties in the compound form, so all single-form
+properties are Perl extensions. Most of these are just synonyms for the
+Unicode ones, but some are genuine extensions, including several that are in
+the compound form. And quite a few of these are actually recommended by Unicode
+(in <https://www.unicode.org/reports/tr18>).
+.PP
+This section gives some details on all extensions that aren't just
+synonyms for compound-form Unicode properties
+(for those properties, you'll have to refer to the
+Unicode Standard <https://www.unicode.org/reports/tr44>.
+.ie n .IP "\fR\fB""\ep{All}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{All}\fR\fB\fR 4
+.IX Item "p{All}"
+This matches every possible code point. It is equivalent to \f(CW\*(C`qr/./s\*(C'\fR.
+Unlike all the other non-user-defined \f(CW\*(C`\ep{}\*(C'\fR property matches, no
+warning is ever generated if this is property is matched against a
+non-Unicode code point (see "Beyond Unicode code points" below).
+.ie n .IP "\fR\fB""\ep{Alnum}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Alnum}\fR\fB\fR 4
+.IX Item "p{Alnum}"
+This matches any \f(CW\*(C`\ep{Alphabetic}\*(C'\fR or \f(CW\*(C`\ep{Decimal_Number}\*(C'\fR character.
+.ie n .IP "\fR\fB""\ep{Any}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Any}\fR\fB\fR 4
+.IX Item "p{Any}"
+This matches any of the 1_114_112 Unicode code points. It is a synonym
+for \f(CW\*(C`\ep{Unicode}\*(C'\fR.
+.ie n .IP "\fR\fB""\ep{ASCII}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{ASCII}\fR\fB\fR 4
+.IX Item "p{ASCII}"
+This matches any of the 128 characters in the US-ASCII character set,
+which is a subset of Unicode.
+.ie n .IP "\fR\fB""\ep{Assigned}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Assigned}\fR\fB\fR 4
+.IX Item "p{Assigned}"
+This matches any assigned code point; that is, any code point whose general
+category is not \f(CW\*(C`Unassigned\*(C'\fR (or equivalently, not \f(CW\*(C`Cn\*(C'\fR).
+.ie n .IP "\fR\fB""\ep{Blank}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Blank}\fR\fB\fR 4
+.IX Item "p{Blank}"
+This is the same as \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ep{HorizSpace}\*(C'\fR: A character that changes the
+spacing horizontally.
+.ie n .IP "\fR\fB""\ep{Decomposition_Type: Non_Canonical}""\fR\fB\fR (Short: ""\ep{Dt=NonCanon}"")" 4
+.el .IP "\fR\f(CB\ep{Decomposition_Type: Non_Canonical}\fR\fB\fR (Short: \f(CW\ep{Dt=NonCanon}\fR)" 4
+.IX Item "p{Decomposition_Type: Non_Canonical} (Short: p{Dt=NonCanon})"
+Matches a character that has any of the non-canonical decomposition
+types. Canonical decompositions are introduced in the
+"Extended Grapheme Clusters (Logical characters)" section above.
+However, many more characters have a different type of decomposition,
+generically called "compatible" decompositions, or "non-canonical". The
+sequences that form these decompositions are not considered canonically
+equivalent to the pre-composed character. An example is the
+\&\f(CW"SUPERSCRIPT ONE"\fR. It is somewhat like a regular digit 1, but not
+exactly; its decomposition into the digit 1 is called a "compatible"
+decomposition, specifically a "super" (for "superscript") decomposition.
+There are several such compatibility decompositions (see
+<https://www.unicode.org/reports/tr44>). \f(CW\*(C`\ep{Dt:\ Non_Canon}\*(C'\fR is a
+Perl extension that uses just one name to refer to the union of all of
+them.
+.Sp
+Most Unicode characters don't have a decomposition, so their
+decomposition type is \f(CW"None"\fR. Hence, \f(CW\*(C`Non_Canonical\*(C'\fR is equivalent
+to
+.Sp
+.Vb 1
+\& qr/(?[ \eP{DT=Canonical} \- \ep{DT=None} ])/
+.Ve
+.Sp
+(Note that one of the non-canonical decompositions is named "compat",
+which could perhaps have been better named "miscellaneous". It includes
+just the things that Unicode couldn't figure out a better generic name
+for.)
+.ie n .IP "\fR\fB""\ep{Graph}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Graph}\fR\fB\fR 4
+.IX Item "p{Graph}"
+Matches any character that is graphic. Theoretically, this means a character
+that on a printer would cause ink to be used.
+.ie n .IP "\fR\fB""\ep{HorizSpace}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{HorizSpace}\fR\fB\fR 4
+.IX Item "p{HorizSpace}"
+This is the same as \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ep{Blank}\*(C'\fR: a character that changes the
+spacing horizontally.
+.ie n .IP "\fR\fB""\ep{In=*}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{In=*}\fR\fB\fR 4
+.IX Item "p{In=*}"
+This is a synonym for \f(CW\*(C`\ep{Present_In=*}\*(C'\fR
+.ie n .IP "\fR\fB""\ep{PerlSpace}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{PerlSpace}\fR\fB\fR 4
+.IX Item "p{PerlSpace}"
+This is the same as \f(CW\*(C`\es\*(C'\fR, restricted to ASCII, namely \f(CW\*(C`[\ \ef\en\er\et]\*(C'\fR
+and starting in Perl v5.18, a vertical tab.
+.Sp
+Mnemonic: Perl's (original) space
+.ie n .IP "\fR\fB""\ep{PerlWord}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{PerlWord}\fR\fB\fR 4
+.IX Item "p{PerlWord}"
+This is the same as \f(CW\*(C`\ew\*(C'\fR, restricted to ASCII, namely \f(CW\*(C`[A\-Za\-z0\-9_]\*(C'\fR
+.Sp
+Mnemonic: Perl's (original) word.
+.ie n .IP "\fR\fB""\ep{Posix...}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Posix...}\fR\fB\fR 4
+.IX Item "p{Posix...}"
+There are several of these, which are equivalents, using the \f(CW\*(C`\ep{}\*(C'\fR
+notation, for Posix classes and are described in
+"POSIX Character Classes" in perlrecharclass.
+.ie n .IP "\fR\fB""\ep{Present_In: *}""\fR\fB\fR (Short: ""\ep{In=*}"")" 4
+.el .IP "\fR\f(CB\ep{Present_In: *}\fR\fB\fR (Short: \f(CW\ep{In=*}\fR)" 4
+.IX Item "p{Present_In: *} (Short: p{In=*})"
+This property is used when you need to know in what Unicode version(s) a
+character is.
+.Sp
+The "*" above stands for some Unicode version number, such as
+\&\f(CW1.1\fR or \f(CW12.0\fR; or the "*" can also be \f(CW\*(C`Unassigned\*(C'\fR. This property will
+match the code points whose final disposition has been settled as of the
+Unicode release given by the version number; \f(CW\*(C`\ep{Present_In: Unassigned}\*(C'\fR
+will match those code points whose meaning has yet to be assigned.
+.Sp
+For example, \f(CW\*(C`U+0041\*(C'\fR \f(CW"LATIN CAPITAL LETTER A"\fR was present in the very first
+Unicode release available, which is \f(CW1.1\fR, so this property is true for all
+valid "*" versions. On the other hand, \f(CW\*(C`U+1EFF\*(C'\fR was not assigned until version
+5.1 when it became \f(CW"LATIN SMALL LETTER Y WITH LOOP"\fR, so the only "*" that
+would match it are 5.1, 5.2, and later.
+.Sp
+Unicode furnishes the \f(CW\*(C`Age\*(C'\fR property from which this is derived. The problem
+with Age is that a strict interpretation of it (which Perl takes) has it
+matching the precise release a code point's meaning is introduced in. Thus
+\&\f(CW\*(C`U+0041\*(C'\fR would match only 1.1; and \f(CW\*(C`U+1EFF\*(C'\fR only 5.1. This is not usually what
+you want.
+.Sp
+Some non-Perl implementations of the Age property may change its meaning to be
+the same as the Perl \f(CW\*(C`Present_In\*(C'\fR property; just be aware of that.
+.Sp
+Another confusion with both these properties is that the definition is not
+that the code point has been \fIassigned\fR, but that the meaning of the code point
+has been \fIdetermined\fR. This is because 66 code points will always be
+unassigned, and so the \f(CW\*(C`Age\*(C'\fR for them is the Unicode version in which the decision
+to make them so was made. For example, \f(CW\*(C`U+FDD0\*(C'\fR is to be permanently
+unassigned to a character, and the decision to do that was made in version 3.1,
+so \f(CW\*(C`\ep{Age=3.1}\*(C'\fR matches this character, as also does \f(CW\*(C`\ep{Present_In: 3.1}\*(C'\fR and up.
+.ie n .IP "\fR\fB""\ep{Print}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Print}\fR\fB\fR 4
+.IX Item "p{Print}"
+This matches any character that is graphical or blank, except controls.
+.ie n .IP "\fR\fB""\ep{SpacePerl}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{SpacePerl}\fR\fB\fR 4
+.IX Item "p{SpacePerl}"
+This is the same as \f(CW\*(C`\es\*(C'\fR, including beyond ASCII.
+.Sp
+Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
+until v5.18, which both the Posix standard and Unicode consider white space.)
+.ie n .IP "\fR\fB""\ep{Title}""\fR\fB\fR and \fB\fR\fB""\ep{Titlecase}""\fR\fB\fR" 4
+.el .IP "\fR\f(CB\ep{Title}\fR\fB\fR and \fB\fR\f(CB\ep{Titlecase}\fR\fB\fR" 4
+.IX Item "p{Title} and p{Titlecase}"
+Under case-sensitive matching, these both match the same code points as
+\&\f(CW\*(C`\ep{General Category=Titlecase_Letter}\*(C'\fR (\f(CW\*(C`\ep{gc=lt}\*(C'\fR). The difference
+is that under \f(CW\*(C`/i\*(C'\fR caseless matching, these match the same as
+\&\f(CW\*(C`\ep{Cased}\*(C'\fR, whereas \f(CW\*(C`\ep{gc=lt}\*(C'\fR matches \f(CW\*(C`\ep{Cased_Letter\*(C'\fR).
+.ie n .IP "\fR\fB""\ep{Unicode}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Unicode}\fR\fB\fR 4
+.IX Item "p{Unicode}"
+This matches any of the 1_114_112 Unicode code points.
+\&\f(CW\*(C`\ep{Any}\*(C'\fR.
+.ie n .IP "\fR\fB""\ep{VertSpace}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{VertSpace}\fR\fB\fR 4
+.IX Item "p{VertSpace}"
+This is the same as \f(CW\*(C`\ev\*(C'\fR: A character that changes the spacing vertically.
+.ie n .IP "\fR\fB""\ep{Word}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{Word}\fR\fB\fR 4
+.IX Item "p{Word}"
+This is the same as \f(CW\*(C`\ew\*(C'\fR, including over 100_000 characters beyond ASCII.
+.ie n .IP "\fR\fB""\ep{XPosix...}""\fR\fB\fR" 4
+.el .IP \fR\f(CB\ep{XPosix...}\fR\fB\fR 4
+.IX Item "p{XPosix...}"
+There are several of these, which are the standard Posix classes
+extended to the full Unicode range. They are described in
+"POSIX Character Classes" in perlrecharclass.
+.ie n .SS "Comparison of ""\eN{...}"" and ""\ep{name=...}"""
+.el .SS "Comparison of \f(CW\eN{...}\fP and \f(CW\ep{name=...}\fP"
+.IX Subsection "Comparison of N{...} and p{name=...}"
+Starting in Perl 5.32, you can specify a character by its name in
+regular expression patterns using \f(CW\*(C`\ep{name=...}\*(C'\fR. This is in addition
+to the longstanding method of using \f(CW\*(C`\eN{...}\*(C'\fR. The following
+summarizes the differences between these two:
+.PP
+.Vb 6
+\& \eN{...} \ep{Name=...}
+\& can interpolate only with eval yes [1]
+\& custom names yes no [2]
+\& name aliases yes yes [3]
+\& named sequences yes yes [4]
+\& name value parsing exact Unicode loose [5]
+.Ve
+.IP [1] 4
+.IX Item "[1]"
+The ability to interpolate means you can do something like
+.Sp
+.Vb 1
+\& qr/\ep{na=latin capital letter $which}/
+.Ve
+.Sp
+and specify \f(CW$which\fR elsewhere.
+.IP [2] 4
+.IX Item "[2]"
+You can create your own names for characters, and override official
+ones when using \f(CW\*(C`\eN{...}\*(C'\fR. See "CUSTOM ALIASES" in charnames.
+.IP [3] 4
+.IX Item "[3]"
+Some characters have multiple names (synonyms).
+.IP [4] 4
+.IX Item "[4]"
+Some particular sequences of characters are given a single name, in
+addition to their individual ones.
+.IP [5] 4
+.IX Item "[5]"
+Exact name value matching means you have to specify case, hyphens,
+underscores, and spaces precisely in the name you want. Loose matching
+follows the Unicode rules
+<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>,
+where these are mostly irrelevant. Except for a few outlier character
+names, these are the same rules as are already used for any other
+\&\f(CW\*(C`\ep{...}\*(C'\fR property.
+.SS "Wildcards in Property Values"
+.IX Subsection "Wildcards in Property Values"
+Starting in Perl 5.30, it is possible to do something like this:
+.PP
+.Vb 1
+\& qr!\ep{numeric_value=/\eA[0\-5]\ez/}!
+.Ve
+.PP
+or, by abbreviating and adding \f(CW\*(C`/x\*(C'\fR,
+.PP
+.Vb 1
+\& qr! \ep{nv= /(?x) \eA [0\-5] \ez / }!
+.Ve
+.PP
+This matches all code points whose numeric value is one of 0, 1, 2, 3,
+4, or 5. This particular example could instead have been written as
+.PP
+.Vb 1
+\& qr! \eA [ \ep{nv=0}\ep{nv=1}\ep{nv=2}\ep{nv=3}\ep{nv=4}\ep{nv=5} ] \ez !xx
+.Ve
+.PP
+in earlier perls, so in this case this feature just makes things easier
+and shorter to write. If we hadn't included the \f(CW\*(C`\eA\*(C'\fR and \f(CW\*(C`\ez\*(C'\fR, these
+would have matched things like \f(CW\*(C`1/2\*(C'\fR because that contains a 1 (as
+well as a 2). As written, it matches things like subscripts that have
+these numeric values. If we only wanted the decimal digits with those
+numeric values, we could say,
+.PP
+.Vb 1
+\& qr! (?[ \ed & \ep{nv=/[0\-5]/ ]) }!x
+.Ve
+.PP
+The \f(CW\*(C`\ed\*(C'\fR gets rid of needing to anchor the pattern, since it forces the
+result to only match \f(CW\*(C`[0\-9]\*(C'\fR, and the \f(CW\*(C`[0\-5]\*(C'\fR further restricts it.
+.PP
+The text in the above examples enclosed between the \f(CW"/"\fR
+characters can be just about any regular expression. It is independent
+of the main pattern, so doesn't share any capturing groups, \fIetc\fR. The
+delimiters for it must be ASCII punctuation, but it may NOT be
+delimited by \f(CW"{"\fR, nor \f(CW"}"\fR nor contain a literal \f(CW"}"\fR, as that
+delimits the end of the enclosing \f(CW\*(C`\ep{}\*(C'\fR. Like any pattern, certain
+other delimiters are terminated by their mirror images. These are
+\&\f(CW"("\fR, \f(CW\*(C`"[\*(C'\fR", and \f(CW"<"\fR. If the delimiter is any of \f(CW"\-"\fR,
+\&\f(CW"_"\fR, \f(CW"+"\fR, or \f(CW"\e"\fR, or is the same delimiter as is used for the
+enclosing pattern, it must be preceded by a backslash escape, both
+fore and aft.
+.PP
+Beware of using \f(CW"$"\fR to indicate to match the end of the string. It
+can too easily be interpreted as being a punctuation variable, like
+\&\f(CW$/\fR.
+.PP
+No modifiers may follow the final delimiter. Instead, use
+"(?adlupimnsx\-imnsx)" in perlre and/or
+"(?adluimnsx\-imnsx:pattern)" in perlre to specify modifiers.
+However, certain modifiers are illegal in your wildcard subpattern.
+The only character set modifier specifiable is \f(CW\*(C`/aa\*(C'\fR;
+any other character set, and \f(CW\*(C`\-m\*(C'\fR, and \f(CW\*(C`p\*(C'\fR, and \f(CW\*(C`s\*(C'\fR are all illegal.
+Specifying modifiers like \f(CW\*(C`qr/.../gc\*(C'\fR that aren't legal in the
+\&\f(CW\*(C`(?...)\*(C'\fR notation normally raise a warning, but with wildcard
+subpatterns, their use is an error. The \f(CW\*(C`m\*(C'\fR modifier is ineffective;
+everything that matches will be a single line.
+.PP
+By default, your pattern is matched case-insensitively, as if \f(CW\*(C`/i\*(C'\fR had
+been specified. You can change this by saying \f(CW\*(C`(?\-i)\*(C'\fR in your pattern.
+.PP
+There are also certain operations that are illegal. You can't nest
+\&\f(CW\*(C`\ep{...}\*(C'\fR and \f(CW\*(C`\eP{...}\*(C'\fR calls within a wildcard subpattern, and \f(CW\*(C`\eG\*(C'\fR
+doesn't make sense, so is also prohibited.
+.PP
+And the \f(CW\*(C`*\*(C'\fR quantifier (or its equivalent \f(CW\*(C`(0,}\*(C'\fR) is illegal.
+.PP
+This feature is not available when the left-hand side is prefixed by
+\&\f(CW\*(C`Is_\*(C'\fR, nor for any form that is marked as "Discouraged" in
+"Discouraged" in perluniprops.
+.PP
+This experimental feature has been added to begin to implement
+<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it
+will raise a (default-on) warning in the
+\&\f(CW\*(C`experimental::uniprop_wildcards\*(C'\fR category. We reserve the right to
+change its operation as we gain experience.
+.PP
+Your subpattern can be just about anything, but for it to have some
+utility, it should match when called with either or both of
+a) the full name of the property value with underscores (and/or spaces
+in the Block property) and some things uppercase; or b) the property
+value in all lowercase with spaces and underscores squeezed out. For
+example,
+.PP
+.Vb 2
+\& qr!\ep{Blk=/Old I.*/}!
+\& qr!\ep{Blk=/oldi.*/}!
+.Ve
+.PP
+would match the same things.
+.PP
+Another example that shows that within \f(CW\*(C`\ep{...}\*(C'\fR, \f(CW\*(C`/x\*(C'\fR isn't needed to
+have spaces:
+.PP
+.Vb 1
+\& qr!\ep{scx= /Hebrew|Greek/ }!
+.Ve
+.PP
+To be safe, we should have anchored the above example, to prevent
+matches for something like \f(CW\*(C`Hebrew_Braille\*(C'\fR, but there aren't
+any script names like that, so far.
+A warning is issued if none of the legal values for a property are
+matched by your pattern. It's likely that a future release will raise a
+warning if your pattern ends up causing every possible code point to
+match.
+.PP
+Starting in 5.32, the Name, Name Aliases, and Named Sequences properties
+are allowed to be matched. They are considered to be a single
+combination property, just as has long been the case for \f(CW\*(C`\eN{}\*(C'\fR. Loose
+matching doesn't work in exactly the same way for these as it does for
+the values of other properties. The rules are given in
+<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>. As a
+result, Perl doesn't try loose matching for you, like it does in other
+properties. All letters in names are uppercase, but you can add \f(CW\*(C`(?i)\*(C'\fR
+to your subpattern to ignore case. If you're uncertain where a blank
+is, you can use \f(CW\*(C` ?\*(C'\fR in your subpattern. No character name contains an
+underscore, so don't bother trying to match one. The use of hyphens is
+particularly problematic; refer to the above link. But note that, as of
+Unicode 13.0, the only script in modern usage which has weirdnesses with
+these is Tibetan; also the two Korean characters U+116C HANGUL JUNGSEONG
+OE and U+1180 HANGUL JUNGSEONG O\-E. Unicode makes no promises to not
+add hyphen-problematic names in the future.
+.PP
+Using wildcards on these is resource intensive, given the hundreds of
+thousands of legal names that must be checked against.
+.PP
+An example of using Name property wildcards is
+.PP
+.Vb 1
+\& qr!\ep{name=/(SMILING|GRINNING) FACE/}!
+.Ve
+.PP
+Another is
+.PP
+.Vb 1
+\& qr/(?[ \ep{name=\e/CJK\e/} \- \ep{ideographic} ])/
+.Ve
+.PP
+which is the 200\-ish (as of Unicode 13.0) CJK characters that aren't
+ideographs.
+.PP
+There are certain properties that wildcard subpatterns don't currently
+work with. These are:
+.PP
+.Vb 9
+\& Bidi Mirroring Glyph
+\& Bidi Paired Bracket
+\& Case Folding
+\& Decomposition Mapping
+\& Equivalent Unified Ideograph
+\& Lowercase Mapping
+\& NFKC Case Fold
+\& Titlecase Mapping
+\& Uppercase Mapping
+.Ve
+.PP
+Nor is the \f(CW\*(C`@\fR\f(CIunicode_property\fR\f(CW@\*(C'\fR form implemented.
+.PP
+Here's a complete example of matching IPV4 internet protocol addresses
+in any (single) script
+.PP
+.Vb 1
+\& no warnings \*(Aqexperimental::uniprop_wildcards\*(Aq;
+\&
+\& # Can match a substring, so this intermediate regex needs to have
+\& # context or anchoring in its final use. Using nt=de yields decimal
+\& # digits. When specifying a subset of these, we must include \ed to
+\& # prevent things like U+00B2 SUPERSCRIPT TWO from matching
+\& my $zero_through_255 =
+\& qr/ \eb (*sr: # All from same sript
+\& (?[ \ep{nv=0} & \ed ])* # Optional leading zeros
+\& ( # Then one of:
+\& \ed{1,2} # 0 \- 99
+\& | (?[ \ep{nv=1} & \ed ]) \ed{2} # 100 \- 199
+\& | (?[ \ep{nv=2} & \ed ])
+\& ( (?[ \ep{nv=:[0\-4]:} & \ed ]) \ed # 200 \- 249
+\& | (?[ \ep{nv=5} & \ed ])
+\& (?[ \ep{nv=:[0\-5]:} & \ed ]) # 250 \- 255
+\& )
+\& )
+\& )
+\& \eb
+\& /x;
+\&
+\& my $ipv4 = qr/ \eA (*sr: $zero_through_255
+\& (?: [.] $zero_through_255 ) {3}
+\& )
+\& \ez
+\& /x;
+.Ve
+.SS "User-Defined Character Properties"
+.IX Subsection "User-Defined Character Properties"
+You can define your own binary character properties by defining subroutines
+whose names begin with \f(CW"In"\fR or \f(CW"Is"\fR. (The regex sets feature
+"(?[ ])" in perlre provides an alternative which allows more complex
+definitions.) The subroutines can be defined in any
+package. They override any Unicode properties expressed as the same
+names. The user-defined properties can be used in the regular
+expression
+\&\f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR constructs; if you are using a user-defined property from a
+package other than the one you are in, you must specify its package in the
+\&\f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR construct.
+.PP
+.Vb 3
+\& # assuming property IsForeign defined in Lang::
+\& package main; # property package name required
+\& if ($txt =~ /\ep{Lang::IsForeign}+/) { ... }
+\&
+\& package Lang; # property package name not required
+\& if ($txt =~ /\ep{IsForeign}+/) { ... }
+.Ve
+.PP
+The subroutines are passed a single parameter, which is 0 if
+case-sensitive matching is in effect and non-zero if caseless matching
+is in effect. The subroutine may return different values depending on
+the value of the flag. But the subroutine is never called more than
+once for each flag value (zero vs non-zero). The return value is saved
+and used instead of calling the sub ever again. If the sub is defined
+at the time the pattern is compiled, it will be called then; if not, it
+will be called the first time its value (for that flag) is needed during
+execution.
+.PP
+Note that if the regular expression is tainted, then Perl will die rather
+than calling the subroutine when the name of the subroutine is
+determined by the tainted data.
+.PP
+The subroutines must return a specially-formatted string, with one
+or more newline-separated lines. Each line must be one of the following:
+.IP \(bu 4
+A single hexadecimal number denoting a code point to include.
+.IP \(bu 4
+Two hexadecimal numbers separated by horizontal whitespace (space or
+tabular characters) denoting a range of code points to include. The
+second number must not be smaller than the first.
+.IP \(bu 4
+Something to include, prefixed by \f(CW"+"\fR: a built-in character
+property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package
+name) user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+.IP \(bu 4
+Something to exclude, prefixed by \f(CW"\-"\fR: an existing character
+property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package
+name) user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+.IP \(bu 4
+Something to negate, prefixed \f(CW"!"\fR: an existing character
+property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package
+name) user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+.IP \(bu 4
+Something to intersect with, prefixed by \f(CW"&"\fR: an existing character
+property (prefixed by \f(CW"utf8::"\fR) or a fully qualified (including package
+name) user-defined character property,
+for all the characters except the characters in the property; two
+hexadecimal code points for a range; or a single hexadecimal code point.
+.PP
+For example, to define a property that covers both the Japanese
+syllabaries (hiragana and katakana), you can define
+.PP
+.Vb 6
+\& sub InKana {
+\& return <<END;
+\& 3040\et309F
+\& 30A0\et30FF
+\& END
+\& }
+.Ve
+.PP
+Imagine that the here-doc end marker is at the beginning of the line.
+Now you can use \f(CW\*(C`\ep{InKana}\*(C'\fR and \f(CW\*(C`\eP{InKana}\*(C'\fR.
+.PP
+You could also have used the existing block property names:
+.PP
+.Vb 6
+\& sub InKana {
+\& return <<\*(AqEND\*(Aq;
+\& +utf8::InHiragana
+\& +utf8::InKatakana
+\& END
+\& }
+.Ve
+.PP
+Suppose you wanted to match only the allocated characters,
+not the raw block ranges: in other words, you want to remove
+the unassigned characters:
+.PP
+.Vb 7
+\& sub InKana {
+\& return <<\*(AqEND\*(Aq;
+\& +utf8::InHiragana
+\& +utf8::InKatakana
+\& \-utf8::IsCn
+\& END
+\& }
+.Ve
+.PP
+The negation is useful for defining (surprise!) negated classes.
+.PP
+.Vb 7
+\& sub InNotKana {
+\& return <<\*(AqEND\*(Aq;
+\& !utf8::InHiragana
+\& \-utf8::InKatakana
+\& +utf8::IsCn
+\& END
+\& }
+.Ve
+.PP
+This will match all non-Unicode code points, since every one of them is
+not in Kana. You can use intersection to exclude these, if desired, as
+this modified example shows:
+.PP
+.Vb 8
+\& sub InNotKana {
+\& return <<\*(AqEND\*(Aq;
+\& !utf8::InHiragana
+\& \-utf8::InKatakana
+\& +utf8::IsCn
+\& &utf8::Any
+\& END
+\& }
+.Ve
+.PP
+\&\f(CW&utf8::Any\fR must be the last line in the definition.
+.PP
+Intersection is used generally for getting the common characters matched
+by two (or more) classes. It's important to remember not to use \f(CW"&"\fR for
+the first set; that would be intersecting with nothing, resulting in an
+empty set. (Similarly using \f(CW"\-"\fR for the first set does nothing).
+.PP
+Unlike non-user-defined \f(CW\*(C`\ep{}\*(C'\fR property matches, no warning is ever
+generated if these properties are matched against a non-Unicode code
+point (see "Beyond Unicode code points" below).
+.SS "User-Defined Case Mappings (for serious hackers only)"
+.IX Subsection "User-Defined Case Mappings (for serious hackers only)"
+\&\fBThis feature has been removed as of Perl 5.16.\fR
+The CPAN module \f(CW\*(C`Unicode::Casing\*(C'\fR provides better functionality without
+the drawbacks that this feature had. If you are using a Perl earlier
+than 5.16, this feature was most fully documented in the 5.14 version of
+this pod:
+<http://perldoc.perl.org/5.14.0/perlunicode.html#User\-Defined\-Case\-Mappings\-%28for\-serious\-hackers\-only%29>
+.SS "Character Encodings for Input and Output"
+.IX Subsection "Character Encodings for Input and Output"
+See Encode.
+.SS "Unicode Regular Expression Support Level"
+.IX Subsection "Unicode Regular Expression Support Level"
+The following list of Unicode supported features for regular expressions describes
+all features currently directly supported by core Perl. The references
+to "Level \fIN\fR" and the section numbers refer to
+UTS#18 "Unicode Regular Expressions" <https://www.unicode.org/reports/tr18>,
+version 18, October 2016.
+.PP
+\fILevel 1 \- Basic Unicode Support\fR
+.IX Subsection "Level 1 - Basic Unicode Support"
+.PP
+.Vb 8
+\& RL1.1 Hex Notation \- Done [1]
+\& RL1.2 Properties \- Done [2]
+\& RL1.2a Compatibility Properties \- Done [3]
+\& RL1.3 Subtraction and Intersection \- Done [4]
+\& RL1.4 Simple Word Boundaries \- Done [5]
+\& RL1.5 Simple Loose Matches \- Done [6]
+\& RL1.6 Line Boundaries \- Partial [7]
+\& RL1.7 Supplementary Code Points \- Done [8]
+.Ve
+.ie n .IP "[1] ""\eN{U+...}"" and ""\ex{...}""" 4
+.el .IP "[1] \f(CW\eN{U+...}\fR and \f(CW\ex{...}\fR" 4
+.IX Item "[1] N{U+...} and x{...}"
+.PD 0
+.ie n .IP "[2] ""\ep{...}"" ""\eP{...}"". This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties." 4
+.el .IP "[2] \f(CW\ep{...}\fR \f(CW\eP{...}\fR. This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties." 4
+.IX Item "[2] p{...} P{...}. This requirement is for a minimal list of properties. Perl supports these. See R2.7 for other properties."
+.IP [3] 4
+.IX Item "[3]"
+.PD
+Perl has \f(CW\*(C`\ed\*(C'\fR \f(CW\*(C`\eD\*(C'\fR \f(CW\*(C`\es\*(C'\fR \f(CW\*(C`\eS\*(C'\fR \f(CW\*(C`\ew\*(C'\fR \f(CW\*(C`\eW\*(C'\fR \f(CW\*(C`\eX\*(C'\fR \f(CW\*(C`[:\fR\f(CIprop\fR\f(CW:]\*(C'\fR
+\&\f(CW\*(C`[:^\fR\f(CIprop\fR\f(CW:]\*(C'\fR, plus all the properties specified by
+<https://www.unicode.org/reports/tr18/#Compatibility_Properties>. These
+are described above in "Other Properties"
+.IP [4] 4
+.IX Item "[4]"
+The regex sets feature \f(CW"(?[...])"\fR starting in v5.18 accomplishes
+this. See "(?[ ])" in perlre.
+.ie n .IP "[5] ""\eb"" ""\eB"" meet most, but not all, the details of this requirement, but ""\eb{wb}"" and ""\eB{wb}"" do, as well as the stricter R2.3." 4
+.el .IP "[5] \f(CW\eb\fR \f(CW\eB\fR meet most, but not all, the details of this requirement, but \f(CW\eb{wb}\fR and \f(CW\eB{wb}\fR do, as well as the stricter R2.3." 4
+.IX Item "[5] b B meet most, but not all, the details of this requirement, but b{wb} and B{wb} do, as well as the stricter R2.3."
+.PD 0
+.IP [6] 4
+.IX Item "[6]"
+.PD
+Note that Perl does Full case-folding in matching, not Simple:
+.Sp
+For example \f(CW\*(C`U+1F88\*(C'\fR is equivalent to \f(CW\*(C`U+1F00 U+03B9\*(C'\fR, instead of just
+\&\f(CW\*(C`U+1F80\*(C'\fR. This difference matters mainly for certain Greek capital
+letters with certain modifiers: the Full case-folding decomposes the
+letter, while the Simple case-folding would map it to a single
+character.
+.IP [7] 4
+.IX Item "[7]"
+The reason this is considered to be only partially implemented is that
+Perl has \f(CW\*(C`qr/\eb{lb}/\*(C'\fR and
+\&\f(CW\*(C`Unicode::LineBreak\*(C'\fR that are conformant with
+UAX#14 "Unicode Line Breaking Algorithm" <https://www.unicode.org/reports/tr14>.
+The regular expression construct provides default behavior, while the
+heavier-weight module provides customizable line breaking.
+.Sp
+But Perl treats \f(CW\*(C`\en\*(C'\fR as the start\- and end-line
+delimiter, whereas Unicode specifies more characters that should be
+so-interpreted.
+.Sp
+These are:
+.Sp
+.Vb 6
+\& VT U+000B (\ev in C)
+\& FF U+000C (\ef)
+\& CR U+000D (\er)
+\& NEL U+0085
+\& LS U+2028
+\& PS U+2029
+.Ve
+.Sp
+\&\f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR in regular expression patterns are supposed to match all
+these, but don't.
+These characters also don't, but should, affect \f(CW\*(C`<>\*(C'\fR \f(CW$.\fR, and
+script line numbers.
+.Sp
+Also, lines should not be split within \f(CW\*(C`CRLF\*(C'\fR (i.e. there is no
+empty line between \f(CW\*(C`\er\*(C'\fR and \f(CW\*(C`\en\*(C'\fR). For \f(CW\*(C`CRLF\*(C'\fR, try the \f(CW\*(C`:crlf\*(C'\fR
+layer (see PerlIO).
+.ie n .IP "[8] UTF\-8/UTF\-EBDDIC used in Perl allows not only ""U+10000"" to ""U+10FFFF"" but also beyond ""U+10FFFF""" 4
+.el .IP "[8] UTF\-8/UTF\-EBDDIC used in Perl allows not only \f(CWU+10000\fR to \f(CWU+10FFFF\fR but also beyond \f(CWU+10FFFF\fR" 4
+.IX Item "[8] UTF-8/UTF-EBDDIC used in Perl allows not only U+10000 to U+10FFFF but also beyond U+10FFFF"
+.PP
+\fILevel 2 \- Extended Unicode Support\fR
+.IX Subsection "Level 2 - Extended Unicode Support"
+.PP
+.Vb 10
+\& RL2.1 Canonical Equivalents \- Retracted [9]
+\& by Unicode
+\& RL2.2 Extended Grapheme Clusters and \- Partial [10]
+\& Character Classes with Strings
+\& RL2.3 Default Word Boundaries \- Done [11]
+\& RL2.4 Default Case Conversion \- Done
+\& RL2.5 Name Properties \- Done
+\& RL2.6 Wildcards in Property Values \- Partial [12]
+\& RL2.7 Full Properties \- Partial [13]
+\& RL2.8 Optional Properties \- Partial [14]
+.Ve
+.IP "[9] Unicode has rewritten this portion of UTS#18 to say that getting canonical equivalence (see UAX#15 ""Unicode Normalization Forms"" <https://www.unicode.org/reports/tr15>) is basically to be done at the programmer level. Use NFD to write both your regular expressions and text to match them against (you can use Unicode::Normalize)." 4
+.IX Item "[9] Unicode has rewritten this portion of UTS#18 to say that getting canonical equivalence (see UAX#15 ""Unicode Normalization Forms"" <https://www.unicode.org/reports/tr15>) is basically to be done at the programmer level. Use NFD to write both your regular expressions and text to match them against (you can use Unicode::Normalize)."
+.PD 0
+.ie n .IP "[10] Perl has ""\eX"" and ""\eb{gcb}"". Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support." 4
+.el .IP "[10] Perl has \f(CW\eX\fR and \f(CW\eb{gcb}\fR. Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support." 4
+.IX Item "[10] Perl has X and b{gcb}. Unicode has retracted their ""Grapheme Cluster Mode"", and recently added string properties, which Perl does not yet support."
+.IP "[11] see UAX#29 ""Unicode Text Segmentation"" <https://www.unicode.org/reports/tr29>," 4
+.IX Item "[11] see UAX#29 ""Unicode Text Segmentation"" <https://www.unicode.org/reports/tr29>,"
+.IP "[12] see ""Wildcards in Property Values"" above." 4
+.IX Item "[12] see ""Wildcards in Property Values"" above."
+.IP "[13] Perl supports all the properties in the Unicode Character Database (UCD). It does not yet support the listed properties that come from other Unicode sources." 4
+.IX Item "[13] Perl supports all the properties in the Unicode Character Database (UCD). It does not yet support the listed properties that come from other Unicode sources."
+.IP "[14] The only optional property that Perl supports is Named Sequence. None of these properties are in the UCD." 4
+.IX Item "[14] The only optional property that Perl supports is Named Sequence. None of these properties are in the UCD."
+.PD
+.PP
+\fILevel 3 \- Tailored Support\fR
+.IX Subsection "Level 3 - Tailored Support"
+.PP
+This has been retracted by Unicode.
+.SS "Unicode Encodings"
+.IX Subsection "Unicode Encodings"
+Unicode characters are assigned to \fIcode points\fR, which are abstract
+numbers. To use these numbers, various encodings are needed.
+.IP \(bu 4
+UTF\-8
+.Sp
+UTF\-8 is a variable-length (1 to 4 bytes), byte-order independent
+encoding. In most of Perl's documentation, including elsewhere in this
+document, the term "UTF\-8" means also "UTF-EBCDIC". But in this section,
+"UTF\-8" refers only to the encoding used on ASCII platforms. It is a
+superset of 7\-bit US-ASCII, so anything encoded in ASCII has the
+identical representation when encoded in UTF\-8.
+.Sp
+The following table is from Unicode 3.2.
+.Sp
+.Vb 1
+\& Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+\&
+\& U+0000..U+007F 00..7F
+\& U+0080..U+07FF * C2..DF 80..BF
+\& U+0800..U+0FFF E0 * A0..BF 80..BF
+\& U+1000..U+CFFF E1..EC 80..BF 80..BF
+\& U+D000..U+D7FF ED 80..9F 80..BF
+\& U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
+\& U+E000..U+FFFF EE..EF 80..BF 80..BF
+\& U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
+\& U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
+\& U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
+.Ve
+.Sp
+Note the gaps marked by "*" before several of the byte entries above. These are
+caused by legal UTF\-8 avoiding non-shortest encodings: it is technically
+possible to UTF\-8\-encode a single code point in different ways, but that is
+explicitly forbidden, and the shortest possible encoding should always be used
+(and that is what Perl does).
+.Sp
+Another way to look at it is via bits:
+.Sp
+.Vb 1
+\& Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+\&
+\& 0aaaaaaa 0aaaaaaa
+\& 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
+\& ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
+\& 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
+.Ve
+.Sp
+As you can see, the continuation bytes all begin with \f(CW"10"\fR, and the
+leading bits of the start byte tell how many bytes there are in the
+encoded character.
+.Sp
+The original UTF\-8 specification allowed up to 6 bytes, to allow
+encoding of numbers up to \f(CW\*(C`0x7FFF_FFFF\*(C'\fR. Perl continues to allow those,
+and has extended that up to 13 bytes to encode code points up to what
+can fit in a 64\-bit word. However, Perl will warn if you output any of
+these as being non-portable; and under strict UTF\-8 input protocols,
+they are forbidden. In addition, it is now illegal to use a code point
+larger than what a signed integer variable on your system can hold. On
+32\-bit ASCII systems, this means \f(CW\*(C`0x7FFF_FFFF\*(C'\fR is the legal maximum
+(much higher on 64\-bit systems).
+.IP \(bu 4
+UTF-EBCDIC
+.Sp
+Like UTF\-8, but EBCDIC-safe, in the way that UTF\-8 is ASCII-safe.
+This means that all the basic characters (which includes all
+those that have ASCII equivalents (like \f(CW"A"\fR, \f(CW"0"\fR, \f(CW"%"\fR, \fIetc.\fR)
+are the same in both EBCDIC and UTF-EBCDIC.)
+.Sp
+UTF-EBCDIC is used on EBCDIC platforms. It generally requires more
+bytes to represent a given code point than UTF\-8 does; the largest
+Unicode code points take 5 bytes to represent (instead of 4 in UTF\-8),
+and, extended for 64\-bit words, it uses 14 bytes instead of 13 bytes in
+UTF\-8.
+.IP \(bu 4
+UTF\-16, UTF\-16BE, UTF\-16LE, Surrogates, and \f(CW\*(C`BOM\*(C'\fR's (Byte Order Marks)
+.Sp
+The followings items are mostly for reference and general Unicode
+knowledge, Perl doesn't use these constructs internally.
+.Sp
+Like UTF\-8, UTF\-16 is a variable-width encoding, but where
+UTF\-8 uses 8\-bit code units, UTF\-16 uses 16\-bit code units.
+All code points occupy either 2 or 4 bytes in UTF\-16: code points
+\&\f(CW\*(C`U+0000..U+FFFF\*(C'\fR are stored in a single 16\-bit unit, and code
+points \f(CW\*(C`U+10000..U+10FFFF\*(C'\fR in two 16\-bit units. The latter case is
+using \fIsurrogates\fR, the first 16\-bit unit being the \fIhigh
+surrogate\fR, and the second being the \fIlow surrogate\fR.
+.Sp
+Surrogates are code points set aside to encode the \f(CW\*(C`U+10000..U+10FFFF\*(C'\fR
+range of Unicode code points in pairs of 16\-bit units. The \fIhigh
+surrogates\fR are the range \f(CW\*(C`U+D800..U+DBFF\*(C'\fR and the \fIlow surrogates\fR
+are the range \f(CW\*(C`U+DC00..U+DFFF\*(C'\fR. The surrogate encoding is
+.Sp
+.Vb 2
+\& $hi = ($uni \- 0x10000) / 0x400 + 0xD800;
+\& $lo = ($uni \- 0x10000) % 0x400 + 0xDC00;
+.Ve
+.Sp
+and the decoding is
+.Sp
+.Vb 1
+\& $uni = 0x10000 + ($hi \- 0xD800) * 0x400 + ($lo \- 0xDC00);
+.Ve
+.Sp
+Because of the 16\-bitness, UTF\-16 is byte-order dependent. UTF\-16
+itself can be used for in-memory computations, but if storage or
+transfer is required either UTF\-16BE (big-endian) or UTF\-16LE
+(little-endian) encodings must be chosen.
+.Sp
+This introduces another problem: what if you just know that your data
+is UTF\-16, but you don't know which endianness? Byte Order Marks, or
+\&\f(CW\*(C`BOM\*(C'\fR's, are a solution to this. A special character has been reserved
+in Unicode to function as a byte order marker: the character with the
+code point \f(CW\*(C`U+FEFF\*(C'\fR is the \f(CW\*(C`BOM\*(C'\fR.
+.Sp
+The trick is that if you read a \f(CW\*(C`BOM\*(C'\fR, you will know the byte order,
+since if it was written on a big-endian platform, you will read the
+bytes \f(CW\*(C`0xFE 0xFF\*(C'\fR, but if it was written on a little-endian platform,
+you will read the bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR. (And if the originating platform
+was writing in ASCII platform UTF\-8, you will read the bytes
+\&\f(CW\*(C`0xEF 0xBB 0xBF\*(C'\fR.)
+.Sp
+The way this trick works is that the character with the code point
+\&\f(CW\*(C`U+FFFE\*(C'\fR is not supposed to be in input streams, so the
+sequence of bytes \f(CW\*(C`0xFF 0xFE\*(C'\fR is unambiguously "\f(CW\*(C`BOM\*(C'\fR, represented in
+little-endian format" and cannot be \f(CW\*(C`U+FFFE\*(C'\fR, represented in big-endian
+format".
+.Sp
+Surrogates have no meaning in Unicode outside their use in pairs to
+represent other code points. However, Perl allows them to be
+represented individually internally, for example by saying
+\&\f(CWchr(0xD801)\fR, so that all code points, not just those valid for open
+interchange, are
+representable. Unicode does define semantics for them, such as their
+\&\f(CW"General_Category"\fR is \f(CW"Cs"\fR. But because their use is somewhat dangerous,
+Perl will warn (using the warning category \f(CW"surrogate"\fR, which is a
+sub-category of \f(CW"utf8"\fR) if an attempt is made
+to do things like take the lower case of one, or match
+case-insensitively, or to output them. (But don't try this on Perls
+before 5.14.)
+.IP \(bu 4
+UTF\-32, UTF\-32BE, UTF\-32LE
+.Sp
+The UTF\-32 family is pretty much like the UTF\-16 family, except that
+the units are 32\-bit, and therefore the surrogate scheme is not
+needed. UTF\-32 is a fixed-width encoding. The \f(CW\*(C`BOM\*(C'\fR signatures are
+\&\f(CW\*(C`0x00 0x00 0xFE 0xFF\*(C'\fR for BE and \f(CW\*(C`0xFF 0xFE 0x00 0x00\*(C'\fR for LE.
+.IP \(bu 4
+UCS\-2, UCS\-4
+.Sp
+Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS\-2 is a 16\-bit
+encoding. Unlike UTF\-16, UCS\-2 is not extensible beyond \f(CW\*(C`U+FFFF\*(C'\fR,
+because it does not use surrogates. UCS\-4 is a 32\-bit encoding,
+functionally identical to UTF\-32 (the difference being that
+UCS\-4 forbids neither surrogates nor code points larger than \f(CW\*(C`0x10_FFFF\*(C'\fR).
+.IP \(bu 4
+UTF\-7
+.Sp
+A seven-bit safe (non-eight-bit) encoding, which is useful if the
+transport or storage is not eight-bit safe. Defined by RFC 2152.
+.SS "Noncharacter code points"
+.IX Subsection "Noncharacter code points"
+66 code points are set aside in Unicode as "noncharacter code points".
+These all have the \f(CW\*(C`Unassigned\*(C'\fR (\f(CW\*(C`Cn\*(C'\fR) \f(CW"General_Category"\fR, and
+no character will ever be assigned to any of them. They are the 32 code
+points between \f(CW\*(C`U+FDD0\*(C'\fR and \f(CW\*(C`U+FDEF\*(C'\fR inclusive, and the 34 code
+points:
+.PP
+.Vb 7
+\& U+FFFE U+FFFF
+\& U+1FFFE U+1FFFF
+\& U+2FFFE U+2FFFF
+\& ...
+\& U+EFFFE U+EFFFF
+\& U+FFFFE U+FFFFF
+\& U+10FFFE U+10FFFF
+.Ve
+.PP
+Until Unicode 7.0, the noncharacters were "\fBforbidden\fR for use in open
+interchange of Unicode text data", so that code that processed those
+streams could use these code points as sentinels that could be mixed in
+with character data, and would always be distinguishable from that data.
+(Emphasis above and in the next paragraph are added in this document.)
+.PP
+Unicode 7.0 changed the wording so that they are "\fBnot recommended\fR for
+use in open interchange of Unicode text data". The 7.0 Standard goes on
+to say:
+.Sp
+.RS 4
+"If a noncharacter is received in open interchange, an application is
+not required to interpret it in any way. It is good practice, however,
+to recognize it as a noncharacter and to take appropriate action, such
+as replacing it with \f(CW\*(C`U+FFFD\*(C'\fR replacement character, to indicate the
+problem in the text. It is not recommended to simply delete
+noncharacter code points from such text, because of the potential
+security issues caused by deleting uninterpreted characters. (See
+conformance clause C7 in Section 3.2, Conformance Requirements, and
+Unicode Technical Report #36, "Unicode Security
+Considerations" <https://www.unicode.org/reports/tr36/#Substituting_for_Ill_Formed_Subsequences>)."
+.RE
+.PP
+This change was made because it was found that various commercial tools
+like editors, or for things like source code control, had been written
+so that they would not handle program files that used these code points,
+effectively precluding their use almost entirely! And that was never
+the intent. They've always been meant to be usable within an
+application, or cooperating set of applications, at will.
+.PP
+If you're writing code, such as an editor, that is supposed to be able
+to handle any Unicode text data, then you shouldn't be using these code
+points yourself, and instead allow them in the input. If you need
+sentinels, they should instead be something that isn't legal Unicode.
+For UTF\-8 data, you can use the bytes 0xC0 and 0xC1 as sentinels, as
+they never appear in well-formed UTF\-8. (There are equivalents for
+UTF-EBCDIC). You can also store your Unicode code points in integer
+variables and use negative values as sentinels.
+.PP
+If you're not writing such a tool, then whether you accept noncharacters
+as input is up to you (though the Standard recommends that you not). If
+you do strict input stream checking with Perl, these code points
+continue to be forbidden. This is to maintain backward compatibility
+(otherwise potential security holes could open up, as an unsuspecting
+application that was written assuming the noncharacters would be
+filtered out before getting to it, could now, without warning, start
+getting them). To do strict checking, you can use the layer
+\&\f(CW:encoding(\*(AqUTF\-8\*(Aq)\fR.
+.PP
+Perl continues to warn (using the warning category \f(CW"nonchar"\fR, which
+is a sub-category of \f(CW"utf8"\fR) if an attempt is made to output
+noncharacters.
+.SS "Beyond Unicode code points"
+.IX Subsection "Beyond Unicode code points"
+The maximum Unicode code point is \f(CW\*(C`U+10FFFF\*(C'\fR, and Unicode only defines
+operations on code points up through that. But Perl works on code
+points up to the maximum permissible signed number available on the
+platform. However, Perl will not accept these from input streams unless
+lax rules are being used, and will warn (using the warning category
+\&\f(CW"non_unicode"\fR, which is a sub-category of \f(CW"utf8"\fR) if any are output.
+.PP
+Since Unicode rules are not defined on these code points, if a
+Unicode-defined operation is done on them, Perl uses what we believe are
+sensible rules, while generally warning, using the \f(CW"non_unicode"\fR
+category. For example, \f(CWuc("\ex{11_0000}")\fR will generate such a
+warning, returning the input parameter as its result, since Perl defines
+the uppercase of every non-Unicode code point to be the code point
+itself. (All the case changing operations, not just uppercasing, work
+this way.)
+.PP
+The situation with matching Unicode properties in regular expressions,
+the \f(CW\*(C`\ep{}\*(C'\fR and \f(CW\*(C`\eP{}\*(C'\fR constructs, against these code points is not as
+clear cut, and how these are handled has changed as we've gained
+experience.
+.PP
+One possibility is to treat any match against these code points as
+undefined. But since Perl doesn't have the concept of a match being
+undefined, it converts this to failing or \f(CW\*(C`FALSE\*(C'\fR. This is almost, but
+not quite, what Perl did from v5.14 (when use of these code points
+became generally reliable) through v5.18. The difference is that Perl
+treated all \f(CW\*(C`\ep{}\*(C'\fR matches as failing, but all \f(CW\*(C`\eP{}\*(C'\fR matches as
+succeeding.
+.PP
+One problem with this is that it leads to unexpected, and confusing
+results in some cases:
+.PP
+.Vb 2
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True} # Failed on <= v5.18
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False} # Failed! on <= v5.18
+.Ve
+.PP
+That is, it treated both matches as undefined, and converted that to
+false (raising a warning on each). The first case is the expected
+result, but the second is likely counterintuitive: "How could both be
+false when they are complements?" Another problem was that the
+implementation optimized many Unicode property matches down to already
+existing simpler, faster operations, which don't raise the warning. We
+chose to not forgo those optimizations, which help the vast majority of
+matches, just to generate a warning for the unlikely event that an
+above-Unicode code point is being matched against.
+.PP
+As a result of these problems, starting in v5.20, what Perl does is
+to treat non-Unicode code points as just typical unassigned Unicode
+characters, and matches accordingly. (Note: Unicode has atypical
+unassigned code points. For example, it has noncharacter code points,
+and ones that, when they do get assigned, are destined to be written
+Right-to-left, as Arabic and Hebrew are. Perl assumes that no
+non-Unicode code point has any atypical properties.)
+.PP
+Perl, in most cases, will raise a warning when matching an above-Unicode
+code point against a Unicode property when the result is \f(CW\*(C`TRUE\*(C'\fR for
+\&\f(CW\*(C`\ep{}\*(C'\fR, and \f(CW\*(C`FALSE\*(C'\fR for \f(CW\*(C`\eP{}\*(C'\fR. For example:
+.PP
+.Vb 2
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True} # Fails, no warning
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False} # Succeeds, with warning
+.Ve
+.PP
+In both these examples, the character being matched is non-Unicode, so
+Unicode doesn't define how it should match. It clearly isn't an ASCII
+hex digit, so the first example clearly should fail, and so it does,
+with no warning. But it is arguable that the second example should have
+an undefined, hence \f(CW\*(C`FALSE\*(C'\fR, result. So a warning is raised for it.
+.PP
+Thus the warning is raised for many fewer cases than in earlier Perls,
+and only when what the result is could be arguable. It turns out that
+none of the optimizations made by Perl (or are ever likely to be made)
+cause the warning to be skipped, so it solves both problems of Perl's
+earlier approach. The most commonly used property that is affected by
+this change is \f(CW\*(C`\ep{Unassigned}\*(C'\fR which is a short form for
+\&\f(CW\*(C`\ep{General_Category=Unassigned}\*(C'\fR. Starting in v5.20, all non-Unicode
+code points are considered \f(CW\*(C`Unassigned\*(C'\fR. In earlier releases the
+matches failed because the result was considered undefined.
+.PP
+The only place where the warning is not raised when it might ought to
+have been is if optimizations cause the whole pattern match to not even
+be attempted. For example, Perl may figure out that for a string to
+match a certain regular expression pattern, the string has to contain
+the substring \f(CW"foobar"\fR. Before attempting the match, Perl may look
+for that substring, and if not found, immediately fail the match without
+actually trying it; so no warning gets generated even if the string
+contains an above-Unicode code point.
+.PP
+This behavior is more "Do what I mean" than in earlier Perls for most
+applications. But it catches fewer issues for code that needs to be
+strictly Unicode compliant. Therefore there is an additional mode of
+operation available to accommodate such code. This mode is enabled if a
+regular expression pattern is compiled within the lexical scope where
+the \f(CW"non_unicode"\fR warning class has been made fatal, say by:
+.PP
+.Vb 1
+\& use warnings FATAL => "non_unicode"
+.Ve
+.PP
+(see warnings). In this mode of operation, Perl will raise the
+warning for all matches against a non-Unicode code point (not just the
+arguable ones), and it skips the optimizations that might cause the
+warning to not be output. (It currently still won't warn if the match
+isn't even attempted, like in the \f(CW"foobar"\fR example above.)
+.PP
+In summary, Perl now normally treats non-Unicode code points as typical
+Unicode unassigned code points for regular expression matches, raising a
+warning only when it is arguable what the result should be. However, if
+this warning has been made fatal, it isn't skipped.
+.PP
+There is one exception to all this. \f(CW\*(C`\ep{All}\*(C'\fR looks like a Unicode
+property, but it is a Perl extension that is defined to be true for all
+possible code points, Unicode or not, so no warning is ever generated
+when matching this against a non-Unicode code point. (Prior to v5.20,
+it was an exact synonym for \f(CW\*(C`\ep{Any}\*(C'\fR, matching code points \f(CW0\fR
+through \f(CW0x10FFFF\fR.)
+.SS "Security Implications of Unicode"
+.IX Subsection "Security Implications of Unicode"
+First, read
+Unicode Security Considerations <https://www.unicode.org/reports/tr36>.
+.PP
+Also, note the following:
+.IP \(bu 4
+Malformed UTF\-8
+.Sp
+UTF\-8 is very structured, so many combinations of bytes are invalid. In
+the past, Perl tried to soldier on and make some sense of invalid
+combinations, but this can lead to security holes, so now, if the Perl
+core needs to process an invalid combination, it will either raise a
+fatal error, or will replace those bytes by the sequence that forms the
+Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
+.Sp
+Every code point can be represented by more than one possible
+syntactically valid UTF\-8 sequence. Early on, both Unicode and Perl
+considered any of these to be valid, but now, all sequences longer
+than the shortest possible one are considered to be malformed.
+.Sp
+Unicode considers many code points to be illegal, or to be avoided.
+Perl generally accepts them, once they have passed through any input
+filters that may try to exclude them. These have been discussed above
+(see "Surrogates" under UTF\-16 in "Unicode Encodings",
+"Noncharacter code points", and "Beyond Unicode code points").
+.IP \(bu 4
+Regular expression pattern matching may surprise you if you're not
+accustomed to Unicode. Starting in Perl 5.14, several pattern
+modifiers are available to control this, called the character set
+modifiers. Details are given in "Character set modifiers" in perlre.
+.PP
+As discussed elsewhere, Perl has one foot (two hooves?) planted in
+each of two worlds: the old world of ASCII and single-byte locales, and
+the new world of Unicode, upgrading when necessary.
+If your legacy code does not explicitly use Unicode, no automatic
+switch-over to Unicode should happen.
+.SS "Unicode in Perl on EBCDIC"
+.IX Subsection "Unicode in Perl on EBCDIC"
+Unicode is supported on EBCDIC platforms. See perlebcdic.
+.PP
+Unless ASCII vs. EBCDIC issues are specifically being discussed,
+references to UTF\-8 encoding in this document and elsewhere should be
+read as meaning UTF-EBCDIC on EBCDIC platforms.
+See "Unicode and UTF" in perlebcdic.
+.PP
+Because UTF-EBCDIC is so similar to UTF\-8, the differences are mostly
+hidden from you; \f(CW\*(C`use\ utf8\*(C'\fR (and NOT something like
+\&\f(CW\*(C`use\ utfebcdic\*(C'\fR) declares the script is in the platform's
+"native" 8\-bit encoding of Unicode. (Similarly for the \f(CW":utf8"\fR
+layer.)
+.SS Locales
+.IX Subsection "Locales"
+See "Unicode and UTF\-8" in perllocale
+.SS "When Unicode Does Not Happen"
+.IX Subsection "When Unicode Does Not Happen"
+There are still many places where Unicode (in some encoding or
+another) could be given as arguments or received as results, or both in
+Perl, but it is not, in spite of Perl having extensive ways to input and
+output in Unicode, and a few other "entry points" like the \f(CW@ARGV\fR
+array (which can sometimes be interpreted as UTF\-8).
+.PP
+The following are such interfaces. Also, see "The "Unicode Bug"".
+For all of these interfaces Perl
+currently (as of v5.16.0) simply assumes byte strings both as arguments
+and results, or UTF\-8 strings if the (deprecated) \f(CW\*(C`encoding\*(C'\fR pragma has been used.
+.PP
+One reason that Perl does not attempt to resolve the role of Unicode in
+these situations is that the answers are highly dependent on the operating
+system and the file system(s). For example, whether filenames can be
+in Unicode and in exactly what kind of encoding, is not exactly a
+portable concept. Similarly for \f(CW\*(C`qx\*(C'\fR and \f(CW\*(C`system\*(C'\fR: how well will the
+"command-line interface" (and which of them?) handle Unicode?
+.IP \(bu 4
+\&\f(CW\*(C`chdir\*(C'\fR, \f(CW\*(C`chmod\*(C'\fR, \f(CW\*(C`chown\*(C'\fR, \f(CW\*(C`chroot\*(C'\fR, \f(CW\*(C`exec\*(C'\fR, \f(CW\*(C`link\*(C'\fR, \f(CW\*(C`lstat\*(C'\fR, \f(CW\*(C`mkdir\*(C'\fR,
+\&\f(CW\*(C`rename\*(C'\fR, \f(CW\*(C`rmdir\*(C'\fR, \f(CW\*(C`stat\*(C'\fR, \f(CW\*(C`symlink\*(C'\fR, \f(CW\*(C`truncate\*(C'\fR, \f(CW\*(C`unlink\*(C'\fR, \f(CW\*(C`utime\*(C'\fR, \f(CW\*(C`\-X\*(C'\fR
+.IP \(bu 4
+\&\f(CW%ENV\fR
+.IP \(bu 4
+\&\f(CW\*(C`glob\*(C'\fR (aka the \f(CW\*(C`<*>\*(C'\fR)
+.IP \(bu 4
+\&\f(CW\*(C`open\*(C'\fR, \f(CW\*(C`opendir\*(C'\fR, \f(CW\*(C`sysopen\*(C'\fR
+.IP \(bu 4
+\&\f(CW\*(C`qx\*(C'\fR (aka the backtick operator), \f(CW\*(C`system\*(C'\fR
+.IP \(bu 4
+\&\f(CW\*(C`readdir\*(C'\fR, \f(CW\*(C`readlink\*(C'\fR
+.SS "The ""Unicode Bug"""
+.IX Subsection "The ""Unicode Bug"""
+The term, "Unicode bug" has been applied to an inconsistency with the
+code points in the \f(CW\*(C`Latin\-1 Supplement\*(C'\fR block, that is, between
+128 and 255. Without a locale specified, unlike all other characters or
+code points, these characters can have very different semantics
+depending on the rules in effect. (Characters whose code points are
+above 255 force Unicode rules; whereas the rules for ASCII characters
+are the same under both ASCII and Unicode rules.)
+.PP
+Under Unicode rules, these upper\-Latin1 characters are interpreted as
+Unicode code points, which means they have the same semantics as Latin\-1
+(ISO\-8859\-1) and C1 controls.
+.PP
+As explained in "ASCII Rules versus Unicode Rules", under ASCII rules,
+they are considered to be unassigned characters.
+.PP
+This can lead to unexpected results. For example, a string's
+semantics can suddenly change if a code point above 255 is appended to
+it, which changes the rules from ASCII to Unicode. As an
+example, consider the following program and its output:
+.PP
+.Vb 11
+\& $ perl \-le\*(Aq
+\& no feature "unicode_strings";
+\& $s1 = "\exC2";
+\& $s2 = "\ex{2660}";
+\& for ($s1, $s2, $s1.$s2) {
+\& print /\ew/ || 0;
+\& }
+\& \*(Aq
+\& 0
+\& 0
+\& 1
+.Ve
+.PP
+If there's no \f(CW\*(C`\ew\*(C'\fR in \f(CW\*(C`s1\*(C'\fR nor in \f(CW\*(C`s2\*(C'\fR, why does their concatenation
+have one?
+.PP
+This anomaly stems from Perl's attempt to not disturb older programs that
+didn't use Unicode, along with Perl's desire to add Unicode support
+seamlessly. But the result turned out to not be seamless. (By the way,
+you can choose to be warned when things like this happen. See
+\&\f(CW\*(C`encoding::warnings\*(C'\fR.)
+.PP
+\&\f(CW\*(C`use\ feature\ \*(Aqunicode_strings\*(Aq\*(C'\fR
+was added, starting in Perl v5.12, to address this problem. It affects
+these things:
+.IP \(bu 4
+Changing the case of a scalar, that is, using \f(CWuc()\fR, \f(CWucfirst()\fR, \f(CWlc()\fR,
+and \f(CWlcfirst()\fR, or \f(CW\*(C`\eL\*(C'\fR, \f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\eu\*(C'\fR and \f(CW\*(C`\el\*(C'\fR in double-quotish
+contexts, such as regular expression substitutions.
+.Sp
+Under \f(CW\*(C`unicode_strings\*(C'\fR starting in Perl 5.12.0, Unicode rules are
+generally used. See "lc" in perlfunc for details on how this works
+in combination with various other pragmas.
+.IP \(bu 4
+Using caseless (\f(CW\*(C`/i\*(C'\fR) regular expression matching.
+.Sp
+Starting in Perl 5.14.0, regular expressions compiled within
+the scope of \f(CW\*(C`unicode_strings\*(C'\fR use Unicode rules
+even when executed or compiled into larger
+regular expressions outside the scope.
+.IP \(bu 4
+Matching any of several properties in regular expressions.
+.Sp
+These properties are \f(CW\*(C`\eb\*(C'\fR (without braces), \f(CW\*(C`\eB\*(C'\fR (without braces),
+\&\f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\eW\*(C'\fR, and all the Posix character classes
+\&\fIexcept\fR \f(CW\*(C`[[:ascii:]]\*(C'\fR.
+.Sp
+Starting in Perl 5.14.0, regular expressions compiled within
+the scope of \f(CW\*(C`unicode_strings\*(C'\fR use Unicode rules
+even when executed or compiled into larger
+regular expressions outside the scope.
+.IP \(bu 4
+In \f(CW\*(C`quotemeta\*(C'\fR or its inline equivalent \f(CW\*(C`\eQ\*(C'\fR.
+.Sp
+Starting in Perl 5.16.0, consistent quoting rules are used within the
+scope of \f(CW\*(C`unicode_strings\*(C'\fR, as described in "quotemeta" in perlfunc.
+Prior to that, or outside its scope, no code points above 127 are quoted
+in UTF\-8 encoded strings, but in byte encoded strings, code points
+between 128\-255 are always quoted.
+.IP \(bu 4
+In the \f(CW\*(C`..\*(C'\fR or range operator.
+.Sp
+Starting in Perl 5.26.0, the range operator on strings treats their lengths
+consistently within the scope of \f(CW\*(C`unicode_strings\*(C'\fR. Prior to that, or
+outside its scope, it could produce strings whose length in characters
+exceeded that of the right-hand side, where the right-hand side took up more
+bytes than the correct range endpoint.
+.IP \(bu 4
+In \f(CW\*(C`split\*(C'\fR's special-case whitespace splitting.
+.Sp
+Starting in Perl 5.28.0, the \f(CW\*(C`split\*(C'\fR function with a pattern specified as
+a string containing a single space handles whitespace characters consistently
+within the scope of \f(CW\*(C`unicode_strings\*(C'\fR. Prior to that, or outside its scope,
+characters that are whitespace according to Unicode rules but not according to
+ASCII rules were treated as field contents rather than field separators when
+they appear in byte-encoded strings.
+.PP
+You can see from the above that the effect of \f(CW\*(C`unicode_strings\*(C'\fR
+increased over several Perl releases. (And Perl's support for Unicode
+continues to improve; it's best to use the latest available release in
+order to get the most complete and accurate results possible.) Note that
+\&\f(CW\*(C`unicode_strings\*(C'\fR is automatically chosen if you \f(CW\*(C`use\ v5.12\*(C'\fR or
+higher.
+.PP
+For Perls earlier than those described above, or when a string is passed
+to a function outside the scope of \f(CW\*(C`unicode_strings\*(C'\fR, see the next section.
+.SS "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)"
+.IX Subsection "Forcing Unicode in Perl (Or Unforcing Unicode in Perl)"
+Sometimes (see "When Unicode Does Not Happen" or "The "Unicode Bug"")
+there are situations where you simply need to force a byte
+string into UTF\-8, or vice versa. The standard module Encode can be
+used for this, or the low-level calls
+\&\f(CWutf8::upgrade($bytestring)\fR and
+\&\f(CW\*(C`utf8::downgrade($utf8string[, FAIL_OK])\*(C'\fR.
+.PP
+Note that \f(CWutf8::downgrade()\fR can fail if the string contains characters
+that don't fit into a byte.
+.PP
+Calling either function on a string that already is in the desired state is a
+no-op.
+.PP
+"ASCII Rules versus Unicode Rules" gives all the ways that a string is
+made to use Unicode rules.
+.SS "Using Unicode in XS"
+.IX Subsection "Using Unicode in XS"
+See "Unicode Support" in perlguts for an introduction to Unicode at
+the XS level, and "Unicode Support" in perlapi for the API details.
+.SS "Hacking Perl to work on earlier Unicode versions (for very serious hackers only)"
+.IX Subsection "Hacking Perl to work on earlier Unicode versions (for very serious hackers only)"
+Perl by default comes with the latest supported Unicode version built-in, but
+the goal is to allow you to change to use any earlier one. In Perls
+v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
+Perl v5.18 and v5.24 are able to handle all earlier versions.
+.PP
+Download the files in the desired version of Unicode from the Unicode web
+site <https://www.unicode.org>). These should replace the existing files in
+\&\fIlib/unicore\fR in the Perl source tree. Follow the instructions in
+\&\fIREADME.perl\fR in that directory to change some of their names, and then build
+perl (see INSTALL).
+.SS "Porting code from perl\-5.6.X"
+.IX Subsection "Porting code from perl-5.6.X"
+Perls starting in 5.8 have a different Unicode model from 5.6. In 5.6 the
+programmer was required to use the \f(CW\*(C`utf8\*(C'\fR pragma to declare that a
+given scope expected to deal with Unicode data and had to make sure that
+only Unicode data were reaching that scope. If you have code that is
+working with 5.6, you will need some of the following adjustments to
+your code. The examples are written such that the code will continue to
+work under 5.6, so you should be safe to try them out.
+.IP \(bu 3
+A filehandle that should read or write UTF\-8
+.Sp
+.Vb 3
+\& if ($] > 5.008) {
+\& binmode $fh, ":encoding(UTF\-8)";
+\& }
+.Ve
+.IP \(bu 3
+A scalar that is going to be passed to some extension
+.Sp
+Be it \f(CW\*(C`Compress::Zlib\*(C'\fR, \f(CW\*(C`Apache::Request\*(C'\fR or any extension that has no
+mention of Unicode in the manpage, you need to make sure that the
+UTF8 flag is stripped off. Note that at the time of this writing
+(January 2012) the mentioned modules are not UTF\-8\-aware. Please
+check the documentation to verify if this is still true.
+.Sp
+.Vb 4
+\& if ($] > 5.008) {
+\& require Encode;
+\& $val = Encode::encode("UTF\-8", $val); # make octets
+\& }
+.Ve
+.IP \(bu 3
+A scalar we got back from an extension
+.Sp
+If you believe the scalar comes back as UTF\-8, you will most likely
+want the UTF8 flag restored:
+.Sp
+.Vb 4
+\& if ($] > 5.008) {
+\& require Encode;
+\& $val = Encode::decode("UTF\-8", $val);
+\& }
+.Ve
+.IP \(bu 3
+Same thing, if you are really sure it is UTF\-8
+.Sp
+.Vb 4
+\& if ($] > 5.008) {
+\& require Encode;
+\& Encode::_utf8_on($val);
+\& }
+.Ve
+.IP \(bu 3
+A wrapper for DBI \f(CW\*(C`fetchrow_array\*(C'\fR and \f(CW\*(C`fetchrow_hashref\*(C'\fR
+.Sp
+When the database contains only UTF\-8, a wrapper function or method is
+a convenient way to replace all your \f(CW\*(C`fetchrow_array\*(C'\fR and
+\&\f(CW\*(C`fetchrow_hashref\*(C'\fR calls. A wrapper function will also make it easier to
+adapt to future enhancements in your database driver. Note that at the
+time of this writing (January 2012), the DBI has no standardized way
+to deal with UTF\-8 data. Please check the DBI documentation to verify if
+that is still true.
+.Sp
+.Vb 10
+\& sub fetchrow {
+\& # $what is one of fetchrow_{array,hashref}
+\& my($self, $sth, $what) = @_;
+\& if ($] < 5.008) {
+\& return $sth\->$what;
+\& } else {
+\& require Encode;
+\& if (wantarray) {
+\& my @arr = $sth\->$what;
+\& for (@arr) {
+\& defined && /[^\e000\-\e177]/ && Encode::_utf8_on($_);
+\& }
+\& return @arr;
+\& } else {
+\& my $ret = $sth\->$what;
+\& if (ref $ret) {
+\& for my $k (keys %$ret) {
+\& defined
+\& && /[^\e000\-\e177]/
+\& && Encode::_utf8_on($_) for $ret\->{$k};
+\& }
+\& return $ret;
+\& } else {
+\& defined && /[^\e000\-\e177]/ && Encode::_utf8_on($_) for $ret;
+\& return $ret;
+\& }
+\& }
+\& }
+\& }
+.Ve
+.IP \(bu 3
+A large scalar that you know can only contain ASCII
+.Sp
+Scalars that contain only ASCII and are marked as UTF\-8 are sometimes
+a drag to your program. If you recognize such a situation, just remove
+the UTF8 flag:
+.Sp
+.Vb 1
+\& utf8::downgrade($val) if $] > 5.008;
+.Ve
+.SH BUGS
+.IX Header "BUGS"
+See also "The "Unicode Bug"" above.
+.SS "Interaction with Extensions"
+.IX Subsection "Interaction with Extensions"
+When Perl exchanges data with an extension, the extension should be
+able to understand the UTF8 flag and act accordingly. If the
+extension doesn't recognize that flag, it's likely that the extension
+will return incorrectly-flagged data.
+.PP
+So if you're working with Unicode data, consult the documentation of
+every module you're using if there are any issues with Unicode data
+exchange. If the documentation does not talk about Unicode at all,
+suspect the worst and probably look at the source to learn how the
+module is implemented. Modules written completely in Perl shouldn't
+cause problems. Modules that directly or indirectly access code written
+in other programming languages are at risk.
+.PP
+For affected functions, the simple strategy to avoid data corruption is
+to always make the encoding of the exchanged data explicit. Choose an
+encoding that you know the extension can handle. Convert arguments passed
+to the extensions to that encoding and convert results back from that
+encoding. Write wrapper functions that do the conversions for you, so
+you can later change the functions when the extension catches up.
+.PP
+To provide an example, let's say the popular \f(CW\*(C`Foo::Bar::escape_html\*(C'\fR
+function doesn't deal with Unicode data yet. The wrapper function
+would convert the argument to raw UTF\-8 and convert the result back to
+Perl's internal representation like so:
+.PP
+.Vb 6
+\& sub my_escape_html ($) {
+\& my($what) = shift;
+\& return unless defined $what;
+\& Encode::decode("UTF\-8", Foo::Bar::escape_html(
+\& Encode::encode("UTF\-8", $what)));
+\& }
+.Ve
+.PP
+Sometimes, when the extension does not convert data but just stores
+and retrieves it, you will be able to use the otherwise
+dangerous \f(CWEncode::_utf8_on()\fR function. Let's say
+the popular \f(CW\*(C`Foo::Bar\*(C'\fR extension, written in C, provides a \f(CW\*(C`param\*(C'\fR
+method that lets you store and retrieve data according to these prototypes:
+.PP
+.Vb 2
+\& $self\->param($name, $value); # set a scalar
+\& $value = $self\->param($name); # retrieve a scalar
+.Ve
+.PP
+If it does not yet provide support for any encoding, one could write a
+derived class with such a \f(CW\*(C`param\*(C'\fR method:
+.PP
+.Vb 12
+\& sub param {
+\& my($self,$name,$value) = @_;
+\& utf8::upgrade($name); # make sure it is UTF\-8 encoded
+\& if (defined $value) {
+\& utf8::upgrade($value); # make sure it is UTF\-8 encoded
+\& return $self\->SUPER::param($name,$value);
+\& } else {
+\& my $ret = $self\->SUPER::param($name);
+\& Encode::_utf8_on($ret); # we know, it is UTF\-8 encoded
+\& return $ret;
+\& }
+\& }
+.Ve
+.PP
+Some extensions provide filters on data entry/exit points, such as
+\&\f(CW\*(C`DB_File::filter_store_key\*(C'\fR and family. Look out for such filters in
+the documentation of your extensions; they can make the transition to
+Unicode data much easier.
+.SS Speed
+.IX Subsection "Speed"
+Some functions are slower when working on UTF\-8 encoded strings than
+on byte encoded strings. All functions that need to hop over
+characters such as \f(CWlength()\fR, \f(CWsubstr()\fR or \f(CWindex()\fR, or matching
+regular expressions can work \fBmuch\fR faster when the underlying data are
+byte-encoded.
+.PP
+In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
+a caching scheme was introduced which improved the situation. In general,
+operations with UTF\-8 encoded strings are still slower. As an example,
+the Unicode properties (character classes) like \f(CW\*(C`\ep{Nd}\*(C'\fR are known to
+be quite a bit slower (5\-20 times) than their simpler counterparts
+like \f(CW\*(C`[0\-9]\*(C'\fR (then again, there are hundreds of Unicode characters matching
+\&\f(CW\*(C`Nd\*(C'\fR compared with the 10 ASCII characters matching \f(CW\*(C`[0\-9]\*(C'\fR).
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+perlunitut, perluniintro, perluniprops, Encode, open, utf8, bytes,
+perlretut, "${^UNICODE}" in perlvar,
+<https://www.unicode.org/reports/tr44>).