Adding upstream version 4.22.0.upstream/4.22.0

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-15 19:43:11 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-15 19:43:11 +0000
commit: fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
tree: ce1e3bce06471410239a6f41282e328770aa404a /upstream/debian-unstable/man1/perlrecharclass.1
parent: Initial commit. (diff)
download: manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz
manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip
1 files changed, 1301 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man1/perlrecharclass.1 b/upstream/debian-unstable/man1/perlrecharclass.1
new file mode 100644
index 00000000..b58bd05e
--- /dev/null
+++ b/upstream/debian-unstable/man1/perlrecharclass.1
@@ -0,0 +1,1301 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+.    ds C` ""
+.    ds C' ""
+'br\}
+.el\{\
+.    ds C`
+.    ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el       .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD.  Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+.    if \nF \{\
+.        de IX
+.        tm Index:\\$1\t\\n%\t"\\$2"
+..
+.        if !\nF==2 \{\
+.            nr % 0
+.            nr F 2
+.        \}
+.    \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLRECHARCLASS 1"
+.TH PERLRECHARCLASS 1 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlrecharclass \- Perl Regular Expression Character Classes
+.IX Xref "character class"
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+The top level documentation about Perl regular expressions
+is found in perlre.
+.PP
+This manual page discusses the syntax and use of character
+classes in Perl regular expressions.
+.PP
+A character class is a way of denoting a set of characters
+in such a way that one character of the set is matched.
+It's important to remember that: matching a character class
+consumes exactly one character in the source string. (The source
+string is the string the regular expression is matched against.)
+.PP
+There are three types of character classes in Perl regular
+expressions: the dot, backslash sequences, and the form enclosed in square
+brackets.  Keep in mind, though, that often the term "character class" is used
+to mean just the bracketed form.  Certainly, most Perl documentation does that.
+.SS "The dot"
+.IX Subsection "The dot"
+The dot (or period), \f(CW\*(C`.\*(C'\fR is probably the most used, and certainly
+the most well-known character class. By default, a dot matches any
+character, except for the newline. That default can be changed to
+add matching the newline by using the \fIsingle line\fR modifier:
+for the entire regular expression with the \f(CW\*(C`/s\*(C'\fR modifier, or
+locally with \f(CW\*(C`(?s)\*(C'\fR  (and even globally within the scope of
+\&\f(CW\*(C`use re \*(Aq/s\*(Aq\*(C'\fR).  (The \f(CW"\eN"\fR backslash
+sequence, described
+below, matches any character except newline without regard to the
+\&\fIsingle line\fR modifier.)
+.PP
+Here are some examples:
+.PP
+.Vb 7
+\& "a"  =~  /./       # Match
+\& "."  =~  /./       # Match
+\& ""   =~  /./       # No match (dot has to match a character)
+\& "\en" =~  /./       # No match (dot does not match a newline)
+\& "\en" =~  /./s      # Match (global \*(Aqsingle line\*(Aq modifier)
+\& "\en" =~  /(?s:.)/  # Match (local \*(Aqsingle line\*(Aq modifier)
+\& "ab" =~  /^.$/     # No match (dot matches one character)
+.Ve
+.SS "Backslash sequences"
+.IX Xref "\\w \\W \\s \\S \\d \\D \\p \\P \\N \\v \\V \\h \\H word whitespace"
+.IX Subsection "Backslash sequences"
+A backslash sequence is a sequence of characters, the first one of which is a
+backslash.  Perl ascribes special meaning to many such sequences, and some of
+these are character classes.  That is, they match a single character each,
+provided that the character belongs to the specific set of characters defined
+by the sequence.
+.PP
+Here's a list of the backslash sequences that are character classes.  They
+are discussed in more detail below.  (For the backslash sequences that aren't
+character classes, see perlrebackslash.)
+.PP
+.Vb 10
+\& \ed             Match a decimal digit character.
+\& \eD             Match a non\-decimal\-digit character.
+\& \ew             Match a "word" character.
+\& \eW             Match a non\-"word" character.
+\& \es             Match a whitespace character.
+\& \eS             Match a non\-whitespace character.
+\& \eh             Match a horizontal whitespace character.
+\& \eH             Match a character that isn\*(Aqt horizontal whitespace.
+\& \ev             Match a vertical whitespace character.
+\& \eV             Match a character that isn\*(Aqt vertical whitespace.
+\& \eN             Match a character that isn\*(Aqt a newline.
+\& \epP, \ep{Prop}  Match a character that has the given Unicode property.
+\& \ePP, \eP{Prop}  Match a character that doesn\*(Aqt have the Unicode property
+.Ve
+.PP
+\fI\eN\fR
+.IX Subsection "N"
+.PP
+\&\f(CW\*(C`\eN\*(C'\fR, available starting in v5.12, like the dot, matches any
+character that is not a newline. The difference is that \f(CW\*(C`\eN\*(C'\fR is not influenced
+by the \fIsingle line\fR regular expression modifier (see "The dot" above).  Note
+that the form \f(CW\*(C`\eN{...}\*(C'\fR may mean something completely different.  When the
+\&\f(CW\*(C`{...}\*(C'\fR is a quantifier, it means to match a non-newline
+character that many times.  For example, \f(CW\*(C`\eN{3}\*(C'\fR means to match 3
+non-newlines; \f(CW\*(C`\eN{5,}\*(C'\fR means to match 5 or more non-newlines.  But if \f(CW\*(C`{...}\*(C'\fR
+is not a legal quantifier, it is presumed to be a named character.  See
+charnames for those.  For example, none of \f(CW\*(C`\eN{COLON}\*(C'\fR, \f(CW\*(C`\eN{4F}\*(C'\fR, and
+\&\f(CW\*(C`\eN{F4}\*(C'\fR contain legal quantifiers, so Perl will try to find characters whose
+names are respectively \f(CW\*(C`COLON\*(C'\fR, \f(CW\*(C`4F\*(C'\fR, and \f(CW\*(C`F4\*(C'\fR.
+.PP
+\fIDigits\fR
+.IX Subsection "Digits"
+.PP
+\&\f(CW\*(C`\ed\*(C'\fR matches a single character considered to be a decimal \fIdigit\fR.
+If the \f(CW\*(C`/a\*(C'\fR regular expression modifier is in effect, it matches [0\-9].
+Otherwise, it
+matches anything that is matched by \f(CW\*(C`\ep{Digit}\*(C'\fR, which includes [0\-9].
+(An unlikely possible exception is that under locale matching rules, the
+current locale might not have \f(CW\*(C`[0\-9]\*(C'\fR matched by \f(CW\*(C`\ed\*(C'\fR, and/or might match
+other characters whose code point is less than 256.  The only such locale
+definitions that are legal would be to match \f(CW\*(C`[0\-9]\*(C'\fR plus another set of
+10 consecutive digit characters;  anything else would be in violation of
+the C language standard, but Perl doesn't currently assume anything in
+regard to this.)
+.PP
+What this means is that unless the \f(CW\*(C`/a\*(C'\fR modifier is in effect \f(CW\*(C`\ed\*(C'\fR not
+only matches the digits '0' \- '9', but also Arabic, Devanagari, and
+digits from other languages.  This may cause some confusion, and some
+security issues.
+.PP
+Some digits that \f(CW\*(C`\ed\*(C'\fR matches look like some of the [0\-9] ones, but
+have different values.  For example, BENGALI DIGIT FOUR (U+09EA) looks
+very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX
+(U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035).  An
+application that
+is expecting only the ASCII digits might be misled, or if the match is
+\&\f(CW\*(C`\ed+\*(C'\fR, the matched string might contain a mixture of digits from
+different writing systems that look like they signify a number different
+than they actually do.  "\fBnum()\fR" in Unicode::UCD can
+be used to safely
+calculate the value, returning \f(CW\*(C`undef\*(C'\fR if the input string contains
+such a mixture.  Otherwise, for example, a displayed price might be
+deliberately different than it appears.
+.PP
+What \f(CW\*(C`\ep{Digit}\*(C'\fR means (and hence \f(CW\*(C`\ed\*(C'\fR except under the \f(CW\*(C`/a\*(C'\fR
+modifier) is \f(CW\*(C`\ep{General_Category=Decimal_Number}\*(C'\fR, or synonymously,
+\&\f(CW\*(C`\ep{General_Category=Digit}\*(C'\fR.  Starting with Unicode version 4.1, this
+is the same set of characters matched by \f(CW\*(C`\ep{Numeric_Type=Decimal}\*(C'\fR.
+But Unicode also has a different property with a similar name,
+\&\f(CW\*(C`\ep{Numeric_Type=Digit}\*(C'\fR, which matches a completely different set of
+characters.  These characters are things such as \f(CW\*(C`CIRCLED DIGIT ONE\*(C'\fR
+or subscripts, or are from writing systems that lack all ten digits.
+.PP
+The design intent is for \f(CW\*(C`\ed\*(C'\fR to exactly match the set of characters
+that can safely be used with "normal" big-endian positional decimal
+syntax, where, for example 123 means one 'hundred', plus two 'tens',
+plus three 'ones'.  This positional notation does not necessarily apply
+to characters that match the other type of "digit",
+\&\f(CW\*(C`\ep{Numeric_Type=Digit}\*(C'\fR, and so \f(CW\*(C`\ed\*(C'\fR doesn't match them.
+.PP
+The Tamil digits (U+0BE6 \- U+0BEF) can also legally be
+used in old-style Tamil numbers in which they would appear no more than
+one in a row, separated by characters that mean "times 10", "times 100",
+etc.  (See <https://www.unicode.org/notes/tn21>.)
+.PP
+Any character not matched by \f(CW\*(C`\ed\*(C'\fR is matched by \f(CW\*(C`\eD\*(C'\fR.
+.PP
+\fIWord characters\fR
+.IX Subsection "Word characters"
+.PP
+A \f(CW\*(C`\ew\*(C'\fR matches a single alphanumeric character (an alphabetic character, or a
+decimal digit); or a connecting punctuation character, such as an
+underscore ("_"); or a "mark" character (like some sort of accent) that
+attaches to one of those.  It does not match a whole word.  To match a
+whole word, use \f(CW\*(C`\ew+\*(C'\fR.  This isn't the same thing as matching an
+English word, but in the ASCII range it is the same as a string of
+Perl-identifier characters.
+.ie n .IP "If the ""/a"" modifier is in effect ..." 4
+.el .IP "If the \f(CW/a\fR modifier is in effect ..." 4
+.IX Item "If the /a modifier is in effect ..."
+\&\f(CW\*(C`\ew\*(C'\fR matches the 63 characters [a\-zA\-Z0\-9_].
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+.RS 4
+.PD 0
+.IP "For code points above 255 ..." 4
+.IX Item "For code points above 255 ..."
+.PD
+\&\f(CW\*(C`\ew\*(C'\fR matches the same as \f(CW\*(C`\ep{Word}\*(C'\fR matches in this range.  That is,
+it matches Thai letters, Greek letters, etc.  This includes connector
+punctuation (like the underscore) which connect two words together, or
+diacritics, such as a \f(CW\*(C`COMBINING TILDE\*(C'\fR and the modifier letters, which
+are generally used to add auxiliary markings to letters.
+.IP "For code points below 256 ..." 4
+.IX Item "For code points below 256 ..."
+.RS 4
+.PD 0
+.IP "if locale rules are in effect ..." 4
+.IX Item "if locale rules are in effect ..."
+.PD
+\&\f(CW\*(C`\ew\*(C'\fR matches the platform's native underscore character plus whatever
+the locale considers to be alphanumeric.
+.IP "if, instead, Unicode rules are in effect ..." 4
+.IX Item "if, instead, Unicode rules are in effect ..."
+\&\f(CW\*(C`\ew\*(C'\fR matches exactly what \f(CW\*(C`\ep{Word}\*(C'\fR matches.
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+\&\f(CW\*(C`\ew\*(C'\fR matches [a\-zA\-Z0\-9_].
+.RE
+.RS 4
+.RE
+.RE
+.RS 4
+.RE
+.PP
+Which rules apply are determined as described in "Which character set modifier is in effect?" in perlre.
+.PP
+There are a number of security issues with the full Unicode list of word
+characters.  See <http://unicode.org/reports/tr36>.
+.PP
+Also, for a somewhat finer-grained set of characters that are in programming
+language identifiers beyond the ASCII range, you may wish to instead use the
+more customized "Unicode Properties", \f(CW\*(C`\ep{ID_Start}\*(C'\fR,
+\&\f(CW\*(C`\ep{ID_Continue}\*(C'\fR, \f(CW\*(C`\ep{XID_Start}\*(C'\fR, and \f(CW\*(C`\ep{XID_Continue}\*(C'\fR.  See
+<http://unicode.org/reports/tr31>.
+.PP
+Any character not matched by \f(CW\*(C`\ew\*(C'\fR is matched by \f(CW\*(C`\eW\*(C'\fR.
+.PP
+\fIWhitespace\fR
+.IX Subsection "Whitespace"
+.PP
+\&\f(CW\*(C`\es\*(C'\fR matches any single character considered whitespace.
+.ie n .IP "If the ""/a"" modifier is in effect ..." 4
+.el .IP "If the \f(CW/a\fR modifier is in effect ..." 4
+.IX Item "If the /a modifier is in effect ..."
+In all Perl versions, \f(CW\*(C`\es\*(C'\fR matches the 5 characters [\et\en\ef\er ]; that
+is, the horizontal tab,
+the newline, the form feed, the carriage return, and the space.
+Starting in Perl v5.18, it also matches the vertical tab, \f(CW\*(C`\ecK\*(C'\fR.
+See note \f(CW\*(C`[1]\*(C'\fR below for a discussion of this.
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+.RS 4
+.PD 0
+.IP "For code points above 255 ..." 4
+.IX Item "For code points above 255 ..."
+.PD
+\&\f(CW\*(C`\es\*(C'\fR matches exactly the code points above 255 shown with an "s" column
+in the table below.
+.IP "For code points below 256 ..." 4
+.IX Item "For code points below 256 ..."
+.RS 4
+.PD 0
+.IP "if locale rules are in effect ..." 4
+.IX Item "if locale rules are in effect ..."
+.PD
+\&\f(CW\*(C`\es\*(C'\fR matches whatever the locale considers to be whitespace.
+.IP "if, instead, Unicode rules are in effect ..." 4
+.IX Item "if, instead, Unicode rules are in effect ..."
+\&\f(CW\*(C`\es\*(C'\fR matches exactly the characters shown with an "s" column in the
+table below.
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+\&\f(CW\*(C`\es\*(C'\fR matches [\et\en\ef\er ] and, starting in Perl
+v5.18, the vertical tab, \f(CW\*(C`\ecK\*(C'\fR.
+(See note \f(CW\*(C`[1]\*(C'\fR below for a discussion of this.)
+Note that this list doesn't include the non-breaking space.
+.RE
+.RS 4
+.RE
+.RE
+.RS 4
+.RE
+.PP
+Which rules apply are determined as described in "Which character set modifier is in effect?" in perlre.
+.PP
+Any character not matched by \f(CW\*(C`\es\*(C'\fR is matched by \f(CW\*(C`\eS\*(C'\fR.
+.PP
+\&\f(CW\*(C`\eh\*(C'\fR matches any character considered horizontal whitespace;
+this includes the platform's space and tab characters and several others
+listed in the table below.  \f(CW\*(C`\eH\*(C'\fR matches any character
+not considered horizontal whitespace.  They use the platform's native
+character set, and do not consider any locale that may otherwise be in
+use.
+.PP
+\&\f(CW\*(C`\ev\*(C'\fR matches any character considered vertical whitespace;
+this includes the platform's carriage return and line feed characters (newline)
+plus several other characters, all listed in the table below.
+\&\f(CW\*(C`\eV\*(C'\fR matches any character not considered vertical whitespace.
+They use the platform's native character set, and do not consider any
+locale that may otherwise be in use.
+.PP
+\&\f(CW\*(C`\eR\*(C'\fR matches anything that can be considered a newline under Unicode
+rules. It can match a multi-character sequence. It cannot be used inside
+a bracketed character class; use \f(CW\*(C`\ev\*(C'\fR instead (vertical whitespace).
+It uses the platform's
+native character set, and does not consider any locale that may
+otherwise be in use.
+Details are discussed in perlrebackslash.
+.PP
+Note that unlike \f(CW\*(C`\es\*(C'\fR (and \f(CW\*(C`\ed\*(C'\fR and \f(CW\*(C`\ew\*(C'\fR), \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ev\*(C'\fR always match
+the same characters, without regard to other factors, such as the active
+locale or whether the source string is in UTF\-8 format.
+.PP
+One might think that \f(CW\*(C`\es\*(C'\fR is equivalent to \f(CW\*(C`[\eh\ev]\*(C'\fR. This is indeed true
+starting in Perl v5.18, but prior to that, the sole difference was that the
+vertical tab (\f(CW"\ecK"\fR) was not matched by \f(CW\*(C`\es\*(C'\fR.
+.PP
+The following table is a complete listing of characters matched by
+\&\f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ev\*(C'\fR as of Unicode 14.0.
+.PP
+The first column gives the Unicode code point of the character (in hex format),
+the second column gives the (Unicode) name. The third column indicates
+by which class(es) the character is matched (assuming no locale is in
+effect that changes the \f(CW\*(C`\es\*(C'\fR matching).
+.PP
+.Vb 10
+\& 0x0009        CHARACTER TABULATION   h s
+\& 0x000a              LINE FEED (LF)    vs
+\& 0x000b             LINE TABULATION    vs  [1]
+\& 0x000c              FORM FEED (FF)    vs
+\& 0x000d        CARRIAGE RETURN (CR)    vs
+\& 0x0020                       SPACE   h s
+\& 0x0085             NEXT LINE (NEL)    vs  [2]
+\& 0x00a0              NO\-BREAK SPACE   h s  [2]
+\& 0x1680            OGHAM SPACE MARK   h s
+\& 0x2000                     EN QUAD   h s
+\& 0x2001                     EM QUAD   h s
+\& 0x2002                    EN SPACE   h s
+\& 0x2003                    EM SPACE   h s
+\& 0x2004          THREE\-PER\-EM SPACE   h s
+\& 0x2005           FOUR\-PER\-EM SPACE   h s
+\& 0x2006            SIX\-PER\-EM SPACE   h s
+\& 0x2007                FIGURE SPACE   h s
+\& 0x2008           PUNCTUATION SPACE   h s
+\& 0x2009                  THIN SPACE   h s
+\& 0x200a                  HAIR SPACE   h s
+\& 0x2028              LINE SEPARATOR    vs
+\& 0x2029         PARAGRAPH SEPARATOR    vs
+\& 0x202f       NARROW NO\-BREAK SPACE   h s
+\& 0x205f   MEDIUM MATHEMATICAL SPACE   h s
+\& 0x3000           IDEOGRAPHIC SPACE   h s
+.Ve
+.IP [1] 4
+.IX Item "[1]"
+Prior to Perl v5.18, \f(CW\*(C`\es\*(C'\fR did not match the vertical tab.
+\&\f(CW\*(C`[^\eS\ecK]\*(C'\fR (obscurely) matches what \f(CW\*(C`\es\*(C'\fR traditionally did.
+.IP [2] 4
+.IX Item "[2]"
+NEXT LINE and NO-BREAK SPACE may or may not match \f(CW\*(C`\es\*(C'\fR depending
+on the rules in effect.  See
+the beginning of this section.
+.PP
+\fIUnicode Properties\fR
+.IX Subsection "Unicode Properties"
+.PP
+\&\f(CW\*(C`\epP\*(C'\fR and \f(CW\*(C`\ep{Prop}\*(C'\fR are character classes to match characters that fit given
+Unicode properties.  One letter property names can be used in the \f(CW\*(C`\epP\*(C'\fR form,
+with the property name following the \f(CW\*(C`\ep\*(C'\fR, otherwise, braces are required.
+When using braces, there is a single form, which is just the property name
+enclosed in the braces, and a compound form which looks like \f(CW\*(C`\ep{name=value}\*(C'\fR,
+which means to match if the property "name" for the character has that particular
+"value".
+For instance, a match for a number can be written as \f(CW\*(C`/\epN/\*(C'\fR or as
+\&\f(CW\*(C`/\ep{Number}/\*(C'\fR, or as \f(CW\*(C`/\ep{Number=True}/\*(C'\fR.
+Lowercase letters are matched by the property \fILowercase_Letter\fR which
+has the short form \fILl\fR. They need the braces, so are written as \f(CW\*(C`/\ep{Ll}/\*(C'\fR or
+\&\f(CW\*(C`/\ep{Lowercase_Letter}/\*(C'\fR, or \f(CW\*(C`/\ep{General_Category=Lowercase_Letter}/\*(C'\fR
+(the underscores are optional).
+\&\f(CW\*(C`/\epLl/\*(C'\fR is valid, but means something different.
+It matches a two character string: a letter (Unicode property \f(CW\*(C`\epL\*(C'\fR),
+followed by a lowercase \f(CW\*(C`l\*(C'\fR.
+.PP
+What a Unicode property matches is never subject to locale rules, and
+if locale rules are not otherwise in effect, the use of a Unicode
+property will force the regular expression into using Unicode rules, if
+it isn't already.
+.PP
+Note that almost all properties are immune to case-insensitive matching.
+That is, adding a \f(CW\*(C`/i\*(C'\fR regular expression modifier does not change what
+they match.  But there are two sets that are affected.  The first set is
+\&\f(CW\*(C`Uppercase_Letter\*(C'\fR,
+\&\f(CW\*(C`Lowercase_Letter\*(C'\fR,
+and \f(CW\*(C`Titlecase_Letter\*(C'\fR,
+all of which match \f(CW\*(C`Cased_Letter\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching.
+The second set is
+\&\f(CW\*(C`Uppercase\*(C'\fR,
+\&\f(CW\*(C`Lowercase\*(C'\fR,
+and \f(CW\*(C`Titlecase\*(C'\fR,
+all of which match \f(CW\*(C`Cased\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching.
+(The difference between these sets is that some things, such as Roman
+numerals, come in both upper and lower case, so they are \f(CW\*(C`Cased\*(C'\fR, but
+aren't considered to be letters, so they aren't \f(CW\*(C`Cased_Letter\*(C'\fRs. They're
+actually \f(CW\*(C`Letter_Number\*(C'\fRs.)
+This set also includes its subsets \f(CW\*(C`PosixUpper\*(C'\fR and \f(CW\*(C`PosixLower\*(C'\fR, both
+of which under \f(CW\*(C`/i\*(C'\fR match \f(CW\*(C`PosixAlpha\*(C'\fR.
+.PP
+For more details on Unicode properties, see "Unicode
+Character Properties" in perlunicode; for a
+complete list of possible properties, see
+"Properties accessible through \ep{} and \eP{}" in perluniprops,
+which notes all forms that have \f(CW\*(C`/i\*(C'\fR differences.
+It is also possible to define your own properties. This is discussed in
+"User-Defined Character Properties" in perlunicode.
+.PP
+Unicode properties are defined (surprise!) only on Unicode code points.
+Starting in v5.20, when matching against \f(CW\*(C`\ep\*(C'\fR and \f(CW\*(C`\eP\*(C'\fR, Perl treats
+non-Unicode code points (those above the legal Unicode maximum of
+0x10FFFF) as if they were typical unassigned Unicode code points.
+.PP
+Prior to v5.20, Perl raised a warning and made all matches fail on
+non-Unicode code points.  This could be somewhat surprising:
+.PP
+.Vb 3
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True}     # Fails on Perls < v5.20.
+\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False}    # Also fails on Perls
+\&                                               # < v5.20
+.Ve
+.PP
+Even though these two matches might be thought of as complements, until
+v5.20 they were so only on Unicode code points.
+.PP
+Starting in perl v5.30, wildcards are allowed in Unicode property
+values.  See "Wildcards in Property Values" in perlunicode.
+.PP
+Examples
+.IX Subsection "Examples"
+.PP
+.Vb 8
+\& "a"  =~  /\ew/      # Match, "a" is a \*(Aqword\*(Aq character.
+\& "7"  =~  /\ew/      # Match, "7" is a \*(Aqword\*(Aq character as well.
+\& "a"  =~  /\ed/      # No match, "a" isn\*(Aqt a digit.
+\& "7"  =~  /\ed/      # Match, "7" is a digit.
+\& " "  =~  /\es/      # Match, a space is whitespace.
+\& "a"  =~  /\eD/      # Match, "a" is a non\-digit.
+\& "7"  =~  /\eD/      # No match, "7" is not a non\-digit.
+\& " "  =~  /\eS/      # No match, a space is not non\-whitespace.
+\&
+\& " "  =~  /\eh/      # Match, space is horizontal whitespace.
+\& " "  =~  /\ev/      # No match, space is not vertical whitespace.
+\& "\er" =~  /\ev/      # Match, a return is vertical whitespace.
+\&
+\& "a"  =~  /\epL/     # Match, "a" is a letter.
+\& "a"  =~  /\ep{Lu}/  # No match, /\ep{Lu}/ matches upper case letters.
+\&
+\& "\ex{0e0b}" =~ /\ep{Thai}/  # Match, \ex{0e0b} is the character
+\&                           # \*(AqTHAI CHARACTER SO SO\*(Aq, and that\*(Aqs in
+\&                           # Thai Unicode class.
+\& "a"  =~  /\eP{Lao}/ # Match, as "a" is not a Laotian character.
+.Ve
+.PP
+It is worth emphasizing that \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, etc, match single characters, not
+complete numbers or words. To match a number (that consists of digits),
+use \f(CW\*(C`\ed+\*(C'\fR; to match a word, use \f(CW\*(C`\ew+\*(C'\fR.  But be aware of the security
+considerations in doing so, as mentioned above.
+.SS "Bracketed Character Classes"
+.IX Subsection "Bracketed Character Classes"
+The third form of character class you can use in Perl regular expressions
+is the bracketed character class.  In its simplest form, it lists the characters
+that may be matched, surrounded by square brackets, like this: \f(CW\*(C`[aeiou]\*(C'\fR.
+This matches one of \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`e\*(C'\fR, \f(CW\*(C`i\*(C'\fR, \f(CW\*(C`o\*(C'\fR or \f(CW\*(C`u\*(C'\fR.  Like the other
+character classes, exactly one character is matched.* To match
+a longer string consisting of characters mentioned in the character
+class, follow the character class with a quantifier.  For
+instance, \f(CW\*(C`[aeiou]+\*(C'\fR matches one or more lowercase English vowels.
+.PP
+Repeating a character in a character class has no
+effect; it's considered to be in the set only once.
+.PP
+Examples:
+.PP
+.Vb 5
+\& "e"  =~  /[aeiou]/        # Match, as "e" is listed in the class.
+\& "p"  =~  /[aeiou]/        # No match, "p" is not listed in the class.
+\& "ae" =~  /^[aeiou]$/      # No match, a character class only matches
+\&                           # a single character.
+\& "ae" =~  /^[aeiou]+$/     # Match, due to the quantifier.
+\&
+\& \-\-\-\-\-\-\-
+.Ve
+.PP
+* There are two exceptions to a bracketed character class matching a
+single character only.  Each requires special handling by Perl to make
+things work:
+.IP \(bu 4
+When the class is to match caselessly under \f(CW\*(C`/i\*(C'\fR matching rules, and a
+character that is explicitly mentioned inside the class matches a
+multiple-character sequence caselessly under Unicode rules, the class
+will also match that sequence.  For example, Unicode says that the
+letter \f(CW\*(C`LATIN SMALL LETTER SHARP S\*(C'\fR should match the sequence \f(CW\*(C`ss\*(C'\fR
+under \f(CW\*(C`/i\*(C'\fR rules.  Thus,
+.Sp
+.Vb 2
+\& \*(Aqss\*(Aq =~ /\eA\eN{LATIN SMALL LETTER SHARP S}\ez/i             # Matches
+\& \*(Aqss\*(Aq =~ /\eA[aeioust\eN{LATIN SMALL LETTER SHARP S}]\ez/i    # Matches
+.Ve
+.Sp
+For this to happen, the class must not be inverted (see "Negation")
+and the character must be explicitly specified, and not be part of a
+multi-character range (not even as one of its endpoints).  ("Character
+Ranges" will be explained shortly.) Therefore,
+.Sp
+.Vb 6
+\& \*(Aqss\*(Aq =~ /\eA[\e0\-\ex{ff}]\ez/ui       # Doesn\*(Aqt match
+\& \*(Aqss\*(Aq =~ /\eA[\e0\-\eN{LATIN SMALL LETTER SHARP S}]\ez/ui   # No match
+\& \*(Aqss\*(Aq =~ /\eA[\exDF\-\exDF]\ez/ui   # Matches on ASCII platforms, since
+\&                               # \exDF is LATIN SMALL LETTER SHARP S,
+\&                               # and the range is just a single
+\&                               # element
+.Ve
+.Sp
+Note that it isn't a good idea to specify these types of ranges anyway.
+.IP \(bu 4
+Some names known to \f(CW\*(C`\eN{...}\*(C'\fR refer to a sequence of multiple characters,
+instead of the usual single character.  When one of these is included in
+the class, the entire sequence is matched.  For example,
+.Sp
+.Vb 2
+\&  "\eN{TAMIL LETTER KA}\eN{TAMIL VOWEL SIGN AU}"
+\&                              =~ / ^ [\eN{TAMIL SYLLABLE KAU}]  $ /x;
+.Ve
+.Sp
+matches, because \f(CW\*(C`\eN{TAMIL SYLLABLE KAU}\*(C'\fR is a named sequence
+consisting of the two characters matched against.  Like the other
+instance where a bracketed class can match multiple characters, and for
+similar reasons, the class must not be inverted, and the named sequence
+may not appear in a range, even one where it is both endpoints.  If
+these happen, it is a fatal error if the character class is within the
+scope of \f(CW\*(C`use re \*(Aqstrict\*(C'\fR, or within an extended
+\&\f(CW\*(C`(?[...])\*(C'\fR class; otherwise
+only the first code point is used (with a \f(CW\*(C`regexp\*(C'\fR\-type warning
+raised).
+.PP
+\fISpecial Characters Inside a Bracketed Character Class\fR
+.IX Subsection "Special Characters Inside a Bracketed Character Class"
+.PP
+Most characters that are meta characters in regular expressions (that
+is, characters that carry a special meaning like \f(CW\*(C`.\*(C'\fR, \f(CW\*(C`*\*(C'\fR, or \f(CW\*(C`(\*(C'\fR) lose
+their special meaning and can be used inside a character class without
+the need to escape them. For instance, \f(CW\*(C`[()]\*(C'\fR matches either an opening
+parenthesis, or a closing parenthesis, and the parens inside the character
+class don't group or capture.  Be aware that, unless the pattern is
+evaluated in single-quotish context, variable interpolation will take
+place before the bracketed class is parsed:
+.PP
+.Vb 6
+\& $, = "\et| ";
+\& $a =~ m\*(Aq[$,]\*(Aq;        # single\-quotish: matches \*(Aq$\*(Aq or \*(Aq,\*(Aq
+\& $a =~ q{[$,]}\*(Aq        # same
+\& $a =~ m/[$,]/;        # double\-quotish: Because we made an
+\&                       #   assignment to $, above, this now
+\&                       #   matches "\et", "|", or " "
+.Ve
+.PP
+Characters that may carry a special meaning inside a character class are:
+\&\f(CW\*(C`\e\*(C'\fR, \f(CW\*(C`^\*(C'\fR, \f(CW\*(C`\-\*(C'\fR, \f(CW\*(C`[\*(C'\fR and \f(CW\*(C`]\*(C'\fR, and are discussed below. They can be
+escaped with a backslash, although this is sometimes not needed, in which
+case the backslash may be omitted.
+.PP
+The sequence \f(CW\*(C`\eb\*(C'\fR is special inside a bracketed character class. While
+outside the character class, \f(CW\*(C`\eb\*(C'\fR is an assertion indicating a point
+that does not have either two word characters or two non-word characters
+on either side, inside a bracketed character class, \f(CW\*(C`\eb\*(C'\fR matches a
+backspace character.
+.PP
+The sequences
+\&\f(CW\*(C`\ea\*(C'\fR,
+\&\f(CW\*(C`\ec\*(C'\fR,
+\&\f(CW\*(C`\ee\*(C'\fR,
+\&\f(CW\*(C`\ef\*(C'\fR,
+\&\f(CW\*(C`\en\*(C'\fR,
+\&\f(CW\*(C`\eN{\fR\f(CINAME\fR\f(CW}\*(C'\fR,
+\&\f(CW\*(C`\eN{U+\fR\f(CIhex char\fR\f(CW}\*(C'\fR,
+\&\f(CW\*(C`\er\*(C'\fR,
+\&\f(CW\*(C`\et\*(C'\fR,
+and
+\&\f(CW\*(C`\ex\*(C'\fR
+are also special and have the same meanings as they do outside a
+bracketed character class.
+.PP
+Also, a backslash followed by two or three octal digits is considered an octal
+number.
+.PP
+A \f(CW\*(C`[\*(C'\fR is not special inside a character class, unless it's the start of a
+POSIX character class (see "POSIX Character Classes" below). It normally does
+not need escaping.
+.PP
+A \f(CW\*(C`]\*(C'\fR is normally either the end of a POSIX character class (see
+"POSIX Character Classes" below), or it signals the end of the bracketed
+character class.  If you want to include a \f(CW\*(C`]\*(C'\fR in the set of characters, you
+must generally escape it.
+.PP
+However, if the \f(CW\*(C`]\*(C'\fR is the \fIfirst\fR (or the second if the first
+character is a caret) character of a bracketed character class, it
+does not denote the end of the class (as you cannot have an empty class)
+and is considered part of the set of characters that can be matched without
+escaping.
+.PP
+Examples:
+.PP
+.Vb 8
+\& "+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
+\& "\ecH" =~ /[\eb]/      #  Match, \eb inside in a character class
+\&                      #  is equivalent to a backspace.
+\& "]"   =~ /[][]/      #  Match, as the character class contains
+\&                      #  both [ and ].
+\& "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
+\&                      #  containing just [, and the character class is
+\&                      #  followed by a ].
+.Ve
+.PP
+\fIBracketed Character Classes and the \fR\f(CI\*(C`/xx\*(C'\fR\fI pattern modifier\fR
+.IX Subsection "Bracketed Character Classes and the /xx pattern modifier"
+.PP
+Normally SPACE and TAB characters have no special meaning inside a
+bracketed character class; they are just added to the list of characters
+matched by the class.  But if the \f(CW\*(C`/xx\*(C'\fR
+pattern modifier is in effect, they are generally ignored and can be
+added to improve readability.  They can't be added in the middle of a
+single construct:
+.PP
+.Vb 1
+\& / [ \ex{10 FFFF} ] /xx  # WRONG!
+.Ve
+.PP
+The SPACE in the middle of the hex constant is illegal.
+.PP
+To specify a literal SPACE character, you can escape it with a
+backslash, like:
+.PP
+.Vb 1
+\& /[ a e i o u \e  ]/xx
+.Ve
+.PP
+This matches the English vowels plus the SPACE character.
+.PP
+For clarity, you should already have been using \f(CW\*(C`\et\*(C'\fR to specify a
+literal tab, and \f(CW\*(C`\et\*(C'\fR is unaffected by \f(CW\*(C`/xx\*(C'\fR.
+.PP
+\fICharacter Ranges\fR
+.IX Subsection "Character Ranges"
+.PP
+It is not uncommon to want to match a range of characters. Luckily, instead
+of listing all characters in the range, one may use the hyphen (\f(CW\*(C`\-\*(C'\fR).
+If inside a bracketed character class you have two characters separated
+by a hyphen, it's treated as if all characters between the two were in
+the class. For instance, \f(CW\*(C`[0\-9]\*(C'\fR matches any ASCII digit, and \f(CW\*(C`[a\-m]\*(C'\fR
+matches any lowercase letter from the first half of the ASCII alphabet.
+.PP
+Note that the two characters on either side of the hyphen are not
+necessarily both letters or both digits. Any character is possible,
+although not advisable.  \f(CW\*(C`[\*(Aq\-?]\*(C'\fR contains a range of characters, but
+most people will not know which characters that means.  Furthermore,
+such ranges may lead to portability problems if the code has to run on
+a platform that uses a different character set, such as EBCDIC.
+.PP
+If a hyphen in a character class cannot syntactically be part of a range, for
+instance because it is the first or the last character of the character class,
+or if it immediately follows a range, the hyphen isn't special, and so is
+considered a character to be matched literally.  If you want a hyphen in
+your set of characters to be matched and its position in the class is such
+that it could be considered part of a range, you must escape that hyphen
+with a backslash.
+.PP
+Examples:
+.PP
+.Vb 12
+\& [a\-z]       #  Matches a character that is a lower case ASCII letter.
+\& [a\-fz]      #  Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive) or
+\&             #  the letter \*(Aqz\*(Aq.
+\& [\-z]        #  Matches either a hyphen (\*(Aq\-\*(Aq) or the letter \*(Aqz\*(Aq.
+\& [a\-f\-m]     #  Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive), the
+\&             #  hyphen (\*(Aq\-\*(Aq), or the letter \*(Aqm\*(Aq.
+\& [\*(Aq\-?]       #  Matches any of the characters  \*(Aq()*+,\-./0123456789:;<=>?
+\&             #  (But not on an EBCDIC platform).
+\& [\eN{APOSTROPHE}\-\eN{QUESTION MARK}]
+\&             #  Matches any of the characters  \*(Aq()*+,\-./0123456789:;<=>?
+\&             #  even on an EBCDIC platform.
+\& [\eN{U+27}\-\eN{U+3F}] # Same. (U+27 is "\*(Aq", and U+3F is "?")
+.Ve
+.PP
+As the final two examples above show, you can achieve portability to
+non-ASCII platforms by using the \f(CW\*(C`\eN{...}\*(C'\fR form for the range
+endpoints.  These indicate that the specified range is to be interpreted
+using Unicode values, so \f(CW\*(C`[\eN{U+27}\-\eN{U+3F}]\*(C'\fR means to match
+\&\f(CW\*(C`\eN{U+27}\*(C'\fR, \f(CW\*(C`\eN{U+28}\*(C'\fR, \f(CW\*(C`\eN{U+29}\*(C'\fR, ..., \f(CW\*(C`\eN{U+3D}\*(C'\fR, \f(CW\*(C`\eN{U+3E}\*(C'\fR,
+and \f(CW\*(C`\eN{U+3F}\*(C'\fR, whatever the native code point versions for those are.
+These are called "Unicode" ranges.  If either end is of the \f(CW\*(C`\eN{...}\*(C'\fR
+form, the range is considered Unicode.  A \f(CW\*(C`regexp\*(C'\fR warning is raised
+under \f(CW"use\ re\ \*(Aqstrict\*(Aq"\fR if the other endpoint is specified
+non-portably:
+.PP
+.Vb 2
+\& [\eN{U+00}\-\ex09]    # Warning under re \*(Aqstrict\*(Aq; \ex09 is non\-portable
+\& [\eN{U+00}\-\et]      # No warning;
+.Ve
+.PP
+Both of the above match the characters \f(CW\*(C`\eN{U+00}\*(C'\fR \f(CW\*(C`\eN{U+01}\*(C'\fR, ...
+\&\f(CW\*(C`\eN{U+08}\*(C'\fR, \f(CW\*(C`\eN{U+09}\*(C'\fR, but the \f(CW\*(C`\ex09\*(C'\fR looks like it could be a
+mistake so the warning is raised (under \f(CW\*(C`re \*(Aqstrict\*(Aq\*(C'\fR) for it.
+.PP
+Perl also guarantees that the ranges \f(CW\*(C`A\-Z\*(C'\fR, \f(CW\*(C`a\-z\*(C'\fR, \f(CW\*(C`0\-9\*(C'\fR, and any
+subranges of these match what an English-only speaker would expect them
+to match on any platform.  That is, \f(CW\*(C`[A\-Z]\*(C'\fR matches the 26 ASCII
+uppercase letters;
+\&\f(CW\*(C`[a\-z]\*(C'\fR matches the 26 lowercase letters; and \f(CW\*(C`[0\-9]\*(C'\fR matches the 10
+digits.  Subranges, like \f(CW\*(C`[h\-k]\*(C'\fR, match correspondingly, in this case
+just the four letters \f(CW"h"\fR, \f(CW"i"\fR, \f(CW"j"\fR, and \f(CW"k"\fR.  This is the
+natural behavior on ASCII platforms where the code points (ordinal
+values) for \f(CW"h"\fR through \f(CW"k"\fR are consecutive integers (0x68 through
+0x6B).  But special handling to achieve this may be needed on platforms
+with a non-ASCII native character set.  For example, on EBCDIC
+platforms, the code point for \f(CW"h"\fR is 0x88, \f(CW"i"\fR is 0x89, \f(CW"j"\fR is
+0x91, and \f(CW"k"\fR is 0x92.   Perl specially treats \f(CW\*(C`[h\-k]\*(C'\fR to exclude the
+seven code points in the gap: 0x8A through 0x90.  This special handling is
+only invoked when the range is a subrange of one of the ASCII uppercase,
+lowercase, and digit ranges, AND each end of the range is expressed
+either as a literal, like \f(CW"A"\fR, or as a named character (\f(CW\*(C`\eN{...}\*(C'\fR,
+including the \f(CW\*(C`\eN{U+...\*(C'\fR form).
+.PP
+EBCDIC Examples:
+.PP
+.Vb 10
+\& [i\-j]               #  Matches either "i" or "j"
+\& [i\-\eN{LATIN SMALL LETTER J}]  # Same
+\& [i\-\eN{U+6A}]        #  Same
+\& [\eN{U+69}\-\eN{U+6A}] #  Same
+\& [\ex{89}\-\ex{91}]     #  Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j")
+\& [i\-\ex{91}]          #  Same
+\& [\ex{89}\-j]          #  Same
+\& [i\-J]               #  Matches, 0x89 ("i") .. 0xC1 ("J"); special
+\&                     #  handling doesn\*(Aqt apply because range is mixed
+\&                     #  case
+.Ve
+.PP
+\fINegation\fR
+.IX Subsection "Negation"
+.PP
+It is also possible to instead list the characters you do not want to
+match. You can do so by using a caret (\f(CW\*(C`^\*(C'\fR) as the first character in the
+character class. For instance, \f(CW\*(C`[^a\-z]\*(C'\fR matches any character that is not a
+lowercase ASCII letter, which therefore includes more than a million
+Unicode code points.  The class is said to be "negated" or "inverted".
+.PP
+This syntax make the caret a special character inside a bracketed character
+class, but only if it is the first character of the class. So if you want
+the caret as one of the characters to match, either escape the caret or
+else don't list it first.
+.PP
+In inverted bracketed character classes, Perl ignores the Unicode rules
+that normally say that named sequence, and certain characters should
+match a sequence of multiple characters use under caseless \f(CW\*(C`/i\*(C'\fR
+matching.  Following those rules could lead to highly confusing
+situations:
+.PP
+.Vb 1
+\& "ss" =~ /^[^\exDF]+$/ui;   # Matches!
+.Ve
+.PP
+This should match any sequences of characters that aren't \f(CW\*(C`\exDF\*(C'\fR nor
+what \f(CW\*(C`\exDF\*(C'\fR matches under \f(CW\*(C`/i\*(C'\fR.  \f(CW"s"\fR isn't \f(CW\*(C`\exDF\*(C'\fR, but Unicode
+says that \f(CW"ss"\fR is what \f(CW\*(C`\exDF\*(C'\fR matches under \f(CW\*(C`/i\*(C'\fR.  So which one
+"wins"? Do you fail the match because the string has \f(CW\*(C`ss\*(C'\fR or accept it
+because it has an \f(CW\*(C`s\*(C'\fR followed by another \f(CW\*(C`s\*(C'\fR?  Perl has chosen the
+latter.  (See note in "Bracketed Character Classes" above.)
+.PP
+Examples:
+.PP
+.Vb 4
+\& "e"  =~  /[^aeiou]/   #  No match, the \*(Aqe\*(Aq is listed.
+\& "x"  =~  /[^aeiou]/   #  Match, as \*(Aqx\*(Aq isn\*(Aqt a lowercase vowel.
+\& "^"  =~  /[^^]/       #  No match, matches anything that isn\*(Aqt a caret.
+\& "^"  =~  /[x^]/       #  Match, caret is not special here.
+.Ve
+.PP
+\fIBackslash Sequences\fR
+.IX Subsection "Backslash Sequences"
+.PP
+You can put any backslash sequence character class (with the exception of
+\&\f(CW\*(C`\eN\*(C'\fR and \f(CW\*(C`\eR\*(C'\fR) inside a bracketed character class, and it will act just
+as if you had put all characters matched by the backslash sequence inside the
+character class. For instance, \f(CW\*(C`[a\-f\ed]\*(C'\fR matches any decimal digit, or any
+of the lowercase letters between 'a' and 'f' inclusive.
+.PP
+\&\f(CW\*(C`\eN\*(C'\fR within a bracketed character class must be of the forms \f(CW\*(C`\eN{\fR\f(CIname\fR\f(CW}\*(C'\fR
+or \f(CW\*(C`\eN{U+\fR\f(CIhex char\fR\f(CW}\*(C'\fR, and NOT be the form that matches non-newlines,
+for the same reason that a dot \f(CW\*(C`.\*(C'\fR inside a bracketed character class loses
+its special meaning: it matches nearly anything, which generally isn't what you
+want to happen.
+.PP
+Examples:
+.PP
+.Vb 4
+\& /[\ep{Thai}\ed]/     # Matches a character that is either a Thai
+\&                    # character, or a digit.
+\& /[^\ep{Arabic}()]/  # Matches a character that is neither an Arabic
+\&                    # character, nor a parenthesis.
+.Ve
+.PP
+Backslash sequence character classes cannot form one of the endpoints
+of a range.  Thus, you can't say:
+.PP
+.Vb 1
+\& /[\ep{Thai}\-\ed]/     # Wrong!
+.Ve
+.PP
+\fIPOSIX Character Classes\fR
+.IX Xref "character class \\p \\p{} alpha alnum ascii blank cntrl digit graph lower print punct space upper word xdigit"
+.IX Subsection "POSIX Character Classes"
+.PP
+POSIX character classes have the form \f(CW\*(C`[:class:]\*(C'\fR, where \fIclass\fR is the
+name, and the \f(CW\*(C`[:\*(C'\fR and \f(CW\*(C`:]\*(C'\fR delimiters. POSIX character classes only appear
+\&\fIinside\fR bracketed character classes, and are a convenient and descriptive
+way of listing a group of characters.
+.PP
+Be careful about the syntax,
+.PP
+.Vb 2
+\& # Correct:
+\& $string =~ /[[:alpha:]]/
+\&
+\& # Incorrect (will warn):
+\& $string =~ /[:alpha:]/
+.Ve
+.PP
+The latter pattern would be a character class consisting of a colon,
+and the letters \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`l\*(C'\fR, \f(CW\*(C`p\*(C'\fR and \f(CW\*(C`h\*(C'\fR.
+.PP
+POSIX character classes can be part of a larger bracketed character class.
+For example,
+.PP
+.Vb 1
+\& [01[:alpha:]%]
+.Ve
+.PP
+is valid and matches '0', '1', any alphabetic character, and the percent sign.
+.PP
+Perl recognizes the following POSIX character classes:
+.PP
+.Vb 10
+\& alpha  Any alphabetical character (e.g., [A\-Za\-z]).
+\& alnum  Any alphanumeric character (e.g., [A\-Za\-z0\-9]).
+\& ascii  Any character in the ASCII character set.
+\& blank  A GNU extension, equal to a space or a horizontal tab ("\et").
+\& cntrl  Any control character.  See Note [2] below.
+\& digit  Any decimal digit (e.g., [0\-9]), equivalent to "\ed".
+\& graph  Any printable character, excluding a space.  See Note [3] below.
+\& lower  Any lowercase character (e.g., [a\-z]).
+\& print  Any printable character, including a space.  See Note [4] below.
+\& punct  Any graphical character excluding "word" characters.  Note [5].
+\& space  Any whitespace character. "\es" including the vertical tab
+\&        ("\ecK").
+\& upper  Any uppercase character (e.g., [A\-Z]).
+\& word   A Perl extension (e.g., [A\-Za\-z0\-9_]), equivalent to "\ew".
+\& xdigit Any hexadecimal digit (e.g., [0\-9a\-fA\-F]).  Note [7].
+.Ve
+.PP
+Like the Unicode properties, most of the POSIX
+properties match the same regardless of whether case-insensitive (\f(CW\*(C`/i\*(C'\fR)
+matching is in effect or not.  The two exceptions are \f(CW\*(C`[:upper:]\*(C'\fR and
+\&\f(CW\*(C`[:lower:]\*(C'\fR.  Under \f(CW\*(C`/i\*(C'\fR, they each match the union of \f(CW\*(C`[:upper:]\*(C'\fR and
+\&\f(CW\*(C`[:lower:]\*(C'\fR.
+.PP
+Most POSIX character classes have two Unicode-style \f(CW\*(C`\ep\*(C'\fR property
+counterparts.  (They are not official Unicode properties, but Perl extensions
+derived from official Unicode properties.)  The table below shows the relation
+between POSIX character classes and these counterparts.
+.PP
+One counterpart, in the column labelled "ASCII-range Unicode" in
+the table, matches only characters in the ASCII character set.
+.PP
+The other counterpart, in the column labelled "Full-range Unicode", matches any
+appropriate characters in the full Unicode character set.  For example,
+\&\f(CW\*(C`\ep{Alpha}\*(C'\fR matches not just the ASCII alphabetic characters, but any
+character in the entire Unicode character set considered alphabetic.
+An entry in the column labelled "backslash sequence" is a (short)
+equivalent.
+.PP
+.Vb 10
+\& [[:...:]]      ASCII\-range          Full\-range  backslash  Note
+\&                 Unicode              Unicode     sequence
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\&   alpha      \ep{PosixAlpha}       \ep{XPosixAlpha}
+\&   alnum      \ep{PosixAlnum}       \ep{XPosixAlnum}
+\&   ascii      \ep{ASCII}
+\&   blank      \ep{PosixBlank}       \ep{XPosixBlank}  \eh      [1]
+\&                                   or \ep{HorizSpace}        [1]
+\&   cntrl      \ep{PosixCntrl}       \ep{XPosixCntrl}          [2]
+\&   digit      \ep{PosixDigit}       \ep{XPosixDigit}  \ed
+\&   graph      \ep{PosixGraph}       \ep{XPosixGraph}          [3]
+\&   lower      \ep{PosixLower}       \ep{XPosixLower}
+\&   print      \ep{PosixPrint}       \ep{XPosixPrint}          [4]
+\&   punct      \ep{PosixPunct}       \ep{XPosixPunct}          [5]
+\&              \ep{PerlSpace}        \ep{XPerlSpace}   \es      [6]
+\&   space      \ep{PosixSpace}       \ep{XPosixSpace}          [6]
+\&   upper      \ep{PosixUpper}       \ep{XPosixUpper}
+\&   word       \ep{PosixWord}        \ep{XPosixWord}   \ew
+\&   xdigit     \ep{PosixXDigit}      \ep{XPosixXDigit}         [7]
+.Ve
+.IP [1] 4
+.IX Item "[1]"
+\&\f(CW\*(C`\ep{Blank}\*(C'\fR and \f(CW\*(C`\ep{HorizSpace}\*(C'\fR are synonyms.
+.IP [2] 4
+.IX Item "[2]"
+Control characters don't produce output as such, but instead usually control
+the terminal somehow: for example, newline and backspace are control characters.
+On ASCII platforms, in the ASCII range, characters whose code points are
+between 0 and 31 inclusive, plus 127 (\f(CW\*(C`DEL\*(C'\fR) are control characters; on
+EBCDIC platforms, their counterparts are control characters.
+.IP [3] 4
+.IX Item "[3]"
+Any character that is \fIgraphical\fR, that is, visible. This class consists
+of all alphanumeric characters and all punctuation characters.
+.IP [4] 4
+.IX Item "[4]"
+All printable characters, which is the set of all graphical characters
+plus those whitespace characters which are not also controls.
+.IP [5] 4
+.IX Item "[5]"
+\&\f(CW\*(C`\ep{PosixPunct}\*(C'\fR and \f(CW\*(C`[[:punct:]]\*(C'\fR in the ASCII range match all
+non-controls, non-alphanumeric, non-space characters:
+\&\f(CW\*(C`[\-!"#$%&\*(Aq()*+,./:;<=>?@[\e\e\e]^_\`{|}~]\*(C'\fR (although if a locale is in effect,
+it could alter the behavior of \f(CW\*(C`[[:punct:]]\*(C'\fR).
+.Sp
+The similarly named property, \f(CW\*(C`\ep{Punct}\*(C'\fR, matches a somewhat different
+set in the ASCII range, namely
+\&\f(CW\*(C`[\-!"#%&\*(Aq()*,./:;?@[\e\e\e]_{}]\*(C'\fR.  That is, it is missing the nine
+characters \f(CW\*(C`[$+<=>^\`|~]\*(C'\fR.
+This is because Unicode splits what POSIX considers to be punctuation into two
+categories, Punctuation and Symbols.
+.Sp
+\&\f(CW\*(C`\ep{XPosixPunct}\*(C'\fR and (under Unicode rules) \f(CW\*(C`[[:punct:]]\*(C'\fR, match what
+\&\f(CW\*(C`\ep{PosixPunct}\*(C'\fR matches in the ASCII range, plus what \f(CW\*(C`\ep{Punct}\*(C'\fR
+matches.  This is different than strictly matching according to
+\&\f(CW\*(C`\ep{Punct}\*(C'\fR.  Another way to say it is that
+if Unicode rules are in effect, \f(CW\*(C`[[:punct:]]\*(C'\fR matches all characters
+that Unicode considers punctuation, plus all ASCII-range characters that
+Unicode considers symbols.
+.IP [6] 4
+.IX Item "[6]"
+\&\f(CW\*(C`\ep{XPerlSpace}\*(C'\fR and \f(CW\*(C`\ep{Space}\*(C'\fR match identically starting with Perl
+v5.18.  In earlier versions, these differ only in that in non-locale
+matching, \f(CW\*(C`\ep{XPerlSpace}\*(C'\fR did not match the vertical tab, \f(CW\*(C`\ecK\*(C'\fR.
+Same for the two ASCII-only range forms.
+.IP [7] 4
+.IX Item "[7]"
+Unlike \f(CW\*(C`[[:digit:]]\*(C'\fR which matches digits in many writing systems, such
+as Thai and Devanagari, there are currently only two sets of hexadecimal
+digits, and it is unlikely that more will be added.  This is because you
+not only need the ten digits, but also the six \f(CW\*(C`[A\-F]\*(C'\fR (and \f(CW\*(C`[a\-f]\*(C'\fR)
+to correspond.  That means only the Latin script is suitable for these,
+and Unicode has only two sets of these, the familiar ASCII set, and the
+fullwidth forms starting at U+FF10 (FULLWIDTH DIGIT ZERO).
+.PP
+There are various other synonyms that can be used besides the names
+listed in the table.  For example, \f(CW\*(C`\ep{XPosixAlpha}\*(C'\fR can be written as
+\&\f(CW\*(C`\ep{Alpha}\*(C'\fR.  All are listed in
+"Properties accessible through \ep{} and \eP{}" in perluniprops.
+.PP
+Both the \f(CW\*(C`\ep\*(C'\fR counterparts always assume Unicode rules are in effect.
+On ASCII platforms, this means they assume that the code points from 128
+to 255 are Latin\-1, and that means that using them under locale rules is
+unwise unless the locale is guaranteed to be Latin\-1 or UTF\-8.  In contrast, the
+POSIX character classes are useful under locale rules.  They are
+affected by the actual rules in effect, as follows:
+.ie n .IP "If the ""/a"" modifier, is in effect ..." 4
+.el .IP "If the \f(CW/a\fR modifier, is in effect ..." 4
+.IX Item "If the /a modifier, is in effect ..."
+Each of the POSIX classes matches exactly the same as their ASCII-range
+counterparts.
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+.RS 4
+.PD 0
+.IP "For code points above 255 ..." 4
+.IX Item "For code points above 255 ..."
+.PD
+The POSIX class matches the same as its Full-range counterpart.
+.IP "For code points below 256 ..." 4
+.IX Item "For code points below 256 ..."
+.RS 4
+.PD 0
+.IP "if locale rules are in effect ..." 4
+.IX Item "if locale rules are in effect ..."
+.PD
+The POSIX class matches according to the locale, except:
+.RS 4
+.ie n .IP """word""" 4
+.el .IP \f(CWword\fR 4
+.IX Item "word"
+also includes the platform's native underscore character, no matter what
+the locale is.
+.ie n .IP """ascii""" 4
+.el .IP \f(CWascii\fR 4
+.IX Item "ascii"
+on platforms that don't have the POSIX \f(CW\*(C`ascii\*(C'\fR extension, this matches
+just the platform's native ASCII-range characters.
+.ie n .IP """blank""" 4
+.el .IP \f(CWblank\fR 4
+.IX Item "blank"
+on platforms that don't have the POSIX \f(CW\*(C`blank\*(C'\fR extension, this matches
+just the platform's native tab and space characters.
+.RE
+.RS 4
+.RE
+.IP "if, instead, Unicode rules are in effect ..." 4
+.IX Item "if, instead, Unicode rules are in effect ..."
+The POSIX class matches the same as the Full-range counterpart.
+.IP "otherwise ..." 4
+.IX Item "otherwise ..."
+The POSIX class matches the same as the ASCII range counterpart.
+.RE
+.RS 4
+.RE
+.RE
+.RS 4
+.RE
+.PP
+Which rules apply are determined as described in
+"Which character set modifier is in effect?" in perlre.
+.PP
+Negation of POSIX character classes
+.IX Xref "character class, negation"
+.IX Subsection "Negation of POSIX character classes"
+.PP
+A Perl extension to the POSIX character class is the ability to
+negate it. This is done by prefixing the class name with a caret (\f(CW\*(C`^\*(C'\fR).
+Some examples:
+.PP
+.Vb 7
+\&     POSIX         ASCII\-range     Full\-range  backslash
+\&                    Unicode         Unicode    sequence
+\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
+\& [[:^digit:]]   \eP{PosixDigit}  \eP{XPosixDigit}   \eD
+\& [[:^space:]]   \eP{PosixSpace}  \eP{XPosixSpace}
+\&                \eP{PerlSpace}   \eP{XPerlSpace}    \eS
+\& [[:^word:]]    \eP{PerlWord}    \eP{XPosixWord}    \eW
+.Ve
+.PP
+The backslash sequence can mean either ASCII\- or Full-range Unicode,
+depending on various factors as described in "Which character set modifier is in effect?" in perlre.
+.PP
+[= =] and [. .]
+.IX Subsection "[= =] and [. .]"
+.PP
+Perl recognizes the POSIX character classes \f(CW\*(C`[=class=]\*(C'\fR and
+\&\f(CW\*(C`[.class.]\*(C'\fR, but does not (yet?) support them.  Any attempt to use
+either construct raises an exception.
+.PP
+Examples
+.IX Subsection "Examples"
+.PP
+.Vb 12
+\& /[[:digit:]]/            # Matches a character that is a digit.
+\& /[01[:lower:]]/          # Matches a character that is either a
+\&                          # lowercase letter, or \*(Aq0\*(Aq or \*(Aq1\*(Aq.
+\& /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything
+\&                          # except the letters \*(Aqa\*(Aq to \*(Aqf\*(Aq and \*(AqA\*(Aq to
+\&                          # \*(AqF\*(Aq.  This is because the main character
+\&                          # class is composed of two POSIX character
+\&                          # classes that are ORed together, one that
+\&                          # matches any digit, and the other that
+\&                          # matches anything that isn\*(Aqt a hex digit.
+\&                          # The OR adds the digits, leaving only the
+\&                          # letters \*(Aqa\*(Aq to \*(Aqf\*(Aq and \*(AqA\*(Aq to \*(AqF\*(Aq excluded.
+.Ve
+.PP
+\fIExtended Bracketed Character Classes\fR
+.IX Xref "character class set operations"
+.IX Subsection "Extended Bracketed Character Classes"
+.PP
+This is a fancy bracketed character class that can be used for more
+readable and less error-prone classes, and to perform set operations,
+such as intersection. An example is
+.PP
+.Vb 1
+\& /(?[ \ep{Thai} & \ep{Digit} ])/
+.Ve
+.PP
+This will match all the digit characters that are in the Thai script.
+.PP
+This feature became available in Perl 5.18, as experimental; accepted in
+5.36.
+.PP
+The rules used by \f(CW\*(C`use re \*(Aqstrict\*(C'\fR apply to this
+construct.
+.PP
+We can extend the example above:
+.PP
+.Vb 1
+\& /(?[ ( \ep{Thai} + \ep{Lao} ) & \ep{Digit} ])/
+.Ve
+.PP
+This matches digits that are in either the Thai or Laotian scripts.
+.PP
+Notice the white space in these examples.  This construct always has
+the \f(CW\*(C`/xx\*(C'\fR modifier turned on within it.
+.PP
+The available binary operators are:
+.PP
+.Vb 10
+\& &    intersection
+\& +    union
+\& |    another name for \*(Aq+\*(Aq, hence means union
+\& \-    subtraction (the result matches the set consisting of those
+\&      code points matched by the first operand, excluding any that
+\&      are also matched by the second operand)
+\& ^    symmetric difference (the union minus the intersection).  This
+\&      is like an exclusive or, in that the result is the set of code
+\&      points that are matched by either, but not both, of the
+\&      operands.
+.Ve
+.PP
+There is one unary operator:
+.PP
+.Vb 1
+\& !    complement
+.Ve
+.PP
+All the binary operators left associate; \f(CW"&"\fR is higher precedence
+than the others, which all have equal precedence.  The unary operator
+right associates, and has highest precedence.  Thus this follows the
+normal Perl precedence rules for logical operators.  Use parentheses to
+override the default precedence and associativity.
+.PP
+The main restriction is that everything is a metacharacter.  Thus,
+you cannot refer to single characters by doing something like this:
+.PP
+.Vb 1
+\& /(?[ a + b ])/ # Syntax error!
+.Ve
+.PP
+The easiest way to specify an individual typable character is to enclose
+it in brackets:
+.PP
+.Vb 1
+\& /(?[ [a] + [b] ])/
+.Ve
+.PP
+(This is the same thing as \f(CW\*(C`[ab]\*(C'\fR.)  You could also have said the
+equivalent:
+.PP
+.Vb 1
+\& /(?[[ a b ]])/
+.Ve
+.PP
+(You can, of course, specify single characters by using, \f(CW\*(C`\ex{...}\*(C'\fR,
+\&\f(CW\*(C`\eN{...}\*(C'\fR, etc.)
+.PP
+This last example shows the use of this construct to specify an ordinary
+bracketed character class without additional set operations.  Note the
+white space within it.  This is allowed because \f(CW\*(C`/xx\*(C'\fR is
+automatically turned on within this construct.
+.PP
+All the other escapes accepted by normal bracketed character classes are
+accepted here as well.
+.PP
+Because this construct compiles under
+\&\f(CW\*(C`use re \*(Aqstrict\*(C'\fR,  unrecognized escapes that
+generate warnings in normal classes are fatal errors here, as well as
+all other warnings from these class elements, as well as some
+practices that don't currently warn outside \f(CW\*(C`re \*(Aqstrict\*(Aq\*(C'\fR.  For example
+you cannot say
+.PP
+.Vb 1
+\& /(?[ [ \exF ] ])/     # Syntax error!
+.Ve
+.PP
+You have to have two hex digits after a braceless \f(CW\*(C`\ex\*(C'\fR (use a leading
+zero to make two).  These restrictions are to lower the incidence of
+typos causing the class to not match what you thought it would.
+.PP
+If a regular bracketed character class contains a \f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR and
+is matched against a non-Unicode code point, a warning may be
+raised, as the result is not Unicode-defined.  No such warning will come
+when using this extended form.
+.PP
+The final difference between regular bracketed character classes and
+these, is that it is not possible to get these to match a
+multi-character fold.  Thus,
+.PP
+.Vb 1
+\& /(?[ [\exDF] ])/iu
+.Ve
+.PP
+does not match the string \f(CW\*(C`ss\*(C'\fR.
+.PP
+You don't have to enclose POSIX class names inside double brackets,
+hence both of the following work:
+.PP
+.Vb 2
+\& /(?[ [:word:] \- [:lower:] ])/
+\& /(?[ [[:word:]] \- [[:lower:]] ])/
+.Ve
+.PP
+Any contained POSIX character classes, including things like \f(CW\*(C`\ew\*(C'\fR and \f(CW\*(C`\eD\*(C'\fR
+respect the \f(CW\*(C`/a\*(C'\fR (and \f(CW\*(C`/aa\*(C'\fR) modifiers.
+.PP
+Note that \f(CW\*(C`(?[ ])\*(C'\fR is a regex-compile-time construct.  Any attempt
+to use something which isn't knowable at the time the containing regular
+expression is compiled is a fatal error.  In practice, this means
+just three limitations:
+.IP 1. 4
+When compiled within the scope of \f(CW\*(C`use locale\*(C'\fR (or the \f(CW\*(C`/l\*(C'\fR regex
+modifier), this construct assumes that the execution-time locale will be
+a UTF\-8 one, and the generated pattern always uses Unicode rules.  What
+gets matched or not thus isn't dependent on the actual runtime locale, so
+tainting is not enabled.  But a \f(CW\*(C`locale\*(C'\fR category warning is raised
+if the runtime locale turns out to not be UTF\-8.
+.IP 2. 4
+Any
+user-defined property
+used must be already defined by the time the regular expression is
+compiled (but note that this construct can be used instead of such
+properties).
+.IP 3. 4
+A regular expression that otherwise would compile
+using \f(CW\*(C`/d\*(C'\fR rules, and which uses this construct will instead
+use \f(CW\*(C`/u\*(C'\fR.  Thus this construct tells Perl that you don't want
+\&\f(CW\*(C`/d\*(C'\fR rules for the entire regular expression containing it.
+.PP
+Note that skipping white space applies only to the interior of this
+construct.  There must not be any space between any of the characters
+that form the initial \f(CW\*(C`(?[\*(C'\fR.  Nor may there be space between the
+closing \f(CW\*(C`])\*(C'\fR characters.
+.PP
+Just as in all regular expressions, the pattern can be built up by
+including variables that are interpolated at regex compilation time.
+But currently each such sub-component should be an already-compiled
+extended bracketed character class.
+.PP
+.Vb 3
+\& my $thai_or_lao = qr/(?[ \ep{Thai} + \ep{Lao} ])/;
+\& ...
+\& qr/(?[ \ep{Digit} & $thai_or_lao ])/;
+.Ve
+.PP
+If you interpolate something else, the pattern may still compile (or it
+may die), but if it compiles, it very well may not behave as you would
+expect:
+.PP
+.Vb 2
+\& my $thai_or_lao = \*(Aq\ep{Thai} + \ep{Lao}\*(Aq;
+\& qr/(?[ \ep{Digit} & $thai_or_lao ])/;
+.Ve
+.PP
+compiles to
+.PP
+.Vb 1
+\& qr/(?[ \ep{Digit} & \ep{Thai} + \ep{Lao} ])/;
+.Ve
+.PP
+This does not have the effect that someone reading the source code
+would likely expect, as the intersection applies just to \f(CW\*(C`\ep{Thai}\*(C'\fR,
+excluding the Laotian.
+.PP
+Due to the way that Perl parses things, your parentheses and brackets
+may need to be balanced, even including comments.  If you run into any
+examples, please submit them to <https://github.com/Perl/perl5/issues>,
+so that we can have a concrete example for this man page.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-15 19:43:11 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-15 19:43:11 +0000
commit	fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch)
tree	ce1e3bce06471410239a6f41282e328770aa404a /upstream/debian-unstable/man1/perlrecharclass.1
parent	Initial commit. (diff)
download	manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip