diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlrecharclass.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlrecharclass.1 | 1301 |
1 files changed, 1301 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlrecharclass.1 b/upstream/mageia-cauldron/man1/perlrecharclass.1 new file mode 100644 index 00000000..83c9dbea --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlrecharclass.1 @@ -0,0 +1,1301 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLRECHARCLASS 1" +.TH PERLRECHARCLASS 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlrecharclass \- Perl Regular Expression Character Classes +.IX Xref "character class" +.SH DESCRIPTION +.IX Header "DESCRIPTION" +The top level documentation about Perl regular expressions +is found in perlre. +.PP +This manual page discusses the syntax and use of character +classes in Perl regular expressions. +.PP +A character class is a way of denoting a set of characters +in such a way that one character of the set is matched. +It's important to remember that: matching a character class +consumes exactly one character in the source string. (The source +string is the string the regular expression is matched against.) +.PP +There are three types of character classes in Perl regular +expressions: the dot, backslash sequences, and the form enclosed in square +brackets. Keep in mind, though, that often the term "character class" is used +to mean just the bracketed form. Certainly, most Perl documentation does that. +.SS "The dot" +.IX Subsection "The dot" +The dot (or period), \f(CW\*(C`.\*(C'\fR is probably the most used, and certainly +the most well-known character class. By default, a dot matches any +character, except for the newline. That default can be changed to +add matching the newline by using the \fIsingle line\fR modifier: +for the entire regular expression with the \f(CW\*(C`/s\*(C'\fR modifier, or +locally with \f(CW\*(C`(?s)\*(C'\fR (and even globally within the scope of +\&\f(CW\*(C`use re \*(Aq/s\*(Aq\*(C'\fR). (The \f(CW"\eN"\fR backslash +sequence, described +below, matches any character except newline without regard to the +\&\fIsingle line\fR modifier.) +.PP +Here are some examples: +.PP +.Vb 7 +\& "a" =~ /./ # Match +\& "." =~ /./ # Match +\& "" =~ /./ # No match (dot has to match a character) +\& "\en" =~ /./ # No match (dot does not match a newline) +\& "\en" =~ /./s # Match (global \*(Aqsingle line\*(Aq modifier) +\& "\en" =~ /(?s:.)/ # Match (local \*(Aqsingle line\*(Aq modifier) +\& "ab" =~ /^.$/ # No match (dot matches one character) +.Ve +.SS "Backslash sequences" +.IX Xref "\\w \\W \\s \\S \\d \\D \\p \\P \\N \\v \\V \\h \\H word whitespace" +.IX Subsection "Backslash sequences" +A backslash sequence is a sequence of characters, the first one of which is a +backslash. Perl ascribes special meaning to many such sequences, and some of +these are character classes. That is, they match a single character each, +provided that the character belongs to the specific set of characters defined +by the sequence. +.PP +Here's a list of the backslash sequences that are character classes. They +are discussed in more detail below. (For the backslash sequences that aren't +character classes, see perlrebackslash.) +.PP +.Vb 10 +\& \ed Match a decimal digit character. +\& \eD Match a non\-decimal\-digit character. +\& \ew Match a "word" character. +\& \eW Match a non\-"word" character. +\& \es Match a whitespace character. +\& \eS Match a non\-whitespace character. +\& \eh Match a horizontal whitespace character. +\& \eH Match a character that isn\*(Aqt horizontal whitespace. +\& \ev Match a vertical whitespace character. +\& \eV Match a character that isn\*(Aqt vertical whitespace. +\& \eN Match a character that isn\*(Aqt a newline. +\& \epP, \ep{Prop} Match a character that has the given Unicode property. +\& \ePP, \eP{Prop} Match a character that doesn\*(Aqt have the Unicode property +.Ve +.PP +\fI\eN\fR +.IX Subsection "N" +.PP +\&\f(CW\*(C`\eN\*(C'\fR, available starting in v5.12, like the dot, matches any +character that is not a newline. The difference is that \f(CW\*(C`\eN\*(C'\fR is not influenced +by the \fIsingle line\fR regular expression modifier (see "The dot" above). Note +that the form \f(CW\*(C`\eN{...}\*(C'\fR may mean something completely different. When the +\&\f(CW\*(C`{...}\*(C'\fR is a quantifier, it means to match a non-newline +character that many times. For example, \f(CW\*(C`\eN{3}\*(C'\fR means to match 3 +non-newlines; \f(CW\*(C`\eN{5,}\*(C'\fR means to match 5 or more non-newlines. But if \f(CW\*(C`{...}\*(C'\fR +is not a legal quantifier, it is presumed to be a named character. See +charnames for those. For example, none of \f(CW\*(C`\eN{COLON}\*(C'\fR, \f(CW\*(C`\eN{4F}\*(C'\fR, and +\&\f(CW\*(C`\eN{F4}\*(C'\fR contain legal quantifiers, so Perl will try to find characters whose +names are respectively \f(CW\*(C`COLON\*(C'\fR, \f(CW\*(C`4F\*(C'\fR, and \f(CW\*(C`F4\*(C'\fR. +.PP +\fIDigits\fR +.IX Subsection "Digits" +.PP +\&\f(CW\*(C`\ed\*(C'\fR matches a single character considered to be a decimal \fIdigit\fR. +If the \f(CW\*(C`/a\*(C'\fR regular expression modifier is in effect, it matches [0\-9]. +Otherwise, it +matches anything that is matched by \f(CW\*(C`\ep{Digit}\*(C'\fR, which includes [0\-9]. +(An unlikely possible exception is that under locale matching rules, the +current locale might not have \f(CW\*(C`[0\-9]\*(C'\fR matched by \f(CW\*(C`\ed\*(C'\fR, and/or might match +other characters whose code point is less than 256. The only such locale +definitions that are legal would be to match \f(CW\*(C`[0\-9]\*(C'\fR plus another set of +10 consecutive digit characters; anything else would be in violation of +the C language standard, but Perl doesn't currently assume anything in +regard to this.) +.PP +What this means is that unless the \f(CW\*(C`/a\*(C'\fR modifier is in effect \f(CW\*(C`\ed\*(C'\fR not +only matches the digits '0' \- '9', but also Arabic, Devanagari, and +digits from other languages. This may cause some confusion, and some +security issues. +.PP +Some digits that \f(CW\*(C`\ed\*(C'\fR matches look like some of the [0\-9] ones, but +have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks +very much like an ASCII DIGIT EIGHT (U+0038), and LEPCHA DIGIT SIX +(U+1C46) looks very much like an ASCII DIGIT FIVE (U+0035). An +application that +is expecting only the ASCII digits might be misled, or if the match is +\&\f(CW\*(C`\ed+\*(C'\fR, the matched string might contain a mixture of digits from +different writing systems that look like they signify a number different +than they actually do. "\fBnum()\fR" in Unicode::UCD can +be used to safely +calculate the value, returning \f(CW\*(C`undef\*(C'\fR if the input string contains +such a mixture. Otherwise, for example, a displayed price might be +deliberately different than it appears. +.PP +What \f(CW\*(C`\ep{Digit}\*(C'\fR means (and hence \f(CW\*(C`\ed\*(C'\fR except under the \f(CW\*(C`/a\*(C'\fR +modifier) is \f(CW\*(C`\ep{General_Category=Decimal_Number}\*(C'\fR, or synonymously, +\&\f(CW\*(C`\ep{General_Category=Digit}\*(C'\fR. Starting with Unicode version 4.1, this +is the same set of characters matched by \f(CW\*(C`\ep{Numeric_Type=Decimal}\*(C'\fR. +But Unicode also has a different property with a similar name, +\&\f(CW\*(C`\ep{Numeric_Type=Digit}\*(C'\fR, which matches a completely different set of +characters. These characters are things such as \f(CW\*(C`CIRCLED DIGIT ONE\*(C'\fR +or subscripts, or are from writing systems that lack all ten digits. +.PP +The design intent is for \f(CW\*(C`\ed\*(C'\fR to exactly match the set of characters +that can safely be used with "normal" big-endian positional decimal +syntax, where, for example 123 means one 'hundred', plus two 'tens', +plus three 'ones'. This positional notation does not necessarily apply +to characters that match the other type of "digit", +\&\f(CW\*(C`\ep{Numeric_Type=Digit}\*(C'\fR, and so \f(CW\*(C`\ed\*(C'\fR doesn't match them. +.PP +The Tamil digits (U+0BE6 \- U+0BEF) can also legally be +used in old-style Tamil numbers in which they would appear no more than +one in a row, separated by characters that mean "times 10", "times 100", +etc. (See <https://www.unicode.org/notes/tn21>.) +.PP +Any character not matched by \f(CW\*(C`\ed\*(C'\fR is matched by \f(CW\*(C`\eD\*(C'\fR. +.PP +\fIWord characters\fR +.IX Subsection "Word characters" +.PP +A \f(CW\*(C`\ew\*(C'\fR matches a single alphanumeric character (an alphabetic character, or a +decimal digit); or a connecting punctuation character, such as an +underscore ("_"); or a "mark" character (like some sort of accent) that +attaches to one of those. It does not match a whole word. To match a +whole word, use \f(CW\*(C`\ew+\*(C'\fR. This isn't the same thing as matching an +English word, but in the ASCII range it is the same as a string of +Perl-identifier characters. +.ie n .IP "If the ""/a"" modifier is in effect ..." 4 +.el .IP "If the \f(CW/a\fR modifier is in effect ..." 4 +.IX Item "If the /a modifier is in effect ..." +\&\f(CW\*(C`\ew\*(C'\fR matches the 63 characters [a\-zA\-Z0\-9_]. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +.RS 4 +.PD 0 +.IP "For code points above 255 ..." 4 +.IX Item "For code points above 255 ..." +.PD +\&\f(CW\*(C`\ew\*(C'\fR matches the same as \f(CW\*(C`\ep{Word}\*(C'\fR matches in this range. That is, +it matches Thai letters, Greek letters, etc. This includes connector +punctuation (like the underscore) which connect two words together, or +diacritics, such as a \f(CW\*(C`COMBINING TILDE\*(C'\fR and the modifier letters, which +are generally used to add auxiliary markings to letters. +.IP "For code points below 256 ..." 4 +.IX Item "For code points below 256 ..." +.RS 4 +.PD 0 +.IP "if locale rules are in effect ..." 4 +.IX Item "if locale rules are in effect ..." +.PD +\&\f(CW\*(C`\ew\*(C'\fR matches the platform's native underscore character plus whatever +the locale considers to be alphanumeric. +.IP "if, instead, Unicode rules are in effect ..." 4 +.IX Item "if, instead, Unicode rules are in effect ..." +\&\f(CW\*(C`\ew\*(C'\fR matches exactly what \f(CW\*(C`\ep{Word}\*(C'\fR matches. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +\&\f(CW\*(C`\ew\*(C'\fR matches [a\-zA\-Z0\-9_]. +.RE +.RS 4 +.RE +.RE +.RS 4 +.RE +.PP +Which rules apply are determined as described in "Which character set modifier is in effect?" in perlre. +.PP +There are a number of security issues with the full Unicode list of word +characters. See <http://unicode.org/reports/tr36>. +.PP +Also, for a somewhat finer-grained set of characters that are in programming +language identifiers beyond the ASCII range, you may wish to instead use the +more customized "Unicode Properties", \f(CW\*(C`\ep{ID_Start}\*(C'\fR, +\&\f(CW\*(C`\ep{ID_Continue}\*(C'\fR, \f(CW\*(C`\ep{XID_Start}\*(C'\fR, and \f(CW\*(C`\ep{XID_Continue}\*(C'\fR. See +<http://unicode.org/reports/tr31>. +.PP +Any character not matched by \f(CW\*(C`\ew\*(C'\fR is matched by \f(CW\*(C`\eW\*(C'\fR. +.PP +\fIWhitespace\fR +.IX Subsection "Whitespace" +.PP +\&\f(CW\*(C`\es\*(C'\fR matches any single character considered whitespace. +.ie n .IP "If the ""/a"" modifier is in effect ..." 4 +.el .IP "If the \f(CW/a\fR modifier is in effect ..." 4 +.IX Item "If the /a modifier is in effect ..." +In all Perl versions, \f(CW\*(C`\es\*(C'\fR matches the 5 characters [\et\en\ef\er ]; that +is, the horizontal tab, +the newline, the form feed, the carriage return, and the space. +Starting in Perl v5.18, it also matches the vertical tab, \f(CW\*(C`\ecK\*(C'\fR. +See note \f(CW\*(C`[1]\*(C'\fR below for a discussion of this. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +.RS 4 +.PD 0 +.IP "For code points above 255 ..." 4 +.IX Item "For code points above 255 ..." +.PD +\&\f(CW\*(C`\es\*(C'\fR matches exactly the code points above 255 shown with an "s" column +in the table below. +.IP "For code points below 256 ..." 4 +.IX Item "For code points below 256 ..." +.RS 4 +.PD 0 +.IP "if locale rules are in effect ..." 4 +.IX Item "if locale rules are in effect ..." +.PD +\&\f(CW\*(C`\es\*(C'\fR matches whatever the locale considers to be whitespace. +.IP "if, instead, Unicode rules are in effect ..." 4 +.IX Item "if, instead, Unicode rules are in effect ..." +\&\f(CW\*(C`\es\*(C'\fR matches exactly the characters shown with an "s" column in the +table below. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +\&\f(CW\*(C`\es\*(C'\fR matches [\et\en\ef\er ] and, starting in Perl +v5.18, the vertical tab, \f(CW\*(C`\ecK\*(C'\fR. +(See note \f(CW\*(C`[1]\*(C'\fR below for a discussion of this.) +Note that this list doesn't include the non-breaking space. +.RE +.RS 4 +.RE +.RE +.RS 4 +.RE +.PP +Which rules apply are determined as described in "Which character set modifier is in effect?" in perlre. +.PP +Any character not matched by \f(CW\*(C`\es\*(C'\fR is matched by \f(CW\*(C`\eS\*(C'\fR. +.PP +\&\f(CW\*(C`\eh\*(C'\fR matches any character considered horizontal whitespace; +this includes the platform's space and tab characters and several others +listed in the table below. \f(CW\*(C`\eH\*(C'\fR matches any character +not considered horizontal whitespace. They use the platform's native +character set, and do not consider any locale that may otherwise be in +use. +.PP +\&\f(CW\*(C`\ev\*(C'\fR matches any character considered vertical whitespace; +this includes the platform's carriage return and line feed characters (newline) +plus several other characters, all listed in the table below. +\&\f(CW\*(C`\eV\*(C'\fR matches any character not considered vertical whitespace. +They use the platform's native character set, and do not consider any +locale that may otherwise be in use. +.PP +\&\f(CW\*(C`\eR\*(C'\fR matches anything that can be considered a newline under Unicode +rules. It can match a multi-character sequence. It cannot be used inside +a bracketed character class; use \f(CW\*(C`\ev\*(C'\fR instead (vertical whitespace). +It uses the platform's +native character set, and does not consider any locale that may +otherwise be in use. +Details are discussed in perlrebackslash. +.PP +Note that unlike \f(CW\*(C`\es\*(C'\fR (and \f(CW\*(C`\ed\*(C'\fR and \f(CW\*(C`\ew\*(C'\fR), \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ev\*(C'\fR always match +the same characters, without regard to other factors, such as the active +locale or whether the source string is in UTF\-8 format. +.PP +One might think that \f(CW\*(C`\es\*(C'\fR is equivalent to \f(CW\*(C`[\eh\ev]\*(C'\fR. This is indeed true +starting in Perl v5.18, but prior to that, the sole difference was that the +vertical tab (\f(CW"\ecK"\fR) was not matched by \f(CW\*(C`\es\*(C'\fR. +.PP +The following table is a complete listing of characters matched by +\&\f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ev\*(C'\fR as of Unicode 14.0. +.PP +The first column gives the Unicode code point of the character (in hex format), +the second column gives the (Unicode) name. The third column indicates +by which class(es) the character is matched (assuming no locale is in +effect that changes the \f(CW\*(C`\es\*(C'\fR matching). +.PP +.Vb 10 +\& 0x0009 CHARACTER TABULATION h s +\& 0x000a LINE FEED (LF) vs +\& 0x000b LINE TABULATION vs [1] +\& 0x000c FORM FEED (FF) vs +\& 0x000d CARRIAGE RETURN (CR) vs +\& 0x0020 SPACE h s +\& 0x0085 NEXT LINE (NEL) vs [2] +\& 0x00a0 NO\-BREAK SPACE h s [2] +\& 0x1680 OGHAM SPACE MARK h s +\& 0x2000 EN QUAD h s +\& 0x2001 EM QUAD h s +\& 0x2002 EN SPACE h s +\& 0x2003 EM SPACE h s +\& 0x2004 THREE\-PER\-EM SPACE h s +\& 0x2005 FOUR\-PER\-EM SPACE h s +\& 0x2006 SIX\-PER\-EM SPACE h s +\& 0x2007 FIGURE SPACE h s +\& 0x2008 PUNCTUATION SPACE h s +\& 0x2009 THIN SPACE h s +\& 0x200a HAIR SPACE h s +\& 0x2028 LINE SEPARATOR vs +\& 0x2029 PARAGRAPH SEPARATOR vs +\& 0x202f NARROW NO\-BREAK SPACE h s +\& 0x205f MEDIUM MATHEMATICAL SPACE h s +\& 0x3000 IDEOGRAPHIC SPACE h s +.Ve +.IP [1] 4 +.IX Item "[1]" +Prior to Perl v5.18, \f(CW\*(C`\es\*(C'\fR did not match the vertical tab. +\&\f(CW\*(C`[^\eS\ecK]\*(C'\fR (obscurely) matches what \f(CW\*(C`\es\*(C'\fR traditionally did. +.IP [2] 4 +.IX Item "[2]" +NEXT LINE and NO-BREAK SPACE may or may not match \f(CW\*(C`\es\*(C'\fR depending +on the rules in effect. See +the beginning of this section. +.PP +\fIUnicode Properties\fR +.IX Subsection "Unicode Properties" +.PP +\&\f(CW\*(C`\epP\*(C'\fR and \f(CW\*(C`\ep{Prop}\*(C'\fR are character classes to match characters that fit given +Unicode properties. One letter property names can be used in the \f(CW\*(C`\epP\*(C'\fR form, +with the property name following the \f(CW\*(C`\ep\*(C'\fR, otherwise, braces are required. +When using braces, there is a single form, which is just the property name +enclosed in the braces, and a compound form which looks like \f(CW\*(C`\ep{name=value}\*(C'\fR, +which means to match if the property "name" for the character has that particular +"value". +For instance, a match for a number can be written as \f(CW\*(C`/\epN/\*(C'\fR or as +\&\f(CW\*(C`/\ep{Number}/\*(C'\fR, or as \f(CW\*(C`/\ep{Number=True}/\*(C'\fR. +Lowercase letters are matched by the property \fILowercase_Letter\fR which +has the short form \fILl\fR. They need the braces, so are written as \f(CW\*(C`/\ep{Ll}/\*(C'\fR or +\&\f(CW\*(C`/\ep{Lowercase_Letter}/\*(C'\fR, or \f(CW\*(C`/\ep{General_Category=Lowercase_Letter}/\*(C'\fR +(the underscores are optional). +\&\f(CW\*(C`/\epLl/\*(C'\fR is valid, but means something different. +It matches a two character string: a letter (Unicode property \f(CW\*(C`\epL\*(C'\fR), +followed by a lowercase \f(CW\*(C`l\*(C'\fR. +.PP +What a Unicode property matches is never subject to locale rules, and +if locale rules are not otherwise in effect, the use of a Unicode +property will force the regular expression into using Unicode rules, if +it isn't already. +.PP +Note that almost all properties are immune to case-insensitive matching. +That is, adding a \f(CW\*(C`/i\*(C'\fR regular expression modifier does not change what +they match. But there are two sets that are affected. The first set is +\&\f(CW\*(C`Uppercase_Letter\*(C'\fR, +\&\f(CW\*(C`Lowercase_Letter\*(C'\fR, +and \f(CW\*(C`Titlecase_Letter\*(C'\fR, +all of which match \f(CW\*(C`Cased_Letter\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching. +The second set is +\&\f(CW\*(C`Uppercase\*(C'\fR, +\&\f(CW\*(C`Lowercase\*(C'\fR, +and \f(CW\*(C`Titlecase\*(C'\fR, +all of which match \f(CW\*(C`Cased\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching. +(The difference between these sets is that some things, such as Roman +numerals, come in both upper and lower case, so they are \f(CW\*(C`Cased\*(C'\fR, but +aren't considered to be letters, so they aren't \f(CW\*(C`Cased_Letter\*(C'\fRs. They're +actually \f(CW\*(C`Letter_Number\*(C'\fRs.) +This set also includes its subsets \f(CW\*(C`PosixUpper\*(C'\fR and \f(CW\*(C`PosixLower\*(C'\fR, both +of which under \f(CW\*(C`/i\*(C'\fR match \f(CW\*(C`PosixAlpha\*(C'\fR. +.PP +For more details on Unicode properties, see "Unicode +Character Properties" in perlunicode; for a +complete list of possible properties, see +"Properties accessible through \ep{} and \eP{}" in perluniprops, +which notes all forms that have \f(CW\*(C`/i\*(C'\fR differences. +It is also possible to define your own properties. This is discussed in +"User-Defined Character Properties" in perlunicode. +.PP +Unicode properties are defined (surprise!) only on Unicode code points. +Starting in v5.20, when matching against \f(CW\*(C`\ep\*(C'\fR and \f(CW\*(C`\eP\*(C'\fR, Perl treats +non-Unicode code points (those above the legal Unicode maximum of +0x10FFFF) as if they were typical unassigned Unicode code points. +.PP +Prior to v5.20, Perl raised a warning and made all matches fail on +non-Unicode code points. This could be somewhat surprising: +.PP +.Vb 3 +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=True} # Fails on Perls < v5.20. +\& chr(0x110000) =~ \ep{ASCII_Hex_Digit=False} # Also fails on Perls +\& # < v5.20 +.Ve +.PP +Even though these two matches might be thought of as complements, until +v5.20 they were so only on Unicode code points. +.PP +Starting in perl v5.30, wildcards are allowed in Unicode property +values. See "Wildcards in Property Values" in perlunicode. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 8 +\& "a" =~ /\ew/ # Match, "a" is a \*(Aqword\*(Aq character. +\& "7" =~ /\ew/ # Match, "7" is a \*(Aqword\*(Aq character as well. +\& "a" =~ /\ed/ # No match, "a" isn\*(Aqt a digit. +\& "7" =~ /\ed/ # Match, "7" is a digit. +\& " " =~ /\es/ # Match, a space is whitespace. +\& "a" =~ /\eD/ # Match, "a" is a non\-digit. +\& "7" =~ /\eD/ # No match, "7" is not a non\-digit. +\& " " =~ /\eS/ # No match, a space is not non\-whitespace. +\& +\& " " =~ /\eh/ # Match, space is horizontal whitespace. +\& " " =~ /\ev/ # No match, space is not vertical whitespace. +\& "\er" =~ /\ev/ # Match, a return is vertical whitespace. +\& +\& "a" =~ /\epL/ # Match, "a" is a letter. +\& "a" =~ /\ep{Lu}/ # No match, /\ep{Lu}/ matches upper case letters. +\& +\& "\ex{0e0b}" =~ /\ep{Thai}/ # Match, \ex{0e0b} is the character +\& # \*(AqTHAI CHARACTER SO SO\*(Aq, and that\*(Aqs in +\& # Thai Unicode class. +\& "a" =~ /\eP{Lao}/ # Match, as "a" is not a Laotian character. +.Ve +.PP +It is worth emphasizing that \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, etc, match single characters, not +complete numbers or words. To match a number (that consists of digits), +use \f(CW\*(C`\ed+\*(C'\fR; to match a word, use \f(CW\*(C`\ew+\*(C'\fR. But be aware of the security +considerations in doing so, as mentioned above. +.SS "Bracketed Character Classes" +.IX Subsection "Bracketed Character Classes" +The third form of character class you can use in Perl regular expressions +is the bracketed character class. In its simplest form, it lists the characters +that may be matched, surrounded by square brackets, like this: \f(CW\*(C`[aeiou]\*(C'\fR. +This matches one of \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`e\*(C'\fR, \f(CW\*(C`i\*(C'\fR, \f(CW\*(C`o\*(C'\fR or \f(CW\*(C`u\*(C'\fR. Like the other +character classes, exactly one character is matched.* To match +a longer string consisting of characters mentioned in the character +class, follow the character class with a quantifier. For +instance, \f(CW\*(C`[aeiou]+\*(C'\fR matches one or more lowercase English vowels. +.PP +Repeating a character in a character class has no +effect; it's considered to be in the set only once. +.PP +Examples: +.PP +.Vb 5 +\& "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. +\& "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. +\& "ae" =~ /^[aeiou]$/ # No match, a character class only matches +\& # a single character. +\& "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. +\& +\& \-\-\-\-\-\-\- +.Ve +.PP +* There are two exceptions to a bracketed character class matching a +single character only. Each requires special handling by Perl to make +things work: +.IP \(bu 4 +When the class is to match caselessly under \f(CW\*(C`/i\*(C'\fR matching rules, and a +character that is explicitly mentioned inside the class matches a +multiple-character sequence caselessly under Unicode rules, the class +will also match that sequence. For example, Unicode says that the +letter \f(CW\*(C`LATIN SMALL LETTER SHARP S\*(C'\fR should match the sequence \f(CW\*(C`ss\*(C'\fR +under \f(CW\*(C`/i\*(C'\fR rules. Thus, +.Sp +.Vb 2 +\& \*(Aqss\*(Aq =~ /\eA\eN{LATIN SMALL LETTER SHARP S}\ez/i # Matches +\& \*(Aqss\*(Aq =~ /\eA[aeioust\eN{LATIN SMALL LETTER SHARP S}]\ez/i # Matches +.Ve +.Sp +For this to happen, the class must not be inverted (see "Negation") +and the character must be explicitly specified, and not be part of a +multi-character range (not even as one of its endpoints). ("Character +Ranges" will be explained shortly.) Therefore, +.Sp +.Vb 6 +\& \*(Aqss\*(Aq =~ /\eA[\e0\-\ex{ff}]\ez/ui # Doesn\*(Aqt match +\& \*(Aqss\*(Aq =~ /\eA[\e0\-\eN{LATIN SMALL LETTER SHARP S}]\ez/ui # No match +\& \*(Aqss\*(Aq =~ /\eA[\exDF\-\exDF]\ez/ui # Matches on ASCII platforms, since +\& # \exDF is LATIN SMALL LETTER SHARP S, +\& # and the range is just a single +\& # element +.Ve +.Sp +Note that it isn't a good idea to specify these types of ranges anyway. +.IP \(bu 4 +Some names known to \f(CW\*(C`\eN{...}\*(C'\fR refer to a sequence of multiple characters, +instead of the usual single character. When one of these is included in +the class, the entire sequence is matched. For example, +.Sp +.Vb 2 +\& "\eN{TAMIL LETTER KA}\eN{TAMIL VOWEL SIGN AU}" +\& =~ / ^ [\eN{TAMIL SYLLABLE KAU}] $ /x; +.Ve +.Sp +matches, because \f(CW\*(C`\eN{TAMIL SYLLABLE KAU}\*(C'\fR is a named sequence +consisting of the two characters matched against. Like the other +instance where a bracketed class can match multiple characters, and for +similar reasons, the class must not be inverted, and the named sequence +may not appear in a range, even one where it is both endpoints. If +these happen, it is a fatal error if the character class is within the +scope of \f(CW\*(C`use re \*(Aqstrict\*(C'\fR, or within an extended +\&\f(CW\*(C`(?[...])\*(C'\fR class; otherwise +only the first code point is used (with a \f(CW\*(C`regexp\*(C'\fR\-type warning +raised). +.PP +\fISpecial Characters Inside a Bracketed Character Class\fR +.IX Subsection "Special Characters Inside a Bracketed Character Class" +.PP +Most characters that are meta characters in regular expressions (that +is, characters that carry a special meaning like \f(CW\*(C`.\*(C'\fR, \f(CW\*(C`*\*(C'\fR, or \f(CW\*(C`(\*(C'\fR) lose +their special meaning and can be used inside a character class without +the need to escape them. For instance, \f(CW\*(C`[()]\*(C'\fR matches either an opening +parenthesis, or a closing parenthesis, and the parens inside the character +class don't group or capture. Be aware that, unless the pattern is +evaluated in single-quotish context, variable interpolation will take +place before the bracketed class is parsed: +.PP +.Vb 6 +\& $, = "\et| "; +\& $a =~ m\*(Aq[$,]\*(Aq; # single\-quotish: matches \*(Aq$\*(Aq or \*(Aq,\*(Aq +\& $a =~ q{[$,]}\*(Aq # same +\& $a =~ m/[$,]/; # double\-quotish: Because we made an +\& # assignment to $, above, this now +\& # matches "\et", "|", or " " +.Ve +.PP +Characters that may carry a special meaning inside a character class are: +\&\f(CW\*(C`\e\*(C'\fR, \f(CW\*(C`^\*(C'\fR, \f(CW\*(C`\-\*(C'\fR, \f(CW\*(C`[\*(C'\fR and \f(CW\*(C`]\*(C'\fR, and are discussed below. They can be +escaped with a backslash, although this is sometimes not needed, in which +case the backslash may be omitted. +.PP +The sequence \f(CW\*(C`\eb\*(C'\fR is special inside a bracketed character class. While +outside the character class, \f(CW\*(C`\eb\*(C'\fR is an assertion indicating a point +that does not have either two word characters or two non-word characters +on either side, inside a bracketed character class, \f(CW\*(C`\eb\*(C'\fR matches a +backspace character. +.PP +The sequences +\&\f(CW\*(C`\ea\*(C'\fR, +\&\f(CW\*(C`\ec\*(C'\fR, +\&\f(CW\*(C`\ee\*(C'\fR, +\&\f(CW\*(C`\ef\*(C'\fR, +\&\f(CW\*(C`\en\*(C'\fR, +\&\f(CW\*(C`\eN{\fR\f(CINAME\fR\f(CW}\*(C'\fR, +\&\f(CW\*(C`\eN{U+\fR\f(CIhex char\fR\f(CW}\*(C'\fR, +\&\f(CW\*(C`\er\*(C'\fR, +\&\f(CW\*(C`\et\*(C'\fR, +and +\&\f(CW\*(C`\ex\*(C'\fR +are also special and have the same meanings as they do outside a +bracketed character class. +.PP +Also, a backslash followed by two or three octal digits is considered an octal +number. +.PP +A \f(CW\*(C`[\*(C'\fR is not special inside a character class, unless it's the start of a +POSIX character class (see "POSIX Character Classes" below). It normally does +not need escaping. +.PP +A \f(CW\*(C`]\*(C'\fR is normally either the end of a POSIX character class (see +"POSIX Character Classes" below), or it signals the end of the bracketed +character class. If you want to include a \f(CW\*(C`]\*(C'\fR in the set of characters, you +must generally escape it. +.PP +However, if the \f(CW\*(C`]\*(C'\fR is the \fIfirst\fR (or the second if the first +character is a caret) character of a bracketed character class, it +does not denote the end of the class (as you cannot have an empty class) +and is considered part of the set of characters that can be matched without +escaping. +.PP +Examples: +.PP +.Vb 8 +\& "+" =~ /[+?*]/ # Match, "+" in a character class is not special. +\& "\ecH" =~ /[\eb]/ # Match, \eb inside in a character class +\& # is equivalent to a backspace. +\& "]" =~ /[][]/ # Match, as the character class contains +\& # both [ and ]. +\& "[]" =~ /[[]]/ # Match, the pattern contains a character class +\& # containing just [, and the character class is +\& # followed by a ]. +.Ve +.PP +\fIBracketed Character Classes and the \fR\f(CI\*(C`/xx\*(C'\fR\fI pattern modifier\fR +.IX Subsection "Bracketed Character Classes and the /xx pattern modifier" +.PP +Normally SPACE and TAB characters have no special meaning inside a +bracketed character class; they are just added to the list of characters +matched by the class. But if the \f(CW\*(C`/xx\*(C'\fR +pattern modifier is in effect, they are generally ignored and can be +added to improve readability. They can't be added in the middle of a +single construct: +.PP +.Vb 1 +\& / [ \ex{10 FFFF} ] /xx # WRONG! +.Ve +.PP +The SPACE in the middle of the hex constant is illegal. +.PP +To specify a literal SPACE character, you can escape it with a +backslash, like: +.PP +.Vb 1 +\& /[ a e i o u \e ]/xx +.Ve +.PP +This matches the English vowels plus the SPACE character. +.PP +For clarity, you should already have been using \f(CW\*(C`\et\*(C'\fR to specify a +literal tab, and \f(CW\*(C`\et\*(C'\fR is unaffected by \f(CW\*(C`/xx\*(C'\fR. +.PP +\fICharacter Ranges\fR +.IX Subsection "Character Ranges" +.PP +It is not uncommon to want to match a range of characters. Luckily, instead +of listing all characters in the range, one may use the hyphen (\f(CW\*(C`\-\*(C'\fR). +If inside a bracketed character class you have two characters separated +by a hyphen, it's treated as if all characters between the two were in +the class. For instance, \f(CW\*(C`[0\-9]\*(C'\fR matches any ASCII digit, and \f(CW\*(C`[a\-m]\*(C'\fR +matches any lowercase letter from the first half of the ASCII alphabet. +.PP +Note that the two characters on either side of the hyphen are not +necessarily both letters or both digits. Any character is possible, +although not advisable. \f(CW\*(C`[\*(Aq\-?]\*(C'\fR contains a range of characters, but +most people will not know which characters that means. Furthermore, +such ranges may lead to portability problems if the code has to run on +a platform that uses a different character set, such as EBCDIC. +.PP +If a hyphen in a character class cannot syntactically be part of a range, for +instance because it is the first or the last character of the character class, +or if it immediately follows a range, the hyphen isn't special, and so is +considered a character to be matched literally. If you want a hyphen in +your set of characters to be matched and its position in the class is such +that it could be considered part of a range, you must escape that hyphen +with a backslash. +.PP +Examples: +.PP +.Vb 12 +\& [a\-z] # Matches a character that is a lower case ASCII letter. +\& [a\-fz] # Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive) or +\& # the letter \*(Aqz\*(Aq. +\& [\-z] # Matches either a hyphen (\*(Aq\-\*(Aq) or the letter \*(Aqz\*(Aq. +\& [a\-f\-m] # Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive), the +\& # hyphen (\*(Aq\-\*(Aq), or the letter \*(Aqm\*(Aq. +\& [\*(Aq\-?] # Matches any of the characters \*(Aq()*+,\-./0123456789:;<=>? +\& # (But not on an EBCDIC platform). +\& [\eN{APOSTROPHE}\-\eN{QUESTION MARK}] +\& # Matches any of the characters \*(Aq()*+,\-./0123456789:;<=>? +\& # even on an EBCDIC platform. +\& [\eN{U+27}\-\eN{U+3F}] # Same. (U+27 is "\*(Aq", and U+3F is "?") +.Ve +.PP +As the final two examples above show, you can achieve portability to +non-ASCII platforms by using the \f(CW\*(C`\eN{...}\*(C'\fR form for the range +endpoints. These indicate that the specified range is to be interpreted +using Unicode values, so \f(CW\*(C`[\eN{U+27}\-\eN{U+3F}]\*(C'\fR means to match +\&\f(CW\*(C`\eN{U+27}\*(C'\fR, \f(CW\*(C`\eN{U+28}\*(C'\fR, \f(CW\*(C`\eN{U+29}\*(C'\fR, ..., \f(CW\*(C`\eN{U+3D}\*(C'\fR, \f(CW\*(C`\eN{U+3E}\*(C'\fR, +and \f(CW\*(C`\eN{U+3F}\*(C'\fR, whatever the native code point versions for those are. +These are called "Unicode" ranges. If either end is of the \f(CW\*(C`\eN{...}\*(C'\fR +form, the range is considered Unicode. A \f(CW\*(C`regexp\*(C'\fR warning is raised +under \f(CW"use\ re\ \*(Aqstrict\*(Aq"\fR if the other endpoint is specified +non-portably: +.PP +.Vb 2 +\& [\eN{U+00}\-\ex09] # Warning under re \*(Aqstrict\*(Aq; \ex09 is non\-portable +\& [\eN{U+00}\-\et] # No warning; +.Ve +.PP +Both of the above match the characters \f(CW\*(C`\eN{U+00}\*(C'\fR \f(CW\*(C`\eN{U+01}\*(C'\fR, ... +\&\f(CW\*(C`\eN{U+08}\*(C'\fR, \f(CW\*(C`\eN{U+09}\*(C'\fR, but the \f(CW\*(C`\ex09\*(C'\fR looks like it could be a +mistake so the warning is raised (under \f(CW\*(C`re \*(Aqstrict\*(Aq\*(C'\fR) for it. +.PP +Perl also guarantees that the ranges \f(CW\*(C`A\-Z\*(C'\fR, \f(CW\*(C`a\-z\*(C'\fR, \f(CW\*(C`0\-9\*(C'\fR, and any +subranges of these match what an English-only speaker would expect them +to match on any platform. That is, \f(CW\*(C`[A\-Z]\*(C'\fR matches the 26 ASCII +uppercase letters; +\&\f(CW\*(C`[a\-z]\*(C'\fR matches the 26 lowercase letters; and \f(CW\*(C`[0\-9]\*(C'\fR matches the 10 +digits. Subranges, like \f(CW\*(C`[h\-k]\*(C'\fR, match correspondingly, in this case +just the four letters \f(CW"h"\fR, \f(CW"i"\fR, \f(CW"j"\fR, and \f(CW"k"\fR. This is the +natural behavior on ASCII platforms where the code points (ordinal +values) for \f(CW"h"\fR through \f(CW"k"\fR are consecutive integers (0x68 through +0x6B). But special handling to achieve this may be needed on platforms +with a non-ASCII native character set. For example, on EBCDIC +platforms, the code point for \f(CW"h"\fR is 0x88, \f(CW"i"\fR is 0x89, \f(CW"j"\fR is +0x91, and \f(CW"k"\fR is 0x92. Perl specially treats \f(CW\*(C`[h\-k]\*(C'\fR to exclude the +seven code points in the gap: 0x8A through 0x90. This special handling is +only invoked when the range is a subrange of one of the ASCII uppercase, +lowercase, and digit ranges, AND each end of the range is expressed +either as a literal, like \f(CW"A"\fR, or as a named character (\f(CW\*(C`\eN{...}\*(C'\fR, +including the \f(CW\*(C`\eN{U+...\*(C'\fR form). +.PP +EBCDIC Examples: +.PP +.Vb 10 +\& [i\-j] # Matches either "i" or "j" +\& [i\-\eN{LATIN SMALL LETTER J}] # Same +\& [i\-\eN{U+6A}] # Same +\& [\eN{U+69}\-\eN{U+6A}] # Same +\& [\ex{89}\-\ex{91}] # Matches 0x89 ("i"), 0x8A .. 0x90, 0x91 ("j") +\& [i\-\ex{91}] # Same +\& [\ex{89}\-j] # Same +\& [i\-J] # Matches, 0x89 ("i") .. 0xC1 ("J"); special +\& # handling doesn\*(Aqt apply because range is mixed +\& # case +.Ve +.PP +\fINegation\fR +.IX Subsection "Negation" +.PP +It is also possible to instead list the characters you do not want to +match. You can do so by using a caret (\f(CW\*(C`^\*(C'\fR) as the first character in the +character class. For instance, \f(CW\*(C`[^a\-z]\*(C'\fR matches any character that is not a +lowercase ASCII letter, which therefore includes more than a million +Unicode code points. The class is said to be "negated" or "inverted". +.PP +This syntax make the caret a special character inside a bracketed character +class, but only if it is the first character of the class. So if you want +the caret as one of the characters to match, either escape the caret or +else don't list it first. +.PP +In inverted bracketed character classes, Perl ignores the Unicode rules +that normally say that named sequence, and certain characters should +match a sequence of multiple characters use under caseless \f(CW\*(C`/i\*(C'\fR +matching. Following those rules could lead to highly confusing +situations: +.PP +.Vb 1 +\& "ss" =~ /^[^\exDF]+$/ui; # Matches! +.Ve +.PP +This should match any sequences of characters that aren't \f(CW\*(C`\exDF\*(C'\fR nor +what \f(CW\*(C`\exDF\*(C'\fR matches under \f(CW\*(C`/i\*(C'\fR. \f(CW"s"\fR isn't \f(CW\*(C`\exDF\*(C'\fR, but Unicode +says that \f(CW"ss"\fR is what \f(CW\*(C`\exDF\*(C'\fR matches under \f(CW\*(C`/i\*(C'\fR. So which one +"wins"? Do you fail the match because the string has \f(CW\*(C`ss\*(C'\fR or accept it +because it has an \f(CW\*(C`s\*(C'\fR followed by another \f(CW\*(C`s\*(C'\fR? Perl has chosen the +latter. (See note in "Bracketed Character Classes" above.) +.PP +Examples: +.PP +.Vb 4 +\& "e" =~ /[^aeiou]/ # No match, the \*(Aqe\*(Aq is listed. +\& "x" =~ /[^aeiou]/ # Match, as \*(Aqx\*(Aq isn\*(Aqt a lowercase vowel. +\& "^" =~ /[^^]/ # No match, matches anything that isn\*(Aqt a caret. +\& "^" =~ /[x^]/ # Match, caret is not special here. +.Ve +.PP +\fIBackslash Sequences\fR +.IX Subsection "Backslash Sequences" +.PP +You can put any backslash sequence character class (with the exception of +\&\f(CW\*(C`\eN\*(C'\fR and \f(CW\*(C`\eR\*(C'\fR) inside a bracketed character class, and it will act just +as if you had put all characters matched by the backslash sequence inside the +character class. For instance, \f(CW\*(C`[a\-f\ed]\*(C'\fR matches any decimal digit, or any +of the lowercase letters between 'a' and 'f' inclusive. +.PP +\&\f(CW\*(C`\eN\*(C'\fR within a bracketed character class must be of the forms \f(CW\*(C`\eN{\fR\f(CIname\fR\f(CW}\*(C'\fR +or \f(CW\*(C`\eN{U+\fR\f(CIhex char\fR\f(CW}\*(C'\fR, and NOT be the form that matches non-newlines, +for the same reason that a dot \f(CW\*(C`.\*(C'\fR inside a bracketed character class loses +its special meaning: it matches nearly anything, which generally isn't what you +want to happen. +.PP +Examples: +.PP +.Vb 4 +\& /[\ep{Thai}\ed]/ # Matches a character that is either a Thai +\& # character, or a digit. +\& /[^\ep{Arabic}()]/ # Matches a character that is neither an Arabic +\& # character, nor a parenthesis. +.Ve +.PP +Backslash sequence character classes cannot form one of the endpoints +of a range. Thus, you can't say: +.PP +.Vb 1 +\& /[\ep{Thai}\-\ed]/ # Wrong! +.Ve +.PP +\fIPOSIX Character Classes\fR +.IX Xref "character class \\p \\p{} alpha alnum ascii blank cntrl digit graph lower print punct space upper word xdigit" +.IX Subsection "POSIX Character Classes" +.PP +POSIX character classes have the form \f(CW\*(C`[:class:]\*(C'\fR, where \fIclass\fR is the +name, and the \f(CW\*(C`[:\*(C'\fR and \f(CW\*(C`:]\*(C'\fR delimiters. POSIX character classes only appear +\&\fIinside\fR bracketed character classes, and are a convenient and descriptive +way of listing a group of characters. +.PP +Be careful about the syntax, +.PP +.Vb 2 +\& # Correct: +\& $string =~ /[[:alpha:]]/ +\& +\& # Incorrect (will warn): +\& $string =~ /[:alpha:]/ +.Ve +.PP +The latter pattern would be a character class consisting of a colon, +and the letters \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`l\*(C'\fR, \f(CW\*(C`p\*(C'\fR and \f(CW\*(C`h\*(C'\fR. +.PP +POSIX character classes can be part of a larger bracketed character class. +For example, +.PP +.Vb 1 +\& [01[:alpha:]%] +.Ve +.PP +is valid and matches '0', '1', any alphabetic character, and the percent sign. +.PP +Perl recognizes the following POSIX character classes: +.PP +.Vb 10 +\& alpha Any alphabetical character (e.g., [A\-Za\-z]). +\& alnum Any alphanumeric character (e.g., [A\-Za\-z0\-9]). +\& ascii Any character in the ASCII character set. +\& blank A GNU extension, equal to a space or a horizontal tab ("\et"). +\& cntrl Any control character. See Note [2] below. +\& digit Any decimal digit (e.g., [0\-9]), equivalent to "\ed". +\& graph Any printable character, excluding a space. See Note [3] below. +\& lower Any lowercase character (e.g., [a\-z]). +\& print Any printable character, including a space. See Note [4] below. +\& punct Any graphical character excluding "word" characters. Note [5]. +\& space Any whitespace character. "\es" including the vertical tab +\& ("\ecK"). +\& upper Any uppercase character (e.g., [A\-Z]). +\& word A Perl extension (e.g., [A\-Za\-z0\-9_]), equivalent to "\ew". +\& xdigit Any hexadecimal digit (e.g., [0\-9a\-fA\-F]). Note [7]. +.Ve +.PP +Like the Unicode properties, most of the POSIX +properties match the same regardless of whether case-insensitive (\f(CW\*(C`/i\*(C'\fR) +matching is in effect or not. The two exceptions are \f(CW\*(C`[:upper:]\*(C'\fR and +\&\f(CW\*(C`[:lower:]\*(C'\fR. Under \f(CW\*(C`/i\*(C'\fR, they each match the union of \f(CW\*(C`[:upper:]\*(C'\fR and +\&\f(CW\*(C`[:lower:]\*(C'\fR. +.PP +Most POSIX character classes have two Unicode-style \f(CW\*(C`\ep\*(C'\fR property +counterparts. (They are not official Unicode properties, but Perl extensions +derived from official Unicode properties.) The table below shows the relation +between POSIX character classes and these counterparts. +.PP +One counterpart, in the column labelled "ASCII-range Unicode" in +the table, matches only characters in the ASCII character set. +.PP +The other counterpart, in the column labelled "Full-range Unicode", matches any +appropriate characters in the full Unicode character set. For example, +\&\f(CW\*(C`\ep{Alpha}\*(C'\fR matches not just the ASCII alphabetic characters, but any +character in the entire Unicode character set considered alphabetic. +An entry in the column labelled "backslash sequence" is a (short) +equivalent. +.PP +.Vb 10 +\& [[:...:]] ASCII\-range Full\-range backslash Note +\& Unicode Unicode sequence +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& alpha \ep{PosixAlpha} \ep{XPosixAlpha} +\& alnum \ep{PosixAlnum} \ep{XPosixAlnum} +\& ascii \ep{ASCII} +\& blank \ep{PosixBlank} \ep{XPosixBlank} \eh [1] +\& or \ep{HorizSpace} [1] +\& cntrl \ep{PosixCntrl} \ep{XPosixCntrl} [2] +\& digit \ep{PosixDigit} \ep{XPosixDigit} \ed +\& graph \ep{PosixGraph} \ep{XPosixGraph} [3] +\& lower \ep{PosixLower} \ep{XPosixLower} +\& print \ep{PosixPrint} \ep{XPosixPrint} [4] +\& punct \ep{PosixPunct} \ep{XPosixPunct} [5] +\& \ep{PerlSpace} \ep{XPerlSpace} \es [6] +\& space \ep{PosixSpace} \ep{XPosixSpace} [6] +\& upper \ep{PosixUpper} \ep{XPosixUpper} +\& word \ep{PosixWord} \ep{XPosixWord} \ew +\& xdigit \ep{PosixXDigit} \ep{XPosixXDigit} [7] +.Ve +.IP [1] 4 +.IX Item "[1]" +\&\f(CW\*(C`\ep{Blank}\*(C'\fR and \f(CW\*(C`\ep{HorizSpace}\*(C'\fR are synonyms. +.IP [2] 4 +.IX Item "[2]" +Control characters don't produce output as such, but instead usually control +the terminal somehow: for example, newline and backspace are control characters. +On ASCII platforms, in the ASCII range, characters whose code points are +between 0 and 31 inclusive, plus 127 (\f(CW\*(C`DEL\*(C'\fR) are control characters; on +EBCDIC platforms, their counterparts are control characters. +.IP [3] 4 +.IX Item "[3]" +Any character that is \fIgraphical\fR, that is, visible. This class consists +of all alphanumeric characters and all punctuation characters. +.IP [4] 4 +.IX Item "[4]" +All printable characters, which is the set of all graphical characters +plus those whitespace characters which are not also controls. +.IP [5] 4 +.IX Item "[5]" +\&\f(CW\*(C`\ep{PosixPunct}\*(C'\fR and \f(CW\*(C`[[:punct:]]\*(C'\fR in the ASCII range match all +non-controls, non-alphanumeric, non-space characters: +\&\f(CW\*(C`[\-!"#$%&\*(Aq()*+,./:;<=>?@[\e\e\e]^_\`{|}~]\*(C'\fR (although if a locale is in effect, +it could alter the behavior of \f(CW\*(C`[[:punct:]]\*(C'\fR). +.Sp +The similarly named property, \f(CW\*(C`\ep{Punct}\*(C'\fR, matches a somewhat different +set in the ASCII range, namely +\&\f(CW\*(C`[\-!"#%&\*(Aq()*,./:;?@[\e\e\e]_{}]\*(C'\fR. That is, it is missing the nine +characters \f(CW\*(C`[$+<=>^\`|~]\*(C'\fR. +This is because Unicode splits what POSIX considers to be punctuation into two +categories, Punctuation and Symbols. +.Sp +\&\f(CW\*(C`\ep{XPosixPunct}\*(C'\fR and (under Unicode rules) \f(CW\*(C`[[:punct:]]\*(C'\fR, match what +\&\f(CW\*(C`\ep{PosixPunct}\*(C'\fR matches in the ASCII range, plus what \f(CW\*(C`\ep{Punct}\*(C'\fR +matches. This is different than strictly matching according to +\&\f(CW\*(C`\ep{Punct}\*(C'\fR. Another way to say it is that +if Unicode rules are in effect, \f(CW\*(C`[[:punct:]]\*(C'\fR matches all characters +that Unicode considers punctuation, plus all ASCII-range characters that +Unicode considers symbols. +.IP [6] 4 +.IX Item "[6]" +\&\f(CW\*(C`\ep{XPerlSpace}\*(C'\fR and \f(CW\*(C`\ep{Space}\*(C'\fR match identically starting with Perl +v5.18. In earlier versions, these differ only in that in non-locale +matching, \f(CW\*(C`\ep{XPerlSpace}\*(C'\fR did not match the vertical tab, \f(CW\*(C`\ecK\*(C'\fR. +Same for the two ASCII-only range forms. +.IP [7] 4 +.IX Item "[7]" +Unlike \f(CW\*(C`[[:digit:]]\*(C'\fR which matches digits in many writing systems, such +as Thai and Devanagari, there are currently only two sets of hexadecimal +digits, and it is unlikely that more will be added. This is because you +not only need the ten digits, but also the six \f(CW\*(C`[A\-F]\*(C'\fR (and \f(CW\*(C`[a\-f]\*(C'\fR) +to correspond. That means only the Latin script is suitable for these, +and Unicode has only two sets of these, the familiar ASCII set, and the +fullwidth forms starting at U+FF10 (FULLWIDTH DIGIT ZERO). +.PP +There are various other synonyms that can be used besides the names +listed in the table. For example, \f(CW\*(C`\ep{XPosixAlpha}\*(C'\fR can be written as +\&\f(CW\*(C`\ep{Alpha}\*(C'\fR. All are listed in +"Properties accessible through \ep{} and \eP{}" in perluniprops. +.PP +Both the \f(CW\*(C`\ep\*(C'\fR counterparts always assume Unicode rules are in effect. +On ASCII platforms, this means they assume that the code points from 128 +to 255 are Latin\-1, and that means that using them under locale rules is +unwise unless the locale is guaranteed to be Latin\-1 or UTF\-8. In contrast, the +POSIX character classes are useful under locale rules. They are +affected by the actual rules in effect, as follows: +.ie n .IP "If the ""/a"" modifier, is in effect ..." 4 +.el .IP "If the \f(CW/a\fR modifier, is in effect ..." 4 +.IX Item "If the /a modifier, is in effect ..." +Each of the POSIX classes matches exactly the same as their ASCII-range +counterparts. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +.RS 4 +.PD 0 +.IP "For code points above 255 ..." 4 +.IX Item "For code points above 255 ..." +.PD +The POSIX class matches the same as its Full-range counterpart. +.IP "For code points below 256 ..." 4 +.IX Item "For code points below 256 ..." +.RS 4 +.PD 0 +.IP "if locale rules are in effect ..." 4 +.IX Item "if locale rules are in effect ..." +.PD +The POSIX class matches according to the locale, except: +.RS 4 +.ie n .IP """word""" 4 +.el .IP \f(CWword\fR 4 +.IX Item "word" +also includes the platform's native underscore character, no matter what +the locale is. +.ie n .IP """ascii""" 4 +.el .IP \f(CWascii\fR 4 +.IX Item "ascii" +on platforms that don't have the POSIX \f(CW\*(C`ascii\*(C'\fR extension, this matches +just the platform's native ASCII-range characters. +.ie n .IP """blank""" 4 +.el .IP \f(CWblank\fR 4 +.IX Item "blank" +on platforms that don't have the POSIX \f(CW\*(C`blank\*(C'\fR extension, this matches +just the platform's native tab and space characters. +.RE +.RS 4 +.RE +.IP "if, instead, Unicode rules are in effect ..." 4 +.IX Item "if, instead, Unicode rules are in effect ..." +The POSIX class matches the same as the Full-range counterpart. +.IP "otherwise ..." 4 +.IX Item "otherwise ..." +The POSIX class matches the same as the ASCII range counterpart. +.RE +.RS 4 +.RE +.RE +.RS 4 +.RE +.PP +Which rules apply are determined as described in +"Which character set modifier is in effect?" in perlre. +.PP +Negation of POSIX character classes +.IX Xref "character class, negation" +.IX Subsection "Negation of POSIX character classes" +.PP +A Perl extension to the POSIX character class is the ability to +negate it. This is done by prefixing the class name with a caret (\f(CW\*(C`^\*(C'\fR). +Some examples: +.PP +.Vb 7 +\& POSIX ASCII\-range Full\-range backslash +\& Unicode Unicode sequence +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& [[:^digit:]] \eP{PosixDigit} \eP{XPosixDigit} \eD +\& [[:^space:]] \eP{PosixSpace} \eP{XPosixSpace} +\& \eP{PerlSpace} \eP{XPerlSpace} \eS +\& [[:^word:]] \eP{PerlWord} \eP{XPosixWord} \eW +.Ve +.PP +The backslash sequence can mean either ASCII\- or Full-range Unicode, +depending on various factors as described in "Which character set modifier is in effect?" in perlre. +.PP +[= =] and [. .] +.IX Subsection "[= =] and [. .]" +.PP +Perl recognizes the POSIX character classes \f(CW\*(C`[=class=]\*(C'\fR and +\&\f(CW\*(C`[.class.]\*(C'\fR, but does not (yet?) support them. Any attempt to use +either construct raises an exception. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 12 +\& /[[:digit:]]/ # Matches a character that is a digit. +\& /[01[:lower:]]/ # Matches a character that is either a +\& # lowercase letter, or \*(Aq0\*(Aq or \*(Aq1\*(Aq. +\& /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything +\& # except the letters \*(Aqa\*(Aq to \*(Aqf\*(Aq and \*(AqA\*(Aq to +\& # \*(AqF\*(Aq. This is because the main character +\& # class is composed of two POSIX character +\& # classes that are ORed together, one that +\& # matches any digit, and the other that +\& # matches anything that isn\*(Aqt a hex digit. +\& # The OR adds the digits, leaving only the +\& # letters \*(Aqa\*(Aq to \*(Aqf\*(Aq and \*(AqA\*(Aq to \*(AqF\*(Aq excluded. +.Ve +.PP +\fIExtended Bracketed Character Classes\fR +.IX Xref "character class set operations" +.IX Subsection "Extended Bracketed Character Classes" +.PP +This is a fancy bracketed character class that can be used for more +readable and less error-prone classes, and to perform set operations, +such as intersection. An example is +.PP +.Vb 1 +\& /(?[ \ep{Thai} & \ep{Digit} ])/ +.Ve +.PP +This will match all the digit characters that are in the Thai script. +.PP +This feature became available in Perl 5.18, as experimental; accepted in +5.36. +.PP +The rules used by \f(CW\*(C`use re \*(Aqstrict\*(C'\fR apply to this +construct. +.PP +We can extend the example above: +.PP +.Vb 1 +\& /(?[ ( \ep{Thai} + \ep{Lao} ) & \ep{Digit} ])/ +.Ve +.PP +This matches digits that are in either the Thai or Laotian scripts. +.PP +Notice the white space in these examples. This construct always has +the \f(CW\*(C`/xx\*(C'\fR modifier turned on within it. +.PP +The available binary operators are: +.PP +.Vb 10 +\& & intersection +\& + union +\& | another name for \*(Aq+\*(Aq, hence means union +\& \- subtraction (the result matches the set consisting of those +\& code points matched by the first operand, excluding any that +\& are also matched by the second operand) +\& ^ symmetric difference (the union minus the intersection). This +\& is like an exclusive or, in that the result is the set of code +\& points that are matched by either, but not both, of the +\& operands. +.Ve +.PP +There is one unary operator: +.PP +.Vb 1 +\& ! complement +.Ve +.PP +All the binary operators left associate; \f(CW"&"\fR is higher precedence +than the others, which all have equal precedence. The unary operator +right associates, and has highest precedence. Thus this follows the +normal Perl precedence rules for logical operators. Use parentheses to +override the default precedence and associativity. +.PP +The main restriction is that everything is a metacharacter. Thus, +you cannot refer to single characters by doing something like this: +.PP +.Vb 1 +\& /(?[ a + b ])/ # Syntax error! +.Ve +.PP +The easiest way to specify an individual typable character is to enclose +it in brackets: +.PP +.Vb 1 +\& /(?[ [a] + [b] ])/ +.Ve +.PP +(This is the same thing as \f(CW\*(C`[ab]\*(C'\fR.) You could also have said the +equivalent: +.PP +.Vb 1 +\& /(?[[ a b ]])/ +.Ve +.PP +(You can, of course, specify single characters by using, \f(CW\*(C`\ex{...}\*(C'\fR, +\&\f(CW\*(C`\eN{...}\*(C'\fR, etc.) +.PP +This last example shows the use of this construct to specify an ordinary +bracketed character class without additional set operations. Note the +white space within it. This is allowed because \f(CW\*(C`/xx\*(C'\fR is +automatically turned on within this construct. +.PP +All the other escapes accepted by normal bracketed character classes are +accepted here as well. +.PP +Because this construct compiles under +\&\f(CW\*(C`use re \*(Aqstrict\*(C'\fR, unrecognized escapes that +generate warnings in normal classes are fatal errors here, as well as +all other warnings from these class elements, as well as some +practices that don't currently warn outside \f(CW\*(C`re \*(Aqstrict\*(Aq\*(C'\fR. For example +you cannot say +.PP +.Vb 1 +\& /(?[ [ \exF ] ])/ # Syntax error! +.Ve +.PP +You have to have two hex digits after a braceless \f(CW\*(C`\ex\*(C'\fR (use a leading +zero to make two). These restrictions are to lower the incidence of +typos causing the class to not match what you thought it would. +.PP +If a regular bracketed character class contains a \f(CW\*(C`\ep{}\*(C'\fR or \f(CW\*(C`\eP{}\*(C'\fR and +is matched against a non-Unicode code point, a warning may be +raised, as the result is not Unicode-defined. No such warning will come +when using this extended form. +.PP +The final difference between regular bracketed character classes and +these, is that it is not possible to get these to match a +multi-character fold. Thus, +.PP +.Vb 1 +\& /(?[ [\exDF] ])/iu +.Ve +.PP +does not match the string \f(CW\*(C`ss\*(C'\fR. +.PP +You don't have to enclose POSIX class names inside double brackets, +hence both of the following work: +.PP +.Vb 2 +\& /(?[ [:word:] \- [:lower:] ])/ +\& /(?[ [[:word:]] \- [[:lower:]] ])/ +.Ve +.PP +Any contained POSIX character classes, including things like \f(CW\*(C`\ew\*(C'\fR and \f(CW\*(C`\eD\*(C'\fR +respect the \f(CW\*(C`/a\*(C'\fR (and \f(CW\*(C`/aa\*(C'\fR) modifiers. +.PP +Note that \f(CW\*(C`(?[ ])\*(C'\fR is a regex-compile-time construct. Any attempt +to use something which isn't knowable at the time the containing regular +expression is compiled is a fatal error. In practice, this means +just three limitations: +.IP 1. 4 +When compiled within the scope of \f(CW\*(C`use locale\*(C'\fR (or the \f(CW\*(C`/l\*(C'\fR regex +modifier), this construct assumes that the execution-time locale will be +a UTF\-8 one, and the generated pattern always uses Unicode rules. What +gets matched or not thus isn't dependent on the actual runtime locale, so +tainting is not enabled. But a \f(CW\*(C`locale\*(C'\fR category warning is raised +if the runtime locale turns out to not be UTF\-8. +.IP 2. 4 +Any +user-defined property +used must be already defined by the time the regular expression is +compiled (but note that this construct can be used instead of such +properties). +.IP 3. 4 +A regular expression that otherwise would compile +using \f(CW\*(C`/d\*(C'\fR rules, and which uses this construct will instead +use \f(CW\*(C`/u\*(C'\fR. Thus this construct tells Perl that you don't want +\&\f(CW\*(C`/d\*(C'\fR rules for the entire regular expression containing it. +.PP +Note that skipping white space applies only to the interior of this +construct. There must not be any space between any of the characters +that form the initial \f(CW\*(C`(?[\*(C'\fR. Nor may there be space between the +closing \f(CW\*(C`])\*(C'\fR characters. +.PP +Just as in all regular expressions, the pattern can be built up by +including variables that are interpolated at regex compilation time. +But currently each such sub-component should be an already-compiled +extended bracketed character class. +.PP +.Vb 3 +\& my $thai_or_lao = qr/(?[ \ep{Thai} + \ep{Lao} ])/; +\& ... +\& qr/(?[ \ep{Digit} & $thai_or_lao ])/; +.Ve +.PP +If you interpolate something else, the pattern may still compile (or it +may die), but if it compiles, it very well may not behave as you would +expect: +.PP +.Vb 2 +\& my $thai_or_lao = \*(Aq\ep{Thai} + \ep{Lao}\*(Aq; +\& qr/(?[ \ep{Digit} & $thai_or_lao ])/; +.Ve +.PP +compiles to +.PP +.Vb 1 +\& qr/(?[ \ep{Digit} & \ep{Thai} + \ep{Lao} ])/; +.Ve +.PP +This does not have the effect that someone reading the source code +would likely expect, as the intersection applies just to \f(CW\*(C`\ep{Thai}\*(C'\fR, +excluding the Laotian. +.PP +Due to the way that Perl parses things, your parentheses and brackets +may need to be balanced, even including comments. If you run into any +examples, please submit them to <https://github.com/Perl/perl5/issues>, +so that we can have a concrete example for this man page. |