diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 19:43:11 +0000 |
commit | fc22b3d6507c6745911b9dfcc68f1e665ae13dbc (patch) | |
tree | ce1e3bce06471410239a6f41282e328770aa404a /upstream/fedora-rawhide/man1/perlre.1 | |
parent | Initial commit. (diff) | |
download | manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.tar.xz manpages-l10n-fc22b3d6507c6745911b9dfcc68f1e665ae13dbc.zip |
Adding upstream version 4.22.0.upstream/4.22.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'upstream/fedora-rawhide/man1/perlre.1')
-rw-r--r-- | upstream/fedora-rawhide/man1/perlre.1 | 3711 |
1 files changed, 3711 insertions, 0 deletions
diff --git a/upstream/fedora-rawhide/man1/perlre.1 b/upstream/fedora-rawhide/man1/perlre.1 new file mode 100644 index 00000000..82e03b65 --- /dev/null +++ b/upstream/fedora-rawhide/man1/perlre.1 @@ -0,0 +1,3711 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLRE 1" +.TH PERLRE 1 2024-01-25 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlre \- Perl regular expressions +.IX Xref "regular expression regex regexp" +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This page describes the syntax of regular expressions in Perl. +.PP +If you haven't used regular expressions before, a tutorial introduction +is available in perlretut. If you know just a little about them, +a quick-start introduction is available in perlrequick. +.PP +Except for "The Basics" section, this page assumes you are familiar +with regular expression basics, like what is a "pattern", what does it +look like, and how it is basically used. For a reference on how they +are used, plus various examples of the same, see discussions of \f(CW\*(C`m//\*(C'\fR, +\&\f(CW\*(C`s///\*(C'\fR, \f(CW\*(C`qr//\*(C'\fR and \f(CW"??"\fR in "Regexp Quote-Like Operators" in perlop. +.PP +New in v5.22, \f(CW\*(C`use re \*(Aqstrict\*(Aq\*(C'\fR applies stricter +rules than otherwise when compiling regular expression patterns. It can +find things that, while legal, may not be what you intended. +.SS "The Basics" +.IX Xref "regular expression, version 8 regex, version 8 regexp, version 8" +.IX Subsection "The Basics" +Regular expressions are strings with the very particular syntax and +meaning described in this document and auxiliary documents referred to +by this one. The strings are called "patterns". Patterns are used to +determine if some other string, called the "target", has (or doesn't +have) the characteristics specified by the pattern. We call this +"matching" the target string against the pattern. Usually the match is +done by having the target be the first operand, and the pattern be the +second operand, of one of the two binary operators \f(CW\*(C`=~\*(C'\fR and \f(CW\*(C`!~\*(C'\fR, +listed in "Binding Operators" in perlop; and the pattern will have been +converted from an ordinary string by one of the operators in +"Regexp Quote-Like Operators" in perlop, like so: +.PP +.Vb 1 +\& $foo =~ m/abc/ +.Ve +.PP +This evaluates to true if and only if the string in the variable \f(CW$foo\fR +contains somewhere in it, the sequence of characters "a", "b", then "c". +(The \f(CW\*(C`=~ m\*(C'\fR, or match operator, is described in +"m/PATTERN/msixpodualngc" in perlop.) +.PP +Patterns that aren't already stored in some variable must be delimited, +at both ends, by delimiter characters. These are often, as in the +example above, forward slashes, and the typical way a pattern is written +in documentation is with those slashes. In most cases, the delimiter +is the same character, fore and aft, but there are a few cases where a +character looks like it has a mirror-image mate, where the opening +version is the beginning delimiter, and the closing one is the ending +delimiter, like +.PP +.Vb 1 +\& $foo =~ m<abc> +.Ve +.PP +Most times, the pattern is evaluated in double-quotish context, but it +is possible to choose delimiters to force single-quotish, like +.PP +.Vb 1 +\& $foo =~ m\*(Aqabc\*(Aq +.Ve +.PP +If the pattern contains its delimiter within it, that delimiter must be +escaped. Prefixing it with a backslash (\fIe.g.\fR, \f(CW"/foo\e/bar/"\fR) +serves this purpose. +.PP +Any single character in a pattern matches that same character in the +target string, unless the character is a \fImetacharacter\fR with a special +meaning described in this document. A sequence of non-metacharacters +matches the same sequence in the target string, as we saw above with +\&\f(CW\*(C`m/abc/\*(C'\fR. +.PP +Only a few characters (all of them being ASCII punctuation characters) +are metacharacters. The most commonly used one is a dot \f(CW"."\fR, which +normally matches almost any character (including a dot itself). +.PP +You can cause characters that normally function as metacharacters to be +interpreted literally by prefixing them with a \f(CW"\e"\fR, just like the +pattern's delimiter must be escaped if it also occurs within the +pattern. Thus, \f(CW"\e."\fR matches just a literal dot, \f(CW"."\fR instead of +its normal meaning. This means that the backslash is also a +metacharacter, so \f(CW"\e\e"\fR matches a single \f(CW"\e"\fR. And a sequence that +contains an escaped metacharacter matches the same sequence (but without +the escape) in the target string. So, the pattern \f(CW\*(C`/blur\e\efl/\*(C'\fR would +match any target string that contains the sequence \f(CW"blur\efl"\fR. +.PP +The metacharacter \f(CW"|"\fR is used to match one thing or another. Thus +.PP +.Vb 1 +\& $foo =~ m/this|that/ +.Ve +.PP +is TRUE if and only if \f(CW$foo\fR contains either the sequence \f(CW"this"\fR or +the sequence \f(CW"that"\fR. Like all metacharacters, prefixing the \f(CW"|"\fR +with a backslash makes it match the plain punctuation character; in its +case, the VERTICAL LINE. +.PP +.Vb 1 +\& $foo =~ m/this\e|that/ +.Ve +.PP +is TRUE if and only if \f(CW$foo\fR contains the sequence \f(CW"this|that"\fR. +.PP +You aren't limited to just a single \f(CW"|"\fR. +.PP +.Vb 1 +\& $foo =~ m/fee|fie|foe|fum/ +.Ve +.PP +is TRUE if and only if \f(CW$foo\fR contains any of those 4 sequences from +the children's story "Jack and the Beanstalk". +.PP +As you can see, the \f(CW"|"\fR binds less tightly than a sequence of +ordinary characters. We can override this by using the grouping +metacharacters, the parentheses \f(CW"("\fR and \f(CW")"\fR. +.PP +.Vb 1 +\& $foo =~ m/th(is|at) thing/ +.Ve +.PP +is TRUE if and only if \f(CW$foo\fR contains either the sequence \f(CW"this\ thing"\fR or the sequence \f(CW"that\ thing"\fR. The portions of the string +that match the portions of the pattern enclosed in parentheses are +normally made available separately for use later in the pattern, +substitution, or program. This is called "capturing", and it can get +complicated. See "Capture groups". +.PP +The first alternative includes everything from the last pattern +delimiter (\f(CW"("\fR, \f(CW"(?:"\fR (described later), \fIetc\fR. or the beginning +of the pattern) up to the first \f(CW"|"\fR, and the last alternative +contains everything from the last \f(CW"|"\fR to the next closing pattern +delimiter. That's why it's common practice to include alternatives in +parentheses: to minimize confusion about where they start and end. +.PP +Alternatives are tried from left to right, so the first +alternative found for which the entire expression matches, is the one that +is chosen. This means that alternatives are not necessarily greedy. For +example: when matching \f(CW\*(C`foo|foot\*(C'\fR against \f(CW"barefoot"\fR, only the \f(CW"foo"\fR +part will match, as that is the first alternative tried, and it successfully +matches the target string. (This might not seem important, but it is +important when you are capturing matched text using parentheses.) +.PP +Besides taking away the special meaning of a metacharacter, a prefixed +backslash changes some letter and digit characters away from matching +just themselves to instead have special meaning. These are called +"escape sequences", and all such are described in perlrebackslash. A +backslash sequence (of a letter or digit) that doesn't currently have +special meaning to Perl will raise a warning if warnings are enabled, +as those are reserved for potential future use. +.PP +One such sequence is \f(CW\*(C`\eb\*(C'\fR, which matches a boundary of some sort. +\&\f(CW\*(C`\eb{wb}\*(C'\fR and a few others give specialized types of boundaries. +(They are all described in detail starting at +"\eb{}, \eb, \eB{}, \eB" in perlrebackslash.) Note that these don't match +characters, but the zero-width spaces between characters. They are an +example of a zero-width assertion. Consider again, +.PP +.Vb 1 +\& $foo =~ m/fee|fie|foe|fum/ +.Ve +.PP +It evaluates to TRUE if, besides those 4 words, any of the sequences +"feed", "field", "Defoe", "fume", and many others are in \f(CW$foo\fR. By +judicious use of \f(CW\*(C`\eb\*(C'\fR (or better (because it is designed to handle +natural language) \f(CW\*(C`\eb{wb}\*(C'\fR), we can make sure that only the Giant's +words are matched: +.PP +.Vb 2 +\& $foo =~ m/\eb(fee|fie|foe|fum)\eb/ +\& $foo =~ m/\eb{wb}(fee|fie|foe|fum)\eb{wb}/ +.Ve +.PP +The final example shows that the characters \f(CW"{"\fR and \f(CW"}"\fR are +metacharacters. +.PP +Another use for escape sequences is to specify characters that cannot +(or which you prefer not to) be written literally. These are described +in detail in "Character Escapes" in perlrebackslash, but the next three +paragraphs briefly describe some of them. +.PP +Various control characters can be written in C language style: \f(CW"\en"\fR +matches a newline, \f(CW"\et"\fR a tab, \f(CW"\er"\fR a carriage return, \f(CW"\ef"\fR a +form feed, \fIetc\fR. +.PP +More generally, \f(CW\*(C`\e\fR\f(CInnn\fR\f(CW\*(C'\fR, where \fInnn\fR is a string of three octal +digits, matches the character whose native code point is \fInnn\fR. You +can easily run into trouble if you don't have exactly three digits. So +always use three, or since Perl 5.14, you can use \f(CW\*(C`\eo{...}\*(C'\fR to specify +any number of octal digits. +.PP +Similarly, \f(CW\*(C`\ex\fR\f(CInn\fR\f(CW\*(C'\fR, where \fInn\fR are hexadecimal digits, matches the +character whose native ordinal is \fInn\fR. Again, not using exactly two +digits is a recipe for disaster, but you can use \f(CW\*(C`\ex{...}\*(C'\fR to specify +any number of hex digits. +.PP +Besides being a metacharacter, the \f(CW"."\fR is an example of a "character +class", something that can match any single character of a given set of +them. In its case, the set is just about all possible characters. Perl +predefines several character classes besides the \f(CW"."\fR; there is a +separate reference page about just these, perlrecharclass. +.PP +You can define your own custom character classes, by putting into your +pattern in the appropriate place(s), a list of all the characters you +want in the set. You do this by enclosing the list within \f(CW\*(C`[]\*(C'\fR bracket +characters. These are called "bracketed character classes" when we are +being precise, but often the word "bracketed" is dropped. (Dropping it +usually doesn't cause confusion.) This means that the \f(CW"["\fR character +is another metacharacter. It doesn't match anything just by itself; it +is used only to tell Perl that what follows it is a bracketed character +class. If you want to match a literal left square bracket, you must +escape it, like \f(CW"\e["\fR. The matching \f(CW"]"\fR is also a metacharacter; +again it doesn't match anything by itself, but just marks the end of +your custom class to Perl. It is an example of a "sometimes +metacharacter". It isn't a metacharacter if there is no corresponding +\&\f(CW"["\fR, and matches its literal self: +.PP +.Vb 1 +\& print "]" =~ /]/; # prints 1 +.Ve +.PP +The list of characters within the character class gives the set of +characters matched by the class. \f(CW"[abc]"\fR matches a single "a" or "b" +or "c". But if the first character after the \f(CW"["\fR is \f(CW"^"\fR, the +class instead matches any character not in the list. Within a list, the +\&\f(CW"\-"\fR character specifies a range of characters, so that \f(CW\*(C`a\-z\*(C'\fR +represents all characters between "a" and "z", inclusive. If you want +either \f(CW"\-"\fR or \f(CW"]"\fR itself to be a member of a class, put it at the +start of the list (possibly after a \f(CW"^"\fR), or escape it with a +backslash. \f(CW"\-"\fR is also taken literally when it is at the end of the +list, just before the closing \f(CW"]"\fR. (The following all specify the +same class of three characters: \f(CW\*(C`[\-az]\*(C'\fR, \f(CW\*(C`[az\-]\*(C'\fR, and \f(CW\*(C`[a\e\-z]\*(C'\fR. All +are different from \f(CW\*(C`[a\-z]\*(C'\fR, which specifies a class containing +twenty-six characters, even on EBCDIC-based character sets.) +.PP +There is lots more to bracketed character classes; full details are in +"Bracketed Character Classes" in perlrecharclass. +.PP +\fIMetacharacters\fR +.IX Xref "metacharacter \\ ^ . $ | ( () [ []" +.IX Subsection "Metacharacters" +.PP +"The Basics" introduced some of the metacharacters. This section +gives them all. Most of them have the same meaning as in the \fIegrep\fR +command. +.PP +Only the \f(CW"\e"\fR is always a metacharacter. The others are metacharacters +just sometimes. The following tables lists all of them, summarizes +their use, and gives the contexts where they are metacharacters. +Outside those contexts or if prefixed by a \f(CW"\e"\fR, they match their +corresponding punctuation character. In some cases, their meaning +varies depending on various pattern modifiers that alter the default +behaviors. See "Modifiers". +.PP +.Vb 10 +\& PURPOSE WHERE +\& \e Escape the next character Always, except when +\& escaped by another \e +\& ^ Match the beginning of the string Not in [] +\& (or line, if /m is used) +\& ^ Complement the [] class At the beginning of [] +\& . Match any single character except newline Not in [] +\& (under /s, includes newline) +\& $ Match the end of the string Not in [], but can +\& (or before newline at the end of the mean interpolate a +\& string; or before any newline if /m is scalar +\& used) +\& | Alternation Not in [] +\& () Grouping Not in [] +\& [ Start Bracketed Character class Not in [] +\& ] End Bracketed Character class Only in [], and +\& not first +\& * Matches the preceding element 0 or more Not in [] +\& times +\& + Matches the preceding element 1 or more Not in [] +\& times +\& ? Matches the preceding element 0 or 1 Not in [] +\& times +\& { Starts a sequence that gives number(s) Not in [] +\& of times the preceding element can be +\& matched +\& { when following certain escape sequences +\& starts a modifier to the meaning of the +\& sequence +\& } End sequence started by { +\& \- Indicates a range Only in [] interior +\& # Beginning of comment, extends to line end Only with /x modifier +.Ve +.PP +Notice that most of the metacharacters lose their special meaning when +they occur in a bracketed character class, except \f(CW"^"\fR has a different +meaning when it is at the beginning of such a class. And \f(CW"\-"\fR and \f(CW"]"\fR +are metacharacters only at restricted positions within bracketed +character classes; while \f(CW"}"\fR is a metacharacter only when closing a +special construct started by \f(CW"{"\fR. +.PP +In double-quotish context, as is usually the case, you need to be +careful about \f(CW"$"\fR and the non-metacharacter \f(CW"@"\fR. Those could +interpolate variables, which may or may not be what you intended. +.PP +These rules were designed for compactness of expression, rather than +legibility and maintainability. The "/x and /xx" pattern +modifiers allow you to insert white space to improve readability. And +use of \f(CW\*(C`re\ \*(Aqstrict\*(Aq\*(C'\fR adds extra checking to +catch some typos that might silently compile into something unintended. +.PP +By default, the \f(CW"^"\fR character is guaranteed to match only the +beginning of the string, the \f(CW"$"\fR character only the end (or before the +newline at the end), and Perl does certain optimizations with the +assumption that the string contains only one line. Embedded newlines +will not be matched by \f(CW"^"\fR or \f(CW"$"\fR. You may, however, wish to treat a +string as a multi-line buffer, such that the \f(CW"^"\fR will match after any +newline within the string (except if the newline is the last character in +the string), and \f(CW"$"\fR will match before any newline. At the +cost of a little more overhead, you can do this by using the +\&\f(CW"/m"\fR modifier on the pattern match operator. (Older programs +did this by setting \f(CW$*\fR, but this option was removed in perl 5.10.) +.IX Xref "^ $ m" +.PP +To simplify multi-line substitutions, the \f(CW"."\fR character never matches a +newline unless you use the \f(CW\*(C`/s\*(C'\fR modifier, which in effect tells +Perl to pretend the string is a single line\-\-even if it isn't. +.IX Xref ". s" +.SS Modifiers +.IX Subsection "Modifiers" +\fIOverview\fR +.IX Subsection "Overview" +.PP +The default behavior for matching can be changed, using various +modifiers. Modifiers that relate to the interpretation of the pattern +are listed just below. Modifiers that alter the way a pattern is used +by Perl are detailed in "Regexp Quote-Like Operators" in perlop and +"Gory details of parsing quoted constructs" in perlop. Modifiers can be added +dynamically; see "Extended Patterns" below. +.ie n .IP "\fR\fB""m""\fR\fB\fR" 4 +.el .IP \fR\f(CBm\fR\fB\fR 4 +.IX Xref " m regex, multiline regexp, multiline regular expression, multiline" +.IX Item "m" +Treat the string being matched against as multiple lines. That is, change \f(CW"^"\fR and \f(CW"$"\fR from matching +the start of the string's first line and the end of its last line to +matching the start and end of each line within the string. +.ie n .IP "\fR\fB""s""\fR\fB\fR" 4 +.el .IP \fR\f(CBs\fR\fB\fR 4 +.IX Xref " s regex, single-line regexp, single-line regular expression, single-line" +.IX Item "s" +Treat the string as single line. That is, change \f(CW"."\fR to match any character +whatsoever, even a newline, which normally it would not match. +.Sp +Used together, as \f(CW\*(C`/ms\*(C'\fR, they let the \f(CW"."\fR match any character whatsoever, +while still allowing \f(CW"^"\fR and \f(CW"$"\fR to match, respectively, just after +and just before newlines within the string. +.ie n .IP "\fR\fB""i""\fR\fB\fR" 4 +.el .IP \fR\f(CBi\fR\fB\fR 4 +.IX Xref " i regex, case-insensitive regexp, case-insensitive regular expression, case-insensitive" +.IX Item "i" +Do case-insensitive pattern matching. For example, "A" will match "a" +under \f(CW\*(C`/i\*(C'\fR. +.Sp +If locale matching rules are in effect, the case map is taken from the +current +locale for code points less than 255, and from Unicode rules for larger +code points. However, matches that would cross the Unicode +rules/non\-Unicode rules boundary (ords 255/256) will not succeed, unless +the locale is a UTF\-8 one. See perllocale. +.Sp +There are a number of Unicode characters that match a sequence of +multiple characters under \f(CW\*(C`/i\*(C'\fR. For example, +\&\f(CW\*(C`LATIN SMALL LIGATURE FI\*(C'\fR should match the sequence \f(CW\*(C`fi\*(C'\fR. Perl is not +currently able to do this when the multiple characters are in the pattern and +are split between groupings, or when one or more are quantified. Thus +.Sp +.Vb 3 +\& "\eN{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches +\& "\eN{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn\*(Aqt match! +\& "\eN{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn\*(Aqt match! +\& +\& # The below doesn\*(Aqt match, and it isn\*(Aqt clear what $1 and $2 would +\& # be even if it did!! +\& "\eN{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn\*(Aqt match! +.Ve +.Sp +Perl doesn't match multiple characters in a bracketed +character class unless the character that maps to them is explicitly +mentioned, and it doesn't match them at all if the character class is +inverted, which otherwise could be highly confusing. See +"Bracketed Character Classes" in perlrecharclass, and +"Negation" in perlrecharclass. +.ie n .IP "\fR\fB""x""\fR\fB\fR and \fB\fR\fB""xx""\fR\fB\fR" 4 +.el .IP "\fR\f(CBx\fR\fB\fR and \fB\fR\f(CBxx\fR\fB\fR" 4 +.IX Xref " x" +.IX Item "x and xx" +Extend your pattern's legibility by permitting whitespace and comments. +Details in "/x and /xx" +.ie n .IP "\fR\fB""p""\fR\fB\fR" 4 +.el .IP \fR\f(CBp\fR\fB\fR 4 +.IX Xref " p regex, preserve regexp, preserve" +.IX Item "p" +Preserve the string matched such that \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR, and +\&\f(CW\*(C`${^POSTMATCH}\*(C'\fR are available for use after matching. +.Sp +In Perl 5.20 and higher this is ignored. Due to a new copy-on-write +mechanism, \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR, and \f(CW\*(C`${^POSTMATCH}\*(C'\fR will be available +after the match regardless of the modifier. +.ie n .IP "\fR\fB""a""\fR\fB\fR, \fB\fR\fB""d""\fR\fB\fR, \fB\fR\fB""l""\fR\fB\fR, and \fB\fR\fB""u""\fR\fB\fR" 4 +.el .IP "\fR\f(CBa\fR\fB\fR, \fB\fR\f(CBd\fR\fB\fR, \fB\fR\f(CBl\fR\fB\fR, and \fB\fR\f(CBu\fR\fB\fR" 4 +.IX Xref " a d l u" +.IX Item "a, d, l, and u" +These modifiers, all new in 5.14, affect which character-set rules +(Unicode, \fIetc\fR.) are used, as described below in +"Character set modifiers". +.ie n .IP "\fR\fB""n""\fR\fB\fR" 4 +.el .IP \fR\f(CBn\fR\fB\fR 4 +.IX Xref " n regex, non-capture regexp, non-capture regular expression, non-capture" +.IX Item "n" +Prevent the grouping metacharacters \f(CW\*(C`()\*(C'\fR from capturing. This modifier, +new in 5.22, will stop \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR... from being filled in. +.Sp +.Vb 2 +\& "hello" =~ /(hi|hello)/; # $1 is "hello" +\& "hello" =~ /(hi|hello)/n; # $1 is undef +.Ve +.Sp +This is equivalent to putting \f(CW\*(C`?:\*(C'\fR at the beginning of every capturing group: +.Sp +.Vb 1 +\& "hello" =~ /(?:hi|hello)/; # $1 is undef +.Ve +.Sp +\&\f(CW\*(C`/n\*(C'\fR can be negated on a per-group basis. Alternatively, named captures +may still be used. +.Sp +.Vb 3 +\& "hello" =~ /(?\-n:(hi|hello))/n; # $1 is "hello" +\& "hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is +\& # "hello" +.Ve +.IP "Other Modifiers" 4 +.IX Item "Other Modifiers" +There are a number of flags that can be found at the end of regular +expression constructs that are \fInot\fR generic regular expression flags, but +apply to the operation being performed, like matching or substitution (\f(CW\*(C`m//\*(C'\fR +or \f(CW\*(C`s///\*(C'\fR respectively). +.Sp +Flags described further in +"Using regular expressions in Perl" in perlretut are: +.Sp +.Vb 2 +\& c \- keep the current position during repeated matching +\& g \- globally match the pattern repeatedly in the string +.Ve +.Sp +Substitution-specific modifiers described in +"s/PATTERN/REPLACEMENT/msixpodualngcer" in perlop are: +.Sp +.Vb 4 +\& e \- evaluate the right\-hand side as an expression +\& ee \- evaluate the right side as a string then eval the result +\& o \- pretend to optimize your code, but actually introduce bugs +\& r \- perform non\-destructive substitution and return the new value +.Ve +.PP +Regular expression modifiers are usually written in documentation +as \fIe.g.\fR, "the \f(CW\*(C`/x\*(C'\fR modifier", even though the delimiter +in question might not really be a slash. The modifiers \f(CW\*(C`/imnsxadlup\*(C'\fR +may also be embedded within the regular expression itself using +the \f(CW\*(C`(?...)\*(C'\fR construct, see "Extended Patterns" below. +.PP +\fIDetails on some modifiers\fR +.IX Subsection "Details on some modifiers" +.PP +Some of the modifiers require more explanation than given in the +"Overview" above. +.PP +\f(CW\*(C`/x\*(C'\fR and \f(CW\*(C`/xx\*(C'\fR +.IX Subsection "/x and /xx" +.PP +A single \f(CW\*(C`/x\*(C'\fR tells +the regular expression parser to ignore most whitespace that is neither +backslashed nor within a bracketed character class, nor within the characters +of a multi-character metapattern like \f(CW\*(C`(?i: ... )\*(C'\fR. You can use this to +break up your regular expression into more readable parts. +Also, the \f(CW"#"\fR character is treated as a metacharacter introducing a +comment that runs up to the pattern's closing delimiter, or to the end +of the current line if the pattern extends onto the next line. Hence, +this is very much like an ordinary Perl code comment. (You can include +the closing delimiter within the comment only if you precede it with a +backslash, so be careful!) +.PP +Use of \f(CW\*(C`/x\*(C'\fR means that if you want real +whitespace or \f(CW"#"\fR characters in the pattern (outside a bracketed character +class, which is unaffected by \f(CW\*(C`/x\*(C'\fR), then you'll either have to +escape them (using backslashes or \f(CW\*(C`\eQ...\eE\*(C'\fR) or encode them using octal, +hex, or \f(CW\*(C`\eN{}\*(C'\fR or \f(CW\*(C`\ep{name=...}\*(C'\fR escapes. +It is ineffective to try to continue a comment onto the next line by +escaping the \f(CW\*(C`\en\*(C'\fR with a backslash or \f(CW\*(C`\eQ\*(C'\fR. +.PP +You can use "(?#text)" to create a comment that ends earlier than the +end of the current line, but \f(CW\*(C`text\*(C'\fR also can't contain the closing +delimiter unless escaped with a backslash. +.PP +A common pitfall is to forget that \f(CW"#"\fR characters (outside a +bracketed character class) begin a comment under \f(CW\*(C`/x\*(C'\fR and are not +matched literally. Just keep that in mind when trying to puzzle out why +a particular \f(CW\*(C`/x\*(C'\fR pattern isn't working as expected. +Inside a bracketed character class, \f(CW"#"\fR retains its non-special, +literal meaning. +.PP +Starting in Perl v5.26, if the modifier has a second \f(CW"x"\fR within it, +the effect of a single \f(CW\*(C`/x\*(C'\fR is increased. The only difference is that +inside bracketed character classes, non-escaped (by a backslash) SPACE +and TAB characters are not added to the class, and hence can be inserted +to make the classes more readable: +.PP +.Vb 2 +\& / [d\-e g\-i 3\-7]/xx +\& /[ ! @ " # $ % ^ & * () = ? <> \*(Aq ]/xx +.Ve +.PP +may be easier to grasp than the squashed equivalents +.PP +.Vb 2 +\& /[d\-eg\-i3\-7]/ +\& /[!@"#$%^&*()=?<>\*(Aq]/ +.Ve +.PP +Note that this unfortunately doesn't mean that your bracketed classes +can contain comments or extend over multiple lines. A \f(CW\*(C`#\*(C'\fR inside a +character class is still just a literal \f(CW\*(C`#\*(C'\fR, and doesn't introduce a +comment. And, unless the closing bracket is on the same line as the +opening one, the newline character (and everything on the next line(s) +until terminated by a \f(CW\*(C`]\*(C'\fR will be part of the class, just as if you'd +written \f(CW\*(C`\en\*(C'\fR. +.PP +Taken together, these features go a long way towards +making Perl's regular expressions more readable. Here's an example: +.PP +.Vb 6 +\& # Delete (most) C comments. +\& $program =~ s { +\& /\e* # Match the opening delimiter. +\& .*? # Match a minimal number of characters. +\& \e*/ # Match the closing delimiter. +\& } []gsx; +.Ve +.PP +Note that anything inside +a \f(CW\*(C`\eQ...\eE\*(C'\fR stays unaffected by \f(CW\*(C`/x\*(C'\fR. And note that \f(CW\*(C`/x\*(C'\fR doesn't affect +space interpretation within a single multi-character construct. For +example \f(CW\*(C`(?:...)\*(C'\fR can't have a space between the \f(CW"("\fR, +\&\f(CW"?"\fR, and \f(CW":"\fR. Within any delimiters for such a construct, allowed +spaces are not affected by \f(CW\*(C`/x\*(C'\fR, and depend on the construct. For +example, all constructs using curly braces as delimiters, such as +\&\f(CW\*(C`\ex{...}\*(C'\fR can have blanks within but adjacent to the braces, but not +elsewhere, and no non-blank space characters. An exception are Unicode +properties which follow Unicode rules, for which see +"Properties accessible through \ep{} and \eP{}" in perluniprops. +.IX Xref " x" +.PP +The set of characters that are deemed whitespace are those that Unicode +calls "Pattern White Space", namely: +.PP +.Vb 11 +\& U+0009 CHARACTER TABULATION +\& U+000A LINE FEED +\& U+000B LINE TABULATION +\& U+000C FORM FEED +\& U+000D CARRIAGE RETURN +\& U+0020 SPACE +\& U+0085 NEXT LINE +\& U+200E LEFT\-TO\-RIGHT MARK +\& U+200F RIGHT\-TO\-LEFT MARK +\& U+2028 LINE SEPARATOR +\& U+2029 PARAGRAPH SEPARATOR +.Ve +.PP +Character set modifiers +.IX Subsection "Character set modifiers" +.PP +\&\f(CW\*(C`/d\*(C'\fR, \f(CW\*(C`/u\*(C'\fR, \f(CW\*(C`/a\*(C'\fR, and \f(CW\*(C`/l\*(C'\fR, available starting in 5.14, are called +the character set modifiers; they affect the character set rules +used for the regular expression. +.PP +The \f(CW\*(C`/d\*(C'\fR, \f(CW\*(C`/u\*(C'\fR, and \f(CW\*(C`/l\*(C'\fR modifiers are not likely to be of much use +to you, and so you need not worry about them very much. They exist for +Perl's internal use, so that complex regular expression data structures +can be automatically serialized and later exactly reconstituted, +including all their nuances. But, since Perl can't keep a secret, and +there may be rare instances where they are useful, they are documented +here. +.PP +The \f(CW\*(C`/a\*(C'\fR modifier, on the other hand, may be useful. Its purpose is to +allow code that is to work mostly on ASCII data to not have to concern +itself with Unicode. +.PP +Briefly, \f(CW\*(C`/l\*(C'\fR sets the character set to that of whatever \fBL\fRocale is in +effect at the time of the execution of the pattern match. +.PP +\&\f(CW\*(C`/u\*(C'\fR sets the character set to \fBU\fRnicode. +.PP +\&\f(CW\*(C`/a\*(C'\fR also sets the character set to Unicode, BUT adds several +restrictions for \fBA\fRSCII-safe matching. +.PP +\&\f(CW\*(C`/d\*(C'\fR is the old, problematic, pre\-5.14 \fBD\fRefault character set +behavior. Its only use is to force that old behavior. +.PP +At any given time, exactly one of these modifiers is in effect. Their +existence allows Perl to keep the originally compiled behavior of a +regular expression, regardless of what rules are in effect when it is +actually executed. And if it is interpolated into a larger regex, the +original's rules continue to apply to it, and don't affect the other +parts. +.PP +The \f(CW\*(C`/l\*(C'\fR and \f(CW\*(C`/u\*(C'\fR modifiers are automatically selected for +regular expressions compiled within the scope of various pragmas, +and we recommend that in general, you use those pragmas instead of +specifying these modifiers explicitly. For one thing, the modifiers +affect only pattern matching, and do not extend to even any replacement +done, whereas using the pragmas gives consistent results for all +appropriate operations within their scopes. For example, +.PP +.Vb 1 +\& s/foo/\eUbar/il +.Ve +.PP +will match "foo" using the locale's rules for case-insensitive matching, +but the \f(CW\*(C`/l\*(C'\fR does not affect how the \f(CW\*(C`\eU\*(C'\fR operates. Most likely you +want both of them to use locale rules. To do this, instead compile the +regular expression within the scope of \f(CW\*(C`use locale\*(C'\fR. This both +implicitly adds the \f(CW\*(C`/l\*(C'\fR, and applies locale rules to the \f(CW\*(C`\eU\*(C'\fR. The +lesson is to \f(CW\*(C`use locale\*(C'\fR, and not \f(CW\*(C`/l\*(C'\fR explicitly. +.PP +Similarly, it would be better to use \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR +instead of, +.PP +.Vb 1 +\& s/foo/\eLbar/iu +.Ve +.PP +to get Unicode rules, as the \f(CW\*(C`\eL\*(C'\fR in the former (but not necessarily +the latter) would also use Unicode rules. +.PP +More detail on each of the modifiers follows. Most likely you don't +need to know this detail for \f(CW\*(C`/l\*(C'\fR, \f(CW\*(C`/u\*(C'\fR, and \f(CW\*(C`/d\*(C'\fR, and can skip ahead +to /a. +.PP +/l +.IX Subsection "/l" +.PP +means to use the current locale's rules (see perllocale) when pattern +matching. For example, \f(CW\*(C`\ew\*(C'\fR will match the "word" characters of that +locale, and \f(CW"/i"\fR case-insensitive matching will match according to +the locale's case folding rules. The locale used will be the one in +effect at the time of execution of the pattern match. This may not be +the same as the compilation-time locale, and can differ from one match +to another if there is an intervening call of the +\&\fBsetlocale()\fR function. +.PP +Prior to v5.20, Perl did not support multi-byte locales. Starting then, +UTF\-8 locales are supported. No other multi byte locales are ever +likely to be supported. However, in all locales, one can have code +points above 255 and these will always be treated as Unicode no matter +what locale is in effect. +.PP +Under Unicode rules, there are a few case-insensitive matches that cross +the 255/256 boundary. Except for UTF\-8 locales in Perls v5.20 and +later, these are disallowed under \f(CW\*(C`/l\*(C'\fR. For example, 0xFF (on ASCII +platforms) does not caselessly match the character at 0x178, \f(CW\*(C`LATIN +CAPITAL LETTER Y WITH DIAERESIS\*(C'\fR, because 0xFF may not be \f(CW\*(C`LATIN SMALL +LETTER Y WITH DIAERESIS\*(C'\fR in the current locale, and Perl has no way of +knowing if that character even exists in the locale, much less what code +point it is. +.PP +In a UTF\-8 locale in v5.20 and later, the only visible difference +between locale and non-locale in regular expressions should be tainting, +if your perl supports taint checking (see perlsec). +.PP +This modifier may be specified to be the default by \f(CW\*(C`use locale\*(C'\fR, but +see "Which character set modifier is in effect?". +.IX Xref " l" +.PP +/u +.IX Subsection "/u" +.PP +means to use Unicode rules when pattern matching. On ASCII platforms, +this means that the code points between 128 and 255 take on their +Latin\-1 (ISO\-8859\-1) meanings (which are the same as Unicode's). +(Otherwise Perl considers their meanings to be undefined.) Thus, +under this modifier, the ASCII platform effectively becomes a Unicode +platform; and hence, for example, \f(CW\*(C`\ew\*(C'\fR will match any of the more than +100_000 word characters in Unicode. +.PP +Unlike most locales, which are specific to a language and country pair, +Unicode classifies all the characters that are letters \fIsomewhere\fR in +the world as +\&\f(CW\*(C`\ew\*(C'\fR. For example, your locale might not think that \f(CW\*(C`LATIN SMALL +LETTER ETH\*(C'\fR is a letter (unless you happen to speak Icelandic), but +Unicode does. Similarly, all the characters that are decimal digits +somewhere in the world will match \f(CW\*(C`\ed\*(C'\fR; this is hundreds, not 10, +possible matches. And some of those digits look like some of the 10 +ASCII digits, but mean a different number, so a human could easily think +a number is a different quantity than it really is. For example, +\&\f(CW\*(C`BENGALI DIGIT FOUR\*(C'\fR (U+09EA) looks very much like an +\&\f(CW\*(C`ASCII DIGIT EIGHT\*(C'\fR (U+0038), and \f(CW\*(C`LEPCHA DIGIT SIX\*(C'\fR (U+1C46) looks +very much like an \f(CW\*(C`ASCII DIGIT FIVE\*(C'\fR (U+0035). And, \f(CW\*(C`\ed+\*(C'\fR, may match +strings of digits that are a mixture from different writing systems, +creating a security issue. A fraudulent website, for example, could +display the price of something using U+1C46, and it would appear to the +user that something cost 500 units, but it really costs 600. A browser +that enforced script runs ("Script Runs") would prevent that +fraudulent display. "\fBnum()\fR" in Unicode::UCD can also be used to sort this +out. Or the \f(CW\*(C`/a\*(C'\fR modifier can be used to force \f(CW\*(C`\ed\*(C'\fR to match just the +ASCII 0 through 9. +.PP +Also, under this modifier, case-insensitive matching works on the full +set of Unicode +characters. The \f(CW\*(C`KELVIN SIGN\*(C'\fR, for example matches the letters "k" and +"K"; and \f(CW\*(C`LATIN SMALL LIGATURE FF\*(C'\fR matches the sequence "ff", which, +if you're not prepared, might make it look like a hexadecimal constant, +presenting another potential security issue. See +<https://unicode.org/reports/tr36> for a detailed discussion of Unicode +security issues. +.PP +This modifier may be specified to be the default by \f(CW\*(C`use feature +\&\*(Aqunicode_strings\*(C'\fR, \f(CW\*(C`use locale \*(Aq:not_characters\*(Aq\*(C'\fR, or +\&\f(CW\*(C`use v5.12\*(C'\fR (or higher), +but see "Which character set modifier is in effect?". +.IX Xref " u" +.PP +/d +.IX Subsection "/d" +.PP +\&\fBIMPORTANT:\fR Because of the unpredictable behaviors this +modifier causes, only use it to maintain weird backward compatibilities. +Use the +\&\f(CW\*(C`unicode_strings\*(C'\fR +feature +in new code to avoid inadvertently enabling this modifier by default. +.PP +What does this modifier do? It "Depends"! +.PP +This modifier means to use platform-native matching rules +except when there is cause to use Unicode rules instead, as follows: +.IP 1. 4 +the target string's UTF8 flag +(see below) is set; or +.IP 2. 4 +the pattern's UTF8 flag +(see below) is set; or +.IP 3. 4 +the pattern explicitly mentions a code point that is above 255 (say by +\&\f(CW\*(C`\ex{100}\*(C'\fR); or +.IP 4. 4 +the pattern uses a Unicode name (\f(CW\*(C`\eN{...}\*(C'\fR); or +.IP 5. 4 +the pattern uses a Unicode property (\f(CW\*(C`\ep{...}\*(C'\fR or \f(CW\*(C`\eP{...}\*(C'\fR); or +.IP 6. 4 +the pattern uses a Unicode break (\f(CW\*(C`\eb{...}\*(C'\fR or \f(CW\*(C`\eB{...}\*(C'\fR); or +.IP 7. 4 +the pattern uses \f(CW"(?[ ])"\fR +.IP 8. 4 +the pattern uses \f(CW\*(C`(*script_run: ...)\*(C'\fR +.PP +Regarding the "UTF8 flag" references above: normally Perl applications +shouldn't think about that flag. It's part of Perl's internals, +so it can change whenever Perl wants. \f(CW\*(C`/d\*(C'\fR may thus cause unpredictable +results. See "The "Unicode Bug"" in perlunicode. This bug +has become rather infamous, leading to yet other (without swearing) names +for this modifier like "Dicey" and "Dodgy". +.PP +Here are some examples of how that works on an ASCII platform: +.PP +.Vb 3 +\& $str = "\exDF"; # +\& utf8::downgrade($str); # $str is not UTF8\-flagged. +\& $str =~ /^\ew/; # No match, since no UTF8 flag. +\& +\& $str .= "\ex{0e0b}"; # Now $str is UTF8\-flagged. +\& $str =~ /^\ew/; # Match! $str is now UTF8\-flagged. +\& chop $str; +\& $str =~ /^\ew/; # Still a match! $str retains its UTF8 flag. +.Ve +.PP +Under Perl's default configuration this modifier is automatically +selected by default when none of the others are, so yet another name +for it (unfortunately) is "Default". +.PP +Whenever you can, use the +\&\f(CW\*(C`unicode_strings\*(C'\fR +to cause to be the default instead. +.IX Xref " u" +.PP +/a (and /aa) +.IX Subsection "/a (and /aa)" +.PP +This modifier stands for ASCII-restrict (or ASCII-safe). This modifier +may be doubled-up to increase its effect. +.PP +When it appears singly, it causes the sequences \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, and +the Posix character classes to match only in the ASCII range. They thus +revert to their pre\-5.6, pre-Unicode meanings. Under \f(CW\*(C`/a\*(C'\fR, \f(CW\*(C`\ed\*(C'\fR +always means precisely the digits \f(CW"0"\fR to \f(CW"9"\fR; \f(CW\*(C`\es\*(C'\fR means the five +characters \f(CW\*(C`[ \ef\en\er\et]\*(C'\fR, and starting in Perl v5.18, the vertical tab; +\&\f(CW\*(C`\ew\*(C'\fR means the 63 characters +\&\f(CW\*(C`[A\-Za\-z0\-9_]\*(C'\fR; and likewise, all the Posix classes such as +\&\f(CW\*(C`[[:print:]]\*(C'\fR match only the appropriate ASCII-range characters. +.PP +This modifier is useful for people who only incidentally use Unicode, +and who do not wish to be burdened with its complexities and security +concerns. +.PP +With \f(CW\*(C`/a\*(C'\fR, one can write \f(CW\*(C`\ed\*(C'\fR with confidence that it will only match +ASCII characters, and should the need arise to match beyond ASCII, you +can instead use \f(CW\*(C`\ep{Digit}\*(C'\fR (or \f(CW\*(C`\ep{Word}\*(C'\fR for \f(CW\*(C`\ew\*(C'\fR). There are +similar \f(CW\*(C`\ep{...}\*(C'\fR constructs that can match beyond ASCII both white +space (see "Whitespace" in perlrecharclass), and Posix classes (see +"POSIX Character Classes" in perlrecharclass). Thus, this modifier +doesn't mean you can't use Unicode, it means that to get Unicode +matching you must explicitly use a construct (\f(CW\*(C`\ep{}\*(C'\fR, \f(CW\*(C`\eP{}\*(C'\fR) that +signals Unicode. +.PP +As you would expect, this modifier causes, for example, \f(CW\*(C`\eD\*(C'\fR to mean +the same thing as \f(CW\*(C`[^0\-9]\*(C'\fR; in fact, all non-ASCII characters match +\&\f(CW\*(C`\eD\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR, and \f(CW\*(C`\eW\*(C'\fR. \f(CW\*(C`\eb\*(C'\fR still means to match at the boundary +between \f(CW\*(C`\ew\*(C'\fR and \f(CW\*(C`\eW\*(C'\fR, using the \f(CW\*(C`/a\*(C'\fR definitions of them (similarly +for \f(CW\*(C`\eB\*(C'\fR). +.PP +Otherwise, \f(CW\*(C`/a\*(C'\fR behaves like the \f(CW\*(C`/u\*(C'\fR modifier, in that +case-insensitive matching uses Unicode rules; for example, "k" will +match the Unicode \f(CW\*(C`\eN{KELVIN SIGN}\*(C'\fR under \f(CW\*(C`/i\*(C'\fR matching, and code +points in the Latin1 range, above ASCII will have Unicode rules when it +comes to case-insensitive matching. +.PP +To forbid ASCII/non\-ASCII matches (like "k" with \f(CW\*(C`\eN{KELVIN SIGN}\*(C'\fR), +specify the \f(CW"a"\fR twice, for example \f(CW\*(C`/aai\*(C'\fR or \f(CW\*(C`/aia\*(C'\fR. (The first +occurrence of \f(CW"a"\fR restricts the \f(CW\*(C`\ed\*(C'\fR, \fIetc\fR., and the second occurrence +adds the \f(CW\*(C`/i\*(C'\fR restrictions.) But, note that code points outside the +ASCII range will use Unicode rules for \f(CW\*(C`/i\*(C'\fR matching, so the modifier +doesn't really restrict things to just ASCII; it just forbids the +intermixing of ASCII and non-ASCII. +.PP +To summarize, this modifier provides protection for applications that +don't wish to be exposed to all of Unicode. Specifying it twice +gives added protection. +.PP +This modifier may be specified to be the default by \f(CW\*(C`use re \*(Aq/a\*(Aq\*(C'\fR +or \f(CW\*(C`use re \*(Aq/aa\*(Aq\*(C'\fR. If you do so, you may actually have occasion to use +the \f(CW\*(C`/u\*(C'\fR modifier explicitly if there are a few regular expressions +where you do want full Unicode rules (but even here, it's best if +everything were under feature \f(CW"unicode_strings"\fR, along with the +\&\f(CW\*(C`use re \*(Aq/aa\*(Aq\*(C'\fR). Also see "Which character set modifier is in +effect?". +.IX Xref " a aa" +.PP +Which character set modifier is in effect? +.IX Subsection "Which character set modifier is in effect?" +.PP +Which of these modifiers is in effect at any given point in a regular +expression depends on a fairly complex set of interactions. These have +been designed so that in general you don't have to worry about it, but +this section gives the gory details. As +explained below in "Extended Patterns" it is possible to explicitly +specify modifiers that apply only to portions of a regular expression. +The innermost always has priority over any outer ones, and one applying +to the whole expression has priority over any of the default settings that are +described in the remainder of this section. +.PP +The \f(CW\*(C`use re \*(Aq/foo\*(Aq\*(C'\fR pragma can be used to set +default modifiers (including these) for regular expressions compiled +within its scope. This pragma has precedence over the other pragmas +listed below that also change the defaults. +.PP +Otherwise, \f(CW\*(C`use locale\*(C'\fR sets the default modifier to \f(CW\*(C`/l\*(C'\fR; +and \f(CW\*(C`use feature \*(Aqunicode_strings\*(C'\fR, or +\&\f(CW\*(C`use v5.12\*(C'\fR (or higher) set the default to +\&\f(CW\*(C`/u\*(C'\fR when not in the same scope as either \f(CW\*(C`use locale\*(C'\fR +or \f(CW\*(C`use bytes\*(C'\fR. +(\f(CW\*(C`use locale \*(Aq:not_characters\*(Aq\*(C'\fR also +sets the default to \f(CW\*(C`/u\*(C'\fR, overriding any plain \f(CW\*(C`use locale\*(C'\fR.) +Unlike the mechanisms mentioned above, these +affect operations besides regular expressions pattern matching, and so +give more consistent results with other operators, including using +\&\f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\el\*(C'\fR, \fIetc\fR. in substitution replacements. +.PP +If none of the above apply, for backwards compatibility reasons, the +\&\f(CW\*(C`/d\*(C'\fR modifier is the one in effect by default. As this can lead to +unexpected results, it is best to specify which other rule set should be +used. +.PP +Character set modifier behavior prior to Perl 5.14 +.IX Subsection "Character set modifier behavior prior to Perl 5.14" +.PP +Prior to 5.14, there were no explicit modifiers, but \f(CW\*(C`/l\*(C'\fR was implied +for regexes compiled within the scope of \f(CW\*(C`use locale\*(C'\fR, and \f(CW\*(C`/d\*(C'\fR was +implied otherwise. However, interpolating a regex into a larger regex +would ignore the original compilation in favor of whatever was in effect +at the time of the second compilation. There were a number of +inconsistencies (bugs) with the \f(CW\*(C`/d\*(C'\fR modifier, where Unicode rules +would be used when inappropriate, and vice versa. \f(CW\*(C`\ep{}\*(C'\fR did not imply +Unicode rules, and neither did all occurrences of \f(CW\*(C`\eN{}\*(C'\fR, until 5.12. +.SS "Regular Expressions" +.IX Subsection "Regular Expressions" +\fIQuantifiers\fR +.IX Subsection "Quantifiers" +.PP +Quantifiers are used when a particular portion of a pattern needs to +match a certain number (or numbers) of times. If there isn't a +quantifier the number of times to match is exactly one. The following +standard quantifiers are recognized: +.IX Xref "metacharacter quantifier * + ? {n} {n,} {n,m}" +.PP +.Vb 7 +\& * Match 0 or more times +\& + Match 1 or more times +\& ? Match 1 or 0 times +\& {n} Match exactly n times +\& {n,} Match at least n times +\& {,n} Match at most n times +\& {n,m} Match at least n but not more than m times +.Ve +.PP +(If a non-escaped curly bracket occurs in a context other than one of +the quantifiers listed above, where it does not form part of a +backslashed sequence like \f(CW\*(C`\ex{...}\*(C'\fR, it is either a fatal syntax error, +or treated as a regular character, generally with a deprecation warning +raised. To escape it, you can precede it with a backslash (\f(CW"\e{"\fR) or +enclose it within square brackets (\f(CW"[{]"\fR). +This change will allow for future syntax extensions (like making the +lower bound of a quantifier optional), and better error checking of +quantifiers). +.PP +The \f(CW"*"\fR quantifier is equivalent to \f(CW\*(C`{0,}\*(C'\fR, the \f(CW"+"\fR +quantifier to \f(CW\*(C`{1,}\*(C'\fR, and the \f(CW"?"\fR quantifier to \f(CW\*(C`{0,1}\*(C'\fR. \fIn\fR and \fIm\fR are limited +to non-negative integral values less than a preset limit defined when perl is built. +This is usually 65534 on the most common platforms. The actual limit can +be seen in the error message generated by code such as this: +.PP +.Vb 1 +\& $_ **= $_ , / {$_} / for 2 .. 42; +.Ve +.PP +By default, a quantified subpattern is "greedy", that is, it will match as +many times as possible (given a particular starting location) while still +allowing the rest of the pattern to match. If you want it to match the +minimum number of times possible, follow the quantifier with a \f(CW"?"\fR. Note +that the meanings don't change, just the "greediness": +.IX Xref "metacharacter greedy greediness ? *? +? ?? {n}? {n,}? {,n}? {n,m}?" +.PP +.Vb 7 +\& *? Match 0 or more times, not greedily +\& +? Match 1 or more times, not greedily +\& ?? Match 0 or 1 time, not greedily +\& {n}? Match exactly n times, not greedily (redundant) +\& {n,}? Match at least n times, not greedily +\& {,n}? Match at most n times, not greedily +\& {n,m}? Match at least n but not more than m times, not greedily +.Ve +.PP +Normally when a quantified subpattern does not allow the rest of the +overall pattern to match, Perl will backtrack. However, this behaviour is +sometimes undesirable. Thus Perl provides the "possessive" quantifier form +as well. +.PP +.Vb 7 +\& *+ Match 0 or more times and give nothing back +\& ++ Match 1 or more times and give nothing back +\& ?+ Match 0 or 1 time and give nothing back +\& {n}+ Match exactly n times and give nothing back (redundant) +\& {n,}+ Match at least n times and give nothing back +\& {,n}+ Match at most n times and give nothing back +\& {n,m}+ Match at least n but not more than m times and give nothing back +.Ve +.PP +For instance, +.PP +.Vb 1 +\& \*(Aqaaaa\*(Aq =~ /a++a/ +.Ve +.PP +will never match, as the \f(CW\*(C`a++\*(C'\fR will gobble up all the \f(CW"a"\fR's in the +string and won't leave any for the remaining part of the pattern. This +feature can be extremely useful to give perl hints about where it +shouldn't backtrack. For instance, the typical "match a double-quoted +string" problem can be most efficiently performed when written as: +.PP +.Vb 1 +\& /"(?:[^"\e\e]++|\e\e.)*+"/ +.Ve +.PP +as we know that if the final quote does not match, backtracking will not +help. See the independent subexpression +\&\f(CW"(?>\fR\f(CIpattern\fR\f(CW)"\fR for more details; +possessive quantifiers are just syntactic sugar for that construct. For +instance the above example could also be written as follows: +.PP +.Vb 1 +\& /"(?>(?:(?>[^"\e\e]+)|\e\e.)*)"/ +.Ve +.PP +Note that the possessive quantifier modifier can not be combined +with the non-greedy modifier. This is because it would make no sense. +Consider the follow equivalency table: +.PP +.Vb 5 +\& Illegal Legal +\& \-\-\-\-\-\-\-\-\-\-\-\- \-\-\-\-\-\- +\& X??+ X{0} +\& X+?+ X{1} +\& X{min,max}?+ X{min} +.Ve +.PP +\fIEscape sequences\fR +.IX Subsection "Escape sequences" +.PP +Because patterns are processed as double-quoted strings, the following +also work: +.PP +.Vb 10 +\& \et tab (HT, TAB) +\& \en newline (LF, NL) +\& \er return (CR) +\& \ef form feed (FF) +\& \ea alarm (bell) (BEL) +\& \ee escape (think troff) (ESC) +\& \ecK control char (example: VT) +\& \ex{}, \ex00 character whose ordinal is the given hexadecimal number +\& \eN{name} named Unicode character or character sequence +\& \eN{U+263D} Unicode character (example: FIRST QUARTER MOON) +\& \eo{}, \e000 character whose ordinal is the given octal number +\& \el lowercase next char (think vi) +\& \eu uppercase next char (think vi) +\& \eL lowercase until \eE (think vi) +\& \eU uppercase until \eE (think vi) +\& \eQ quote (disable) pattern metacharacters until \eE +\& \eE end either case modification or quoted section, think vi +.Ve +.PP +Details are in "Quote and Quote-like Operators" in perlop. +.PP +\fICharacter Classes and other Special Escapes\fR +.IX Subsection "Character Classes and other Special Escapes" +.PP +In addition, Perl defines the following: +.IX Xref "\\g \\k \\K backreference" +.PP +.Vb 10 +\& Sequence Note Description +\& [...] [1] Match a character according to the rules of the +\& bracketed character class defined by the "...". +\& Example: [a\-z] matches "a" or "b" or "c" ... or "z" +\& [[:...:]] [2] Match a character according to the rules of the POSIX +\& character class "..." within the outer bracketed +\& character class. Example: [[:upper:]] matches any +\& uppercase character. +\& (?[...]) [8] Extended bracketed character class +\& \ew [3] Match a "word" character (alphanumeric plus "_", plus +\& other connector punctuation chars plus Unicode +\& marks) +\& \eW [3] Match a non\-"word" character +\& \es [3] Match a whitespace character +\& \eS [3] Match a non\-whitespace character +\& \ed [3] Match a decimal digit character +\& \eD [3] Match a non\-digit character +\& \epP [3] Match P, named property. Use \ep{Prop} for longer names +\& \ePP [3] Match non\-P +\& \eX [4] Match Unicode "eXtended grapheme cluster" +\& \e1 [5] Backreference to a specific capture group or buffer. +\& \*(Aq1\*(Aq may actually be any positive integer. +\& \eg1 [5] Backreference to a specific or previous group, +\& \eg{\-1} [5] The number may be negative indicating a relative +\& previous group and may optionally be wrapped in +\& curly brackets for safer parsing. +\& \eg{name} [5] Named backreference +\& \ek<name> [5] Named backreference +\& \ek\*(Aqname\*(Aq [5] Named backreference +\& \ek{name} [5] Named backreference +\& \eK [6] Keep the stuff left of the \eK, don\*(Aqt include it in $& +\& \eN [7] Any character but \en. Not affected by /s modifier +\& \ev [3] Vertical whitespace +\& \eV [3] Not vertical whitespace +\& \eh [3] Horizontal whitespace +\& \eH [3] Not horizontal whitespace +\& \eR [4] Linebreak +.Ve +.IP [1] 4 +.IX Item "[1]" +See "Bracketed Character Classes" in perlrecharclass for details. +.IP [2] 4 +.IX Item "[2]" +See "POSIX Character Classes" in perlrecharclass for details. +.IP [3] 4 +.IX Item "[3]" +See "Unicode Character Properties" in perlunicode for details +.IP [4] 4 +.IX Item "[4]" +See "Misc" in perlrebackslash for details. +.IP [5] 4 +.IX Item "[5]" +See "Capture groups" below for details. +.IP [6] 4 +.IX Item "[6]" +See "Extended Patterns" below for details. +.IP [7] 4 +.IX Item "[7]" +Note that \f(CW\*(C`\eN\*(C'\fR has two meanings. When of the form \f(CW\*(C`\eN{\fR\f(CINAME\fR\f(CW}\*(C'\fR, it +matches the character or character sequence whose name is \fINAME\fR; and +similarly +when of the form \f(CW\*(C`\eN{U+\fR\f(CIhex\fR\f(CW}\*(C'\fR, it matches the character whose Unicode +code point is \fIhex\fR. Otherwise it matches any character but \f(CW\*(C`\en\*(C'\fR. +.IP [8] 4 +.IX Item "[8]" +See "Extended Bracketed Character Classes" in perlrecharclass for details. +.PP +\fIAssertions\fR +.IX Subsection "Assertions" +.PP +Besides \f(CW"^"\fR and \f(CW"$"\fR, Perl defines the following +zero-width assertions: +.IX Xref "zero-width assertion assertion regex, zero-width assertion regexp, zero-width assertion regular expression, zero-width assertion \\b \\B \\A \\Z \\z \\G" +.PP +.Vb 9 +\& \eb{} Match at Unicode boundary of specified type +\& \eB{} Match where corresponding \eb{} doesn\*(Aqt match +\& \eb Match a \ew\eW or \eW\ew boundary +\& \eB Match except at a \ew\eW or \eW\ew boundary +\& \eA Match only at beginning of string +\& \eZ Match only at end of string, or before newline at the end +\& \ez Match only at end of string +\& \eG Match only at pos() (e.g. at the end\-of\-match position +\& of prior m//g) +.Ve +.PP +A Unicode boundary (\f(CW\*(C`\eb{}\*(C'\fR), available starting in v5.22, is a spot +between two characters, or before the first character in the string, or +after the final character in the string where certain criteria defined +by Unicode are met. See "\eb{}, \eb, \eB{}, \eB" in perlrebackslash for +details. +.PP +A word boundary (\f(CW\*(C`\eb\*(C'\fR) is a spot between two characters +that has a \f(CW\*(C`\ew\*(C'\fR on one side of it and a \f(CW\*(C`\eW\*(C'\fR on the other side +of it (in either order), counting the imaginary characters off the +beginning and end of the string as matching a \f(CW\*(C`\eW\*(C'\fR. (Within +character classes \f(CW\*(C`\eb\*(C'\fR represents backspace rather than a word +boundary, just as it normally does in any double-quoted string.) +The \f(CW\*(C`\eA\*(C'\fR and \f(CW\*(C`\eZ\*(C'\fR are just like \f(CW"^"\fR and \f(CW"$"\fR, except that they +won't match multiple times when the \f(CW\*(C`/m\*(C'\fR modifier is used, while +\&\f(CW"^"\fR and \f(CW"$"\fR will match at every internal line boundary. To match +the actual end of the string and not ignore an optional trailing +newline, use \f(CW\*(C`\ez\*(C'\fR. +.IX Xref "\\b \\A \\Z \\z m" +.PP +The \f(CW\*(C`\eG\*(C'\fR assertion can be used to chain global matches (using +\&\f(CW\*(C`m//g\*(C'\fR), as described in "Regexp Quote-Like Operators" in perlop. +It is also useful when writing \f(CW\*(C`lex\*(C'\fR\-like scanners, when you have +several patterns that you want to match against consequent substrings +of your string; see the previous reference. The actual location +where \f(CW\*(C`\eG\*(C'\fR will match can also be influenced by using \f(CWpos()\fR as +an lvalue: see "pos" in perlfunc. Note that the rule for zero-length +matches (see "Repeated Patterns Matching a Zero-length Substring") +is modified somewhat, in that contents to the left of \f(CW\*(C`\eG\*(C'\fR are +not counted when determining the length of the match. Thus the following +will not match forever: +.IX Xref "\\G" +.PP +.Vb 5 +\& my $string = \*(AqABC\*(Aq; +\& pos($string) = 1; +\& while ($string =~ /(.\eG)/g) { +\& print $1; +\& } +.Ve +.PP +It will print 'A' and then terminate, as it considers the match to +be zero-width, and thus will not match at the same position twice in a +row. +.PP +It is worth noting that \f(CW\*(C`\eG\*(C'\fR improperly used can result in an infinite +loop. Take care when using patterns that include \f(CW\*(C`\eG\*(C'\fR in an alternation. +.PP +Note also that \f(CW\*(C`s///\*(C'\fR will refuse to overwrite part of a substitution +that has already been replaced; so for example this will stop after the +first iteration, rather than iterating its way backwards through the +string: +.PP +.Vb 4 +\& $_ = "123456789"; +\& pos = 6; +\& s/.(?=.\eG)/X/g; +\& print; # prints 1234X6789, not XXXXX6789 +.Ve +.PP +\fICapture groups\fR +.IX Subsection "Capture groups" +.PP +The grouping construct \f(CW\*(C`( ... )\*(C'\fR creates capture groups (also referred to as +capture buffers). To refer to the current contents of a group later on, within +the same pattern, use \f(CW\*(C`\eg1\*(C'\fR (or \f(CW\*(C`\eg{1}\*(C'\fR) for the first, \f(CW\*(C`\eg2\*(C'\fR (or \f(CW\*(C`\eg{2}\*(C'\fR) +for the second, and so on. +This is called a \fIbackreference\fR. + + + + + + + + +There is no limit to the number of captured substrings that you may use. +Groups are numbered with the leftmost open parenthesis being number 1, \fIetc\fR. If +a group did not match, the associated backreference won't match either. (This +can happen if the group is optional, or in a different branch of an +alternation.) +You can omit the \f(CW"g"\fR, and write \f(CW"\e1"\fR, \fIetc\fR, but there are some issues with +this form, described below. +.IX Xref "regex, capture buffer regexp, capture buffer regex, capture group regexp, capture group regular expression, capture buffer backreference regular expression, capture group backreference \\g{1} \\g{-1} \\g{name} relative backreference named backreference named capture buffer regular expression, named capture buffer named capture group regular expression, named capture group %+ $+{name} \\k<name>" +.PP +You can also refer to capture groups relatively, by using a negative number, so +that \f(CW\*(C`\eg\-1\*(C'\fR and \f(CW\*(C`\eg{\-1}\*(C'\fR both refer to the immediately preceding capture +group, and \f(CW\*(C`\eg\-2\*(C'\fR and \f(CW\*(C`\eg{\-2}\*(C'\fR both refer to the group before it. For +example: +.PP +.Vb 8 +\& / +\& (Y) # group 1 +\& ( # group 2 +\& (X) # group 3 +\& \eg{\-1} # backref to group 3 +\& \eg{\-3} # backref to group 1 +\& ) +\& /x +.Ve +.PP +would match the same as \f(CW\*(C`/(Y) ( (X) \eg3 \eg1 )/x\*(C'\fR. This allows you to +interpolate regexes into larger regexes and not have to worry about the +capture groups being renumbered. +.PP +You can dispense with numbers altogether and create named capture groups. +The notation is \f(CW\*(C`(?<\fR\f(CIname\fR\f(CW>...)\*(C'\fR to declare and \f(CW\*(C`\eg{\fR\f(CIname\fR\f(CW}\*(C'\fR to +reference. (To be compatible with .Net regular expressions, \f(CW\*(C`\eg{\fR\f(CIname\fR\f(CW}\*(C'\fR may +also be written as \f(CW\*(C`\ek{\fR\f(CIname\fR\f(CW}\*(C'\fR, \f(CW\*(C`\ek<\fR\f(CIname\fR\f(CW>\*(C'\fR or \f(CW\*(C`\ek\*(Aq\fR\f(CIname\fR\f(CW\*(Aq\*(C'\fR.) +\&\fIname\fR must not begin with a number, nor contain hyphens. +When different groups within the same pattern have the same name, any reference +to that name assumes the leftmost defined group. Named groups count in +absolute and relative numbering, and so can also be referred to by those +numbers. +(It's possible to do things with named capture groups that would otherwise +require \f(CW\*(C`(??{})\*(C'\fR.) +.PP +Capture group contents are dynamically scoped and available to you outside the +pattern until the end of the enclosing block or until the next successful +match in the same scope, whichever comes first. +See "Compound Statements" in perlsyn and +"Scoping Rules of Regex Variables" in perlvar for more details. +.PP +You can access the contents of a capture group by absolute number (using +\&\f(CW"$1"\fR instead of \f(CW"\eg1"\fR, \fIetc\fR); or by name via the \f(CW\*(C`%+\*(C'\fR hash, +using \f(CW"$+{\fR\f(CIname\fR\f(CW}"\fR. +.PP +Braces are required in referring to named capture groups, but are optional for +absolute or relative numbered ones. Braces are safer when creating a regex by +concatenating smaller strings. For example if you have \f(CW\*(C`qr/$a$b/\*(C'\fR, and \f(CW$a\fR +contained \f(CW"\eg1"\fR, and \f(CW$b\fR contained \f(CW"37"\fR, you would get \f(CW\*(C`/\eg137/\*(C'\fR which +is probably not what you intended. +.PP +If you use braces, you may also optionally add any number of blank +(space or tab) characters within but adjacent to the braces, like +\&\f(CW\*(C`\eg{\ \-1\ }\*(C'\fR, or \f(CW\*(C`\ek{\ \fR\f(CIname\fR\f(CW\ }\*(C'\fR. +.PP +The \f(CW\*(C`\eg\*(C'\fR and \f(CW\*(C`\ek\*(C'\fR notations were introduced in Perl 5.10.0. Prior to that +there were no named nor relative numbered capture groups. Absolute numbered +groups were referred to using \f(CW\*(C`\e1\*(C'\fR, +\&\f(CW\*(C`\e2\*(C'\fR, \fIetc\fR., and this notation is still +accepted (and likely always will be). But it leads to some ambiguities if +there are more than 9 capture groups, as \f(CW\*(C`\e10\*(C'\fR could mean either the tenth +capture group, or the character whose ordinal in octal is 010 (a backspace in +ASCII). Perl resolves this ambiguity by interpreting \f(CW\*(C`\e10\*(C'\fR as a backreference +only if at least 10 left parentheses have opened before it. Likewise \f(CW\*(C`\e11\*(C'\fR is +a backreference only if at least 11 left parentheses have opened before it. +And so on. \f(CW\*(C`\e1\*(C'\fR through \f(CW\*(C`\e9\*(C'\fR are always interpreted as backreferences. +There are several examples below that illustrate these perils. You can avoid +the ambiguity by always using \f(CW\*(C`\eg{}\*(C'\fR or \f(CW\*(C`\eg\*(C'\fR if you mean capturing groups; +and for octal constants always using \f(CW\*(C`\eo{}\*(C'\fR, or for \f(CW\*(C`\e077\*(C'\fR and below, using 3 +digits padded with leading zeros, since a leading zero implies an octal +constant. +.PP +The \f(CW\*(C`\e\fR\f(CIdigit\fR\f(CW\*(C'\fR notation also works in certain circumstances outside +the pattern. See "Warning on \e1 Instead of \f(CW$1\fR" below for details. +.PP +Examples: +.PP +.Vb 1 +\& s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words +\& +\& /(.)\eg1/ # find first doubled char +\& and print "\*(Aq$1\*(Aq is the first doubled character\en"; +\& +\& /(?<char>.)\ek<char>/ # ... a different way +\& and print "\*(Aq$+{char}\*(Aq is the first doubled character\en"; +\& +\& /(?\*(Aqchar\*(Aq.)\eg1/ # ... mix and match +\& and print "\*(Aq$1\*(Aq is the first doubled character\en"; +\& +\& if (/Time: (..):(..):(..)/) { # parse out values +\& $hours = $1; +\& $minutes = $2; +\& $seconds = $3; +\& } +\& +\& /(.)(.)(.)(.)(.)(.)(.)(.)(.)\eg10/ # \eg10 is a backreference +\& /(.)(.)(.)(.)(.)(.)(.)(.)(.)\e10/ # \e10 is octal +\& /((.)(.)(.)(.)(.)(.)(.)(.)(.))\e10/ # \e10 is a backreference +\& /((.)(.)(.)(.)(.)(.)(.)(.)(.))\e010/ # \e010 is octal +\& +\& $a = \*(Aq(.)\e1\*(Aq; # Creates problems when concatenated. +\& $b = \*(Aq(.)\eg{1}\*(Aq; # Avoids the problems. +\& "aa" =~ /${a}/; # True +\& "aa" =~ /${b}/; # True +\& "aa0" =~ /${a}0/; # False! +\& "aa0" =~ /${b}0/; # True +\& "aa\ex08" =~ /${a}0/; # True! +\& "aa\ex08" =~ /${b}0/; # False +.Ve +.PP +Several special variables also refer back to portions of the previous +match. \f(CW$+\fR returns whatever the last bracket match matched. +\&\f(CW$&\fR returns the entire matched string. (At one point \f(CW$0\fR did +also, but now it returns the name of the program.) \f(CW\*(C`$\`\*(C'\fR returns +everything before the matched string. \f(CW\*(C`$\*(Aq\*(C'\fR returns everything +after the matched string. And \f(CW$^N\fR contains whatever was matched by +the most-recently closed group (submatch). \f(CW$^N\fR can be used in +extended patterns (see below), for example to assign a submatch to a +variable. +.IX Xref "$+ $^N $& $` $'" +.PP +These special variables, like the \f(CW\*(C`%+\*(C'\fR hash and the numbered match variables +(\f(CW$1\fR, \f(CW$2\fR, \f(CW$3\fR, \fIetc\fR.) are dynamically scoped +until the end of the enclosing block or until the next successful +match, whichever comes first. (See "Compound Statements" in perlsyn.) +.IX Xref "$+ $^N $& $` $' $1 $2 $3 $4 $5 $6 $7 $8 $9 @{^CAPTURE}" +.PP +The \f(CW\*(C`@{^CAPTURE}\*(C'\fR array may be used to access ALL of the capture buffers +as an array without needing to know how many there are. For instance +.PP +.Vb 1 +\& $string=~/$pattern/ and @captured = @{^CAPTURE}; +.Ve +.PP +will place a copy of each capture variable, \f(CW$1\fR, \f(CW$2\fR etc, into the +\&\f(CW@captured\fR array. +.PP +Be aware that when interpolating a subscript of the \f(CW\*(C`@{^CAPTURE}\*(C'\fR +array you must use demarcated curly brace notation: +.PP +.Vb 1 +\& print "@{^CAPTURE[0]}"; +.Ve +.PP +See "Demarcated variable names using braces" in perldata for more on +this notation. +.PP +\&\fBNOTE\fR: Failed matches in Perl do not reset the match variables, +which makes it easier to write code that tests for a series of more +specific cases and remembers the best match. +.PP +\&\fBWARNING\fR: If your code is to run on Perl 5.16 or earlier, +beware that once Perl sees that you need one of \f(CW$&\fR, \f(CW\*(C`$\`\*(C'\fR, or +\&\f(CW\*(C`$\*(Aq\*(C'\fR anywhere in the program, it has to provide them for every +pattern match. This may substantially slow your program. +.PP +Perl uses the same mechanism to produce \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR, so you also +pay a price for each pattern that contains capturing parentheses. +(To avoid this cost while retaining the grouping behaviour, use the +extended regular expression \f(CW\*(C`(?: ... )\*(C'\fR instead.) But if you never +use \f(CW$&\fR, \f(CW\*(C`$\`\*(C'\fR or \f(CW\*(C`$\*(Aq\*(C'\fR, then patterns \fIwithout\fR capturing +parentheses will not be penalized. So avoid \f(CW$&\fR, \f(CW\*(C`$\*(Aq\*(C'\fR, and \f(CW\*(C`$\`\*(C'\fR +if you can, but if you can't (and some algorithms really appreciate +them), once you've used them once, use them at will, because you've +already paid the price. +.IX Xref "$& $` $'" +.PP +Perl 5.16 introduced a slightly more efficient mechanism that notes +separately whether each of \f(CW\*(C`$\`\*(C'\fR, \f(CW$&\fR, and \f(CW\*(C`$\*(Aq\*(C'\fR have been seen, and +thus may only need to copy part of the string. Perl 5.20 introduced a +much more efficient copy-on-write mechanism which eliminates any slowdown. +.PP +As another workaround for this problem, Perl 5.10.0 introduced \f(CW\*(C`${^PREMATCH}\*(C'\fR, +\&\f(CW\*(C`${^MATCH}\*(C'\fR and \f(CW\*(C`${^POSTMATCH}\*(C'\fR, which are equivalent to \f(CW\*(C`$\`\*(C'\fR, \f(CW$&\fR +and \f(CW\*(C`$\*(Aq\*(C'\fR, \fBexcept\fR that they are only guaranteed to be defined after a +successful match that was executed with the \f(CW\*(C`/p\*(C'\fR (preserve) modifier. +The use of these variables incurs no global performance penalty, unlike +their punctuation character equivalents, however at the trade-off that you +have to tell perl when you want to use them. As of Perl 5.20, these three +variables are equivalent to \f(CW\*(C`$\`\*(C'\fR, \f(CW$&\fR and \f(CW\*(C`$\*(Aq\*(C'\fR, and \f(CW\*(C`/p\*(C'\fR is ignored. +.IX Xref " p p modifier" +.SS "Quoting metacharacters" +.IX Subsection "Quoting metacharacters" +Backslashed metacharacters in Perl are alphanumeric, such as \f(CW\*(C`\eb\*(C'\fR, +\&\f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\en\*(C'\fR. Unlike some other regular expression languages, there +are no backslashed symbols that aren't alphanumeric. So anything +that looks like \f(CW\*(C`\e\e\*(C'\fR, \f(CW\*(C`\e(\*(C'\fR, \f(CW\*(C`\e)\*(C'\fR, \f(CW\*(C`\e[\*(C'\fR, \f(CW\*(C`\e]\*(C'\fR, \f(CW\*(C`\e{\*(C'\fR, or \f(CW\*(C`\e}\*(C'\fR is +always +interpreted as a literal character, not a metacharacter. This was +once used in a common idiom to disable or quote the special meanings +of regular expression metacharacters in a string that you want to +use for a pattern. Simply quote all non\-"word" characters: +.PP +.Vb 1 +\& $pattern =~ s/(\eW)/\e\e$1/g; +.Ve +.PP +(If \f(CW\*(C`use locale\*(C'\fR is set, then this depends on the current locale.) +Today it is more common to use the \f(CWquotemeta()\fR +function or the \f(CW\*(C`\eQ\*(C'\fR metaquoting escape sequence to disable all +metacharacters' special meanings like this: +.PP +.Vb 1 +\& /$unquoted\eQ$quoted\eE$unquoted/ +.Ve +.PP +Beware that if you put literal backslashes (those not inside +interpolated variables) between \f(CW\*(C`\eQ\*(C'\fR and \f(CW\*(C`\eE\*(C'\fR, double-quotish +backslash interpolation may lead to confusing results. If you +\&\fIneed\fR to use literal backslashes within \f(CW\*(C`\eQ...\eE\*(C'\fR, +consult "Gory details of parsing quoted constructs" in perlop. +.PP +\&\f(CWquotemeta()\fR and \f(CW\*(C`\eQ\*(C'\fR are fully described in "quotemeta" in perlfunc. +.SS "Extended Patterns" +.IX Subsection "Extended Patterns" +Perl also defines a consistent extension syntax for features not +found in standard tools like \fBawk\fR and +\&\fBlex\fR. The syntax for most of these is a +pair of parentheses with a question mark as the first thing within +the parentheses. The character after the question mark indicates +the extension. +.PP +A question mark was chosen for this and for the minimal-matching +construct because 1) question marks are rare in older regular +expressions, and 2) whenever you see one, you should stop and +"question" exactly what is going on. That's psychology.... +.ie n .IP """(?#\fItext\fR)""" 4 +.el .IP \f(CW(?#\fR\f(CItext\fR\f(CW)\fR 4 +.IX Xref "(?#)" +.IX Item "(?#text)" +A comment. The \fItext\fR is ignored. +Note that Perl closes +the comment as soon as it sees a \f(CW")"\fR, so there is no way to put a literal +\&\f(CW")"\fR in the comment. The pattern's closing delimiter must be escaped by +a backslash if it appears in the comment. +.Sp +See "/x" for another way to have comments in patterns. +.Sp +Note that a comment can go just about anywhere, except in the middle of +an escape sequence. Examples: +.Sp +.Vb 1 +\& qr/foo(?#comment)bar/\*(Aq # Matches \*(Aqfoobar\*(Aq +\& +\& # The pattern below matches \*(Aqabcd\*(Aq, \*(Aqabccd\*(Aq, or \*(Aqabcccd\*(Aq +\& qr/abc(?#comment between literal and its quantifier){1,3}d/ +\& +\& # The pattern below generates a syntax error, because the \*(Aq\ep\*(Aq must +\& # be followed immediately by a \*(Aq{\*(Aq. +\& qr/\ep(?#comment between \ep and its property name){Any}/ +\& +\& # The pattern below generates a syntax error, because the initial +\& # \*(Aq\e(\*(Aq is a literal opening parenthesis, and so there is nothing +\& # for the closing \*(Aq)\*(Aq to match +\& qr/\e(?#the backslash means this isn\*(Aqt a comment)p{Any}/ +\& +\& # Comments can be used to fold long patterns into multiple lines +\& qr/First part of a long regex(?# +\& )remaining part/ +.Ve +.ie n .IP """(?adlupimnsx\-imnsx)""" 4 +.el .IP \f(CW(?adlupimnsx\-imnsx)\fR 4 +.IX Item "(?adlupimnsx-imnsx)" +.PD 0 +.ie n .IP """(?^alupimnsx)""" 4 +.el .IP \f(CW(?^alupimnsx)\fR 4 +.IX Xref "(?) (?^)" +.IX Item "(?^alupimnsx)" +.PD +Zero or more embedded pattern-match modifiers, to be turned on (or +turned off if preceded by \f(CW"\-"\fR) for the remainder of the pattern or +the remainder of the enclosing pattern group (if any). +.Sp +This is particularly useful for dynamically-generated patterns, +such as those read in from a +configuration file, taken from an argument, or specified in a table +somewhere. Consider the case where some patterns want to be +case-sensitive and some do not: The case-insensitive ones merely need to +include \f(CW\*(C`(?i)\*(C'\fR at the front of the pattern. For example: +.Sp +.Vb 2 +\& $pattern = "foobar"; +\& if ( /$pattern/i ) { } +\& +\& # more flexible: +\& +\& $pattern = "(?i)foobar"; +\& if ( /$pattern/ ) { } +.Ve +.Sp +These modifiers are restored at the end of the enclosing group. For example, +.Sp +.Vb 1 +\& ( (?i) blah ) \es+ \eg1 +.Ve +.Sp +will match \f(CW\*(C`blah\*(C'\fR in any case, some spaces, and an exact (\fIincluding the case\fR!) +repetition of the previous word, assuming the \f(CW\*(C`/x\*(C'\fR modifier, and no \f(CW\*(C`/i\*(C'\fR +modifier outside this group. +.Sp +These modifiers do not carry over into named subpatterns called in the +enclosing group. In other words, a pattern such as \f(CW\*(C`((?i)(?&\fR\f(CINAME\fR\f(CW))\*(C'\fR does not +change the case-sensitivity of the \fINAME\fR pattern. +.Sp +A modifier is overridden by later occurrences of this construct in the +same scope containing the same modifier, so that +.Sp +.Vb 1 +\& /((?im)foo(?\-m)bar)/ +.Ve +.Sp +matches all of \f(CW\*(C`foobar\*(C'\fR case insensitively, but uses \f(CW\*(C`/m\*(C'\fR rules for +only the \f(CW\*(C`foo\*(C'\fR portion. The \f(CW"a"\fR flag overrides \f(CW\*(C`aa\*(C'\fR as well; +likewise \f(CW\*(C`aa\*(C'\fR overrides \f(CW"a"\fR. The same goes for \f(CW"x"\fR and \f(CW\*(C`xx\*(C'\fR. +Hence, in +.Sp +.Vb 1 +\& /(?\-x)foo/xx +.Ve +.Sp +both \f(CW\*(C`/x\*(C'\fR and \f(CW\*(C`/xx\*(C'\fR are turned off during matching \f(CW\*(C`foo\*(C'\fR. And in +.Sp +.Vb 1 +\& /(?x)foo/x +.Ve +.Sp +\&\f(CW\*(C`/x\*(C'\fR but NOT \f(CW\*(C`/xx\*(C'\fR is turned on for matching \f(CW\*(C`foo\*(C'\fR. (One might +mistakenly think that since the inner \f(CW\*(C`(?x)\*(C'\fR is already in the scope of +\&\f(CW\*(C`/x\*(C'\fR, that the result would effectively be the sum of them, yielding +\&\f(CW\*(C`/xx\*(C'\fR. It doesn't work that way.) Similarly, doing something like +\&\f(CW\*(C`(?xx\-x)foo\*(C'\fR turns off all \f(CW"x"\fR behavior for matching \f(CW\*(C`foo\*(C'\fR, it is not +that you subtract 1 \f(CW"x"\fR from 2 to get 1 \f(CW"x"\fR remaining. +.Sp +Any of these modifiers can be set to apply globally to all regular +expressions compiled within the scope of a \f(CW\*(C`use re\*(C'\fR. See +"'/flags' mode" in re. +.Sp +Starting in Perl 5.14, a \f(CW"^"\fR (caret or circumflex accent) immediately +after the \f(CW"?"\fR is a shorthand equivalent to \f(CW\*(C`d\-imnsx\*(C'\fR. Flags (except +\&\f(CW"d"\fR) may follow the caret to override it. +But a minus sign is not legal with it. +.Sp +Note that the \f(CW"a"\fR, \f(CW"d"\fR, \f(CW"l"\fR, \f(CW"p"\fR, and \f(CW"u"\fR modifiers are special in +that they can only be enabled, not disabled, and the \f(CW"a"\fR, \f(CW"d"\fR, \f(CW"l"\fR, and +\&\f(CW"u"\fR modifiers are mutually exclusive: specifying one de-specifies the +others, and a maximum of one (or two \f(CW"a"\fR's) may appear in the +construct. Thus, for +example, \f(CW\*(C`(?\-p)\*(C'\fR will warn when compiled under \f(CW\*(C`use warnings\*(C'\fR; +\&\f(CW\*(C`(?\-d:...)\*(C'\fR and \f(CW\*(C`(?dl:...)\*(C'\fR are fatal errors. +.Sp +Note also that the \f(CW"p"\fR modifier is special in that its presence +anywhere in a pattern has a global effect. +.Sp +Having zero modifiers makes this a no-op (so why did you specify it, +unless it's generated code), and starting in v5.30, warns under \f(CW\*(C`use +re \*(Aqstrict\*(Aq\*(C'\fR. +.ie n .IP """(?:\fIpattern\fR)""" 4 +.el .IP \f(CW(?:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?:)" +.IX Item "(?:pattern)" +.PD 0 +.ie n .IP """(?adluimnsx\-imnsx:\fIpattern\fR)""" 4 +.el .IP \f(CW(?adluimnsx\-imnsx:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?adluimnsx-imnsx:pattern)" +.ie n .IP """(?^aluimnsx:\fIpattern\fR)""" 4 +.el .IP \f(CW(?^aluimnsx:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?^:)" +.IX Item "(?^aluimnsx:pattern)" +.PD +This is for clustering, not capturing; it groups subexpressions like +\&\f(CW"()"\fR, but doesn't make backreferences as \f(CW"()"\fR does. So +.Sp +.Vb 1 +\& @fields = split(/\eb(?:a|b|c)\eb/) +.Ve +.Sp +matches the same field delimiters as +.Sp +.Vb 1 +\& @fields = split(/\eb(a|b|c)\eb/) +.Ve +.Sp +but doesn't spit out the delimiters themselves as extra fields (even though +that's the behaviour of "split" in perlfunc when its pattern contains capturing +groups). It's also cheaper not to capture +characters if you don't need to. +.Sp +Any letters between \f(CW"?"\fR and \f(CW":"\fR act as flags modifiers as with +\&\f(CW\*(C`(?adluimnsx\-imnsx)\*(C'\fR. For example, +.Sp +.Vb 1 +\& /(?s\-i:more.*than).*million/i +.Ve +.Sp +is equivalent to the more verbose +.Sp +.Vb 1 +\& /(?:(?s\-i)more.*than).*million/i +.Ve +.Sp +Note that any \f(CW\*(C`()\*(C'\fR constructs enclosed within this one will still +capture unless the \f(CW\*(C`/n\*(C'\fR modifier is in effect. +.Sp +Like the "(?adlupimnsx\-imnsx)" construct, \f(CW\*(C`aa\*(C'\fR and \f(CW"a"\fR override each +other, as do \f(CW\*(C`xx\*(C'\fR and \f(CW"x"\fR. They are not additive. So, doing +something like \f(CW\*(C`(?xx\-x:foo)\*(C'\fR turns off all \f(CW"x"\fR behavior for matching +\&\f(CW\*(C`foo\*(C'\fR. +.Sp +Starting in Perl 5.14, a \f(CW"^"\fR (caret or circumflex accent) immediately +after the \f(CW"?"\fR is a shorthand equivalent to \f(CW\*(C`d\-imnsx\*(C'\fR. Any positive +flags (except \f(CW"d"\fR) may follow the caret, so +.Sp +.Vb 1 +\& (?^x:foo) +.Ve +.Sp +is equivalent to +.Sp +.Vb 1 +\& (?x\-imns:foo) +.Ve +.Sp +The caret tells Perl that this cluster doesn't inherit the flags of any +surrounding pattern, but uses the system defaults (\f(CW\*(C`d\-imnsx\*(C'\fR), +modified by any flags specified. +.Sp +The caret allows for simpler stringification of compiled regular +expressions. These look like +.Sp +.Vb 1 +\& (?^:pattern) +.Ve +.Sp +with any non-default flags appearing between the caret and the colon. +A test that looks at such stringification thus doesn't need to have the +system default flags hard-coded in it, just the caret. If new flags are +added to Perl, the meaning of the caret's expansion will change to include +the default for those flags, so the test will still work, unchanged. +.Sp +Specifying a negative flag after the caret is an error, as the flag is +redundant. +.Sp +Mnemonic for \f(CW\*(C`(?^...)\*(C'\fR: A fresh beginning since the usual use of a caret is +to match at the beginning. +.ie n .IP """(?|\fIpattern\fR)""" 4 +.el .IP \f(CW(?|\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?|) Branch reset" +.IX Item "(?|pattern)" +This is the "branch reset" pattern, which has the special property +that the capture groups are numbered from the same starting point +in each alternation branch. It is available starting from perl 5.10.0. +.Sp +Capture groups are numbered from left to right, but inside this +construct the numbering is restarted for each branch. +.Sp +The numbering within each branch will be as normal, and any groups +following this construct will be numbered as though the construct +contained only one branch, that being the one with the most capture +groups in it. +.Sp +This construct is useful when you want to capture one of a +number of alternative matches. +.Sp +Consider the following pattern. The numbers underneath show in +which group the captured content will be stored. +.Sp +.Vb 3 +\& # before \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-branch\-reset\-\-\-\-\-\-\-\-\-\-\- after +\& / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x +\& # 1 2 2 3 2 3 4 +.Ve +.Sp +Be careful when using the branch reset pattern in combination with +named captures. Named captures are implemented as being aliases to +numbered groups holding the captures, and that interferes with the +implementation of the branch reset pattern. If you are using named +captures in a branch reset pattern, it's best to use the same names, +in the same order, in each of the alternations: +.Sp +.Vb 2 +\& /(?| (?<a> x ) (?<b> y ) +\& | (?<a> z ) (?<b> w )) /x +.Ve +.Sp +Not doing so may lead to surprises: +.Sp +.Vb 3 +\& "12" =~ /(?| (?<a> \ed+ ) | (?<b> \eD+))/x; +\& say $+{a}; # Prints \*(Aq12\*(Aq +\& say $+{b}; # *Also* prints \*(Aq12\*(Aq. +.Ve +.Sp +The problem here is that both the group named \f(CW\*(C`a\*(C'\fR and the group +named \f(CW\*(C`b\*(C'\fR are aliases for the group belonging to \f(CW$1\fR. +.IP "Lookaround Assertions" 4 +.IX Xref "look-around assertion lookaround assertion look-around lookaround" +.IX Item "Lookaround Assertions" +Lookaround assertions are zero-width patterns which match a specific +pattern without including it in \f(CW$&\fR. Positive assertions match when +their subpattern matches, negative assertions match when their subpattern +fails. Lookbehind matches text up to the current match position, +lookahead matches text following the current match position. +.RS 4 +.ie n .IP """(?=\fIpattern\fR)""" 4 +.el .IP \f(CW(?=\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?=pattern)" +.PD 0 +.ie n .IP """(*pla:\fIpattern\fR)""" 4 +.el .IP \f(CW(*pla:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(*pla:pattern)" +.ie n .IP """(*positive_lookahead:\fIpattern\fR)""" 4 +.el .IP \f(CW(*positive_lookahead:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?=) (*pla (*positive_lookahead look-ahead, positive lookahead, positive" +.IX Item "(*positive_lookahead:pattern)" +.PD +A zero-width positive lookahead assertion. For example, \f(CW\*(C`/\ew+(?=\et)/\*(C'\fR +matches a word followed by a tab, without including the tab in \f(CW$&\fR. +.ie n .IP """(?!\fIpattern\fR)""" 4 +.el .IP \f(CW(?!\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?!pattern)" +.PD 0 +.ie n .IP """(*nla:\fIpattern\fR)""" 4 +.el .IP \f(CW(*nla:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(*nla:pattern)" +.ie n .IP """(*negative_lookahead:\fIpattern\fR)""" 4 +.el .IP \f(CW(*negative_lookahead:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?!) (*nla (*negative_lookahead look-ahead, negative lookahead, negative" +.IX Item "(*negative_lookahead:pattern)" +.PD +A zero-width negative lookahead assertion. For example \f(CW\*(C`/foo(?!bar)/\*(C'\fR +matches any occurrence of "foo" that isn't followed by "bar". Note +however that lookahead and lookbehind are NOT the same thing. You cannot +use this for lookbehind. +.Sp +If you are looking for a "bar" that isn't preceded by a "foo", \f(CW\*(C`/(?!foo)bar/\*(C'\fR +will not do what you want. That's because the \f(CW\*(C`(?!foo)\*(C'\fR is just saying that +the next thing cannot be "foo"\-\-and it's not, it's a "bar", so "foobar" will +match. Use lookbehind instead (see below). +.ie n .IP """(?<=\fIpattern\fR)""" 4 +.el .IP \f(CW(?<=\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?<=pattern)" +.PD 0 +.ie n .IP """\eK""" 4 +.el .IP \f(CW\eK\fR 4 +.IX Item "K" +.ie n .IP """(*plb:\fIpattern\fR)""" 4 +.el .IP \f(CW(*plb:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(*plb:pattern)" +.ie n .IP """(*positive_lookbehind:\fIpattern\fR)""" 4 +.el .IP \f(CW(*positive_lookbehind:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?<=) (*plb (*positive_lookbehind look-behind, positive lookbehind, positive \\K" +.IX Item "(*positive_lookbehind:pattern)" +.PD +A zero-width positive lookbehind assertion. For example, \f(CW\*(C`/(?<=\et)\ew+/\*(C'\fR +matches a word that follows a tab, without including the tab in \f(CW$&\fR. +.Sp +Prior to Perl 5.30, it worked only for fixed-width lookbehind, but +starting in that release, it can handle variable lengths from 1 to 255 +characters as an experimental feature. The feature is enabled +automatically if you use a variable length positive lookbehind assertion. +.Sp +In Perl 5.35.10 the scope of the experimental nature of this construct +has been reduced, and experimental warnings will only be produced when +the construct contains capturing parenthesis. The warnings will be +raised at pattern compilation time, unless turned off, in the +\&\f(CW\*(C`experimental::vlb\*(C'\fR category. This is to warn you that the exact +contents of capturing buffers in a variable length positive lookbehind +is not well defined and is subject to change in a future release of perl. +.Sp +Currently if you use capture buffers inside of a positive variable length +lookbehind the result will be the longest and thus leftmost match possible. +This means that +.Sp +.Vb 4 +\& "aax" =~ /(?=x)(?<=(a|aa))/ +\& "aax" =~ /(?=x)(?<=(aa|a))/ +\& "aax" =~ /(?=x)(?<=(a{1,2}?)/ +\& "aax" =~ /(?=x)(?<=(a{1,2})/ +.Ve +.Sp +will all result in \f(CW$1\fR containing \f(CW"aa"\fR. It is possible in a future +release of perl we will change this behavior. +.Sp +There is a special form of this construct, called \f(CW\*(C`\eK\*(C'\fR +(available since Perl 5.10.0), which causes the +regex engine to "keep" everything it had matched prior to the \f(CW\*(C`\eK\*(C'\fR and +not include it in \f(CW$&\fR. This effectively provides non-experimental +variable-length lookbehind of any length. +.Sp +And, there is a technique that can be used to handle variable length +lookbehinds on earlier releases, and longer than 255 characters. It is +described in +<http://www.drregex.com/2019/02/variable\-length\-lookbehinds\-actually.html>. +.Sp +Note that under \f(CW\*(C`/i\*(C'\fR, a few single characters match two or three other +characters. This makes them variable length, and the 255 length applies +to the maximum number of characters in the match. For +example \f(CW\*(C`qr/\eN{LATIN SMALL LETTER SHARP S}/i\*(C'\fR matches the sequence +\&\f(CW"ss"\fR. Your lookbehind assertion could contain 127 Sharp S +characters under \f(CW\*(C`/i\*(C'\fR, but adding a 128th would generate a compilation +error, as that could match 256 \f(CW"s"\fR characters in a row. +.Sp +The use of \f(CW\*(C`\eK\*(C'\fR inside of another lookaround assertion +is allowed, but the behaviour is currently not well defined. +.Sp +For various reasons \f(CW\*(C`\eK\*(C'\fR may be significantly more efficient than the +equivalent \f(CW\*(C`(?<=...)\*(C'\fR construct, and it is especially useful in +situations where you want to efficiently remove something following +something else in a string. For instance +.Sp +.Vb 1 +\& s/(foo)bar/$1/g; +.Ve +.Sp +can be rewritten as the much more efficient +.Sp +.Vb 1 +\& s/foo\eKbar//g; +.Ve +.Sp +Use of the non-greedy modifier \f(CW"?"\fR may not give you the expected +results if it is within a capturing group within the construct. +.ie n .IP """(?<!\fIpattern\fR)""" 4 +.el .IP \f(CW(?<!\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?<!pattern)" +.PD 0 +.ie n .IP """(*nlb:\fIpattern\fR)""" 4 +.el .IP \f(CW(*nlb:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(*nlb:pattern)" +.ie n .IP """(*negative_lookbehind:\fIpattern\fR)""" 4 +.el .IP \f(CW(*negative_lookbehind:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?<!) (*nlb (*negative_lookbehind look-behind, negative lookbehind, negative" +.IX Item "(*negative_lookbehind:pattern)" +.PD +A zero-width negative lookbehind assertion. For example \f(CW\*(C`/(?<!bar)foo/\*(C'\fR +matches any occurrence of "foo" that does not follow "bar". +.Sp +Prior to Perl 5.30, it worked only for fixed-width lookbehind, but +starting in that release, it can handle variable lengths from 1 to 255 +characters as an experimental feature. The feature is enabled +automatically if you use a variable length negative lookbehind assertion. +.Sp +In Perl 5.35.10 the scope of the experimental nature of this construct +has been reduced, and experimental warnings will only be produced when +the construct contains capturing parentheses. The warnings will be +raised at pattern compilation time, unless turned off, in the +\&\f(CW\*(C`experimental::vlb\*(C'\fR category. This is to warn you that the exact +contents of capturing buffers in a variable length negative lookbehind +is not well defined and is subject to change in a future release of perl. +.Sp +Currently if you use capture buffers inside of a negative variable length +lookbehind the result may not be what you expect, for instance: +.Sp +.Vb 1 +\& say "axfoo"=~/(?=foo)(?<!(a|ax)(?{ say $1 }))/ ? "y" : "n"; +.Ve +.Sp +will output the following: +.Sp +.Vb 2 +\& a +\& no +.Ve +.Sp +which does not make sense as this should print out "ax" as the "a" does +not line up at the correct place. Another example would be: +.Sp +.Vb 1 +\& say "yes: \*(Aq$1\-$2\*(Aq" if "aayfoo"=~/(?=foo)(?<!(a|aa)(a|aa)x)/; +.Ve +.Sp +will output the following: +.Sp +.Vb 1 +\& yes: \*(Aqaa\-a\*(Aq +.Ve +.Sp +It is possible in a future release of perl we will change this behavior +so both of these examples produced more reasonable output. +.Sp +Note that we are confident that the construct will match and reject +patterns appropriately, the undefined behavior strictly relates to the +value of the capture buffer during or after matching. +.Sp +There is a technique that can be used to handle variable length +lookbehind on earlier releases, and longer than 255 characters. It is +described in +<http://www.drregex.com/2019/02/variable\-length\-lookbehinds\-actually.html>. +.Sp +Note that under \f(CW\*(C`/i\*(C'\fR, a few single characters match two or three other +characters. This makes them variable length, and the 255 length applies +to the maximum number of characters in the match. For +example \f(CW\*(C`qr/\eN{LATIN SMALL LETTER SHARP S}/i\*(C'\fR matches the sequence +\&\f(CW"ss"\fR. Your lookbehind assertion could contain 127 Sharp S +characters under \f(CW\*(C`/i\*(C'\fR, but adding a 128th would generate a compilation +error, as that could match 256 \f(CW"s"\fR characters in a row. +.Sp +Use of the non-greedy modifier \f(CW"?"\fR may not give you the expected +results if it is within a capturing group within the construct. +.RE +.RS 4 +.RE +.ie n .IP """(?<\fINAME\fR>\fIpattern\fR)""" 4 +.el .IP \f(CW(?<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?<NAME>pattern)" +.PD 0 +.ie n .IP """(?\*(Aq\fINAME\fR\*(Aq\fIpattern\fR)""" 4 +.el .IP \f(CW(?\*(Aq\fR\f(CINAME\fR\f(CW\*(Aq\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?<NAME>) (?'NAME') named capture capture" +.IX Item "(?NAMEpattern)" +.PD +A named capture group. Identical in every respect to normal capturing +parentheses \f(CW\*(C`()\*(C'\fR but for the additional fact that the group +can be referred to by name in various regular expression +constructs (like \f(CW\*(C`\eg{\fR\f(CINAME\fR\f(CW}\*(C'\fR) and can be accessed by name +after a successful match via \f(CW\*(C`%+\*(C'\fR or \f(CW\*(C`%\-\*(C'\fR. See perlvar +for more details on the \f(CW\*(C`%+\*(C'\fR and \f(CW\*(C`%\-\*(C'\fR hashes. +.Sp +If multiple distinct capture groups have the same name, then +\&\f(CW$+{\fR\f(CINAME\fR\f(CW}\fR will refer to the leftmost defined group in the match. +.Sp +The forms \f(CW\*(C`(?\*(Aq\fR\f(CINAME\fR\f(CW\*(Aq\fR\f(CIpattern\fR\f(CW)\*(C'\fR and \f(CW\*(C`(?<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\*(C'\fR +are equivalent. +.Sp +\&\fBNOTE:\fR While the notation of this construct is the same as the similar +function in .NET regexes, the behavior is not. In Perl the groups are +numbered sequentially regardless of being named or not. Thus in the +pattern +.Sp +.Vb 1 +\& /(x)(?<foo>y)(z)/ +.Ve +.Sp +\&\f(CW$+{foo}\fR will be the same as \f(CW$2\fR, and \f(CW$3\fR will contain 'z' instead of +the opposite which is what a .NET regex hacker might expect. +.Sp +Currently \fINAME\fR is restricted to simple identifiers only. +In other words, it must match \f(CW\*(C`/^[_A\-Za\-z][_A\-Za\-z0\-9]*\ez/\*(C'\fR or +its Unicode extension (see utf8), +though it isn't extended by the locale (see perllocale). +.Sp +\&\fBNOTE:\fR In order to make things easier for programmers with experience +with the Python or PCRE regex engines, the pattern \f(CW\*(C`(?P<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\*(C'\fR +may be used instead of \f(CW\*(C`(?<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\*(C'\fR; however this form does not +support the use of single quotes as a delimiter for the name. +.ie n .IP """\ek<\fINAME\fR>""" 4 +.el .IP \f(CW\ek<\fR\f(CINAME\fR\f(CW>\fR 4 +.IX Item "k<NAME>" +.PD 0 +.ie n .IP """\ek\*(Aq\fINAME\fR\*(Aq""" 4 +.el .IP \f(CW\ek\*(Aq\fR\f(CINAME\fR\f(CW\*(Aq\fR 4 +.IX Item "kNAME" +.ie n .IP """\ek{\fINAME\fR}""" 4 +.el .IP \f(CW\ek{\fR\f(CINAME\fR\f(CW}\fR 4 +.IX Item "k{NAME}" +.PD +Named backreference. Similar to numeric backreferences, except that +the group is designated by name and not number. If multiple groups +have the same name then it refers to the leftmost defined group in +the current match. +.Sp +It is an error to refer to a name not defined by a \f(CW\*(C`(?<\fR\f(CINAME\fR\f(CW>)\*(C'\fR +earlier in the pattern. +.Sp +All three forms are equivalent, although with \f(CW\*(C`\ek{ \fR\f(CINAME\fR\f(CW }\*(C'\fR, +you may optionally have blanks within but adjacent to the braces, as +shown. +.Sp +\&\fBNOTE:\fR In order to make things easier for programmers with experience +with the Python or PCRE regex engines, the pattern \f(CW\*(C`(?P=\fR\f(CINAME\fR\f(CW)\*(C'\fR +may be used instead of \f(CW\*(C`\ek<\fR\f(CINAME\fR\f(CW>\*(C'\fR. +.ie n .IP """(?{ \fIcode\fR })""" 4 +.el .IP "\f(CW(?{ \fR\f(CIcode\fR\f(CW })\fR" 4 +.IX Xref "(?{}) regex, code in regexp, code in regular expression, code in" +.IX Item "(?{ code })" +\&\fBWARNING\fR: Using this feature safely requires that you understand its +limitations. Code executed that has side effects may not perform identically +from version to version due to the effect of future optimisations in the regex +engine. For more information on this, see "Embedded Code Execution +Frequency". +.Sp +This zero-width assertion executes any embedded Perl code. It always +succeeds, and its return value is set as \f(CW$^R\fR. +.Sp +In literal patterns, the code is parsed at the same time as the +surrounding code. While within the pattern, control is passed temporarily +back to the perl parser, until the logically-balancing closing brace is +encountered. This is similar to the way that an array index expression in +a literal string is handled, for example +.Sp +.Vb 1 +\& "abc$array[ 1 + f(\*(Aq[\*(Aq) + g()]def" +.Ve +.Sp +In particular, braces do not need to be balanced: +.Sp +.Vb 1 +\& s/abc(?{ f(\*(Aq{\*(Aq); })/def/ +.Ve +.Sp +Even in a pattern that is interpolated and compiled at run-time, literal +code blocks will be compiled once, at perl compile time; the following +prints "ABCD": +.Sp +.Vb 5 +\& print "D"; +\& my $qr = qr/(?{ BEGIN { print "A" } })/; +\& my $foo = "foo"; +\& /$foo$qr(?{ BEGIN { print "B" } })/; +\& BEGIN { print "C" } +.Ve +.Sp +In patterns where the text of the code is derived from run-time +information rather than appearing literally in a source code /pattern/, +the code is compiled at the same time that the pattern is compiled, and +for reasons of security, \f(CW\*(C`use re \*(Aqeval\*(Aq\*(C'\fR must be in scope. This is to +stop user-supplied patterns containing code snippets from being +executable. +.Sp +In situations where you need to enable this with \f(CW\*(C`use re \*(Aqeval\*(Aq\*(C'\fR, you should +also have taint checking enabled, if your perl supports it. +Better yet, use the carefully constrained evaluation within a Safe compartment. +See perlsec for details about both these mechanisms. +.Sp +From the viewpoint of parsing, lexical variable scope and closures, +.Sp +.Vb 1 +\& /AAA(?{ BBB })CCC/ +.Ve +.Sp +behaves approximately like +.Sp +.Vb 1 +\& /AAA/ && do { BBB } && /CCC/ +.Ve +.Sp +Similarly, +.Sp +.Vb 1 +\& qr/AAA(?{ BBB })CCC/ +.Ve +.Sp +behaves approximately like +.Sp +.Vb 1 +\& sub { /AAA/ && do { BBB } && /CCC/ } +.Ve +.Sp +In particular: +.Sp +.Vb 3 +\& { my $i = 1; $r = qr/(?{ print $i })/ } +\& my $i = 2; +\& /$r/; # prints "1" +.Ve +.Sp +Inside a \f(CW\*(C`(?{...})\*(C'\fR block, \f(CW$_\fR refers to the string the regular +expression is matching against. You can also use \f(CWpos()\fR to know what is +the current position of matching within this string. +.Sp +The code block introduces a new scope from the perspective of lexical +variable declarations, but \fBnot\fR from the perspective of \f(CW\*(C`local\*(C'\fR and +similar localizing behaviours. So later code blocks within the same +pattern will still see the values which were localized in earlier blocks. +These accumulated localizations are undone either at the end of a +successful match, or if the assertion is backtracked (compare +"Backtracking"). For example, +.Sp +.Vb 10 +\& $_ = \*(Aqa\*(Aq x 8; +\& m< +\& (?{ $cnt = 0 }) # Initialize $cnt. +\& ( +\& a +\& (?{ +\& local $cnt = $cnt + 1; # Update $cnt, +\& # backtracking\-safe. +\& }) +\& )* +\& aaaa +\& (?{ $res = $cnt }) # On success copy to +\& # non\-localized location. +\& >x; +.Ve +.Sp +will initially increment \f(CW$cnt\fR up to 8; then during backtracking, its +value will be unwound back to 4, which is the value assigned to \f(CW$res\fR. +At the end of the regex execution, \f(CW$cnt\fR will be wound back to its initial +value of 0. +.Sp +This assertion may be used as the condition in a +.Sp +.Vb 1 +\& (?(condition)yes\-pattern|no\-pattern) +.Ve +.Sp +switch. If \fInot\fR used in this way, the result of evaluation of \fIcode\fR +is put into the special variable \f(CW$^R\fR. This happens immediately, so +\&\f(CW$^R\fR can be used from other \f(CW\*(C`(?{ \fR\f(CIcode\fR\f(CW })\*(C'\fR assertions inside the same +regular expression. +.Sp +The assignment to \f(CW$^R\fR above is properly localized, so the old +value of \f(CW$^R\fR is restored if the assertion is backtracked; compare +"Backtracking". +.Sp +Note that the special variable \f(CW$^N\fR is particularly useful with code +blocks to capture the results of submatches in variables without having to +keep track of the number of nested parentheses. For example: +.Sp +.Vb 3 +\& $_ = "The brown fox jumps over the lazy dog"; +\& /the (\eS+)(?{ $color = $^N }) (\eS+)(?{ $animal = $^N })/i; +\& print "color = $color, animal = $animal\en"; +.Ve +.Sp +The use of this construct disables some optimisations globally in the +pattern, and the pattern may execute much slower as a consequence. +Use a \f(CW\*(C`*\*(C'\fR instead of the \f(CW\*(C`?\*(C'\fR block to create an optimistic form of +this construct. \f(CW\*(C`(*{ ... })\*(C'\fR should not disable any optimisations. +.ie n .IP """(*{ \fIcode\fR })""" 4 +.el .IP "\f(CW(*{ \fR\f(CIcode\fR\f(CW })\fR" 4 +.IX Xref "(*{}) regex, optimistic code" +.IX Item "(*{ code })" +This is *exactly* the same as \f(CW\*(C`(?{ \fR\f(CIcode\fR\f(CW })\*(C'\fR with the exception +that it does not disable \fBany\fR optimisations at all in the regex engine. +How often it is executed may vary from perl release to perl release. +In a failing match it may not even be executed at all. +.ie n .IP """(??{ \fIcode\fR })""" 4 +.el .IP "\f(CW(??{ \fR\f(CIcode\fR\f(CW })\fR" 4 +.IX Xref "(??{}) regex, postponed regexp, postponed regular expression, postponed" +.IX Item "(??{ code })" +\&\fBWARNING\fR: Using this feature safely requires that you understand its +limitations. Code executed that has side effects may not perform +identically from version to version due to the effect of future +optimisations in the regex engine. For more information on this, see +"Embedded Code Execution Frequency". +.Sp +This is a "postponed" regular subexpression. It behaves in \fIexactly\fR the +same way as a \f(CW\*(C`(?{ \fR\f(CIcode\fR\f(CW })\*(C'\fR code block as described above, except that +its return value, rather than being assigned to \f(CW$^R\fR, is treated as a +pattern, compiled if it's a string (or used as-is if its a qr// object), +then matched as if it were inserted instead of this construct. +.Sp +During the matching of this sub-pattern, it has its own set of +captures which are valid during the sub-match, but are discarded once +control returns to the main pattern. For example, the following matches, +with the inner pattern capturing "B" and matching "BB", while the outer +pattern captures "A"; +.Sp +.Vb 3 +\& my $inner = \*(Aq(.)\e1\*(Aq; +\& "ABBA" =~ /^(.)(??{ $inner })\e1/; +\& print $1; # prints "A"; +.Ve +.Sp +Note that this means that there is no way for the inner pattern to refer +to a capture group defined outside. (The code block itself can use \f(CW$1\fR, +\&\fIetc\fR., to refer to the enclosing pattern's capture groups.) Thus, although +.Sp +.Vb 1 +\& (\*(Aqa\*(Aq x 100)=~/(??{\*(Aq(.)\*(Aq x 100})/ +.Ve +.Sp +\&\fIwill\fR match, it will \fInot\fR set \f(CW$1\fR on exit. +.Sp +The following pattern matches a parenthesized group: +.Sp +.Vb 9 +\& $re = qr{ +\& \e( +\& (?: +\& (?> [^()]+ ) # Non\-parens without backtracking +\& | +\& (??{ $re }) # Group with matching parens +\& )* +\& \e) +\& }x; +.Ve +.Sp +See also +\&\f(CW\*(C`(?\fR\f(CIPARNO\fR\f(CW)\*(C'\fR +for a different, more efficient way to accomplish +the same task. +.Sp +Executing a postponed regular expression too many times without +consuming any input string will also result in a fatal error. The depth +at which that happens is compiled into perl, so it can be changed with a +custom build. +.Sp +The use of this construct disables some optimisations globally in the pattern, +and the pattern may execute much slower as a consequence. +.ie n .IP """(?\fIPARNO\fR)"" ""(?\-\fIPARNO\fR)"" ""(?+\fIPARNO\fR)"" ""(?R)"" ""(?0)""" 4 +.el .IP "\f(CW(?\fR\f(CIPARNO\fR\f(CW)\fR \f(CW(?\-\fR\f(CIPARNO\fR\f(CW)\fR \f(CW(?+\fR\f(CIPARNO\fR\f(CW)\fR \f(CW(?R)\fR \f(CW(?0)\fR" 4 +.IX Xref "(?PARNO) (?1) (?R) (?0) (?-1) (?+1) (?-PARNO) (?+PARNO) regex, recursive regexp, recursive regular expression, recursive regex, relative recursion GOSUB GOSTART" +.IX Item "(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)" +Recursive subpattern. Treat the contents of a given capture buffer in the +current pattern as an independent subpattern and attempt to match it at +the current position in the string. Information about capture state from +the caller for things like backreferences is available to the subpattern, +but capture buffers set by the subpattern are not visible to the caller. +.Sp +Similar to \f(CW\*(C`(??{ \fR\f(CIcode\fR\f(CW })\*(C'\fR except that it does not involve executing any +code or potentially compiling a returned pattern string; instead it treats +the part of the current pattern contained within a specified capture group +as an independent pattern that must match at the current position. Also +different is the treatment of capture buffers, unlike \f(CW\*(C`(??{ \fR\f(CIcode\fR\f(CW })\*(C'\fR +recursive patterns have access to their caller's match state, so one can +use backreferences safely. +.Sp +\&\fIPARNO\fR is a sequence of digits (not starting with 0) whose value reflects +the paren-number of the capture group to recurse to. \f(CW\*(C`(?R)\*(C'\fR recurses to +the beginning of the whole pattern. \f(CW\*(C`(?0)\*(C'\fR is an alternate syntax for +\&\f(CW\*(C`(?R)\*(C'\fR. If \fIPARNO\fR is preceded by a plus or minus sign then it is assumed +to be relative, with negative numbers indicating preceding capture groups +and positive ones following. Thus \f(CW\*(C`(?\-1)\*(C'\fR refers to the most recently +declared group, and \f(CW\*(C`(?+1)\*(C'\fR indicates the next group to be declared. +Note that the counting for relative recursion differs from that of +relative backreferences, in that with recursion unclosed groups \fBare\fR +included. +.Sp +The following pattern matches a function \f(CWfoo()\fR which may contain +balanced parentheses as the argument. +.Sp +.Vb 10 +\& $re = qr{ ( # paren group 1 (full function) +\& foo +\& ( # paren group 2 (parens) +\& \e( +\& ( # paren group 3 (contents of parens) +\& (?: +\& (?> [^()]+ ) # Non\-parens without backtracking +\& | +\& (?2) # Recurse to start of paren group 2 +\& )* +\& ) +\& \e) +\& ) +\& ) +\& }x; +.Ve +.Sp +If the pattern was used as follows +.Sp +.Vb 4 +\& \*(Aqfoo(bar(baz)+baz(bop))\*(Aq=~/$re/ +\& and print "\e$1 = $1\en", +\& "\e$2 = $2\en", +\& "\e$3 = $3\en"; +.Ve +.Sp +the output produced should be the following: +.Sp +.Vb 3 +\& $1 = foo(bar(baz)+baz(bop)) +\& $2 = (bar(baz)+baz(bop)) +\& $3 = bar(baz)+baz(bop) +.Ve +.Sp +If there is no corresponding capture group defined, then it is a +fatal error. Recursing deeply without consuming any input string will +also result in a fatal error. The depth at which that happens is +compiled into perl, so it can be changed with a custom build. +.Sp +The following shows how using negative indexing can make it +easier to embed recursive patterns inside of a \f(CW\*(C`qr//\*(C'\fR construct +for later use: +.Sp +.Vb 4 +\& my $parens = qr/(\e((?:[^()]++|(?\-1))*+\e))/; +\& if (/foo $parens \es+ \e+ \es+ bar $parens/x) { +\& # do something here... +\& } +.Ve +.Sp +\&\fBNote\fR that this pattern does not behave the same way as the equivalent +PCRE or Python construct of the same form. In Perl you can backtrack into +a recursed group, in PCRE and Python the recursed into group is treated +as atomic. Also, modifiers are resolved at compile time, so constructs +like \f(CW\*(C`(?i:(?1))\*(C'\fR or \f(CW\*(C`(?:(?i)(?1))\*(C'\fR do not affect how the sub-pattern will +be processed. +.ie n .IP """(?&\fINAME\fR)""" 4 +.el .IP \f(CW(?&\fR\f(CINAME\fR\f(CW)\fR 4 +.IX Xref "(?&NAME)" +.IX Item "(?&NAME)" +Recurse to a named subpattern. Identical to \f(CW\*(C`(?\fR\f(CIPARNO\fR\f(CW)\*(C'\fR except that the +parenthesis to recurse to is determined by name. If multiple parentheses have +the same name, then it recurses to the leftmost. +.Sp +It is an error to refer to a name that is not declared somewhere in the +pattern. +.Sp +\&\fBNOTE:\fR In order to make things easier for programmers with experience +with the Python or PCRE regex engines the pattern \f(CW\*(C`(?P>\fR\f(CINAME\fR\f(CW)\*(C'\fR +may be used instead of \f(CW\*(C`(?&\fR\f(CINAME\fR\f(CW)\*(C'\fR. +.ie n .IP """(?(\fIcondition\fR)\fIyes\-pattern\fR|\fIno\-pattern\fR)""" 4 +.el .IP \f(CW(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-pattern\fR\f(CW|\fR\f(CIno\-pattern\fR\f(CW)\fR 4 +.IX Xref "(?()" +.IX Item "(?(condition)yes-pattern|no-pattern)" +.PD 0 +.ie n .IP """(?(\fIcondition\fR)\fIyes\-pattern\fR)""" 4 +.el .IP \f(CW(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-pattern\fR\f(CW)\fR 4 +.IX Item "(?(condition)yes-pattern)" +.PD +Conditional expression. Matches \fIyes-pattern\fR if \fIcondition\fR yields +a true value, matches \fIno-pattern\fR otherwise. A missing pattern always +matches. +.Sp +\&\f(CW\*(C`(\fR\f(CIcondition\fR\f(CW)\*(C'\fR should be one of: +.RS 4 +.IP "an integer in parentheses" 4 +.IX Item "an integer in parentheses" +(which is valid if the corresponding pair of parentheses +matched); +.IP "a lookahead/lookbehind/evaluate zero-width assertion;" 4 +.IX Item "a lookahead/lookbehind/evaluate zero-width assertion;" +.PD 0 +.IP "a name in angle brackets or single quotes" 4 +.IX Item "a name in angle brackets or single quotes" +.PD +(which is valid if a group with the given name matched); +.ie n .IP "the special symbol ""(R)""" 4 +.el .IP "the special symbol \f(CW(R)\fR" 4 +.IX Item "the special symbol (R)" +(true when evaluated inside of recursion or eval). Additionally the +\&\f(CW"R"\fR may be +followed by a number, (which will be true when evaluated when recursing +inside of the appropriate group), or by \f(CW\*(C`&\fR\f(CINAME\fR\f(CW\*(C'\fR, in which case it will +be true only when evaluated during recursion in the named group. +.RE +.RS 4 +.Sp +Here's a summary of the possible predicates: +.ie n .IP """(1)"" ""(2)"" ..." 4 +.el .IP "\f(CW(1)\fR \f(CW(2)\fR ..." 4 +.IX Item "(1) (2) ..." +Checks if the numbered capturing group has matched something. +Full syntax: \f(CW\*(C`(?(1)then|else)\*(C'\fR +.ie n .IP """(<\fINAME\fR>)"" ""(\*(Aq\fINAME\fR\*(Aq)""" 4 +.el .IP "\f(CW(<\fR\f(CINAME\fR\f(CW>)\fR \f(CW(\*(Aq\fR\f(CINAME\fR\f(CW\*(Aq)\fR" 4 +.IX Item "(<NAME>) (NAME)" +Checks if a group with the given name has matched something. +Full syntax: \f(CW\*(C`(?(<name>)then|else)\*(C'\fR +.ie n .IP """(?=...)"" ""(?!...)"" ""(?<=...)"" ""(?<!...)""" 4 +.el .IP "\f(CW(?=...)\fR \f(CW(?!...)\fR \f(CW(?<=...)\fR \f(CW(?<!...)\fR" 4 +.IX Item "(?=...) (?!...) (?<=...) (?<!...)" +Checks whether the pattern matches (or does not match, for the \f(CW"!"\fR +variants). +Full syntax: \f(CW\*(C`(?(?=\fR\f(CIlookahead\fR\f(CW)\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.ie n .IP """(?{ \fICODE\fR })""" 4 +.el .IP "\f(CW(?{ \fR\f(CICODE\fR\f(CW })\fR" 4 +.IX Item "(?{ CODE })" +Treats the return value of the code block as the condition. +Full syntax: \f(CW\*(C`(?(?{ \fR\f(CICODE\fR\f(CW })\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.Sp +Note use of this construct may globally affect the performance +of the pattern. Consider using \f(CW\*(C`(*{ \fR\f(CICODE\fR\f(CW })\*(C'\fR +.ie n .IP """(*{ \fICODE\fR })""" 4 +.el .IP "\f(CW(*{ \fR\f(CICODE\fR\f(CW })\fR" 4 +.IX Item "(*{ CODE })" +Treats the return value of the code block as the condition. +Full syntax: \f(CW\*(C`(?(*{ \fR\f(CICODE\fR\f(CW })\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.ie n .IP """(R)""" 4 +.el .IP \f(CW(R)\fR 4 +.IX Item "(R)" +Checks if the expression has been evaluated inside of recursion. +Full syntax: \f(CW\*(C`(?(R)\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.ie n .IP """(R1)"" ""(R2)"" ..." 4 +.el .IP "\f(CW(R1)\fR \f(CW(R2)\fR ..." 4 +.IX Item "(R1) (R2) ..." +Checks if the expression has been evaluated while executing directly +inside of the n\-th capture group. This check is the regex equivalent of +.Sp +.Vb 1 +\& if ((caller(0))[3] eq \*(Aqsubname\*(Aq) { ... } +.Ve +.Sp +In other words, it does not check the full recursion stack. +.Sp +Full syntax: \f(CW\*(C`(?(R1)\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.ie n .IP """(R&\fINAME\fR)""" 4 +.el .IP \f(CW(R&\fR\f(CINAME\fR\f(CW)\fR 4 +.IX Item "(R&NAME)" +Similar to \f(CW\*(C`(R1)\*(C'\fR, this predicate checks to see if we're executing +directly inside of the leftmost group with a given name (this is the same +logic used by \f(CW\*(C`(?&\fR\f(CINAME\fR\f(CW)\*(C'\fR to disambiguate). It does not check the full +stack, but only the name of the innermost active recursion. +Full syntax: \f(CW\*(C`(?(R&\fR\f(CIname\fR\f(CW)\fR\f(CIthen\fR\f(CW|\fR\f(CIelse\fR\f(CW)\*(C'\fR +.ie n .IP """(DEFINE)""" 4 +.el .IP \f(CW(DEFINE)\fR 4 +.IX Item "(DEFINE)" +In this case, the yes-pattern is never directly executed, and no +no-pattern is allowed. Similar in spirit to \f(CW\*(C`(?{0})\*(C'\fR but more efficient. +See below for details. +Full syntax: \f(CW\*(C`(?(DEFINE)\fR\f(CIdefinitions\fR\f(CW...)\*(C'\fR +.RE +.RS 4 +.Sp +For example: +.Sp +.Vb 4 +\& m{ ( \e( )? +\& [^()]+ +\& (?(1) \e) ) +\& }x +.Ve +.Sp +matches a chunk of non-parentheses, possibly included in parentheses +themselves. +.Sp +A special form is the \f(CW\*(C`(DEFINE)\*(C'\fR predicate, which never executes its +yes-pattern directly, and does not allow a no-pattern. This allows one to +define subpatterns which will be executed only by the recursion mechanism. +This way, you can define a set of regular expression rules that can be +bundled into any pattern you choose. +.Sp +It is recommended that for this usage you put the DEFINE block at the +end of the pattern, and that you name any subpatterns defined within it. +.Sp +Also, it's worth noting that patterns defined this way probably will +not be as efficient, as the optimizer is not very clever about +handling them. +.Sp +An example of how this might be used is as follows: +.Sp +.Vb 5 +\& /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) +\& (?(DEFINE) +\& (?<NAME_PAT>....) +\& (?<ADDRESS_PAT>....) +\& )/x +.Ve +.Sp +Note that capture groups matched inside of recursion are not accessible +after the recursion returns, so the extra layer of capturing groups is +necessary. Thus \f(CW$+{NAME_PAT}\fR would not be defined even though +\&\f(CW$+{NAME}\fR would be. +.Sp +Finally, keep in mind that subpatterns created inside a DEFINE block +count towards the absolute and relative number of captures, so this: +.Sp +.Vb 5 +\& my @captures = "a" =~ /(.) # First capture +\& (?(DEFINE) +\& (?<EXAMPLE> 1 ) # Second capture +\& )/x; +\& say scalar @captures; +.Ve +.Sp +Will output 2, not 1. This is particularly important if you intend to +compile the definitions with the \f(CW\*(C`qr//\*(C'\fR operator, and later +interpolate them in another pattern. +.RE +.ie n .IP """(?>\fIpattern\fR)""" 4 +.el .IP \f(CW(?>\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?>pattern)" +.PD 0 +.ie n .IP """(*atomic:\fIpattern\fR)""" 4 +.el .IP \f(CW(*atomic:\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Xref "(?>pattern) (*atomic backtrack backtracking atomic possessive" +.IX Item "(*atomic:pattern)" +.PD +An "independent" subexpression, one which matches the substring +that a standalone \fIpattern\fR would match if anchored at the given +position, and it matches \fInothing other than this substring\fR. This +construct is useful for optimizations of what would otherwise be +"eternal" matches, because it will not backtrack (see "Backtracking"). +It may also be useful in places where the "grab all you can, and do not +give anything back" semantic is desirable. +.Sp +For example: \f(CW\*(C`^(?>a*)ab\*(C'\fR will never match, since \f(CW\*(C`(?>a*)\*(C'\fR +(anchored at the beginning of string, as above) will match \fIall\fR +characters \f(CW"a"\fR at the beginning of string, leaving no \f(CW"a"\fR for +\&\f(CW\*(C`ab\*(C'\fR to match. In contrast, \f(CW\*(C`a*ab\*(C'\fR will match the same as \f(CW\*(C`a+b\*(C'\fR, +since the match of the subgroup \f(CW\*(C`a*\*(C'\fR is influenced by the following +group \f(CW\*(C`ab\*(C'\fR (see "Backtracking"). In particular, \f(CW\*(C`a*\*(C'\fR inside +\&\f(CW\*(C`a*ab\*(C'\fR will match fewer characters than a standalone \f(CW\*(C`a*\*(C'\fR, since +this makes the tail match. +.Sp +\&\f(CW\*(C`(?>\fR\f(CIpattern\fR\f(CW)\*(C'\fR does not disable backtracking altogether once it has +matched. It is still possible to backtrack past the construct, but not +into it. So \f(CW\*(C`((?>a*)|(?>b*))ar\*(C'\fR will still match "bar". +.Sp +An effect similar to \f(CW\*(C`(?>\fR\f(CIpattern\fR\f(CW)\*(C'\fR may be achieved by writing +\&\f(CW\*(C`(?=(\fR\f(CIpattern\fR\f(CW))\eg{\-1}\*(C'\fR. This matches the same substring as a standalone +\&\f(CW\*(C`a+\*(C'\fR, and the following \f(CW\*(C`\eg{\-1}\*(C'\fR eats the matched string; it therefore +makes a zero-length assertion into an analogue of \f(CW\*(C`(?>...)\*(C'\fR. +(The difference between these two constructs is that the second one +uses a capturing group, thus shifting ordinals of backreferences +in the rest of a regular expression.) +.Sp +Consider this pattern: +.Sp +.Vb 8 +\& m{ \e( +\& ( +\& [^()]+ # x+ +\& | +\& \e( [^()]* \e) +\& )+ +\& \e) +\& }x +.Ve +.Sp +That will efficiently match a nonempty group with matching parentheses +two levels deep or less. However, if there is no such group, it +will take virtually forever on a long string. That's because there +are so many different ways to split a long string into several +substrings. This is what \f(CW\*(C`(.+)+\*(C'\fR is doing, and \f(CW\*(C`(.+)+\*(C'\fR is similar +to a subpattern of the above pattern. Consider how the pattern +above detects no-match on \f(CW\*(C`((()aaaaaaaaaaaaaaaaaa\*(C'\fR in several +seconds, but that each extra letter doubles this time. This +exponential performance will make it appear that your program has +hung. However, a tiny change to this pattern +.Sp +.Vb 8 +\& m{ \e( +\& ( +\& (?> [^()]+ ) # change x+ above to (?> x+ ) +\& | +\& \e( [^()]* \e) +\& )+ +\& \e) +\& }x +.Ve +.Sp +which uses \f(CW\*(C`(?>...)\*(C'\fR matches exactly when the one above does (verifying +this yourself would be a productive exercise), but finishes in a fourth +the time when used on a similar string with 1000000 \f(CW"a"\fRs. Be aware, +however, that, when this construct is followed by a +quantifier, it currently triggers a warning message under +the \f(CW\*(C`use warnings\*(C'\fR pragma or \fB\-w\fR switch saying it +\&\f(CW"matches null string many times in regex"\fR. +.Sp +On simple groups, such as the pattern \f(CW\*(C`(?> [^()]+ )\*(C'\fR, a comparable +effect may be achieved by negative lookahead, as in \f(CW\*(C`[^()]+ (?! [^()] )\*(C'\fR. +This was only 4 times slower on a string with 1000000 \f(CW"a"\fRs. +.Sp +The "grab all you can, and do not give anything back" semantic is desirable +in many situations where on the first sight a simple \f(CW\*(C`()*\*(C'\fR looks like +the correct solution. Suppose we parse text with comments being delimited +by \f(CW"#"\fR followed by some optional (horizontal) whitespace. Contrary to +its appearance, \f(CW\*(C`#[ \et]*\*(C'\fR \fIis not\fR the correct subexpression to match +the comment delimiter, because it may "give up" some whitespace if +the remainder of the pattern can be made to match that way. The correct +answer is either one of these: +.Sp +.Vb 2 +\& (?>#[ \et]*) +\& #[ \et]*(?![ \et]) +.Ve +.Sp +For example, to grab non-empty comments into \f(CW$1\fR, one should use either +one of these: +.Sp +.Vb 2 +\& / (?> \e# [ \et]* ) ( .+ ) /x; +\& / \e# [ \et]* ( [^ \et] .* ) /x; +.Ve +.Sp +Which one you pick depends on which of these expressions better reflects +the above specification of comments. +.Sp +In some literature this construct is called "atomic matching" or +"possessive matching". +.Sp +Possessive quantifiers are equivalent to putting the item they are applied +to inside of one of these constructs. The following equivalences apply: +.Sp +.Vb 6 +\& Quantifier Form Bracketing Form +\& \-\-\-\-\-\-\-\-\-\-\-\-\-\-\- \-\-\-\-\-\-\-\-\-\-\-\-\-\-\- +\& PAT*+ (?>PAT*) +\& PAT++ (?>PAT+) +\& PAT?+ (?>PAT?) +\& PAT{min,max}+ (?>PAT{min,max}) +.Ve +.Sp +Nested \f(CW\*(C`(?>...)\*(C'\fR constructs are not no-ops, even if at first glance +they might seem to be. This is because the nested \f(CW\*(C`(?>...)\*(C'\fR can +restrict internal backtracking that otherwise might occur. For example, +.Sp +.Vb 1 +\& "abc" =~ /(?>a[bc]*c)/ +.Ve +.Sp +matches, but +.Sp +.Vb 1 +\& "abc" =~ /(?>a(?>[bc]*)c)/ +.Ve +.Sp +does not. +.ie n .IP """(?[ ])""" 4 +.el .IP "\f(CW(?[ ])\fR" 4 +.IX Item "(?[ ])" +See "Extended Bracketed Character Classes" in perlrecharclass. +.SS Backtracking +.IX Xref "backtrack backtracking" +.IX Subsection "Backtracking" +NOTE: This section presents an abstract approximation of regular +expression behavior. For a more rigorous (and complicated) view of +the rules involved in selecting a match among possible alternatives, +see "Combining RE Pieces". +.PP +A fundamental feature of regular expression matching involves the +notion called \fIbacktracking\fR, which is currently used (when needed) +by all regular non-possessive expression quantifiers, namely \f(CW"*"\fR, +\&\f(CW\*(C`*?\*(C'\fR, \f(CW"+"\fR, \f(CW\*(C`+?\*(C'\fR, \f(CW\*(C`{n,m}\*(C'\fR, and \f(CW\*(C`{n,m}?\*(C'\fR. Backtracking is often +optimized internally, but the general principle outlined here is valid. +.PP +For a regular expression to match, the \fIentire\fR regular expression must +match, not just part of it. So if the beginning of a pattern containing a +quantifier succeeds in a way that causes later parts in the pattern to +fail, the matching engine backs up and recalculates the beginning +part\-\-that's why it's called backtracking. +.PP +Here is an example of backtracking: Let's say you want to find the +word following "foo" in the string "Food is on the foo table.": +.PP +.Vb 4 +\& $_ = "Food is on the foo table."; +\& if ( /\eb(foo)\es+(\ew+)/i ) { +\& print "$2 follows $1.\en"; +\& } +.Ve +.PP +When the match runs, the first part of the regular expression (\f(CW\*(C`\eb(foo)\*(C'\fR) +finds a possible match right at the beginning of the string, and loads up +\&\f(CW$1\fR with "Foo". However, as soon as the matching engine sees that there's +no whitespace following the "Foo" that it had saved in \f(CW$1\fR, it realizes its +mistake and starts over again one character after where it had the +tentative match. This time it goes all the way until the next occurrence +of "foo". The complete regular expression matches this time, and you get +the expected output of "table follows foo." +.PP +Sometimes minimal matching can help a lot. Imagine you'd like to match +everything between "foo" and "bar". Initially, you write something +like this: +.PP +.Vb 4 +\& $_ = "The food is under the bar in the barn."; +\& if ( /foo(.*)bar/ ) { +\& print "got <$1>\en"; +\& } +.Ve +.PP +Which perhaps unexpectedly yields: +.PP +.Vb 1 +\& got <d is under the bar in the > +.Ve +.PP +That's because \f(CW\*(C`.*\*(C'\fR was greedy, so you get everything between the +\&\fIfirst\fR "foo" and the \fIlast\fR "bar". Here it's more effective +to use minimal matching to make sure you get the text between a "foo" +and the first "bar" thereafter. +.PP +.Vb 2 +\& if ( /foo(.*?)bar/ ) { print "got <$1>\en" } +\& got <d is under the > +.Ve +.PP +Here's another example. Let's say you'd like to match a number at the end +of a string, and you also want to keep the preceding part of the match. +So you write this: +.PP +.Vb 4 +\& $_ = "I have 2 numbers: 53147"; +\& if ( /(.*)(\ed*)/ ) { # Wrong! +\& print "Beginning is <$1>, number is <$2>.\en"; +\& } +.Ve +.PP +That won't work at all, because \f(CW\*(C`.*\*(C'\fR was greedy and gobbled up the +whole string. As \f(CW\*(C`\ed*\*(C'\fR can match on an empty string the complete +regular expression matched successfully. +.PP +.Vb 1 +\& Beginning is <I have 2 numbers: 53147>, number is <>. +.Ve +.PP +Here are some variants, most of which don't work: +.PP +.Vb 11 +\& $_ = "I have 2 numbers: 53147"; +\& @pats = qw{ +\& (.*)(\ed*) +\& (.*)(\ed+) +\& (.*?)(\ed*) +\& (.*?)(\ed+) +\& (.*)(\ed+)$ +\& (.*?)(\ed+)$ +\& (.*)\eb(\ed+)$ +\& (.*\eD)(\ed+)$ +\& }; +\& +\& for $pat (@pats) { +\& printf "%\-12s ", $pat; +\& if ( /$pat/ ) { +\& print "<$1> <$2>\en"; +\& } else { +\& print "FAIL\en"; +\& } +\& } +.Ve +.PP +That will print out: +.PP +.Vb 8 +\& (.*)(\ed*) <I have 2 numbers: 53147> <> +\& (.*)(\ed+) <I have 2 numbers: 5314> <7> +\& (.*?)(\ed*) <> <> +\& (.*?)(\ed+) <I have > <2> +\& (.*)(\ed+)$ <I have 2 numbers: 5314> <7> +\& (.*?)(\ed+)$ <I have 2 numbers: > <53147> +\& (.*)\eb(\ed+)$ <I have 2 numbers: > <53147> +\& (.*\eD)(\ed+)$ <I have 2 numbers: > <53147> +.Ve +.PP +As you see, this can be a bit tricky. It's important to realize that a +regular expression is merely a set of assertions that gives a definition +of success. There may be 0, 1, or several different ways that the +definition might succeed against a particular string. And if there are +multiple ways it might succeed, you need to understand backtracking to +know which variety of success you will achieve. +.PP +When using lookahead assertions and negations, this can all get even +trickier. Imagine you'd like to find a sequence of non-digits not +followed by "123". You might try to write that as +.PP +.Vb 4 +\& $_ = "ABC123"; +\& if ( /^\eD*(?!123)/ ) { # Wrong! +\& print "Yup, no 123 in $_\en"; +\& } +.Ve +.PP +But that isn't going to match; at least, not the way you're hoping. It +claims that there is no 123 in the string. Here's a clearer picture of +why that pattern matches, contrary to popular expectations: +.PP +.Vb 2 +\& $x = \*(AqABC123\*(Aq; +\& $y = \*(AqABC445\*(Aq; +\& +\& print "1: got $1\en" if $x =~ /^(ABC)(?!123)/; +\& print "2: got $1\en" if $y =~ /^(ABC)(?!123)/; +\& +\& print "3: got $1\en" if $x =~ /^(\eD*)(?!123)/; +\& print "4: got $1\en" if $y =~ /^(\eD*)(?!123)/; +.Ve +.PP +This prints +.PP +.Vb 3 +\& 2: got ABC +\& 3: got AB +\& 4: got ABC +.Ve +.PP +You might have expected test 3 to fail because it seems to a more +general purpose version of test 1. The important difference between +them is that test 3 contains a quantifier (\f(CW\*(C`\eD*\*(C'\fR) and so can use +backtracking, whereas test 1 will not. What's happening is +that you've asked "Is it true that at the start of \f(CW$x\fR, following 0 or more +non-digits, you have something that's not 123?" If the pattern matcher had +let \f(CW\*(C`\eD*\*(C'\fR expand to "ABC", this would have caused the whole pattern to +fail. +.PP +The search engine will initially match \f(CW\*(C`\eD*\*(C'\fR with "ABC". Then it will +try to match \f(CW\*(C`(?!123)\*(C'\fR with "123", which fails. But because +a quantifier (\f(CW\*(C`\eD*\*(C'\fR) has been used in the regular expression, the +search engine can backtrack and retry the match differently +in the hope of matching the complete regular expression. +.PP +The pattern really, \fIreally\fR wants to succeed, so it uses the +standard pattern back-off-and-retry and lets \f(CW\*(C`\eD*\*(C'\fR expand to just "AB" this +time. Now there's indeed something following "AB" that is not +"123". It's "C123", which suffices. +.PP +We can deal with this by using both an assertion and a negation. +We'll say that the first part in \f(CW$1\fR must be followed both by a digit +and by something that's not "123". Remember that the lookaheads +are zero-width expressions\-\-they only look, but don't consume any +of the string in their match. So rewriting this way produces what +you'd expect; that is, case 5 will fail, but case 6 succeeds: +.PP +.Vb 2 +\& print "5: got $1\en" if $x =~ /^(\eD*)(?=\ed)(?!123)/; +\& print "6: got $1\en" if $y =~ /^(\eD*)(?=\ed)(?!123)/; +\& +\& 6: got ABC +.Ve +.PP +In other words, the two zero-width assertions next to each other work as though +they're ANDed together, just as you'd use any built-in assertions: \f(CW\*(C`/^$/\*(C'\fR +matches only if you're at the beginning of the line AND the end of the +line simultaneously. The deeper underlying truth is that juxtaposition in +regular expressions always means AND, except when you write an explicit OR +using the vertical bar. \f(CW\*(C`/ab/\*(C'\fR means match "a" AND (then) match "b", +although the attempted matches are made at different positions because "a" +is not a zero-width assertion, but a one-width assertion. +.PP +\&\fBWARNING\fR: Particularly complicated regular expressions can take +exponential time to solve because of the immense number of possible +ways they can use backtracking to try for a match. For example, without +internal optimizations done by the regular expression engine, this will +take a painfully long time to run: +.PP +.Vb 1 +\& \*(Aqaaaaaaaaaaaa\*(Aq =~ /((a{0,5}){0,5})*[c]/ +.Ve +.PP +And if you used \f(CW"*"\fR's in the internal groups instead of limiting them +to 0 through 5 matches, then it would take forever\-\-or until you ran +out of stack space. Moreover, these internal optimizations are not +always applicable. For example, if you put \f(CW\*(C`{0,5}\*(C'\fR instead of \f(CW"*"\fR +on the external group, no current optimization is applicable, and the +match takes a long time to finish. +.PP +A powerful tool for optimizing such beasts is what is known as an +"independent group", +which does not backtrack (see \f(CW"(?>pattern)"\fR). Note also that +zero-length lookahead/lookbehind assertions will not backtrack to make +the tail match, since they are in "logical" context: only +whether they match is considered relevant. For an example +where side-effects of lookahead \fImight\fR have influenced the +following match, see \f(CW"(?>pattern)"\fR. +.SS "Script Runs" +.IX Xref "(*script_run:...) (sr:...) (*atomic_script_run:...) (asr:...)" +.IX Subsection "Script Runs" +A script run is basically a sequence of characters, all from the same +Unicode script (see "Scripts" in perlunicode), such as Latin or Greek. In +most places a single word would never be written in multiple scripts, +unless it is a spoofing attack. An infamous example, is +.PP +.Vb 1 +\& paypal.com +.Ve +.PP +Those letters could all be Latin (as in the example just above), or they +could be all Cyrillic (except for the dot), or they could be a mixture +of the two. In the case of an internet address the \f(CW\*(C`.com\*(C'\fR would be in +Latin, And any Cyrillic ones would cause it to be a mixture, not a +script run. Someone clicking on such a link would not be directed to +the real Paypal website, but an attacker would craft a look-alike one to +attempt to gather sensitive information from the person. +.PP +Starting in Perl 5.28, it is now easy to detect strings that aren't +script runs. Simply enclose just about any pattern like either of +these: +.PP +.Vb 2 +\& (*script_run:pattern) +\& (*sr:pattern) +.Ve +.PP +What happens is that after \fIpattern\fR succeeds in matching, it is +subjected to the additional criterion that every character in it must be +from the same script (see exceptions below). If this isn't true, +backtracking occurs until something all in the same script is found that +matches, or all possibilities are exhausted. This can cause a lot of +backtracking, but generally, only malicious input will result in this, +though the slow down could cause a denial of service attack. If your +needs permit, it is best to make the pattern atomic to cut down on the +amount of backtracking. This is so likely to be what you want, that +instead of writing this: +.PP +.Vb 1 +\& (*script_run:(?>pattern)) +.Ve +.PP +you can write either of these: +.PP +.Vb 2 +\& (*atomic_script_run:pattern) +\& (*asr:pattern) +.Ve +.PP +(See \f(CW"(?>\fR\f(CIpattern\fR\f(CW)"\fR.) +.PP +In Taiwan, Japan, and Korea, it is common for text to have a mixture of +characters from their native scripts and base Chinese. Perl follows +Unicode's UTS 39 (<https://unicode.org/reports/tr39/>) Unicode Security +Mechanisms in allowing such mixtures. For example, the Japanese scripts +Katakana and Hiragana are commonly mixed together in practice, along +with some Chinese characters, and hence are treated as being in a single +script run by Perl. +.PP +The rules used for matching decimal digits are slightly stricter. Many +scripts have their own sets of digits equivalent to the Western \f(CW0\fR +through \f(CW9\fR ones. A few, such as Arabic, have more than one set. For +a string to be considered a script run, all digits in it must come from +the same set of ten, as determined by the first digit encountered. +As an example, +.PP +.Vb 1 +\& qr/(*script_run: \ed+ \eb )/x +.Ve +.PP +guarantees that the digits matched will all be from the same set of 10. +You won't get a look-alike digit from a different script that has a +different value than what it appears to be. +.PP +Unicode has three pseudo scripts that are handled specially. +.PP +"Unknown" is applied to code points whose meaning has yet to be +determined. Perl currently will match as a script run, any single +character string consisting of one of these code points. But any string +longer than one code point containing one of these will not be +considered a script run. +.PP +"Inherited" is applied to characters that modify another, such as an +accent of some type. These are considered to be in the script of the +master character, and so never cause a script run to not match. +.PP +The other one is "Common". This consists of mostly punctuation, emoji, +characters used in mathematics and music, the ASCII digits \f(CW0\fR +through \f(CW9\fR, and full-width forms of these digits. These characters +can appear intermixed in text in many of the world's scripts. These +also don't cause a script run to not match. But like other scripts, all +digits in a run must come from the same set of 10. +.PP +This construct is non-capturing. You can add parentheses to \fIpattern\fR +to capture, if desired. You will have to do this if you plan to use +"(*ACCEPT) (*ACCEPT:arg)" and not have it bypass the script run +checking. +.PP +The \f(CW\*(C`Script_Extensions\*(C'\fR property as modified by UTS 39 +(<https://unicode.org/reports/tr39/>) is used as the basis for this +feature. +.PP +To summarize, +.IP \(bu 4 +All length 0 or length 1 sequences are script runs. +.IP \(bu 4 +A longer sequence is a script run if and only if \fBall\fR of the following +conditions are met: +.Sp + +.RS 4 +.IP 1. 4 +No code point in the sequence has the \f(CW\*(C`Script_Extension\*(C'\fR property of +\&\f(CW\*(C`Unknown\*(C'\fR. +.Sp +This currently means that all code points in the sequence have been +assigned by Unicode to be characters that aren't private use nor +surrogate code points. +.IP 2. 4 +All characters in the sequence come from the Common script and/or the +Inherited script and/or a single other script. +.Sp +The script of a character is determined by the \f(CW\*(C`Script_Extensions\*(C'\fR +property as modified by UTS 39 (<https://unicode.org/reports/tr39/>), as +described above. +.IP 3. 4 +All decimal digits in the sequence come from the same block of 10 +consecutive digits. +.RE +.RS 4 +.RE +.SS "Special Backtracking Control Verbs" +.IX Subsection "Special Backtracking Control Verbs" +These special patterns are generally of the form \f(CW\*(C`(*\fR\f(CIVERB\fR\f(CW:\fR\f(CIarg\fR\f(CW)\*(C'\fR. Unless +otherwise stated the \fIarg\fR argument is optional; in some cases, it is +mandatory. +.PP +Any pattern containing a special backtracking verb that allows an argument +has the special behaviour that when executed it sets the current package's +\&\f(CW$REGERROR\fR and \f(CW$REGMARK\fR variables. When doing so the following +rules apply: +.PP +On failure, the \f(CW$REGERROR\fR variable will be set to the \fIarg\fR value of the +verb pattern, if the verb was involved in the failure of the match. If the +\&\fIarg\fR part of the pattern was omitted, then \f(CW$REGERROR\fR will be set to the +name of the last \f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR pattern executed, or to TRUE if there was +none. Also, the \f(CW$REGMARK\fR variable will be set to FALSE. +.PP +On a successful match, the \f(CW$REGERROR\fR variable will be set to FALSE, and +the \f(CW$REGMARK\fR variable will be set to the name of the last +\&\f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR pattern executed. See the explanation for the +\&\f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR verb below for more details. +.PP +\&\fBNOTE:\fR \f(CW$REGERROR\fR and \f(CW$REGMARK\fR are not magic variables like \f(CW$1\fR +and most other regex-related variables. They are not local to a scope, nor +readonly, but instead are volatile package variables similar to \f(CW$AUTOLOAD\fR. +They are set in the package containing the code that \fIexecuted\fR the regex +(rather than the one that compiled it, where those differ). If necessary, you +can use \f(CW\*(C`local\*(C'\fR to localize changes to these variables to a specific scope +before executing a regex. +.PP +If a pattern does not contain a special backtracking verb that allows an +argument, then \f(CW$REGERROR\fR and \f(CW$REGMARK\fR are not touched at all. +.IP Verbs 3 +.IX Item "Verbs" +.RS 3 +.PD 0 +.ie n .IP """(*PRUNE)"" ""(*PRUNE:\fINAME\fR)""" 4 +.el .IP "\f(CW(*PRUNE)\fR \f(CW(*PRUNE:\fR\f(CINAME\fR\f(CW)\fR" 4 +.IX Xref "(*PRUNE) (*PRUNE:NAME)" +.IX Item "(*PRUNE) (*PRUNE:NAME)" +.PD +This zero-width pattern prunes the backtracking tree at the current point +when backtracked into on failure. Consider the pattern \f(CW\*(C`/\fR\f(CIA\fR\f(CW (*PRUNE) \fR\f(CIB\fR\f(CW/\*(C'\fR, +where \fIA\fR and \fIB\fR are complex patterns. Until the \f(CW\*(C`(*PRUNE)\*(C'\fR verb is reached, +\&\fIA\fR may backtrack as necessary to match. Once it is reached, matching +continues in \fIB\fR, which may also backtrack as necessary; however, should B +not match, then no further backtracking will take place, and the pattern +will fail outright at the current starting position. +.Sp +The following example counts all the possible matching strings in a +pattern (without actually matching any of them). +.Sp +.Vb 2 +\& \*(Aqaaab\*(Aq =~ /a+b?(?{print "$&\en"; $count++})(*FAIL)/; +\& print "Count=$count\en"; +.Ve +.Sp +which produces: +.Sp +.Vb 10 +\& aaab +\& aaa +\& aa +\& a +\& aab +\& aa +\& a +\& ab +\& a +\& Count=9 +.Ve +.Sp +If we add a \f(CW\*(C`(*PRUNE)\*(C'\fR before the count like the following +.Sp +.Vb 2 +\& \*(Aqaaab\*(Aq =~ /a+b?(*PRUNE)(?{print "$&\en"; $count++})(*FAIL)/; +\& print "Count=$count\en"; +.Ve +.Sp +we prevent backtracking and find the count of the longest matching string +at each matching starting point like so: +.Sp +.Vb 4 +\& aaab +\& aab +\& ab +\& Count=3 +.Ve +.Sp +Any number of \f(CW\*(C`(*PRUNE)\*(C'\fR assertions may be used in a pattern. +.Sp +See also \f(CW"(?>\fR\f(CIpattern\fR\f(CW)"\fR and possessive quantifiers for +other ways to +control backtracking. In some cases, the use of \f(CW\*(C`(*PRUNE)\*(C'\fR can be +replaced with a \f(CW\*(C`(?>pattern)\*(C'\fR with no functional difference; however, +\&\f(CW\*(C`(*PRUNE)\*(C'\fR can be used to handle cases that cannot be expressed using a +\&\f(CW\*(C`(?>pattern)\*(C'\fR alone. +.ie n .IP """(*SKIP)"" ""(*SKIP:\fINAME\fR)""" 4 +.el .IP "\f(CW(*SKIP)\fR \f(CW(*SKIP:\fR\f(CINAME\fR\f(CW)\fR" 4 +.IX Xref "(*SKIP)" +.IX Item "(*SKIP) (*SKIP:NAME)" +This zero-width pattern is similar to \f(CW\*(C`(*PRUNE)\*(C'\fR, except that on +failure it also signifies that whatever text that was matched leading up +to the \f(CW\*(C`(*SKIP)\*(C'\fR pattern being executed cannot be part of \fIany\fR match +of this pattern. This effectively means that the regex engine "skips" forward +to this position on failure and tries to match again, (assuming that +there is sufficient room to match). +.Sp +The name of the \f(CW\*(C`(*SKIP:\fR\f(CINAME\fR\f(CW)\*(C'\fR pattern has special significance. If a +\&\f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR was encountered while matching, then it is that position +which is used as the "skip point". If no \f(CW\*(C`(*MARK)\*(C'\fR of that name was +encountered, then the \f(CW\*(C`(*SKIP)\*(C'\fR operator has no effect. When used +without a name the "skip point" is where the match point was when +executing the \f(CW\*(C`(*SKIP)\*(C'\fR pattern. +.Sp +Compare the following to the examples in \f(CW\*(C`(*PRUNE)\*(C'\fR; note the string +is twice as long: +.Sp +.Vb 2 +\& \*(Aqaaabaaab\*(Aq =~ /a+b?(*SKIP)(?{print "$&\en"; $count++})(*FAIL)/; +\& print "Count=$count\en"; +.Ve +.Sp +outputs +.Sp +.Vb 3 +\& aaab +\& aaab +\& Count=2 +.Ve +.Sp +Once the 'aaab' at the start of the string has matched, and the \f(CW\*(C`(*SKIP)\*(C'\fR +executed, the next starting point will be where the cursor was when the +\&\f(CW\*(C`(*SKIP)\*(C'\fR was executed. +.ie n .IP """(*MARK:\fINAME\fR)"" ""(*:\fINAME\fR)""" 4 +.el .IP "\f(CW(*MARK:\fR\f(CINAME\fR\f(CW)\fR \f(CW(*:\fR\f(CINAME\fR\f(CW)\fR" 4 +.IX Xref "(*MARK) (*MARK:NAME) (*:NAME)" +.IX Item "(*MARK:NAME) (*:NAME)" +This zero-width pattern can be used to mark the point reached in a string +when a certain part of the pattern has been successfully matched. This +mark may be given a name. A later \f(CW\*(C`(*SKIP)\*(C'\fR pattern will then skip +forward to that point if backtracked into on failure. Any number of +\&\f(CW\*(C`(*MARK)\*(C'\fR patterns are allowed, and the \fINAME\fR portion may be duplicated. +.Sp +In addition to interacting with the \f(CW\*(C`(*SKIP)\*(C'\fR pattern, \f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR +can be used to "label" a pattern branch, so that after matching, the +program can determine which branches of the pattern were involved in the +match. +.Sp +When a match is successful, the \f(CW$REGMARK\fR variable will be set to the +name of the most recently executed \f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR that was involved +in the match. +.Sp +This can be used to determine which branch of a pattern was matched +without using a separate capture group for each branch, which in turn +can result in a performance improvement, as perl cannot optimize +\&\f(CW\*(C`/(?:(x)|(y)|(z))/\*(C'\fR as efficiently as something like +\&\f(CW\*(C`/(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/\*(C'\fR. +.Sp +When a match has failed, and unless another verb has been involved in +failing the match and has provided its own name to use, the \f(CW$REGERROR\fR +variable will be set to the name of the most recently executed +\&\f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR. +.Sp +See "(*SKIP)" for more details. +.Sp +As a shortcut \f(CW\*(C`(*MARK:\fR\f(CINAME\fR\f(CW)\*(C'\fR can be written \f(CW\*(C`(*:\fR\f(CINAME\fR\f(CW)\*(C'\fR. +.ie n .IP """(*THEN)"" ""(*THEN:\fINAME\fR)""" 4 +.el .IP "\f(CW(*THEN)\fR \f(CW(*THEN:\fR\f(CINAME\fR\f(CW)\fR" 4 +.IX Item "(*THEN) (*THEN:NAME)" +This is similar to the "cut group" operator \f(CW\*(C`::\*(C'\fR from Raku. Like +\&\f(CW\*(C`(*PRUNE)\*(C'\fR, this verb always matches, and when backtracked into on +failure, it causes the regex engine to try the next alternation in the +innermost enclosing group (capturing or otherwise) that has alternations. +The two branches of a \f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-pattern\fR\f(CW|\fR\f(CIno\-pattern\fR\f(CW)\*(C'\fR do not +count as an alternation, as far as \f(CW\*(C`(*THEN)\*(C'\fR is concerned. +.Sp +Its name comes from the observation that this operation combined with the +alternation operator (\f(CW"|"\fR) can be used to create what is essentially a +pattern-based if/then/else block: +.Sp +.Vb 1 +\& ( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) +.Ve +.Sp +Note that if this operator is used and NOT inside of an alternation then +it acts exactly like the \f(CW\*(C`(*PRUNE)\*(C'\fR operator. +.Sp +.Vb 1 +\& / A (*PRUNE) B / +.Ve +.Sp +is the same as +.Sp +.Vb 1 +\& / A (*THEN) B / +.Ve +.Sp +but +.Sp +.Vb 1 +\& / ( A (*THEN) B | C ) / +.Ve +.Sp +is not the same as +.Sp +.Vb 1 +\& / ( A (*PRUNE) B | C ) / +.Ve +.Sp +as after matching the \fIA\fR but failing on the \fIB\fR the \f(CW\*(C`(*THEN)\*(C'\fR verb will +backtrack and try \fIC\fR; but the \f(CW\*(C`(*PRUNE)\*(C'\fR verb will simply fail. +.ie n .IP """(*COMMIT)"" ""(*COMMIT:\fIarg\fR)""" 4 +.el .IP "\f(CW(*COMMIT)\fR \f(CW(*COMMIT:\fR\f(CIarg\fR\f(CW)\fR" 4 +.IX Xref "(*COMMIT)" +.IX Item "(*COMMIT) (*COMMIT:arg)" +This is the Raku "commit pattern" \f(CW\*(C`<commit>\*(C'\fR or \f(CW\*(C`:::\*(C'\fR. It's a +zero-width pattern similar to \f(CW\*(C`(*SKIP)\*(C'\fR, except that when backtracked +into on failure it causes the match to fail outright. No further attempts +to find a valid match by advancing the start pointer will occur again. +For example, +.Sp +.Vb 2 +\& \*(Aqaaabaaab\*(Aq =~ /a+b?(*COMMIT)(?{print "$&\en"; $count++})(*FAIL)/; +\& print "Count=$count\en"; +.Ve +.Sp +outputs +.Sp +.Vb 2 +\& aaab +\& Count=1 +.Ve +.Sp +In other words, once the \f(CW\*(C`(*COMMIT)\*(C'\fR has been entered, and if the pattern +does not match, the regex engine will not try any further matching on the +rest of the string. +.ie n .IP """(*FAIL)"" ""(*F)"" ""(*FAIL:\fIarg\fR)""" 4 +.el .IP "\f(CW(*FAIL)\fR \f(CW(*F)\fR \f(CW(*FAIL:\fR\f(CIarg\fR\f(CW)\fR" 4 +.IX Xref "(*FAIL) (*F)" +.IX Item "(*FAIL) (*F) (*FAIL:arg)" +This pattern matches nothing and always fails. It can be used to force the +engine to backtrack. It is equivalent to \f(CW\*(C`(?!)\*(C'\fR, but easier to read. In +fact, \f(CW\*(C`(?!)\*(C'\fR gets optimised into \f(CW\*(C`(*FAIL)\*(C'\fR internally. You can provide +an argument so that if the match fails because of this \f(CW\*(C`FAIL\*(C'\fR directive +the argument can be obtained from \f(CW$REGERROR\fR. +.Sp +It is probably useful only when combined with \f(CW\*(C`(?{})\*(C'\fR or \f(CW\*(C`(??{})\*(C'\fR. +.ie n .IP """(*ACCEPT)"" ""(*ACCEPT:\fIarg\fR)""" 4 +.el .IP "\f(CW(*ACCEPT)\fR \f(CW(*ACCEPT:\fR\f(CIarg\fR\f(CW)\fR" 4 +.IX Xref "(*ACCEPT)" +.IX Item "(*ACCEPT) (*ACCEPT:arg)" +This pattern matches nothing and causes the end of successful matching at +the point at which the \f(CW\*(C`(*ACCEPT)\*(C'\fR pattern was encountered, regardless of +whether there is actually more to match in the string. When inside of a +nested pattern, such as recursion, or in a subpattern dynamically generated +via \f(CW\*(C`(??{})\*(C'\fR, only the innermost pattern is ended immediately. +.Sp +If the \f(CW\*(C`(*ACCEPT)\*(C'\fR is inside of capturing groups then the groups are +marked as ended at the point at which the \f(CW\*(C`(*ACCEPT)\*(C'\fR was encountered. +For instance: +.Sp +.Vb 1 +\& \*(AqAB\*(Aq =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; +.Ve +.Sp +will match, and \f(CW$1\fR will be \f(CW\*(C`AB\*(C'\fR and \f(CW$2\fR will be \f(CW"B"\fR, \f(CW$3\fR will not +be set. If another branch in the inner parentheses was matched, such as in the +string 'ACDE', then the \f(CW"D"\fR and \f(CW"E"\fR would have to be matched as well. +.Sp +You can provide an argument, which will be available in the var +\&\f(CW$REGMARK\fR after the match completes. +.RE +.RS 3 +.RE +.ie n .SS "Warning on ""\e1"" Instead of $1" +.el .SS "Warning on \f(CW\e1\fP Instead of \f(CW$1\fP" +.IX Subsection "Warning on 1 Instead of $1" +Some people get too used to writing things like: +.PP +.Vb 1 +\& $pattern =~ s/(\eW)/\e\e\e1/g; +.Ve +.PP +This is grandfathered (for \e1 to \e9) for the RHS of a substitute to avoid +shocking the +\&\fBsed\fR addicts, but it's a dirty habit to get into. That's because in +PerlThink, the righthand side of an \f(CW\*(C`s///\*(C'\fR is a double-quoted string. \f(CW\*(C`\e1\*(C'\fR in +the usual double-quoted string means a control-A. The customary Unix +meaning of \f(CW\*(C`\e1\*(C'\fR is kludged in for \f(CW\*(C`s///\*(C'\fR. However, if you get into the habit +of doing that, you get yourself into trouble if you then add an \f(CW\*(C`/e\*(C'\fR +modifier. +.PP +.Vb 1 +\& s/(\ed+)/ \e1 + 1 /eg; # causes warning under \-w +.Ve +.PP +Or if you try to do +.PP +.Vb 1 +\& s/(\ed+)/\e1000/; +.Ve +.PP +You can't disambiguate that by saying \f(CW\*(C`\e{1}000\*(C'\fR, whereas you can fix it with +\&\f(CW\*(C`${1}000\*(C'\fR. The operation of interpolation should not be confused +with the operation of matching a backreference. Certainly they mean two +different things on the \fIleft\fR side of the \f(CW\*(C`s///\*(C'\fR. +.SS "Repeated Patterns Matching a Zero-length Substring" +.IX Subsection "Repeated Patterns Matching a Zero-length Substring" +\&\fBWARNING\fR: Difficult material (and prose) ahead. This section needs a rewrite. +.PP +Regular expressions provide a terse and powerful programming language. As +with most other power tools, power comes together with the ability +to wreak havoc. +.PP +A common abuse of this power stems from the ability to make infinite +loops using regular expressions, with something as innocuous as: +.PP +.Vb 1 +\& \*(Aqfoo\*(Aq =~ m{ ( o? )* }x; +.Ve +.PP +The \f(CW\*(C`o?\*(C'\fR matches at the beginning of "\f(CW\*(C`foo\*(C'\fR", and since the position +in the string is not moved by the match, \f(CW\*(C`o?\*(C'\fR would match again and again +because of the \f(CW"*"\fR quantifier. Another common way to create a similar cycle +is with the looping modifier \f(CW\*(C`/g\*(C'\fR: +.PP +.Vb 1 +\& @matches = ( \*(Aqfoo\*(Aq =~ m{ o? }xg ); +.Ve +.PP +or +.PP +.Vb 1 +\& print "match: <$&>\en" while \*(Aqfoo\*(Aq =~ m{ o? }xg; +.Ve +.PP +or the loop implied by \f(CWsplit()\fR. +.PP +However, long experience has shown that many programming tasks may +be significantly simplified by using repeated subexpressions that +may match zero-length substrings. Here's a simple example being: +.PP +.Vb 2 +\& @chars = split //, $string; # // is not magic in split +\& ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / +.Ve +.PP +Thus Perl allows such constructs, by \fIforcefully breaking +the infinite loop\fR. The rules for this are different for lower-level +loops given by the greedy quantifiers \f(CW\*(C`*+{}\*(C'\fR, and for higher-level +ones like the \f(CW\*(C`/g\*(C'\fR modifier or \f(CWsplit()\fR operator. +.PP +The lower-level loops are \fIinterrupted\fR (that is, the loop is +broken) when Perl detects that a repeated expression matched a +zero-length substring. Thus +.PP +.Vb 1 +\& m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; +.Ve +.PP +is made equivalent to +.PP +.Vb 1 +\& m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x; +.Ve +.PP +For example, this program +.PP +.Vb 12 +\& #!perl \-l +\& "aaaaab" =~ / +\& (?: +\& a # non\-zero +\& | # or +\& (?{print "hello"}) # print hello whenever this +\& # branch is tried +\& (?=(b)) # zero\-width assertion +\& )* # any number of times +\& /x; +\& print $&; +\& print $1; +.Ve +.PP +prints +.PP +.Vb 3 +\& hello +\& aaaaa +\& b +.Ve +.PP +Notice that "hello" is only printed once, as when Perl sees that the sixth +iteration of the outermost \f(CW\*(C`(?:)*\*(C'\fR matches a zero-length string, it stops +the \f(CW"*"\fR. +.PP +The higher-level loops preserve an additional state between iterations: +whether the last match was zero-length. To break the loop, the following +match after a zero-length match is prohibited to have a length of zero. +This prohibition interacts with backtracking (see "Backtracking"), +and so the \fIsecond best\fR match is chosen if the \fIbest\fR match is of +zero length. +.PP +For example: +.PP +.Vb 2 +\& $_ = \*(Aqbar\*(Aq; +\& s/\ew??/<$&>/g; +.Ve +.PP +results in \f(CW\*(C`<><b><><a><><r><>\*(C'\fR. At each position of the string the best +match given by non-greedy \f(CW\*(C`??\*(C'\fR is the zero-length match, and the \fIsecond +best\fR match is what is matched by \f(CW\*(C`\ew\*(C'\fR. Thus zero-length matches +alternate with one-character-long matches. +.PP +Similarly, for repeated \f(CW\*(C`m/()/g\*(C'\fR the second-best match is the match at the +position one notch further in the string. +.PP +The additional state of being \fImatched with zero-length\fR is associated with +the matched string, and is reset by each assignment to \f(CWpos()\fR. +Zero-length matches at the end of the previous match are ignored +during \f(CW\*(C`split\*(C'\fR. +.SS "Combining RE Pieces" +.IX Subsection "Combining RE Pieces" +Each of the elementary pieces of regular expressions which were described +before (such as \f(CW\*(C`ab\*(C'\fR or \f(CW\*(C`\eZ\*(C'\fR) could match at most one substring +at the given position of the input string. However, in a typical regular +expression these elementary pieces are combined into more complicated +patterns using combining operators \f(CW\*(C`ST\*(C'\fR, \f(CW\*(C`S|T\*(C'\fR, \f(CW\*(C`S*\*(C'\fR \fIetc\fR. +(in these examples \f(CW"S"\fR and \f(CW"T"\fR are regular subexpressions). +.PP +Such combinations can include alternatives, leading to a problem of choice: +if we match a regular expression \f(CW\*(C`a|ab\*(C'\fR against \f(CW"abc"\fR, will it match +substring \f(CW"a"\fR or \f(CW"ab"\fR? One way to describe which substring is +actually matched is the concept of backtracking (see "Backtracking"). +However, this description is too low-level and makes you think +in terms of a particular implementation. +.PP +Another description starts with notions of "better"/"worse". All the +substrings which may be matched by the given regular expression can be +sorted from the "best" match to the "worst" match, and it is the "best" +match which is chosen. This substitutes the question of "what is chosen?" +by the question of "which matches are better, and which are worse?". +.PP +Again, for elementary pieces there is no such question, since at most +one match at a given position is possible. This section describes the +notion of better/worse for combining operators. In the description +below \f(CW"S"\fR and \f(CW"T"\fR are regular subexpressions. +.ie n .IP """ST""" 4 +.el .IP \f(CWST\fR 4 +.IX Item "ST" +Consider two possible matches, \f(CW\*(C`AB\*(C'\fR and \f(CW\*(C`A\*(AqB\*(Aq\*(C'\fR, \f(CW"A"\fR and \f(CW\*(C`A\*(Aq\*(C'\fR are +substrings which can be matched by \f(CW"S"\fR, \f(CW"B"\fR and \f(CW\*(C`B\*(Aq\*(C'\fR are substrings +which can be matched by \f(CW"T"\fR. +.Sp +If \f(CW"A"\fR is a better match for \f(CW"S"\fR than \f(CW\*(C`A\*(Aq\*(C'\fR, \f(CW\*(C`AB\*(C'\fR is a better +match than \f(CW\*(C`A\*(AqB\*(Aq\*(C'\fR. +.Sp +If \f(CW"A"\fR and \f(CW\*(C`A\*(Aq\*(C'\fR coincide: \f(CW\*(C`AB\*(C'\fR is a better match than \f(CW\*(C`AB\*(Aq\*(C'\fR if +\&\f(CW"B"\fR is a better match for \f(CW"T"\fR than \f(CW\*(C`B\*(Aq\*(C'\fR. +.ie n .IP """S|T""" 4 +.el .IP \f(CWS|T\fR 4 +.IX Item "S|T" +When \f(CW"S"\fR can match, it is a better match than when only \f(CW"T"\fR can match. +.Sp +Ordering of two matches for \f(CW"S"\fR is the same as for \f(CW"S"\fR. Similar for +two matches for \f(CW"T"\fR. +.ie n .IP """S{REPEAT_COUNT}""" 4 +.el .IP \f(CWS{REPEAT_COUNT}\fR 4 +.IX Item "S{REPEAT_COUNT}" +Matches as \f(CW\*(C`SSS...S\*(C'\fR (repeated as many times as necessary). +.ie n .IP """S{min,max}""" 4 +.el .IP \f(CWS{min,max}\fR 4 +.IX Item "S{min,max}" +Matches as \f(CW\*(C`S{max}|S{max\-1}|...|S{min+1}|S{min}\*(C'\fR. +.ie n .IP """S{min,max}?""" 4 +.el .IP \f(CWS{min,max}?\fR 4 +.IX Item "S{min,max}?" +Matches as \f(CW\*(C`S{min}|S{min+1}|...|S{max\-1}|S{max}\*(C'\fR. +.ie n .IP """S?"", ""S*"", ""S+""" 4 +.el .IP "\f(CWS?\fR, \f(CWS*\fR, \f(CWS+\fR" 4 +.IX Item "S?, S*, S+" +Same as \f(CW\*(C`S{0,1}\*(C'\fR, \f(CW\*(C`S{0,BIG_NUMBER}\*(C'\fR, \f(CW\*(C`S{1,BIG_NUMBER}\*(C'\fR respectively. +.ie n .IP """S??"", ""S*?"", ""S+?""" 4 +.el .IP "\f(CWS??\fR, \f(CWS*?\fR, \f(CWS+?\fR" 4 +.IX Item "S??, S*?, S+?" +Same as \f(CW\*(C`S{0,1}?\*(C'\fR, \f(CW\*(C`S{0,BIG_NUMBER}?\*(C'\fR, \f(CW\*(C`S{1,BIG_NUMBER}?\*(C'\fR respectively. +.ie n .IP """(?>S)""" 4 +.el .IP \f(CW(?>S)\fR 4 +.IX Item "(?>S)" +Matches the best match for \f(CW"S"\fR and only that. +.ie n .IP """(?=S)"", ""(?<=S)""" 4 +.el .IP "\f(CW(?=S)\fR, \f(CW(?<=S)\fR" 4 +.IX Item "(?=S), (?<=S)" +Only the best match for \f(CW"S"\fR is considered. (This is important only if +\&\f(CW"S"\fR has capturing parentheses, and backreferences are used somewhere +else in the whole regular expression.) +.ie n .IP """(?!S)"", ""(?<!S)""" 4 +.el .IP "\f(CW(?!S)\fR, \f(CW(?<!S)\fR" 4 +.IX Item "(?!S), (?<!S)" +For this grouping operator there is no need to describe the ordering, since +only whether or not \f(CW"S"\fR can match is important. +.ie n .IP """(??{ \fIEXPR\fR })"", ""(?\fIPARNO\fR)""" 4 +.el .IP "\f(CW(??{ \fR\f(CIEXPR\fR\f(CW })\fR, \f(CW(?\fR\f(CIPARNO\fR\f(CW)\fR" 4 +.IX Item "(??{ EXPR }), (?PARNO)" +The ordering is the same as for the regular expression which is +the result of \fIEXPR\fR, or the pattern contained by capture group \fIPARNO\fR. +.ie n .IP """(?(\fIcondition\fR)\fIyes\-pattern\fR|\fIno\-pattern\fR)""" 4 +.el .IP \f(CW(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-pattern\fR\f(CW|\fR\f(CIno\-pattern\fR\f(CW)\fR 4 +.IX Item "(?(condition)yes-pattern|no-pattern)" +Recall that which of \fIyes-pattern\fR or \fIno-pattern\fR actually matches is +already determined. The ordering of the matches is the same as for the +chosen subexpression. +.PP +The above recipes describe the ordering of matches \fIat a given position\fR. +One more rule is needed to understand how a match is determined for the +whole regular expression: a match at an earlier position is always better +than a match at a later position. +.SS "Creating Custom RE Engines" +.IX Subsection "Creating Custom RE Engines" +As of Perl 5.10.0, one can create custom regular expression engines. This +is not for the faint of heart, as they have to plug in at the C level. See +perlreapi for more details. +.PP +As an alternative, overloaded constants (see overload) provide a simple +way to extend the functionality of the RE engine, by substituting one +pattern for another. +.PP +Suppose that we want to enable a new RE escape-sequence \f(CW\*(C`\eY|\*(C'\fR which +matches at a boundary between whitespace characters and non-whitespace +characters. Note that \f(CW\*(C`(?=\eS)(?<!\eS)|(?!\eS)(?<=\eS)\*(C'\fR matches exactly +at these positions, so we want to have each \f(CW\*(C`\eY|\*(C'\fR in the place of the +more complicated version. We can create a module \f(CW\*(C`customre\*(C'\fR to do +this: +.PP +.Vb 2 +\& package customre; +\& use overload; +\& +\& sub import { +\& shift; +\& die "No argument to customre::import allowed" if @_; +\& overload::constant \*(Aqqr\*(Aq => \e&convert; +\& } +\& +\& sub invalid { die "/$_[0]/: invalid escape \*(Aq\e\e$_[1]\*(Aq"} +\& +\& # We must also take care of not escaping the legitimate \e\eY| +\& # sequence, hence the presence of \*(Aq\e\e\*(Aq in the conversion rules. +\& my %rules = ( \*(Aq\e\e\*(Aq => \*(Aq\e\e\e\e\*(Aq, +\& \*(AqY|\*(Aq => qr/(?=\eS)(?<!\eS)|(?!\eS)(?<=\eS)/ ); +\& sub convert { +\& my $re = shift; +\& $re =~ s{ +\& \e\e ( \e\e | Y . ) +\& } +\& { $rules{$1} or invalid($re,$1) }sgex; +\& return $re; +\& } +.Ve +.PP +Now \f(CW\*(C`use customre\*(C'\fR enables the new escape in constant regular +expressions, \fIi.e.\fR, those without any runtime variable interpolations. +As documented in overload, this conversion will work only over +literal parts of regular expressions. For \f(CW\*(C`\eY|$re\eY|\*(C'\fR the variable +part of this regular expression needs to be converted explicitly +(but only if the special meaning of \f(CW\*(C`\eY|\*(C'\fR should be enabled inside \f(CW$re\fR): +.PP +.Vb 5 +\& use customre; +\& $re = <>; +\& chomp $re; +\& $re = customre::convert $re; +\& /\eY|$re\eY|/; +.Ve +.SS "Embedded Code Execution Frequency" +.IX Subsection "Embedded Code Execution Frequency" +The exact rules for how often \f(CW\*(C`(?{})\*(C'\fR and \f(CW\*(C`(??{})\*(C'\fR are executed in a pattern +are unspecified, and this is even more true of \f(CW\*(C`(*{})\*(C'\fR. +In the case of a successful match you can assume that they DWIM and +will be executed in left to right order the appropriate number of times in the +accepting path of the pattern as would any other meta-pattern. How non\- +accepting pathways and match failures affect the number of times a pattern is +executed is specifically unspecified and may vary depending on what +optimizations can be applied to the pattern and is likely to change from +version to version. +.PP +For instance in +.PP +.Vb 1 +\& "aaabcdeeeee"=~/a(?{print "a"})b(?{print "b"})cde/; +.Ve +.PP +the exact number of times "a" or "b" are printed out is unspecified for +failure, but you may assume they will be printed at least once during +a successful match, additionally you may assume that if "b" is printed, +it will be preceded by at least one "a". +.PP +In the case of branching constructs like the following: +.PP +.Vb 1 +\& /a(b|(?{ print "a" }))c(?{ print "c" })/; +.Ve +.PP +you can assume that the input "ac" will output "ac", and that "abc" +will output only "c". +.PP +When embedded code is quantified, successful matches will call the +code once for each matched iteration of the quantifier. For +example: +.PP +.Vb 1 +\& "good" =~ /g(?:o(?{print "o"}))*d/; +.Ve +.PP +will output "o" twice. +.PP +For historical and consistency reasons the use of normal code blocks +anywhere in a pattern will disable certain optimisations. As of 5.37.7 +you can use an "optimistic" codeblock, \f(CW\*(C`(*{ ... })\*(C'\fR as a replacement +for \f(CW\*(C`(?{ ... })\*(C'\fR, if you do *not* wish to disable these optimisations. +This may result in the code block being called less often than it might +have been had they not been optimistic. +.SS "PCRE/Python Support" +.IX Subsection "PCRE/Python Support" +As of Perl 5.10.0, Perl supports several Python/PCRE\-specific extensions +to the regex syntax. While Perl programmers are encouraged to use the +Perl-specific syntax, the following are also accepted: +.ie n .IP """(?P<\fINAME\fR>\fIpattern\fR)""" 4 +.el .IP \f(CW(?P<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\fR 4 +.IX Item "(?P<NAME>pattern)" +Define a named capture group. Equivalent to \f(CW\*(C`(?<\fR\f(CINAME\fR\f(CW>\fR\f(CIpattern\fR\f(CW)\*(C'\fR. +.ie n .IP """(?P=\fINAME\fR)""" 4 +.el .IP \f(CW(?P=\fR\f(CINAME\fR\f(CW)\fR 4 +.IX Item "(?P=NAME)" +Backreference to a named capture group. Equivalent to \f(CW\*(C`\eg{\fR\f(CINAME\fR\f(CW}\*(C'\fR. +.ie n .IP """(?P>\fINAME\fR)""" 4 +.el .IP \f(CW(?P>\fR\f(CINAME\fR\f(CW)\fR 4 +.IX Item "(?P>NAME)" +Subroutine call to a named capture group. Equivalent to \f(CW\*(C`(?&\fR\f(CINAME\fR\f(CW)\*(C'\fR. +.SH BUGS +.IX Header "BUGS" +There are a number of issues with regard to case-insensitive matching +in Unicode rules. See \f(CW"i"\fR under "Modifiers" above. +.PP +This document varies from difficult to understand to completely +and utterly opaque. The wandering prose riddled with jargon is +hard to fathom in several places. +.PP +This document needs a rewrite that separates the tutorial content +from the reference content. +.SH "SEE ALSO" +.IX Header "SEE ALSO" +The syntax of patterns used in Perl pattern matching evolved from those +supplied in the Bell Labs Research Unix 8th Edition (Version 8) regex +routines. (The code is actually derived (distantly) from Henry +Spencer's freely redistributable reimplementation of those V8 routines.) +.PP +perlrequick. +.PP +perlretut. +.PP +"Regexp Quote-Like Operators" in perlop. +.PP +"Gory details of parsing quoted constructs" in perlop. +.PP +perlfaq6. +.PP +"pos" in perlfunc. +.PP +perllocale. +.PP +perlebcdic. +.PP +\&\fIMastering Regular Expressions\fR by Jeffrey Friedl, published +by O'Reilly and Associates. |