diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1/perlrebackslash.1')
-rw-r--r-- | upstream/mageia-cauldron/man1/perlrebackslash.1 | 859 |
1 files changed, 859 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1/perlrebackslash.1 b/upstream/mageia-cauldron/man1/perlrebackslash.1 new file mode 100644 index 00000000..c109cd0f --- /dev/null +++ b/upstream/mageia-cauldron/man1/perlrebackslash.1 @@ -0,0 +1,859 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLREBACKSLASH 1" +.TH PERLREBACKSLASH 1 2023-11-28 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlrebackslash \- Perl Regular Expression Backslash Sequences and Escapes +.SH DESCRIPTION +.IX Header "DESCRIPTION" +The top level documentation about Perl regular expressions +is found in perlre. +.PP +This document describes all backslash and escape sequences. After +explaining the role of the backslash, it lists all the sequences that have +a special meaning in Perl regular expressions (in alphabetical order), +then describes each of them. +.PP +Most sequences are described in detail in different documents; the primary +purpose of this document is to have a quick reference guide describing all +backslash and escape sequences. +.SS "The backslash" +.IX Subsection "The backslash" +In a regular expression, the backslash can perform one of two tasks: +it either takes away the special meaning of the character following it +(for instance, \f(CW\*(C`\e|\*(C'\fR matches a vertical bar, it's not an alternation), +or it is the start of a backslash or escape sequence. +.PP +The rules determining what it is are quite simple: if the character +following the backslash is an ASCII punctuation (non-word) character (that is, +anything that is not a letter, digit, or underscore), then the backslash just +takes away any special meaning of the character following it. +.PP +If the character following the backslash is an ASCII letter or an ASCII digit, +then the sequence may be special; if so, it's listed below. A few letters have +not been used yet, so escaping them with a backslash doesn't change them to be +special. A future version of Perl may assign a special meaning to them, so if +you have warnings turned on, Perl issues a warning if you use such a +sequence. [1]. +.PP +It is however guaranteed that backslash or escape sequences never have a +punctuation character following the backslash, not now, and not in a future +version of Perl 5. So it is safe to put a backslash in front of a non-word +character. +.PP +Note that the backslash itself is special; if you want to match a backslash, +you have to escape the backslash with a backslash: \f(CW\*(C`/\e\e/\*(C'\fR matches a single +backslash. +.IP [1] 4 +.IX Item "[1]" +There is one exception. If you use an alphanumeric character as the +delimiter of your pattern (which you probably shouldn't do for readability +reasons), you have to escape the delimiter if you want to match +it. Perl won't warn then. See also "Gory details of parsing +quoted constructs" in perlop. +.SS "All the sequences and escapes" +.IX Subsection "All the sequences and escapes" +Those not usable within a bracketed character class (like \f(CW\*(C`[\eda\-z]\*(C'\fR) are marked +as \f(CW\*(C`Not in [].\*(C'\fR +.PP +.Vb 10 +\& \e000 Octal escape sequence. See also \eo{}. +\& \e1 Absolute backreference. Not in []. +\& \ea Alarm or bell. +\& \eA Beginning of string. Not in []. +\& \eb{}, \eb Boundary. (\eb is a backspace in []). +\& \eB{}, \eB Not a boundary. Not in []. +\& \ecX Control\-X. +\& \ed Match any digit character. +\& \eD Match any character that isn\*(Aqt a digit. +\& \ee Escape character. +\& \eE Turn off \eQ, \eL and \eU processing. Not in []. +\& \ef Form feed. +\& \eF Foldcase till \eE. Not in []. +\& \eg{}, \eg1 Named, absolute or relative backreference. +\& Not in []. +\& \eG Pos assertion. Not in []. +\& \eh Match any horizontal whitespace character. +\& \eH Match any character that isn\*(Aqt horizontal whitespace. +\& \ek{}, \ek<>, \ek\*(Aq\*(Aq Named backreference. Not in []. +\& \eK Keep the stuff left of \eK. Not in []. +\& \el Lowercase next character. Not in []. +\& \eL Lowercase till \eE. Not in []. +\& \en (Logical) newline character. +\& \eN Match any character but newline. Not in []. +\& \eN{} Named or numbered (Unicode) character or sequence. +\& \eo{} Octal escape sequence. +\& \ep{}, \epP Match any character with the given Unicode property. +\& \eP{}, \ePP Match any character without the given property. +\& \eQ Quote (disable) pattern metacharacters till \eE. Not +\& in []. +\& \er Return character. +\& \eR Generic new line. Not in []. +\& \es Match any whitespace character. +\& \eS Match any character that isn\*(Aqt a whitespace. +\& \et Tab character. +\& \eu Titlecase next character. Not in []. +\& \eU Uppercase till \eE. Not in []. +\& \ev Match any vertical whitespace character. +\& \eV Match any character that isn\*(Aqt vertical whitespace +\& \ew Match any word character. +\& \eW Match any character that isn\*(Aqt a word character. +\& \ex{}, \ex00 Hexadecimal escape sequence. +\& \eX Unicode "extended grapheme cluster". Not in []. +\& \ez End of string. Not in []. +\& \eZ End of string. Not in []. +.Ve +.SS "Character Escapes" +.IX Subsection "Character Escapes" +\fIFixed characters\fR +.IX Subsection "Fixed characters" +.PP +A handful of characters have a dedicated \fIcharacter escape\fR. The following +table shows them, along with their ASCII code points (in decimal and hex), +their ASCII name, the control escape on ASCII platforms and a short +description. (For EBCDIC platforms, see "OPERATOR DIFFERENCES" in perlebcdic.) +.PP +.Vb 9 +\& Seq. Code Point ASCII Cntrl Description. +\& Dec Hex +\& \ea 7 07 BEL \ecG alarm or bell +\& \eb 8 08 BS \ecH backspace [1] +\& \ee 27 1B ESC \ec[ escape character +\& \ef 12 0C FF \ecL form feed +\& \en 10 0A LF \ecJ line feed [2] +\& \er 13 0D CR \ecM carriage return +\& \et 9 09 TAB \ecI tab +.Ve +.IP [1] 4 +.IX Item "[1]" +\&\f(CW\*(C`\eb\*(C'\fR is the backspace character only inside a character class. Outside a +character class, \f(CW\*(C`\eb\*(C'\fR alone is a word\-character/non\-word\-character +boundary, and \f(CW\*(C`\eb{}\*(C'\fR is some other type of boundary. +.IP [2] 4 +.IX Item "[2]" +\&\f(CW\*(C`\en\*(C'\fR matches a logical newline. Perl converts between \f(CW\*(C`\en\*(C'\fR and your +OS's native newline character when reading from or writing to text files. +.PP +Example +.IX Subsection "Example" +.PP +.Vb 1 +\& $str =~ /\et/; # Matches if $str contains a (horizontal) tab. +.Ve +.PP +\fIControl characters\fR +.IX Subsection "Control characters" +.PP +\&\f(CW\*(C`\ec\*(C'\fR is used to denote a control character; the character following \f(CW\*(C`\ec\*(C'\fR +determines the value of the construct. For example the value of \f(CW\*(C`\ecA\*(C'\fR is +\&\f(CWchr(1)\fR, and the value of \f(CW\*(C`\ecb\*(C'\fR is \f(CWchr(2)\fR, etc. +The gory details are in "Regexp Quote-Like Operators" in perlop. A complete +list of what \f(CWchr(1)\fR, etc. means for ASCII and EBCDIC platforms is in +"OPERATOR DIFFERENCES" in perlebcdic. +.PP +Note that \f(CW\*(C`\ec\e\*(C'\fR alone at the end of a regular expression (or doubled-quoted +string) is not valid. The backslash must be followed by another character. +That is, \f(CW\*(C`\ec\e\fR\f(CIX\fR\f(CW\*(C'\fR means \f(CW\*(C`chr(28) . \*(Aq\fR\f(CIX\fR\f(CW\*(Aq\*(C'\fR for all characters \fIX\fR. +.PP +To write platform-independent code, you must use \f(CW\*(C`\eN{\fR\f(CINAME\fR\f(CW}\*(C'\fR instead, like +\&\f(CW\*(C`\eN{ESCAPE}\*(C'\fR or \f(CW\*(C`\eN{U+001B}\*(C'\fR, see charnames. +.PP +Mnemonic: \fIc\fRontrol character. +.PP +Example +.IX Subsection "Example" +.PP +.Vb 1 +\& $str =~ /\ecK/; # Matches if $str contains a vertical tab (control\-K). +.Ve +.PP +\fINamed or numbered characters and character sequences\fR +.IX Subsection "Named or numbered characters and character sequences" +.PP +Unicode characters have a Unicode name and numeric code point (ordinal) +value. Use the +\&\f(CW\*(C`\eN{}\*(C'\fR construct to specify a character by either of these values. +Certain sequences of characters also have names. +.PP +To specify by name, the name of the character or character sequence goes +between the curly braces. +.PP +To specify a character by Unicode code point, use the form \f(CW\*(C`\eN{U+\fR\f(CIcode +point\fR\f(CW}\*(C'\fR, where \fIcode point\fR is a number in hexadecimal that gives the +code point that Unicode has assigned to the desired character. It is +customary but not required to use leading zeros to pad the number to 4 +digits. Thus \f(CW\*(C`\eN{U+0041}\*(C'\fR means \f(CW\*(C`LATIN CAPITAL LETTER A\*(C'\fR, and you will +rarely see it written without the two leading zeros. \f(CW\*(C`\eN{U+0041}\*(C'\fR means +"A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41). +.PP +Blanks may freely be inserted adjacent to but within the braces +enclosing the name or code point. So \f(CW\*(C`\eN{\ U+0041\ }\*(C'\fR is perfectly +legal. +.PP +It is even possible to give your own names to characters and character +sequences by using the charnames module. These custom names are +lexically scoped, and so a given code point may have different names +in different scopes. The name used is what is in effect at the time the +\&\f(CW\*(C`\eN{}\*(C'\fR is expanded. For patterns in double-quotish context, that means +at the time the pattern is parsed. But for patterns that are delimitted +by single quotes, the expansion is deferred until pattern compilation +time, which may very well have a different \f(CW\*(C`charnames\*(C'\fR translator in +effect. +.PP +(There is an expanded internal form that you may see in debug output: +\&\f(CW\*(C`\eN{U+\fR\f(CIcode point\fR\f(CW.\fR\f(CIcode point\fR\f(CW...}\*(C'\fR. +The \f(CW\*(C`...\*(C'\fR means any number of these \fIcode point\fRs separated by dots. +This represents the sequence formed by the characters. This is an internal +form only, subject to change, and you should not try to use it yourself.) +.PP +Mnemonic: \fIN\fRamed character. +.PP +Note that a character or character sequence expressed as a named +or numbered character is considered a character without special +meaning by the regex engine, and will match "as is". +.PP +Example +.IX Subsection "Example" +.PP +.Vb 1 +\& $str =~ /\eN{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character +\& +\& use charnames \*(AqCyrillic\*(Aq; # Loads Cyrillic names. +\& $str =~ /\eN{ZHE}\eN{KA}/; # Match "ZHE" followed by "KA". +.Ve +.PP +\fIOctal escapes\fR +.IX Subsection "Octal escapes" +.PP +There are two forms of octal escapes. Each is used to specify a character by +its code point specified in base 8. +.PP +One form, available starting in Perl 5.14 looks like \f(CW\*(C`\eo{...}\*(C'\fR, where the dots +represent one or more octal digits. It can be used for any Unicode character. +.PP +It was introduced to avoid the potential problems with the other form, +available in all Perls. That form consists of a backslash followed by three +octal digits. One problem with this form is that it can look exactly like an +old-style backreference (see +"Disambiguation rules between old-style octal escapes and backreferences" +below.) You can avoid this by making the first of the three digits always a +zero, but that makes \e077 the largest code point specifiable. +.PP +In some contexts, a backslash followed by two or even one octal digits may be +interpreted as an octal escape, sometimes with a warning, and because of some +bugs, sometimes with surprising results. Also, if you are creating a regex +out of smaller snippets concatenated together, and you use fewer than three +digits, the beginning of one snippet may be interpreted as adding digits to the +ending of the snippet before it. See "Absolute referencing" for more +discussion and examples of the snippet problem. +.PP +Note that a character expressed as an octal escape is considered +a character without special meaning by the regex engine, and will match +"as is". +.PP +To summarize, the \f(CW\*(C`\eo{}\*(C'\fR form is always safe to use, and the other form is +safe to use for code points through \e077 when you use exactly three digits to +specify them. +.PP +Mnemonic: \fI0\fRctal or \fIo\fRctal. +.PP +Examples (assuming an ASCII platform) +.IX Subsection "Examples (assuming an ASCII platform)" +.PP +.Vb 12 +\& $str = "Perl"; +\& $str =~ /\eo{120}/; # Match, "\e120" is "P". +\& $str =~ /\e120/; # Same. +\& $str =~ /\eo{120}+/; # Match, "\e120" is "P", +\& # it\*(Aqs repeated at least once. +\& $str =~ /\e120+/; # Same. +\& $str =~ /P\e053/; # No match, "\e053" is "+" and taken literally. +\& /\eo{23073}/ # Black foreground, white background smiling face. +\& /\eo{4801234567}/ # Raises a warning, and yields chr(4). +\& /\eo{ 400}/ # LATIN CAPITAL LETTER A WITH MACRON +\& /\eo{ 400 }/ # Same. These show blanks are allowed adjacent to +\& # the braces +.Ve +.PP +Disambiguation rules between old-style octal escapes and backreferences +.IX Subsection "Disambiguation rules between old-style octal escapes and backreferences" +.PP +Octal escapes of the \f(CW\*(C`\e000\*(C'\fR form outside of bracketed character classes +potentially clash with old-style backreferences (see "Absolute referencing" +below). They both consist of a backslash followed by numbers. So Perl has to +use heuristics to determine whether it is a backreference or an octal escape. +Perl uses the following rules to disambiguate: +.IP 1. 4 +If the backslash is followed by a single digit, it's a backreference. +.IP 2. 4 +If the first digit following the backslash is a 0, it's an octal escape. +.IP 3. 4 +If the number following the backslash is N (in decimal), and Perl already +has seen N capture groups, Perl considers this a backreference. Otherwise, +it considers it an octal escape. If N has more than three digits, Perl +takes only the first three for the octal escape; the rest are matched as is. +.Sp +.Vb 6 +\& my $pat = "(" x 999; +\& $pat .= "a"; +\& $pat .= ")" x 999; +\& /^($pat)\e1000$/; # Matches \*(Aqaa\*(Aq; there are 1000 capture groups. +\& /^$pat\e1000$/; # Matches \*(Aqa@0\*(Aq; there are 999 capture groups +\& # and \e1000 is seen as \e100 (a \*(Aq@\*(Aq) and a \*(Aq0\*(Aq. +.Ve +.PP +You can force a backreference interpretation always by using the \f(CW\*(C`\eg{...}\*(C'\fR +form. You can the force an octal interpretation always by using the \f(CW\*(C`\eo{...}\*(C'\fR +form, or for numbers up through \e077 (= 63 decimal), by using three digits, +beginning with a "0". +.PP +\fIHexadecimal escapes\fR +.IX Subsection "Hexadecimal escapes" +.PP +Like octal escapes, there are two forms of hexadecimal escapes, but both start +with the sequence \f(CW\*(C`\ex\*(C'\fR. This is followed by either exactly two hexadecimal +digits forming a number, or a hexadecimal number of arbitrary length surrounded +by curly braces. The hexadecimal number is the code point of the character you +want to express. +.PP +Note that a character expressed as one of these escapes is considered a +character without special meaning by the regex engine, and will match +"as is". +.PP +Mnemonic: he\fIx\fRadecimal. +.PP +Examples (assuming an ASCII platform) +.IX Subsection "Examples (assuming an ASCII platform)" +.PP +.Vb 4 +\& $str = "Perl"; +\& $str =~ /\ex50/; # Match, "\ex50" is "P". +\& $str =~ /\ex50+/; # Match, "\ex50" is "P", it is repeated at least once +\& $str =~ /P\ex2B/; # No match, "\ex2B" is "+" and taken literally. +\& +\& /\ex{2603}\ex{2602}/ # Snowman with an umbrella. +\& # The Unicode character 2603 is a snowman, +\& # the Unicode character 2602 is an umbrella. +\& /\ex{263B}/ # Black smiling face. +\& /\ex{263b}/ # Same, the hex digits A \- F are case insensitive. +\& /\ex{ 263b }/ # Same, showing optional blanks adjacent to the +\& # braces +.Ve +.SS Modifiers +.IX Subsection "Modifiers" +A number of backslash sequences have to do with changing the character, +or characters following them. \f(CW\*(C`\el\*(C'\fR will lowercase the character following +it, while \f(CW\*(C`\eu\*(C'\fR will uppercase (or, more accurately, titlecase) the +character following it. They provide functionality similar to the +functions \f(CW\*(C`lcfirst\*(C'\fR and \f(CW\*(C`ucfirst\*(C'\fR. +.PP +To uppercase or lowercase several characters, one might want to use +\&\f(CW\*(C`\eL\*(C'\fR or \f(CW\*(C`\eU\*(C'\fR, which will lowercase/uppercase all characters following +them, until either the end of the pattern or the next occurrence of +\&\f(CW\*(C`\eE\*(C'\fR, whichever comes first. They provide functionality similar to what +the functions \f(CW\*(C`lc\*(C'\fR and \f(CW\*(C`uc\*(C'\fR provide. +.PP +\&\f(CW\*(C`\eQ\*(C'\fR is used to quote (disable) pattern metacharacters, up to the next +\&\f(CW\*(C`\eE\*(C'\fR or the end of the pattern. \f(CW\*(C`\eQ\*(C'\fR adds a backslash to any character +that could have special meaning to Perl. In the ASCII range, it quotes +every character that isn't a letter, digit, or underscore. See +"quotemeta" in perlfunc for details on what gets quoted for non-ASCII +code points. Using this ensures that any character between \f(CW\*(C`\eQ\*(C'\fR and +\&\f(CW\*(C`\eE\*(C'\fR will be matched literally, not interpreted as a metacharacter by +the regex engine. +.PP +\&\f(CW\*(C`\eF\*(C'\fR can be used to casefold all characters following, up to the next \f(CW\*(C`\eE\*(C'\fR +or the end of the pattern. It provides the functionality similar to +the \f(CW\*(C`fc\*(C'\fR function. +.PP +Mnemonic: \fIL\fRowercase, \fIU\fRppercase, \fIF\fRold-case, \fIQ\fRuotemeta, \fIE\fRnd. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 7 +\& $sid = "sid"; +\& $greg = "GrEg"; +\& $miranda = "(Miranda)"; +\& $str =~ /\eu$sid/; # Matches \*(AqSid\*(Aq +\& $str =~ /\eL$greg/; # Matches \*(Aqgreg\*(Aq +\& $str =~ /\eQ$miranda\eE/; # Matches \*(Aq(Miranda)\*(Aq, as if the pattern +\& # had been written as /\e(Miranda\e)/ +.Ve +.SS "Character classes" +.IX Subsection "Character classes" +Perl regular expressions have a large range of character classes. Some of +the character classes are written as a backslash sequence. We will briefly +discuss those here; full details of character classes can be found in +perlrecharclass. +.PP +\&\f(CW\*(C`\ew\*(C'\fR is a character class that matches any single \fIword\fR character +(letters, digits, Unicode marks, and connector punctuation (like the +underscore)). \f(CW\*(C`\ed\*(C'\fR is a character class that matches any decimal +digit, while the character class \f(CW\*(C`\es\*(C'\fR matches any whitespace character. +New in perl 5.10.0 are the classes \f(CW\*(C`\eh\*(C'\fR and \f(CW\*(C`\ev\*(C'\fR which match horizontal +and vertical whitespace characters. +.PP +The exact set of characters matched by \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR, and \f(CW\*(C`\ew\*(C'\fR varies +depending on various pragma and regular expression modifiers. It is +possible to restrict the match to the ASCII range by using the \f(CW\*(C`/a\*(C'\fR +regular expression modifier. See perlrecharclass. +.PP +The uppercase variants (\f(CW\*(C`\eW\*(C'\fR, \f(CW\*(C`\eD\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR, \f(CW\*(C`\eH\*(C'\fR, and \f(CW\*(C`\eV\*(C'\fR) are +character classes that match, respectively, any character that isn't a +word character, digit, whitespace, horizontal whitespace, or vertical +whitespace. +.PP +Mnemonics: \fIw\fRord, \fId\fRigit, \fIs\fRpace, \fIh\fRorizontal, \fIv\fRertical. +.PP +\fIUnicode classes\fR +.IX Subsection "Unicode classes" +.PP +\&\f(CW\*(C`\epP\*(C'\fR (where \f(CW\*(C`P\*(C'\fR is a single letter) and \f(CW\*(C`\ep{Property}\*(C'\fR are used to +match a character that matches the given Unicode property; properties +include things like "letter", or "thai character". Capitalizing the +sequence to \f(CW\*(C`\ePP\*(C'\fR and \f(CW\*(C`\eP{Property}\*(C'\fR make the sequence match a character +that doesn't match the given Unicode property. For more details, see +"Backslash sequences" in perlrecharclass and +"Unicode Character Properties" in perlunicode. +.PP +Mnemonic: \fIp\fRroperty. +.SS Referencing +.IX Subsection "Referencing" +If capturing parenthesis are used in a regular expression, we can refer +to the part of the source string that was matched, and match exactly the +same thing. There are three ways of referring to such \fIbackreference\fR: +absolutely, relatively, and by name. +.PP +\fIAbsolute referencing\fR +.IX Subsection "Absolute referencing" +.PP +Either \f(CW\*(C`\eg\fR\f(CIN\fR\f(CW\*(C'\fR (starting in Perl 5.10.0), or \f(CW\*(C`\e\fR\f(CIN\fR\f(CW\*(C'\fR (old-style) where \fIN\fR +is a positive (unsigned) decimal number of any length is an absolute reference +to a capturing group. +.PP +\&\fIN\fR refers to the Nth set of parentheses, so \f(CW\*(C`\eg\fR\f(CIN\fR\f(CW\*(C'\fR refers to whatever has +been matched by that set of parentheses. Thus \f(CW\*(C`\eg1\*(C'\fR refers to the first +capture group in the regex. +.PP +The \f(CW\*(C`\eg\fR\f(CIN\fR\f(CW\*(C'\fR form can be equivalently written as \f(CW\*(C`\eg{\fR\f(CIN\fR\f(CW}\*(C'\fR +which avoids ambiguity when building a regex by concatenating shorter +strings. Otherwise if you had a regex \f(CW\*(C`qr/$a$b/\*(C'\fR, and \f(CW$a\fR contained +\&\f(CW"\eg1"\fR, and \f(CW$b\fR contained \f(CW"37"\fR, you would get \f(CW\*(C`/\eg137/\*(C'\fR which is +probably not what you intended. +.PP +In the \f(CW\*(C`\e\fR\f(CIN\fR\f(CW\*(C'\fR form, \fIN\fR must not begin with a "0", and there must be at +least \fIN\fR capturing groups, or else \fIN\fR is considered an octal escape +(but something like \f(CW\*(C`\e18\*(C'\fR is the same as \f(CW\*(C`\e0018\*(C'\fR; that is, the octal escape +\&\f(CW"\e001"\fR followed by a literal digit \f(CW"8"\fR). +.PP +Mnemonic: \fIg\fRroup. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 5 +\& /(\ew+) \eg1/; # Finds a duplicated word, (e.g. "cat cat"). +\& /(\ew+) \e1/; # Same thing; written old\-style. +\& /(\ew+) \eg{1}/; # Same, using the safer braced notation +\& /(\ew+) \eg{ 1 }/;# Same, showing optional blanks adjacent to the braces +\& /(.)(.)\eg2\eg1/; # Match a four letter palindrome (e.g. "ABBA"). +.Ve +.PP +\fIRelative referencing\fR +.IX Subsection "Relative referencing" +.PP +\&\f(CW\*(C`\eg\-\fR\f(CIN\fR\f(CW\*(C'\fR (starting in Perl 5.10.0) is used for relative addressing. (It can +be written as \f(CW\*(C`\eg{\-\fR\f(CIN\fR\f(CW}\*(C'\fR.) It refers to the \fIN\fRth group before the +\&\f(CW\*(C`\eg{\-\fR\f(CIN\fR\f(CW}\*(C'\fR. +.PP +The big advantage of this form is that it makes it much easier to write +patterns with references that can be interpolated in larger patterns, +even if the larger pattern also contains capture groups. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 8 +\& /(A) # Group 1 +\& ( # Group 2 +\& (B) # Group 3 +\& \eg{\-1} # Refers to group 3 (B) +\& \eg{\-3} # Refers to group 1 (A) +\& \eg{ \-3 } # Same, showing optional blanks adjacent to the braces +\& ) +\& /x; # Matches "ABBA". +\& +\& my $qr = qr /(.)(.)\eg{\-2}\eg{\-1}/; # Matches \*(Aqabab\*(Aq, \*(Aqcdcd\*(Aq, etc. +\& /$qr$qr/ # Matches \*(Aqababcdcd\*(Aq. +.Ve +.PP +\fINamed referencing\fR +.IX Subsection "Named referencing" +.PP +\&\f(CW\*(C`\eg{\fR\f(CIname\fR\f(CW}\*(C'\fR (starting in Perl 5.10.0) can be used to back refer to a +named capture group, dispensing completely with having to think about capture +buffer positions. +.PP +To be compatible with .Net regular expressions, \f(CW\*(C`\eg{name}\*(C'\fR may also be +written as \f(CW\*(C`\ek{name}\*(C'\fR, \f(CW\*(C`\ek<name>\*(C'\fR or \f(CW\*(C`\ek\*(Aqname\*(Aq\*(C'\fR. +.PP +To prevent any ambiguity, \fIname\fR must not start with a digit nor contain a +hyphen. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 10 +\& /(?<word>\ew+) \eg{word}/ # Finds duplicated word, (e.g. "cat cat") +\& /(?<word>\ew+) \ek{word}/ # Same. +\& /(?<word>\ew+) \eg{ word }/ # Same, showing optional blanks adjacent to +\& # the braces +\& /(?<word>\ew+) \ek{ word }/ # Same. +\& /(?<word>\ew+) \ek<word>/ # Same. There are no braces, so no blanks +\& # are permitted +\& /(?<letter1>.)(?<letter2>.)\eg{letter2}\eg{letter1}/ +\& # Match a four letter palindrome (e.g. +\& # "ABBA") +.Ve +.SS Assertions +.IX Subsection "Assertions" +Assertions are conditions that have to be true; they don't actually +match parts of the substring. There are six assertions that are written as +backslash sequences. +.IP \eA 4 +.IX Item "A" +\&\f(CW\*(C`\eA\*(C'\fR only matches at the beginning of the string. If the \f(CW\*(C`/m\*(C'\fR modifier +isn't used, then \f(CW\*(C`/\eA/\*(C'\fR is equivalent to \f(CW\*(C`/^/\*(C'\fR. However, if the \f(CW\*(C`/m\*(C'\fR +modifier is used, then \f(CW\*(C`/^/\*(C'\fR matches internal newlines, but the meaning +of \f(CW\*(C`/\eA/\*(C'\fR isn't changed by the \f(CW\*(C`/m\*(C'\fR modifier. \f(CW\*(C`\eA\*(C'\fR matches at the beginning +of the string regardless whether the \f(CW\*(C`/m\*(C'\fR modifier is used. +.IP "\ez, \eZ" 4 +.IX Item "z, Z" +\&\f(CW\*(C`\ez\*(C'\fR and \f(CW\*(C`\eZ\*(C'\fR match at the end of the string. If the \f(CW\*(C`/m\*(C'\fR modifier isn't +used, then \f(CW\*(C`/\eZ/\*(C'\fR is equivalent to \f(CW\*(C`/$/\*(C'\fR; that is, it matches at the +end of the string, or one before the newline at the end of the string. If the +\&\f(CW\*(C`/m\*(C'\fR modifier is used, then \f(CW\*(C`/$/\*(C'\fR matches at internal newlines, but the +meaning of \f(CW\*(C`/\eZ/\*(C'\fR isn't changed by the \f(CW\*(C`/m\*(C'\fR modifier. \f(CW\*(C`\eZ\*(C'\fR matches at +the end of the string (or just before a trailing newline) regardless whether +the \f(CW\*(C`/m\*(C'\fR modifier is used. +.Sp +\&\f(CW\*(C`\ez\*(C'\fR is just like \f(CW\*(C`\eZ\*(C'\fR, except that it does not match before a trailing +newline. \f(CW\*(C`\ez\*(C'\fR matches at the end of the string only, regardless of the +modifiers used, and not just before a newline. It is how to anchor the +match to the true end of the string under all conditions. +.IP \eG 4 +.IX Item "G" +\&\f(CW\*(C`\eG\*(C'\fR is usually used only in combination with the \f(CW\*(C`/g\*(C'\fR modifier. If the +\&\f(CW\*(C`/g\*(C'\fR modifier is used and the match is done in scalar context, Perl +remembers where in the source string the last match ended, and the next time, +it will start the match from where it ended the previous time. +.Sp +\&\f(CW\*(C`\eG\*(C'\fR matches the point where the previous match on that string ended, +or the beginning of that string if there was no previous match. +.Sp +Mnemonic: \fIG\fRlobal. +.IP "\eb{}, \eb, \eB{}, \eB" 4 +.IX Item "b{}, b, B{}, B" +\&\f(CW\*(C`\eb{...}\*(C'\fR, available starting in v5.22, matches a boundary (between two +characters, or before the first character of the string, or after the +final character of the string) based on the Unicode rules for the +boundary type specified inside the braces. The boundary +types are given a few paragraphs below. \f(CW\*(C`\eB{...}\*(C'\fR matches at any place +between characters where \f(CW\*(C`\eb{...}\*(C'\fR of the same type doesn't match. +.Sp +\&\f(CW\*(C`\eb\*(C'\fR when not immediately followed by a \f(CW"{"\fR is available in all +Perls. It matches at any place +between a word (something matched by \f(CW\*(C`\ew\*(C'\fR) and a non-word character +(\f(CW\*(C`\eW\*(C'\fR); \f(CW\*(C`\eB\*(C'\fR when not immediately followed by a \f(CW"{"\fR matches at any +place between characters where \f(CW\*(C`\eb\*(C'\fR doesn't match. To get better +word matching of natural language text, see "\eb{wb}" below. +.Sp +\&\f(CW\*(C`\eb\*(C'\fR +and \f(CW\*(C`\eB\*(C'\fR assume there's a non-word character before the beginning and after +the end of the source string; so \f(CW\*(C`\eb\*(C'\fR will match at the beginning (or end) +of the source string if the source string begins (or ends) with a word +character. Otherwise, \f(CW\*(C`\eB\*(C'\fR will match. +.Sp +Do not use something like \f(CW\*(C`\eb=head\ed\eb\*(C'\fR and expect it to match the +beginning of a line. It can't, because for there to be a boundary before +the non-word "=", there must be a word character immediately previous. +All plain \f(CW\*(C`\eb\*(C'\fR and \f(CW\*(C`\eB\*(C'\fR boundary determinations look for word +characters alone, not for +non-word characters nor for string ends. It may help to understand how +\&\f(CW\*(C`\eb\*(C'\fR and \f(CW\*(C`\eB\*(C'\fR work by equating them as follows: +.Sp +.Vb 2 +\& \eb really means (?:(?<=\ew)(?!\ew)|(?<!\ew)(?=\ew)) +\& \eB really means (?:(?<=\ew)(?=\ew)|(?<!\ew)(?!\ew)) +.Ve +.Sp +In contrast, \f(CW\*(C`\eb{...}\*(C'\fR and \f(CW\*(C`\eB{...}\*(C'\fR may or may not match at the +beginning and end of the line, depending on the boundary type. These +implement the Unicode default boundaries, specified in +<https://www.unicode.org/reports/tr14/> and +<https://www.unicode.org/reports/tr29/>. +The boundary types are: +.RS 4 +.ie n .IP """\eb{gcb}"" or ""\eb{g}""" 4 +.el .IP "\f(CW\eb{gcb}\fR or \f(CW\eb{g}\fR" 4 +.IX Item "b{gcb} or b{g}" +This matches a Unicode "Grapheme Cluster Boundary". (Actually Perl +always uses the improved "extended" grapheme cluster"). These are +explained below under \f(CW"\eX"\fR. In fact, \f(CW\*(C`\eX\*(C'\fR is another way to get +the same functionality. It is equivalent to \f(CW\*(C`/.+?\eb{gcb}/\*(C'\fR. Use +whichever is most convenient for your situation. +.ie n .IP """\eb{lb}""" 4 +.el .IP \f(CW\eb{lb}\fR 4 +.IX Item "b{lb}" +This matches according to the default Unicode Line Breaking Algorithm +(<https://www.unicode.org/reports/tr14/>), as customized in that +document +(Example 7 of revision 35 <https://www.unicode.org/reports/tr14/tr14-35.html#Example7>) +for better handling of numeric expressions. +.Sp +This is suitable for many purposes, but the Unicode::LineBreak module +is available on CPAN that provides many more features, including +customization. +.ie n .IP """\eb{sb}""" 4 +.el .IP \f(CW\eb{sb}\fR 4 +.IX Item "b{sb}" +This matches a Unicode "Sentence Boundary". This is an aid to parsing +natural language sentences. It gives good, but imperfect results. For +example, it thinks that "Mr. Smith" is two sentences. More details are +at <https://www.unicode.org/reports/tr29/>. Note also that it thinks +that anything matching "\eR" (except form feed and vertical tab) is a +sentence boundary. \f(CW\*(C`\eb{sb}\*(C'\fR works with text designed for +word-processors which wrap lines +automatically for display, but hard-coded line boundaries are considered +to be essentially the ends of text blocks (paragraphs really), and hence +the ends of sentences. \f(CW\*(C`\eb{sb}\*(C'\fR doesn't do well with text containing +embedded newlines, like the source text of the document you are reading. +Such text needs to be preprocessed to get rid of the line separators +before looking for sentence boundaries. Some people view this as a bug +in the Unicode standard, and this behavior is quite subject to change in +future Perl versions. +.ie n .IP """\eb{wb}""" 4 +.el .IP \f(CW\eb{wb}\fR 4 +.IX Item "b{wb}" +This matches a Unicode "Word Boundary", but tailored to Perl +expectations. This gives better (though not +perfect) results for natural language processing than plain \f(CW\*(C`\eb\*(C'\fR +(without braces) does. For example, it understands that apostrophes can +be in the middle of words and that parentheses aren't (see the examples +below). More details are at <https://www.unicode.org/reports/tr29/>. +.Sp +The current Unicode definition of a Word Boundary matches between every +white space character. Perl tailors this, starting in version 5.24, to +generally not break up spans of white space, just as plain \f(CW\*(C`\eb\*(C'\fR has +always functioned. This allows \f(CW\*(C`\eb{wb}\*(C'\fR to be a drop-in replacement for +\&\f(CW\*(C`\eb\*(C'\fR, but with generally better results for natural language +processing. (The exception to this tailoring is when a span of white +space is immediately followed by something like U+0303, COMBINING TILDE. +If the final space character in the span is a horizontal white space, it +is broken out so that it attaches instead to the combining character. +To be precise, if a span of white space that ends in a horizontal space +has the character immediately following it have any of the Word +Boundary property values "Extend", "Format" or "ZWJ", the boundary between the +final horizontal space character and the rest of the span matches +\&\f(CW\*(C`\eb{wb}\*(C'\fR. In all other cases the boundary between two white space +characters matches \f(CW\*(C`\eB{wb}\*(C'\fR.) +.RE +.RS 4 +.Sp +It is important to realize when you use these Unicode boundaries, +that you are taking a risk that a future version of Perl which contains +a later version of the Unicode Standard will not work precisely the same +way as it did when your code was written. These rules are not +considered stable and have been somewhat more subject to change than the +rest of the Standard. Unicode reserves the right to change them at +will, and Perl reserves the right to update its implementation to +Unicode's new rules. In the past, some changes have been because new +characters have been added to the Standard which have different +characteristics than all previous characters, so new rules are +formulated for handling them. These should not cause any backward +compatibility issues. But some changes have changed the treatment of +existing characters because the Unicode Technical Committee has decided +that the change is warranted for whatever reason. This could be to fix +a bug, or because they think better results are obtained with the new +rule. +.Sp +It is also important to realize that these are default boundary +definitions, and that implementations may wish to tailor the results for +particular purposes and locales. For example, some languages, such as +Japanese and Thai, require dictionary lookup to accurately determine +word boundaries. +.Sp +Mnemonic: \fIb\fRoundary. +.RE +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 4 +\& "cat" =~ /\eAcat/; # Match. +\& "cat" =~ /cat\eZ/; # Match. +\& "cat\en" =~ /cat\eZ/; # Match. +\& "cat\en" =~ /cat\ez/; # No match. +\& +\& "cat" =~ /\ebcat\eb/; # Matches. +\& "cats" =~ /\ebcat\eb/; # No match. +\& "cat" =~ /\ebcat\eB/; # No match. +\& "cats" =~ /\ebcat\eB/; # Match. +\& +\& while ("cat dog" =~ /(\ew+)/g) { +\& print $1; # Prints \*(Aqcatdog\*(Aq +\& } +\& while ("cat dog" =~ /\eG(\ew+)/g) { +\& print $1; # Prints \*(Aqcat\*(Aq +\& } +\& +\& my $s = "He said, \e"Is pi 3.14? (I\*(Aqm not sure).\e""; +\& print join("|", $s =~ m/ ( .+? \eb ) /xg), "\en"; +\& print join("|", $s =~ m/ ( .+? \eb{wb} ) /xg), "\en"; +\& prints +\& He| |said|, "|Is| |pi| |3|.|14|? (|I|\*(Aq|m| |not| |sure +\& He| |said|,| |"|Is| |pi| |3.14|?| |(|I\*(Aqm| |not| |sure|)|.|" +.Ve +.SS Misc +.IX Subsection "Misc" +Here we document the backslash sequences that don't fall in one of the +categories above. These are: +.IP \eK 4 +.IX Item "K" +This appeared in perl 5.10.0. Anything matched left of \f(CW\*(C`\eK\*(C'\fR is +not included in \f(CW$&\fR, and will not be replaced if the pattern is +used in a substitution. This lets you write \f(CW\*(C`s/PAT1 \eK PAT2/REPL/x\*(C'\fR +instead of \f(CW\*(C`s/(PAT1) PAT2/${1}REPL/x\*(C'\fR or \f(CW\*(C`s/(?<=PAT1) PAT2/REPL/x\*(C'\fR. +.Sp +Mnemonic: \fIK\fReep. +.IP \eN 4 +.IX Item "N" +This feature, available starting in v5.12, matches any character +that is \fBnot\fR a newline. It is a short-hand for writing \f(CW\*(C`[^\en]\*(C'\fR, and is +identical to the \f(CW\*(C`.\*(C'\fR metasymbol, except under the \f(CW\*(C`/s\*(C'\fR flag, which changes +the meaning of \f(CW\*(C`.\*(C'\fR, but not \f(CW\*(C`\eN\*(C'\fR. +.Sp +Note that \f(CW\*(C`\eN{...}\*(C'\fR can mean a +named or numbered character +\&. +.Sp +Mnemonic: Complement of \fI\en\fR. +.IP \eR 4 +.IX Xref "\\R" +.IX Item "R" +\&\f(CW\*(C`\eR\*(C'\fR matches a \fIgeneric newline\fR; that is, anything considered a +linebreak sequence by Unicode. This includes all characters matched by +\&\f(CW\*(C`\ev\*(C'\fR (vertical whitespace), and the multi character sequence \f(CW"\ex0D\ex0A"\fR +(carriage return followed by a line feed, sometimes called the network +newline; it's the end of line sequence used in Microsoft text files opened +in binary mode). \f(CW\*(C`\eR\*(C'\fR is equivalent to \f(CW\*(C`(?>\ex0D\ex0A|\ev)\*(C'\fR. (The +reason it doesn't backtrack is that the sequence is considered +inseparable. That means that +.Sp +.Vb 1 +\& "\ex0D\ex0A" =~ /^\eR\ex0A$/ # No match +.Ve +.Sp +fails, because the \f(CW\*(C`\eR\*(C'\fR matches the entire string, and won't backtrack +to match just the \f(CW"\ex0D"\fR.) Since +\&\f(CW\*(C`\eR\*(C'\fR can match a sequence of more than one character, it cannot be put +inside a bracketed character class; \f(CW\*(C`/[\eR]/\*(C'\fR is an error; use \f(CW\*(C`\ev\*(C'\fR +instead. \f(CW\*(C`\eR\*(C'\fR was introduced in perl 5.10.0. +.Sp +Note that this does not respect any locale that might be in effect; it +matches according to the platform's native character set. +.Sp +Mnemonic: none really. \f(CW\*(C`\eR\*(C'\fR was picked because PCRE already uses \f(CW\*(C`\eR\*(C'\fR, +and more importantly because Unicode recommends such a regular expression +metacharacter, and suggests \f(CW\*(C`\eR\*(C'\fR as its notation. +.IP \eX 4 +.IX Xref "\\X" +.IX Item "X" +This matches a Unicode \fIextended grapheme cluster\fR. +.Sp +\&\f(CW\*(C`\eX\*(C'\fR matches quite well what normal (non-Unicode-programmer) usage +would consider a single character. As an example, consider a G with some sort +of diacritic mark, such as an arrow. There is no such single character in +Unicode, but one can be composed by using a G followed by a Unicode "COMBINING +UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it +were a single character. +.Sp +The match is greedy and non-backtracking, so that the cluster is never +broken up into smaller components. +.Sp +See also \f(CW\*(C`\eb{gcb}\*(C'\fR. +.Sp +Mnemonic: e\fIX\fRtended Unicode character. +.PP +Examples +.IX Subsection "Examples" +.PP +.Vb 2 +\& $str =~ s/foo\eKbar/baz/g; # Change any \*(Aqbar\*(Aq following a \*(Aqfoo\*(Aq to \*(Aqbaz\*(Aq +\& $str =~ s/(.)\eK\eg1//g; # Delete duplicated characters. +\& +\& "\en" =~ /^\eR$/; # Match, \en is a generic newline. +\& "\er" =~ /^\eR$/; # Match, \er is a generic newline. +\& "\er\en" =~ /^\eR$/; # Match, \er\en is a generic newline. +\& +\& "P\ex{307}" =~ /^\eX$/ # \eX matches a P with a dot above. +.Ve |