diff options
Diffstat (limited to 'upstream/debian-unstable/man1/perlretut.1')
-rw-r--r-- | upstream/debian-unstable/man1/perlretut.1 | 3219 |
1 files changed, 3219 insertions, 0 deletions
diff --git a/upstream/debian-unstable/man1/perlretut.1 b/upstream/debian-unstable/man1/perlretut.1 new file mode 100644 index 00000000..d73273f5 --- /dev/null +++ b/upstream/debian-unstable/man1/perlretut.1 @@ -0,0 +1,3219 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLRETUT 1" +.TH PERLRETUT 1 2024-01-12 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlretut \- Perl regular expressions tutorial +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This page provides a basic tutorial on understanding, creating and +using regular expressions in Perl. It serves as a complement to the +reference page on regular expressions perlre. Regular expressions +are an integral part of the \f(CW\*(C`m//\*(C'\fR, \f(CW\*(C`s///\*(C'\fR, \f(CW\*(C`qr//\*(C'\fR and \f(CW\*(C`split\*(C'\fR +operators and so this tutorial also overlaps with +"Regexp Quote-Like Operators" in perlop and "split" in perlfunc. +.PP +Perl is widely renowned for excellence in text processing, and regular +expressions are one of the big factors behind this fame. Perl regular +expressions display an efficiency and flexibility unknown in most +other computer languages. Mastering even the basics of regular +expressions will allow you to manipulate text with surprising ease. +.PP +What is a regular expression? At its most basic, a regular expression +is a template that is used to determine if a string has certain +characteristics. The string is most often some text, such as a line, +sentence, web page, or even a whole book, but it doesn't have to be. It +could be binary data, for example. Biologists often use Perl to look +for patterns in long DNA sequences. +.PP +Suppose we want to determine if the text in variable, \f(CW$var\fR contains +the sequence of characters \f(CW\*(C`m\ u\ s\ h\ r\ o\ o\ m\*(C'\fR +(blanks added for legibility). We can write in Perl +.PP +.Vb 1 +\& $var =~ m/mushroom/ +.Ve +.PP +The value of this expression will be TRUE if \f(CW$var\fR contains that +sequence of characters anywhere within it, and FALSE otherwise. The +portion enclosed in \f(CW\*(Aq/\*(Aq\fR characters denotes the characteristic we +are looking for. +We use the term \fIpattern\fR for it. The process of looking to see if the +pattern occurs in the string is called \fImatching\fR, and the \f(CW"=~"\fR +operator along with the \f(CW\*(C`m//\*(C'\fR tell Perl to try to match the pattern +against the string. Note that the pattern is also a string, but a very +special kind of one, as we will see. Patterns are in common use these +days; +examples are the patterns typed into a search engine to find web pages +and the patterns used to list files in a directory, \fIe.g.\fR, "\f(CW\*(C`ls *.txt\*(C'\fR" +or "\f(CW\*(C`dir *.*\*(C'\fR". In Perl, the patterns described by regular expressions +are used not only to search strings, but to also extract desired parts +of strings, and to do search and replace operations. +.PP +Regular expressions have the undeserved reputation of being abstract +and difficult to understand. This really stems simply because the +notation used to express them tends to be terse and dense, and not +because of inherent complexity. We recommend using the \f(CW\*(C`/x\*(C'\fR regular +expression modifier (described below) along with plenty of white space +to make them less dense, and easier to read. Regular expressions are +constructed using +simple concepts like conditionals and loops and are no more difficult +to understand than the corresponding \f(CW\*(C`if\*(C'\fR conditionals and \f(CW\*(C`while\*(C'\fR +loops in the Perl language itself. +.PP +This tutorial flattens the learning curve by discussing regular +expression concepts, along with their notation, one at a time and with +many examples. The first part of the tutorial will progress from the +simplest word searches to the basic regular expression concepts. If +you master the first part, you will have all the tools needed to solve +about 98% of your needs. The second part of the tutorial is for those +comfortable with the basics, and hungry for more power tools. It +discusses the more advanced regular expression operators and +introduces the latest cutting-edge innovations. +.PP +A note: to save time, "regular expression" is often abbreviated as +regexp or regex. Regexp is a more natural abbreviation than regex, but +is harder to pronounce. The Perl pod documentation is evenly split on +regexp vs regex; in Perl, there is more than one way to abbreviate it. +We'll use regexp in this tutorial. +.PP +New in v5.22, \f(CW\*(C`use re \*(Aqstrict\*(Aq\*(C'\fR applies stricter +rules than otherwise when compiling regular expression patterns. It can +find things that, while legal, may not be what you intended. +.SH "Part 1: The basics" +.IX Header "Part 1: The basics" +.SS "Simple word matching" +.IX Subsection "Simple word matching" +The simplest regexp is simply a word, or more generally, a string of +characters. A regexp consisting of just a word matches any string that +contains that word: +.PP +.Vb 1 +\& "Hello World" =~ /World/; # matches +.Ve +.PP +What is this Perl statement all about? \f(CW"Hello World"\fR is a simple +double-quoted string. \f(CW\*(C`World\*(C'\fR is the regular expression and the +\&\f(CW\*(C`//\*(C'\fR enclosing \f(CW\*(C`/World/\*(C'\fR tells Perl to search a string for a match. +The operator \f(CW\*(C`=~\*(C'\fR associates the string with the regexp match and +produces a true value if the regexp matched, or false if the regexp +did not match. In our case, \f(CW\*(C`World\*(C'\fR matches the second word in +\&\f(CW"Hello World"\fR, so the expression is true. Expressions like this +are useful in conditionals: +.PP +.Vb 6 +\& if ("Hello World" =~ /World/) { +\& print "It matches\en"; +\& } +\& else { +\& print "It doesn\*(Aqt match\en"; +\& } +.Ve +.PP +There are useful variations on this theme. The sense of the match can +be reversed by using the \f(CW\*(C`!~\*(C'\fR operator: +.PP +.Vb 6 +\& if ("Hello World" !~ /World/) { +\& print "It doesn\*(Aqt match\en"; +\& } +\& else { +\& print "It matches\en"; +\& } +.Ve +.PP +The literal string in the regexp can be replaced by a variable: +.PP +.Vb 7 +\& my $greeting = "World"; +\& if ("Hello World" =~ /$greeting/) { +\& print "It matches\en"; +\& } +\& else { +\& print "It doesn\*(Aqt match\en"; +\& } +.Ve +.PP +If you're matching against the special default variable \f(CW$_\fR, the +\&\f(CW\*(C`$_ =~\*(C'\fR part can be omitted: +.PP +.Vb 7 +\& $_ = "Hello World"; +\& if (/World/) { +\& print "It matches\en"; +\& } +\& else { +\& print "It doesn\*(Aqt match\en"; +\& } +.Ve +.PP +And finally, the \f(CW\*(C`//\*(C'\fR default delimiters for a match can be changed +to arbitrary delimiters by putting an \f(CW\*(Aqm\*(Aq\fR out front: +.PP +.Vb 4 +\& "Hello World" =~ m!World!; # matches, delimited by \*(Aq!\*(Aq +\& "Hello World" =~ m{World}; # matches, note the paired \*(Aq{}\*(Aq +\& "/usr/bin/perl" =~ m"/perl"; # matches after \*(Aq/usr/bin\*(Aq, +\& # \*(Aq/\*(Aq becomes an ordinary char +.Ve +.PP +\&\f(CW\*(C`/World/\*(C'\fR, \f(CW\*(C`m!World!\*(C'\fR, and \f(CW\*(C`m{World}\*(C'\fR all represent the +same thing. When, \fIe.g.\fR, the quote (\f(CW\*(Aq"\*(Aq\fR) is used as a delimiter, the forward +slash \f(CW\*(Aq/\*(Aq\fR becomes an ordinary character and can be used in this regexp +without trouble. +.PP +Let's consider how different regexps would match \f(CW"Hello World"\fR: +.PP +.Vb 4 +\& "Hello World" =~ /world/; # doesn\*(Aqt match +\& "Hello World" =~ /o W/; # matches +\& "Hello World" =~ /oW/; # doesn\*(Aqt match +\& "Hello World" =~ /World /; # doesn\*(Aqt match +.Ve +.PP +The first regexp \f(CW\*(C`world\*(C'\fR doesn't match because regexps are by default +case-sensitive. The second regexp matches because the substring +\&\f(CW\*(Aqo\ W\*(Aq\fR occurs in the string \f(CW"Hello\ World"\fR. The space +character \f(CW\*(Aq \*(Aq\fR is treated like any other character in a regexp and is +needed to match in this case. The lack of a space character is the +reason the third regexp \f(CW\*(AqoW\*(Aq\fR doesn't match. The fourth regexp +"\f(CW\*(C`World \*(C'\fR" doesn't match because there is a space at the end of the +regexp, but not at the end of the string. The lesson here is that +regexps must match a part of the string \fIexactly\fR in order for the +statement to be true. +.PP +If a regexp matches in more than one place in the string, Perl will +always match at the earliest possible point in the string: +.PP +.Vb 2 +\& "Hello World" =~ /o/; # matches \*(Aqo\*(Aq in \*(AqHello\*(Aq +\& "That hat is red" =~ /hat/; # matches \*(Aqhat\*(Aq in \*(AqThat\*(Aq +.Ve +.PP +With respect to character matching, there are a few more points you +need to know about. First of all, not all characters can be used +"as-is" in a match. Some characters, called \fImetacharacters\fR, are +generally reserved for use in regexp notation. The metacharacters are +.PP +.Vb 1 +\& {}[]()^$.|*+?\-#\e +.Ve +.PP +This list is not as definitive as it may appear (or be claimed to be in +other documentation). For example, \f(CW"#"\fR is a metacharacter only when +the \f(CW\*(C`/x\*(C'\fR pattern modifier (described below) is used, and both \f(CW"}"\fR +and \f(CW"]"\fR are metacharacters only when paired with opening \f(CW"{"\fR or +\&\f(CW"["\fR respectively; other gotchas apply. +.PP +The significance of each of these will be explained +in the rest of the tutorial, but for now, it is important only to know +that a metacharacter can be matched as-is by putting a backslash before +it: +.PP +.Vb 5 +\& "2+2=4" =~ /2+2/; # doesn\*(Aqt match, + is a metacharacter +\& "2+2=4" =~ /2\e+2/; # matches, \e+ is treated like an ordinary + +\& "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! +\& "The interval is [0,1)." =~ /\e[0,1\e)\e./ # matches +\& "#!/usr/bin/perl" =~ /#!\e/usr\e/bin\e/perl/; # matches +.Ve +.PP +In the last regexp, the forward slash \f(CW\*(Aq/\*(Aq\fR is also backslashed, +because it is used to delimit the regexp. This can lead to LTS +(leaning toothpick syndrome), however, and it is often more readable +to change delimiters. +.PP +.Vb 1 +\& "#!/usr/bin/perl" =~ m!#\e!/usr/bin/perl!; # easier to read +.Ve +.PP +The backslash character \f(CW\*(Aq\e\*(Aq\fR is a metacharacter itself and needs to +be backslashed: +.PP +.Vb 1 +\& \*(AqC:\eWIN32\*(Aq =~ /C:\e\eWIN/; # matches +.Ve +.PP +In situations where it doesn't make sense for a particular metacharacter +to mean what it normally does, it automatically loses its +metacharacter-ness and becomes an ordinary character that is to be +matched literally. For example, the \f(CW\*(Aq}\*(Aq\fR is a metacharacter only when +it is the mate of a \f(CW\*(Aq{\*(Aq\fR metacharacter. Otherwise it is treated as a +literal RIGHT CURLY BRACKET. This may lead to unexpected results. +\&\f(CW\*(C`use re \*(Aqstrict\*(Aq\*(C'\fR can catch some of these. +.PP +In addition to the metacharacters, there are some ASCII characters +which don't have printable character equivalents and are instead +represented by \fIescape sequences\fR. Common examples are \f(CW\*(C`\et\*(C'\fR for a +tab, \f(CW\*(C`\en\*(C'\fR for a newline, \f(CW\*(C`\er\*(C'\fR for a carriage return and \f(CW\*(C`\ea\*(C'\fR for a +bell (or alert). If your string is better thought of as a sequence of arbitrary +bytes, the octal escape sequence, \fIe.g.\fR, \f(CW\*(C`\e033\*(C'\fR, or hexadecimal escape +sequence, \fIe.g.\fR, \f(CW\*(C`\ex1B\*(C'\fR may be a more natural representation for your +bytes. Here are some examples of escapes: +.PP +.Vb 5 +\& "1000\et2000" =~ m(0\et2) # matches +\& "1000\en2000" =~ /0\en20/ # matches +\& "1000\et2000" =~ /\e000\et2/ # doesn\*(Aqt match, "0" ne "\e000" +\& "cat" =~ /\eo{143}\ex61\ex74/ # matches in ASCII, but a weird way +\& # to spell cat +.Ve +.PP +If you've been around Perl a while, all this talk of escape sequences +may seem familiar. Similar escape sequences are used in double-quoted +strings and in fact the regexps in Perl are mostly treated as +double-quoted strings. This means that variables can be used in +regexps as well. Just like double-quoted strings, the values of the +variables in the regexp will be substituted in before the regexp is +evaluated for matching purposes. So we have: +.PP +.Vb 4 +\& $foo = \*(Aqhouse\*(Aq; +\& \*(Aqhousecat\*(Aq =~ /$foo/; # matches +\& \*(Aqcathouse\*(Aq =~ /cat$foo/; # matches +\& \*(Aqhousecat\*(Aq =~ /${foo}cat/; # matches +.Ve +.PP +So far, so good. With the knowledge above you can already perform +searches with just about any literal string regexp you can dream up. +Here is a \fIvery simple\fR emulation of the Unix grep program: +.PP +.Vb 7 +\& % cat > simple_grep +\& #!/usr/bin/perl +\& $regexp = shift; +\& while (<>) { +\& print if /$regexp/; +\& } +\& ^D +\& +\& % chmod +x simple_grep +\& +\& % simple_grep abba /usr/dict/words +\& Babbage +\& cabbage +\& cabbages +\& sabbath +\& Sabbathize +\& Sabbathizes +\& sabbatical +\& scabbard +\& scabbards +.Ve +.PP +This program is easy to understand. \f(CW\*(C`#!/usr/bin/perl\*(C'\fR is the standard +way to invoke a perl program from the shell. +\&\f(CW\*(C`$regexp\ =\ shift;\*(C'\fR saves the first command line argument as the +regexp to be used, leaving the rest of the command line arguments to +be treated as files. \f(CW\*(C`while\ (<>)\*(C'\fR loops over all the lines in +all the files. For each line, \f(CW\*(C`print\ if\ /$regexp/;\*(C'\fR prints the +line if the regexp matches the line. In this line, both \f(CW\*(C`print\*(C'\fR and +\&\f(CW\*(C`/$regexp/\*(C'\fR use the default variable \f(CW$_\fR implicitly. +.PP +With all of the regexps above, if the regexp matched anywhere in the +string, it was considered a match. Sometimes, however, we'd like to +specify \fIwhere\fR in the string the regexp should try to match. To do +this, we would use the \fIanchor\fR metacharacters \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR. The +anchor \f(CW\*(Aq^\*(Aq\fR means match at the beginning of the string and the anchor +\&\f(CW\*(Aq$\*(Aq\fR means match at the end of the string, or before a newline at the +end of the string. Here is how they are used: +.PP +.Vb 4 +\& "housekeeper" =~ /keeper/; # matches +\& "housekeeper" =~ /^keeper/; # doesn\*(Aqt match +\& "housekeeper" =~ /keeper$/; # matches +\& "housekeeper\en" =~ /keeper$/; # matches +.Ve +.PP +The second regexp doesn't match because \f(CW\*(Aq^\*(Aq\fR constrains \f(CW\*(C`keeper\*(C'\fR to +match only at the beginning of the string, but \f(CW"housekeeper"\fR has +keeper starting in the middle. The third regexp does match, since the +\&\f(CW\*(Aq$\*(Aq\fR constrains \f(CW\*(C`keeper\*(C'\fR to match only at the end of the string. +.PP +When both \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR are used at the same time, the regexp has to +match both the beginning and the end of the string, \fIi.e.\fR, the regexp +matches the whole string. Consider +.PP +.Vb 3 +\& "keeper" =~ /^keep$/; # doesn\*(Aqt match +\& "keeper" =~ /^keeper$/; # matches +\& "" =~ /^$/; # ^$ matches an empty string +.Ve +.PP +The first regexp doesn't match because the string has more to it than +\&\f(CW\*(C`keep\*(C'\fR. Since the second regexp is exactly the string, it +matches. Using both \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR in a regexp forces the complete +string to match, so it gives you complete control over which strings +match and which don't. Suppose you are looking for a fellow named +bert, off in a string by himself: +.PP +.Vb 1 +\& "dogbert" =~ /bert/; # matches, but not what you want +\& +\& "dilbert" =~ /^bert/; # doesn\*(Aqt match, but .. +\& "bertram" =~ /^bert/; # matches, so still not good enough +\& +\& "bertram" =~ /^bert$/; # doesn\*(Aqt match, good +\& "dilbert" =~ /^bert$/; # doesn\*(Aqt match, good +\& "bert" =~ /^bert$/; # matches, perfect +.Ve +.PP +Of course, in the case of a literal string, one could just as easily +use the string comparison \f(CW\*(C`$string\ eq\ \*(Aqbert\*(Aq\*(C'\fR and it would be +more efficient. The \f(CW\*(C`^...$\*(C'\fR regexp really becomes useful when we +add in the more powerful regexp tools below. +.SS "Using character classes" +.IX Subsection "Using character classes" +Although one can already do quite a lot with the literal string +regexps above, we've only scratched the surface of regular expression +technology. In this and subsequent sections we will introduce regexp +concepts (and associated metacharacter notations) that will allow a +regexp to represent not just a single character sequence, but a \fIwhole +class\fR of them. +.PP +One such concept is that of a \fIcharacter class\fR. A character class +allows a set of possible characters, rather than just a single +character, to match at a particular point in a regexp. You can define +your own custom character classes. These +are denoted by brackets \f(CW\*(C`[...]\*(C'\fR, with the set of characters +to be possibly matched inside. Here are some examples: +.PP +.Vb 4 +\& /cat/; # matches \*(Aqcat\*(Aq +\& /[bcr]at/; # matches \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq +\& /item[0123456789]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq +\& "abc" =~ /[cab]/; # matches \*(Aqa\*(Aq +.Ve +.PP +In the last statement, even though \f(CW\*(Aqc\*(Aq\fR is the first character in +the class, \f(CW\*(Aqa\*(Aq\fR matches because the first character position in the +string is the earliest point at which the regexp can match. +.PP +.Vb 2 +\& /[yY][eE][sS]/; # match \*(Aqyes\*(Aq in a case\-insensitive way +\& # \*(Aqyes\*(Aq, \*(AqYes\*(Aq, \*(AqYES\*(Aq, etc. +.Ve +.PP +This regexp displays a common task: perform a case-insensitive +match. Perl provides a way of avoiding all those brackets by simply +appending an \f(CW\*(Aqi\*(Aq\fR to the end of the match. Then \f(CW\*(C`/[yY][eE][sS]/;\*(C'\fR +can be rewritten as \f(CW\*(C`/yes/i;\*(C'\fR. The \f(CW\*(Aqi\*(Aq\fR stands for +case-insensitive and is an example of a \fImodifier\fR of the matching +operation. We will meet other modifiers later in the tutorial. +.PP +We saw in the section above that there were ordinary characters, which +represented themselves, and special characters, which needed a +backslash \f(CW\*(Aq\e\*(Aq\fR to represent themselves. The same is true in a +character class, but the sets of ordinary and special characters +inside a character class are different than those outside a character +class. The special characters for a character class are \f(CW\*(C`\-]\e^$\*(C'\fR (and +the pattern delimiter, whatever it is). +\&\f(CW\*(Aq]\*(Aq\fR is special because it denotes the end of a character class. \f(CW\*(Aq$\*(Aq\fR is +special because it denotes a scalar variable. \f(CW\*(Aq\e\*(Aq\fR is special because +it is used in escape sequences, just like above. Here is how the +special characters \f(CW\*(C`]$\e\*(C'\fR are handled: +.PP +.Vb 5 +\& /[\e]c]def/; # matches \*(Aq]def\*(Aq or \*(Aqcdef\*(Aq +\& $x = \*(Aqbcr\*(Aq; +\& /[$x]at/; # matches \*(Aqbat\*(Aq, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq +\& /[\e$x]at/; # matches \*(Aq$at\*(Aq or \*(Aqxat\*(Aq +\& /[\e\e$x]at/; # matches \*(Aq\eat\*(Aq, \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq +.Ve +.PP +The last two are a little tricky. In \f(CW\*(C`[\e$x]\*(C'\fR, the backslash protects +the dollar sign, so the character class has two members \f(CW\*(Aq$\*(Aq\fR and \f(CW\*(Aqx\*(Aq\fR. +In \f(CW\*(C`[\e\e$x]\*(C'\fR, the backslash is protected, so \f(CW$x\fR is treated as a +variable and substituted in double quote fashion. +.PP +The special character \f(CW\*(Aq\-\*(Aq\fR acts as a range operator within character +classes, so that a contiguous set of characters can be written as a +range. With ranges, the unwieldy \f(CW\*(C`[0123456789]\*(C'\fR and \f(CW\*(C`[abc...xyz]\*(C'\fR +become the svelte \f(CW\*(C`[0\-9]\*(C'\fR and \f(CW\*(C`[a\-z]\*(C'\fR. Some examples are +.PP +.Vb 6 +\& /item[0\-9]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq +\& /[0\-9bx\-z]aa/; # matches \*(Aq0aa\*(Aq, ..., \*(Aq9aa\*(Aq, +\& # \*(Aqbaa\*(Aq, \*(Aqxaa\*(Aq, \*(Aqyaa\*(Aq, or \*(Aqzaa\*(Aq +\& /[0\-9a\-fA\-F]/; # matches a hexadecimal digit +\& /[0\-9a\-zA\-Z_]/; # matches a "word" character, +\& # like those in a Perl variable name +.Ve +.PP +If \f(CW\*(Aq\-\*(Aq\fR is the first or last character in a character class, it is +treated as an ordinary character; \f(CW\*(C`[\-ab]\*(C'\fR, \f(CW\*(C`[ab\-]\*(C'\fR and \f(CW\*(C`[a\e\-b]\*(C'\fR are +all equivalent. +.PP +The special character \f(CW\*(Aq^\*(Aq\fR in the first position of a character class +denotes a \fInegated character class\fR, which matches any character but +those in the brackets. Both \f(CW\*(C`[...]\*(C'\fR and \f(CW\*(C`[^...]\*(C'\fR must match a +character, or the match fails. Then +.PP +.Vb 4 +\& /[^a]at/; # doesn\*(Aqt match \*(Aqaat\*(Aq or \*(Aqat\*(Aq, but matches +\& # all other \*(Aqbat\*(Aq, \*(Aqcat, \*(Aq0at\*(Aq, \*(Aq%at\*(Aq, etc. +\& /[^0\-9]/; # matches a non\-numeric character +\& /[a^]at/; # matches \*(Aqaat\*(Aq or \*(Aq^at\*(Aq; here \*(Aq^\*(Aq is ordinary +.Ve +.PP +Now, even \f(CW\*(C`[0\-9]\*(C'\fR can be a bother to write multiple times, so in the +interest of saving keystrokes and making regexps more readable, Perl +has several abbreviations for common character classes, as shown below. +Since the introduction of Unicode, unless the \f(CW\*(C`/a\*(C'\fR modifier is in +effect, these character classes match more than just a few characters in +the ASCII range. +.IP \(bu 4 +\&\f(CW\*(C`\ed\*(C'\fR matches a digit, not just \f(CW\*(C`[0\-9]\*(C'\fR but also digits from non-roman scripts +.IP \(bu 4 +\&\f(CW\*(C`\es\*(C'\fR matches a whitespace character, the set \f(CW\*(C`[\e \et\er\en\ef]\*(C'\fR and others +.IP \(bu 4 +\&\f(CW\*(C`\ew\*(C'\fR matches a word character (alphanumeric or \f(CW\*(Aq_\*(Aq\fR), not just \f(CW\*(C`[0\-9a\-zA\-Z_]\*(C'\fR +but also digits and characters from non-roman scripts +.IP \(bu 4 +\&\f(CW\*(C`\eD\*(C'\fR is a negated \f(CW\*(C`\ed\*(C'\fR; it represents any other character than a digit, or \f(CW\*(C`[^\ed]\*(C'\fR +.IP \(bu 4 +\&\f(CW\*(C`\eS\*(C'\fR is a negated \f(CW\*(C`\es\*(C'\fR; it represents any non-whitespace character \f(CW\*(C`[^\es]\*(C'\fR +.IP \(bu 4 +\&\f(CW\*(C`\eW\*(C'\fR is a negated \f(CW\*(C`\ew\*(C'\fR; it represents any non-word character \f(CW\*(C`[^\ew]\*(C'\fR +.IP \(bu 4 +The period \f(CW\*(Aq.\*(Aq\fR matches any character but \f(CW"\en"\fR (unless the modifier \f(CW\*(C`/s\*(C'\fR is +in effect, as explained below). +.IP \(bu 4 +\&\f(CW\*(C`\eN\*(C'\fR, like the period, matches any character but \f(CW"\en"\fR, but it does so +regardless of whether the modifier \f(CW\*(C`/s\*(C'\fR is in effect. +.PP +The \f(CW\*(C`/a\*(C'\fR modifier, available starting in Perl 5.14, is used to +restrict the matches of \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR, and \f(CW\*(C`\ew\*(C'\fR to just those in the ASCII range. +It is useful to keep your program from being needlessly exposed to full +Unicode (and its accompanying security considerations) when all you want +is to process English-like text. (The "a" may be doubled, \f(CW\*(C`/aa\*(C'\fR, to +provide even more restrictions, preventing case-insensitive matching of +ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" +would caselessly match a "k" or "K".) +.PP +The \f(CW\*(C`\ed\es\ew\eD\eS\eW\*(C'\fR abbreviations can be used both inside and outside +of bracketed character classes. Here are some in use: +.PP +.Vb 7 +\& /\ed\ed:\ed\ed:\ed\ed/; # matches a hh:mm:ss time format +\& /[\ed\es]/; # matches any digit or whitespace character +\& /\ew\eW\ew/; # matches a word char, followed by a +\& # non\-word char, followed by a word char +\& /..rt/; # matches any two chars, followed by \*(Aqrt\*(Aq +\& /end\e./; # matches \*(Aqend.\*(Aq +\& /end[.]/; # same thing, matches \*(Aqend.\*(Aq +.Ve +.PP +Because a period is a metacharacter, it needs to be escaped to match +as an ordinary period. Because, for example, \f(CW\*(C`\ed\*(C'\fR and \f(CW\*(C`\ew\*(C'\fR are sets +of characters, it is incorrect to think of \f(CW\*(C`[^\ed\ew]\*(C'\fR as \f(CW\*(C`[\eD\eW]\*(C'\fR; in +fact \f(CW\*(C`[^\ed\ew]\*(C'\fR is the same as \f(CW\*(C`[^\ew]\*(C'\fR, which is the same as +\&\f(CW\*(C`[\eW]\*(C'\fR. Think De Morgan's laws. +.PP +In actuality, the period and \f(CW\*(C`\ed\es\ew\eD\eS\eW\*(C'\fR abbreviations are +themselves types of character classes, so the ones surrounded by +brackets are just one type of character class. When we need to make a +distinction, we refer to them as "bracketed character classes." +.PP +An anchor useful in basic regexps is the \fIword anchor\fR +\&\f(CW\*(C`\eb\*(C'\fR. This matches a boundary between a word character and a non-word +character \f(CW\*(C`\ew\eW\*(C'\fR or \f(CW\*(C`\eW\ew\*(C'\fR: +.PP +.Vb 5 +\& $x = "Housecat catenates house and cat"; +\& $x =~ /cat/; # matches cat in \*(Aqhousecat\*(Aq +\& $x =~ /\ebcat/; # matches cat in \*(Aqcatenates\*(Aq +\& $x =~ /cat\eb/; # matches cat in \*(Aqhousecat\*(Aq +\& $x =~ /\ebcat\eb/; # matches \*(Aqcat\*(Aq at end of string +.Ve +.PP +Note in the last example, the end of the string is considered a word +boundary. +.PP +For natural language processing (so that, for example, apostrophes are +included in words), use instead \f(CW\*(C`\eb{wb}\*(C'\fR +.PP +.Vb 1 +\& "don\*(Aqt" =~ / .+? \eb{wb} /x; # matches the whole string +.Ve +.PP +You might wonder why \f(CW\*(Aq.\*(Aq\fR matches everything but \f(CW"\en"\fR \- why not +every character? The reason is that often one is matching against +lines and would like to ignore the newline characters. For instance, +while the string \f(CW"\en"\fR represents one line, we would like to think +of it as empty. Then +.PP +.Vb 2 +\& "" =~ /^$/; # matches +\& "\en" =~ /^$/; # matches, $ anchors before "\en" +\& +\& "" =~ /./; # doesn\*(Aqt match; it needs a char +\& "" =~ /^.$/; # doesn\*(Aqt match; it needs a char +\& "\en" =~ /^.$/; # doesn\*(Aqt match; it needs a char other than "\en" +\& "a" =~ /^.$/; # matches +\& "a\en" =~ /^.$/; # matches, $ anchors before "\en" +.Ve +.PP +This behavior is convenient, because we usually want to ignore +newlines when we count and match characters in a line. Sometimes, +however, we want to keep track of newlines. We might even want \f(CW\*(Aq^\*(Aq\fR +and \f(CW\*(Aq$\*(Aq\fR to anchor at the beginning and end of lines within the +string, rather than just the beginning and end of the string. Perl +allows us to choose between ignoring and paying attention to newlines +by using the \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR modifiers. \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR stand for +single line and multi-line and they determine whether a string is to +be treated as one continuous string, or as a set of lines. The two +modifiers affect two aspects of how the regexp is interpreted: 1) how +the \f(CW\*(Aq.\*(Aq\fR character class is defined, and 2) where the anchors \f(CW\*(Aq^\*(Aq\fR +and \f(CW\*(Aq$\*(Aq\fR are able to match. Here are the four possible combinations: +.IP \(bu 4 +no modifiers: Default behavior. \f(CW\*(Aq.\*(Aq\fR matches any character +except \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR matches only at the beginning of the string and +\&\f(CW\*(Aq$\*(Aq\fR matches only at the end or before a newline at the end. +.IP \(bu 4 +s modifier (\f(CW\*(C`/s\*(C'\fR): Treat string as a single long line. \f(CW\*(Aq.\*(Aq\fR matches +any character, even \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR matches only at the beginning of +the string and \f(CW\*(Aq$\*(Aq\fR matches only at the end or before a newline at the +end. +.IP \(bu 4 +m modifier (\f(CW\*(C`/m\*(C'\fR): Treat string as a set of multiple lines. \f(CW\*(Aq.\*(Aq\fR +matches any character except \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR are able to match +at the start or end of \fIany\fR line within the string. +.IP \(bu 4 +both s and m modifiers (\f(CW\*(C`/sm\*(C'\fR): Treat string as a single long line, but +detect multiple lines. \f(CW\*(Aq.\*(Aq\fR matches any character, even +\&\f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR, however, are able to match at the start or end +of \fIany\fR line within the string. +.PP +Here are examples of \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR in action: +.PP +.Vb 1 +\& $x = "There once was a girl\enWho programmed in Perl\en"; +\& +\& $x =~ /^Who/; # doesn\*(Aqt match, "Who" not at start of string +\& $x =~ /^Who/s; # doesn\*(Aqt match, "Who" not at start of string +\& $x =~ /^Who/m; # matches, "Who" at start of second line +\& $x =~ /^Who/sm; # matches, "Who" at start of second line +\& +\& $x =~ /girl.Who/; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en" +\& $x =~ /girl.Who/s; # matches, "." matches "\en" +\& $x =~ /girl.Who/m; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en" +\& $x =~ /girl.Who/sm; # matches, "." matches "\en" +.Ve +.PP +Most of the time, the default behavior is what is wanted, but \f(CW\*(C`/s\*(C'\fR and +\&\f(CW\*(C`/m\*(C'\fR are occasionally very useful. If \f(CW\*(C`/m\*(C'\fR is being used, the start +of the string can still be matched with \f(CW\*(C`\eA\*(C'\fR and the end of the string +can still be matched with the anchors \f(CW\*(C`\eZ\*(C'\fR (matches both the end and +the newline before, like \f(CW\*(Aq$\*(Aq\fR), and \f(CW\*(C`\ez\*(C'\fR (matches only the end): +.PP +.Vb 2 +\& $x =~ /^Who/m; # matches, "Who" at start of second line +\& $x =~ /\eAWho/m; # doesn\*(Aqt match, "Who" is not at start of string +\& +\& $x =~ /girl$/m; # matches, "girl" at end of first line +\& $x =~ /girl\eZ/m; # doesn\*(Aqt match, "girl" is not at end of string +\& +\& $x =~ /Perl\eZ/m; # matches, "Perl" is at newline before end +\& $x =~ /Perl\ez/m; # doesn\*(Aqt match, "Perl" is not at end of string +.Ve +.PP +We now know how to create choices among classes of characters in a +regexp. What about choices among words or character strings? Such +choices are described in the next section. +.SS "Matching this or that" +.IX Subsection "Matching this or that" +Sometimes we would like our regexp to be able to match different +possible words or character strings. This is accomplished by using +the \fIalternation\fR metacharacter \f(CW\*(Aq|\*(Aq\fR. To match \f(CW\*(C`dog\*(C'\fR or \f(CW\*(C`cat\*(C'\fR, we +form the regexp \f(CW\*(C`dog|cat\*(C'\fR. As before, Perl will try to match the +regexp at the earliest possible point in the string. At each +character position, Perl will first try to match the first +alternative, \f(CW\*(C`dog\*(C'\fR. If \f(CW\*(C`dog\*(C'\fR doesn't match, Perl will then try the +next alternative, \f(CW\*(C`cat\*(C'\fR. If \f(CW\*(C`cat\*(C'\fR doesn't match either, then the +match fails and Perl moves to the next position in the string. Some +examples: +.PP +.Vb 2 +\& "cats and dogs" =~ /cat|dog|bird/; # matches "cat" +\& "cats and dogs" =~ /dog|cat|bird/; # matches "cat" +.Ve +.PP +Even though \f(CW\*(C`dog\*(C'\fR is the first alternative in the second regexp, +\&\f(CW\*(C`cat\*(C'\fR is able to match earlier in the string. +.PP +.Vb 2 +\& "cats" =~ /c|ca|cat|cats/; # matches "c" +\& "cats" =~ /cats|cat|ca|c/; # matches "cats" +.Ve +.PP +Here, all the alternatives match at the first string position, so the +first alternative is the one that matches. If some of the +alternatives are truncations of the others, put the longest ones first +to give them a chance to match. +.PP +.Vb 2 +\& "cab" =~ /a|b|c/ # matches "c" +\& # /a|b|c/ == /[abc]/ +.Ve +.PP +The last example points out that character classes are like +alternations of characters. At a given character position, the first +alternative that allows the regexp match to succeed will be the one +that matches. +.SS "Grouping things and hierarchical matching" +.IX Subsection "Grouping things and hierarchical matching" +Alternation allows a regexp to choose among alternatives, but by +itself it is unsatisfying. The reason is that each alternative is a whole +regexp, but sometime we want alternatives for just part of a +regexp. For instance, suppose we want to search for housecats or +housekeepers. The regexp \f(CW\*(C`housecat|housekeeper\*(C'\fR fits the bill, but is +inefficient because we had to type \f(CW\*(C`house\*(C'\fR twice. It would be nice to +have parts of the regexp be constant, like \f(CW\*(C`house\*(C'\fR, and some +parts have alternatives, like \f(CW\*(C`cat|keeper\*(C'\fR. +.PP +The \fIgrouping\fR metacharacters \f(CW\*(C`()\*(C'\fR solve this problem. Grouping +allows parts of a regexp to be treated as a single unit. Parts of a +regexp are grouped by enclosing them in parentheses. Thus we could solve +the \f(CW\*(C`housecat|housekeeper\*(C'\fR by forming the regexp as +\&\f(CWhouse(cat|keeper)\fR. The regexp \f(CWhouse(cat|keeper)\fR means match +\&\f(CW\*(C`house\*(C'\fR followed by either \f(CW\*(C`cat\*(C'\fR or \f(CW\*(C`keeper\*(C'\fR. Some more examples +are +.PP +.Vb 4 +\& /(a|b)b/; # matches \*(Aqab\*(Aq or \*(Aqbb\*(Aq +\& /(ac|b)b/; # matches \*(Aqacb\*(Aq or \*(Aqbb\*(Aq +\& /(^a|b)c/; # matches \*(Aqac\*(Aq at start of string or \*(Aqbc\*(Aq anywhere +\& /(a|[bc])d/; # matches \*(Aqad\*(Aq, \*(Aqbd\*(Aq, or \*(Aqcd\*(Aq +\& +\& /house(cat|)/; # matches either \*(Aqhousecat\*(Aq or \*(Aqhouse\*(Aq +\& /house(cat(s|)|)/; # matches either \*(Aqhousecats\*(Aq or \*(Aqhousecat\*(Aq or +\& # \*(Aqhouse\*(Aq. Note groups can be nested. +\& +\& /(19|20|)\ed\ed/; # match years 19xx, 20xx, or the Y2K problem, xx +\& "20" =~ /(19|20|)\ed\ed/; # matches the null alternative \*(Aq()\ed\ed\*(Aq, +\& # because \*(Aq20\ed\ed\*(Aq can\*(Aqt match +.Ve +.PP +Alternations behave the same way in groups as out of them: at a given +string position, the leftmost alternative that allows the regexp to +match is taken. So in the last example at the first string position, +\&\f(CW"20"\fR matches the second alternative, but there is nothing left over +to match the next two digits \f(CW\*(C`\ed\ed\*(C'\fR. So Perl moves on to the next +alternative, which is the null alternative and that works, since +\&\f(CW"20"\fR is two digits. +.PP +The process of trying one alternative, seeing if it matches, and +moving on to the next alternative, while going back in the string +from where the previous alternative was tried, if it doesn't, is called +\&\fIbacktracking\fR. The term "backtracking" comes from the idea that +matching a regexp is like a walk in the woods. Successfully matching +a regexp is like arriving at a destination. There are many possible +trailheads, one for each string position, and each one is tried in +order, left to right. From each trailhead there may be many paths, +some of which get you there, and some which are dead ends. When you +walk along a trail and hit a dead end, you have to backtrack along the +trail to an earlier point to try another trail. If you hit your +destination, you stop immediately and forget about trying all the +other trails. You are persistent, and only if you have tried all the +trails from all the trailheads and not arrived at your destination, do +you declare failure. To be concrete, here is a step-by-step analysis +of what Perl does when it tries to match the regexp +.PP +.Vb 1 +\& "abcde" =~ /(abd|abc)(df|d|de)/; +.Ve +.IP 1. 4 +Start with the first letter in the string \f(CW\*(Aqa\*(Aq\fR. +.IP 2. 4 +Try the first alternative in the first group \f(CW\*(Aqabd\*(Aq\fR. +.IP 3. 4 +Match \f(CW\*(Aqa\*(Aq\fR followed by \f(CW\*(Aqb\*(Aq\fR. So far so good. +.IP 4. 4 +\&\f(CW\*(Aqd\*(Aq\fR in the regexp doesn't match \f(CW\*(Aqc\*(Aq\fR in the string \- a +dead end. So backtrack two characters and pick the second alternative +in the first group \f(CW\*(Aqabc\*(Aq\fR. +.IP 5. 4 +Match \f(CW\*(Aqa\*(Aq\fR followed by \f(CW\*(Aqb\*(Aq\fR followed by \f(CW\*(Aqc\*(Aq\fR. We are on a roll +and have satisfied the first group. Set \f(CW$1\fR to \f(CW\*(Aqabc\*(Aq\fR. +.IP 6. 4 +Move on to the second group and pick the first alternative \f(CW\*(Aqdf\*(Aq\fR. +.IP 7. 4 +Match the \f(CW\*(Aqd\*(Aq\fR. +.IP 8. 4 +\&\f(CW\*(Aqf\*(Aq\fR in the regexp doesn't match \f(CW\*(Aqe\*(Aq\fR in the string, so a dead +end. Backtrack one character and pick the second alternative in the +second group \f(CW\*(Aqd\*(Aq\fR. +.IP 9. 4 +\&\f(CW\*(Aqd\*(Aq\fR matches. The second grouping is satisfied, so set +\&\f(CW$2\fR to \f(CW\*(Aqd\*(Aq\fR. +.IP 10. 4 +We are at the end of the regexp, so we are done! We have +matched \f(CW\*(Aqabcd\*(Aq\fR out of the string \f(CW"abcde"\fR. +.PP +There are a couple of things to note about this analysis. First, the +third alternative in the second group \f(CW\*(Aqde\*(Aq\fR also allows a match, but we +stopped before we got to it \- at a given character position, leftmost +wins. Second, we were able to get a match at the first character +position of the string \f(CW\*(Aqa\*(Aq\fR. If there were no matches at the first +position, Perl would move to the second character position \f(CW\*(Aqb\*(Aq\fR and +attempt the match all over again. Only when all possible paths at all +possible character positions have been exhausted does Perl give +up and declare \f(CW\*(C`$string\ =~\ /(abd|abc)(df|d|de)/;\*(C'\fR to be false. +.PP +Even with all this work, regexp matching happens remarkably fast. To +speed things up, Perl compiles the regexp into a compact sequence of +opcodes that can often fit inside a processor cache. When the code is +executed, these opcodes can then run at full throttle and search very +quickly. +.SS "Extracting matches" +.IX Subsection "Extracting matches" +The grouping metacharacters \f(CW\*(C`()\*(C'\fR also serve another completely +different function: they allow the extraction of the parts of a string +that matched. This is very useful to find out what matched and for +text processing in general. For each grouping, the part that matched +inside goes into the special variables \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR. They can be +used just as ordinary variables: +.PP +.Vb 6 +\& # extract hours, minutes, seconds +\& if ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/) { # match hh:mm:ss format +\& $hours = $1; +\& $minutes = $2; +\& $seconds = $3; +\& } +.Ve +.PP +Now, we know that in scalar context, +\&\f(CW\*(C`$time\ =~\ /(\ed\ed):(\ed\ed):(\ed\ed)/\*(C'\fR returns a true or false +value. In list context, however, it returns the list of matched values +\&\f(CW\*(C`($1,$2,$3)\*(C'\fR. So we could write the code more compactly as +.PP +.Vb 2 +\& # extract hours, minutes, seconds +\& ($hours, $minutes, $second) = ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/); +.Ve +.PP +If the groupings in a regexp are nested, \f(CW$1\fR gets the group with the +leftmost opening parenthesis, \f(CW$2\fR the next opening parenthesis, +\&\fIetc\fR. Here is a regexp with nested groups: +.PP +.Vb 2 +\& /(ab(cd|ef)((gi)|j))/; +\& 1 2 34 +.Ve +.PP +If this regexp matches, \f(CW$1\fR contains a string starting with +\&\f(CW\*(Aqab\*(Aq\fR, \f(CW$2\fR is either set to \f(CW\*(Aqcd\*(Aq\fR or \f(CW\*(Aqef\*(Aq\fR, \f(CW$3\fR equals either +\&\f(CW\*(Aqgi\*(Aq\fR or \f(CW\*(Aqj\*(Aq\fR, and \f(CW$4\fR is either set to \f(CW\*(Aqgi\*(Aq\fR, just like \f(CW$3\fR, +or it remains undefined. +.PP +For convenience, Perl sets \f(CW$+\fR to the string held by the highest numbered +\&\f(CW$1\fR, \f(CW$2\fR,... that got assigned (and, somewhat related, \f(CW$^N\fR to the +value of the \f(CW$1\fR, \f(CW$2\fR,... most-recently assigned; \fIi.e.\fR the \f(CW$1\fR, +\&\f(CW$2\fR,... associated with the rightmost closing parenthesis used in the +match). +.SS Backreferences +.IX Subsection "Backreferences" +Closely associated with the matching variables \f(CW$1\fR, \f(CW$2\fR, ... are +the \fIbackreferences\fR \f(CW\*(C`\eg1\*(C'\fR, \f(CW\*(C`\eg2\*(C'\fR,... Backreferences are simply +matching variables that can be used \fIinside\fR a regexp. This is a +really nice feature; what matches later in a regexp is made to depend on +what matched earlier in the regexp. Suppose we wanted to look +for doubled words in a text, like "the the". The following regexp finds +all 3\-letter doubles with a space in between: +.PP +.Vb 1 +\& /\eb(\ew\ew\ew)\es\eg1\eb/; +.Ve +.PP +The grouping assigns a value to \f(CW\*(C`\eg1\*(C'\fR, so that the same 3\-letter sequence +is used for both parts. +.PP +A similar task is to find words consisting of two identical parts: +.PP +.Vb 7 +\& % simple_grep \*(Aq^(\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew)\eg1$\*(Aq /usr/dict/words +\& beriberi +\& booboo +\& coco +\& mama +\& murmur +\& papa +.Ve +.PP +The regexp has a single grouping which considers 4\-letter +combinations, then 3\-letter combinations, \fIetc\fR., and uses \f(CW\*(C`\eg1\*(C'\fR to look for +a repeat. Although \f(CW$1\fR and \f(CW\*(C`\eg1\*(C'\fR represent the same thing, care should be +taken to use matched variables \f(CW$1\fR, \f(CW$2\fR,... only \fIoutside\fR a regexp +and backreferences \f(CW\*(C`\eg1\*(C'\fR, \f(CW\*(C`\eg2\*(C'\fR,... only \fIinside\fR a regexp; not doing +so may lead to surprising and unsatisfactory results. +.SS "Relative backreferences" +.IX Subsection "Relative backreferences" +Counting the opening parentheses to get the correct number for a +backreference is error-prone as soon as there is more than one +capturing group. A more convenient technique became available +with Perl 5.10: relative backreferences. To refer to the immediately +preceding capture group one now may write \f(CW\*(C`\eg\-1\*(C'\fR or \f(CW\*(C`\eg{\-1}\*(C'\fR, the next but +last is available via \f(CW\*(C`\eg\-2\*(C'\fR or \f(CW\*(C`\eg{\-2}\*(C'\fR, and so on. +.PP +Another good reason in addition to readability and maintainability +for using relative backreferences is illustrated by the following example, +where a simple pattern for matching peculiar strings is used: +.PP +.Vb 1 +\& $a99a = \*(Aq([a\-z])(\ed)\eg2\eg1\*(Aq; # matches a11a, g22g, x33x, etc. +.Ve +.PP +Now that we have this pattern stored as a handy string, we might feel +tempted to use it as a part of some other pattern: +.PP +.Vb 6 +\& $line = "code=e99e"; +\& if ($line =~ /^(\ew+)=$a99a$/){ # unexpected behavior! +\& print "$1 is valid\en"; +\& } else { +\& print "bad line: \*(Aq$line\*(Aq\en"; +\& } +.Ve +.PP +But this doesn't match, at least not the way one might expect. Only +after inserting the interpolated \f(CW$a99a\fR and looking at the resulting +full text of the regexp is it obvious that the backreferences have +backfired. The subexpression \f(CW\*(C`(\ew+)\*(C'\fR has snatched number 1 and +demoted the groups in \f(CW$a99a\fR by one rank. This can be avoided by +using relative backreferences: +.PP +.Vb 1 +\& $a99a = \*(Aq([a\-z])(\ed)\eg{\-1}\eg{\-2}\*(Aq; # safe for being interpolated +.Ve +.SS "Named backreferences" +.IX Subsection "Named backreferences" +Perl 5.10 also introduced named capture groups and named backreferences. +To attach a name to a capturing group, you write either +\&\f(CW\*(C`(?<name>...)\*(C'\fR or \f(CW\*(C`(?\*(Aqname\*(Aq...)\*(C'\fR. The backreference may +then be written as \f(CW\*(C`\eg{name}\*(C'\fR. It is permissible to attach the +same name to more than one group, but then only the leftmost one of the +eponymous set can be referenced. Outside of the pattern a named +capture group is accessible through the \f(CW\*(C`%+\*(C'\fR hash. +.PP +Assuming that we have to match calendar dates which may be given in one +of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write +three suitable patterns where we use \f(CW\*(Aqd\*(Aq\fR, \f(CW\*(Aqm\*(Aq\fR and \f(CW\*(Aqy\*(Aq\fR respectively as the +names of the groups capturing the pertaining components of a date. The +matching operation combines the three patterns as alternatives: +.PP +.Vb 8 +\& $fmt1 = \*(Aq(?<y>\ed\ed\ed\ed)\-(?<m>\ed\ed)\-(?<d>\ed\ed)\*(Aq; +\& $fmt2 = \*(Aq(?<m>\ed\ed)/(?<d>\ed\ed)/(?<y>\ed\ed\ed\ed)\*(Aq; +\& $fmt3 = \*(Aq(?<d>\ed\ed)\e.(?<m>\ed\ed)\e.(?<y>\ed\ed\ed\ed)\*(Aq; +\& for my $d (qw(2006\-10\-21 15.01.2007 10/31/2005)) { +\& if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ +\& print "day=$+{d} month=$+{m} year=$+{y}\en"; +\& } +\& } +.Ve +.PP +If any of the alternatives matches, the hash \f(CW\*(C`%+\*(C'\fR is bound to contain the +three key-value pairs. +.SS "Alternative capture group numbering" +.IX Subsection "Alternative capture group numbering" +Yet another capturing group numbering technique (also as from Perl 5.10) +deals with the problem of referring to groups within a set of alternatives. +Consider a pattern for matching a time of the day, civil or military style: +.PP +.Vb 3 +\& if ( $time =~ /(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed)/ ){ +\& # process hour and minute +\& } +.Ve +.PP +Processing the results requires an additional if statement to determine +whether \f(CW$1\fR and \f(CW$2\fR or \f(CW$3\fR and \f(CW$4\fR contain the goodies. It would +be easier if we could use group numbers 1 and 2 in second alternative as +well, and this is exactly what the parenthesized construct \f(CW\*(C`(?|...)\*(C'\fR, +set around an alternative achieves. Here is an extended version of the +previous pattern: +.PP +.Vb 3 +\& if($time =~ /(?|(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed))\es+([A\-Z][A\-Z][A\-Z])/){ +\& print "hour=$1 minute=$2 zone=$3\en"; +\& } +.Ve +.PP +Within the alternative numbering group, group numbers start at the same +position for each alternative. After the group, numbering continues +with one higher than the maximum reached across all the alternatives. +.SS "Position information" +.IX Subsection "Position information" +In addition to what was matched, Perl also provides the +positions of what was matched as contents of the \f(CW\*(C`@\-\*(C'\fR and \f(CW\*(C`@+\*(C'\fR +arrays. \f(CW\*(C`$\-[0]\*(C'\fR is the position of the start of the entire match and +\&\f(CW$+[0]\fR is the position of the end. Similarly, \f(CW\*(C`$\-[n]\*(C'\fR is the +position of the start of the \f(CW$n\fR match and \f(CW$+[n]\fR is the position +of the end. If \f(CW$n\fR is undefined, so are \f(CW\*(C`$\-[n]\*(C'\fR and \f(CW$+[n]\fR. Then +this code +.PP +.Vb 6 +\& $x = "Mmm...donut, thought Homer"; +\& $x =~ /^(Mmm|Yech)\e.\e.\e.(donut|peas)/; # matches +\& foreach $exp (1..$#\-) { +\& no strict \*(Aqrefs\*(Aq; +\& print "Match $exp: \*(Aq$$exp\*(Aq at position ($\-[$exp],$+[$exp])\en"; +\& } +.Ve +.PP +prints +.PP +.Vb 2 +\& Match 1: \*(AqMmm\*(Aq at position (0,3) +\& Match 2: \*(Aqdonut\*(Aq at position (6,11) +.Ve +.PP +Even if there are no groupings in a regexp, it is still possible to +find out what exactly matched in a string. If you use them, Perl +will set \f(CW\*(C`$\`\*(C'\fR to the part of the string before the match, will set \f(CW$&\fR +to the part of the string that matched, and will set \f(CW\*(C`$\*(Aq\*(C'\fR to the part +of the string after the match. An example: +.PP +.Vb 3 +\& $x = "the cat caught the mouse"; +\& $x =~ /cat/; # $\` = \*(Aqthe \*(Aq, $& = \*(Aqcat\*(Aq, $\*(Aq = \*(Aq caught the mouse\*(Aq +\& $x =~ /the/; # $\` = \*(Aq\*(Aq, $& = \*(Aqthe\*(Aq, $\*(Aq = \*(Aq cat caught the mouse\*(Aq +.Ve +.PP +In the second match, \f(CW\*(C`$\`\*(C'\fR equals \f(CW\*(Aq\*(Aq\fR because the regexp matched at the +first character position in the string and stopped; it never saw the +second "the". +.PP +If your code is to run on Perl versions earlier than +5.20, it is worthwhile to note that using \f(CW\*(C`$\`\*(C'\fR and \f(CW\*(C`$\*(Aq\*(C'\fR +slows down regexp matching quite a bit, while \f(CW$&\fR slows it down to a +lesser extent, because if they are used in one regexp in a program, +they are generated for \fIall\fR regexps in the program. So if raw +performance is a goal of your application, they should be avoided. +If you need to extract the corresponding substrings, use \f(CW\*(C`@\-\*(C'\fR and +\&\f(CW\*(C`@+\*(C'\fR instead: +.PP +.Vb 3 +\& $\` is the same as substr( $x, 0, $\-[0] ) +\& $& is the same as substr( $x, $\-[0], $+[0]\-$\-[0] ) +\& $\*(Aq is the same as substr( $x, $+[0] ) +.Ve +.PP +As of Perl 5.10, the \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR and \f(CW\*(C`${^POSTMATCH}\*(C'\fR +variables may be used. These are only set if the \f(CW\*(C`/p\*(C'\fR modifier is +present. Consequently they do not penalize the rest of the program. In +Perl 5.20, \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR and \f(CW\*(C`${^POSTMATCH}\*(C'\fR are available +whether the \f(CW\*(C`/p\*(C'\fR has been used or not (the modifier is ignored), and +\&\f(CW\*(C`$\`\*(C'\fR, \f(CW\*(C`$\*(Aq\*(C'\fR and \f(CW$&\fR do not cause any speed difference. +.SS "Non-capturing groupings" +.IX Subsection "Non-capturing groupings" +A group that is required to bundle a set of alternatives may or may not be +useful as a capturing group. If it isn't, it just creates a superfluous +addition to the set of available capture group values, inside as well as +outside the regexp. Non-capturing groupings, denoted by \f(CW\*(C`(?:regexp)\*(C'\fR, +still allow the regexp to be treated as a single unit, but don't establish +a capturing group at the same time. Both capturing and non-capturing +groupings are allowed to co-exist in the same regexp. Because there is +no extraction, non-capturing groupings are faster than capturing +groupings. Non-capturing groupings are also handy for choosing exactly +which parts of a regexp are to be extracted to matching variables: +.PP +.Vb 2 +\& # match a number, $1\-$4 are set, but we only want $1 +\& /([+\-]?\e *(\ed+(\e.\ed*)?|\e.\ed+)([eE][+\-]?\ed+)?)/; +\& +\& # match a number faster , only $1 is set +\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE][+\-]?\ed+)?)/; +\& +\& # match a number, get $1 = whole number, $2 = exponent +\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE]([+\-]?\ed+))?)/; +.Ve +.PP +Non-capturing groupings are also useful for removing nuisance +elements gathered from a split operation where parentheses are +required for some reason: +.PP +.Vb 3 +\& $x = \*(Aq12aba34ba5\*(Aq; +\& @num = split /(a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aqa\*(Aq,\*(Aq34\*(Aq,\*(Aqa\*(Aq,\*(Aq5\*(Aq) +\& @num = split /(?:a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aq34\*(Aq,\*(Aq5\*(Aq) +.Ve +.PP +In Perl 5.22 and later, all groups within a regexp can be set to +non-capturing by using the new \f(CW\*(C`/n\*(C'\fR flag: +.PP +.Vb 1 +\& "hello" =~ /(hi|hello)/n; # $1 is not set! +.Ve +.PP +See "n" in perlre for more information. +.SS "Matching repetitions" +.IX Subsection "Matching repetitions" +The examples in the previous section display an annoying weakness. We +were only matching 3\-letter words, or chunks of words of 4 letters or +less. We'd like to be able to match words or, more generally, strings +of any length, without writing out tedious alternatives like +\&\f(CW\*(C`\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew\*(C'\fR. +.PP +This is exactly the problem the \fIquantifier\fR metacharacters \f(CW\*(Aq?\*(Aq\fR, +\&\f(CW\*(Aq*\*(Aq\fR, \f(CW\*(Aq+\*(Aq\fR, and \f(CW\*(C`{}\*(C'\fR were created for. They allow us to delimit the +number of repeats for a portion of a regexp we consider to be a +match. Quantifiers are put immediately after the character, character +class, or grouping that we want to specify. They have the following +meanings: +.IP \(bu 4 +\&\f(CW\*(C`a?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or 0 times +.IP \(bu 4 +\&\f(CW\*(C`a*\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or more times, \fIi.e.\fR, any number of times +.IP \(bu 4 +\&\f(CW\*(C`a+\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or more times, \fIi.e.\fR, at least once +.IP \(bu 4 +\&\f(CW\*(C`a{n,m}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but not more than \f(CW\*(C`m\*(C'\fR +times. +.IP \(bu 4 +\&\f(CW\*(C`a{n,}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR or more times +.IP \(bu 4 +\&\f(CW\*(C`a{,n}\*(C'\fR means: match at most \f(CW\*(C`n\*(C'\fR times, or fewer +.IP \(bu 4 +\&\f(CW\*(C`a{n}\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times +.PP +If you like, you can add blanks (tab or space characters) within the +braces, but adjacent to them, and/or next to the comma (if any). +.PP +Here are some examples: +.PP +.Vb 10 +\& /[a\-z]+\es+\ed*/; # match a lowercase word, at least one space, and +\& # any number of digits +\& /(\ew+)\es+\eg1/; # match doubled words of arbitrary length +\& /y(es)?/i; # matches \*(Aqy\*(Aq, \*(AqY\*(Aq, or a case\-insensitive \*(Aqyes\*(Aq +\& $year =~ /^\ed{2,4}$/; # make sure year is at least 2 but not more +\& # than 4 digits +\& $year =~ /^\ed{ 2, 4 }$/; # Same; for those who like wide open +\& # spaces. +\& $year =~ /^\ed{2, 4}$/; # Same. +\& $year =~ /^\ed{4}$|^\ed{2}$/; # better match; throw out 3\-digit dates +\& $year =~ /^\ed{2}(\ed{2})?$/; # same thing written differently. +\& # However, this captures the last two +\& # digits in $1 and the other does not. +\& +\& % simple_grep \*(Aq^(\ew+)\eg1$\*(Aq /usr/dict/words # isn\*(Aqt this easier? +\& beriberi +\& booboo +\& coco +\& mama +\& murmur +\& papa +.Ve +.PP +For all of these quantifiers, Perl will try to match as much of the +string as possible, while still allowing the regexp to succeed. Thus +with \f(CW\*(C`/a?.../\*(C'\fR, Perl will first try to match the regexp with the \f(CW\*(Aqa\*(Aq\fR +present; if that fails, Perl will try to match the regexp without the +\&\f(CW\*(Aqa\*(Aq\fR present. For the quantifier \f(CW\*(Aq*\*(Aq\fR, we get the following: +.PP +.Vb 5 +\& $x = "the cat in the hat"; +\& $x =~ /^(.*)(cat)(.*)$/; # matches, +\& # $1 = \*(Aqthe \*(Aq +\& # $2 = \*(Aqcat\*(Aq +\& # $3 = \*(Aq in the hat\*(Aq +.Ve +.PP +Which is what we might expect, the match finds the only \f(CW\*(C`cat\*(C'\fR in the +string and locks onto it. Consider, however, this regexp: +.PP +.Vb 4 +\& $x =~ /^(.*)(at)(.*)$/; # matches, +\& # $1 = \*(Aqthe cat in the h\*(Aq +\& # $2 = \*(Aqat\*(Aq +\& # $3 = \*(Aq\*(Aq (0 characters match) +.Ve +.PP +One might initially guess that Perl would find the \f(CW\*(C`at\*(C'\fR in \f(CW\*(C`cat\*(C'\fR and +stop there, but that wouldn't give the longest possible string to the +first quantifier \f(CW\*(C`.*\*(C'\fR. Instead, the first quantifier \f(CW\*(C`.*\*(C'\fR grabs as +much of the string as possible while still having the regexp match. In +this example, that means having the \f(CW\*(C`at\*(C'\fR sequence with the final \f(CW\*(C`at\*(C'\fR +in the string. The other important principle illustrated here is that, +when there are two or more elements in a regexp, the \fIleftmost\fR +quantifier, if there is one, gets to grab as much of the string as +possible, leaving the rest of the regexp to fight over scraps. Thus in +our example, the first quantifier \f(CW\*(C`.*\*(C'\fR grabs most of the string, while +the second quantifier \f(CW\*(C`.*\*(C'\fR gets the empty string. Quantifiers that +grab as much of the string as possible are called \fImaximal match\fR or +\&\fIgreedy\fR quantifiers. +.PP +When a regexp can match a string in several different ways, we can use +the principles above to predict which way the regexp will match: +.IP \(bu 4 +Principle 0: Taken as a whole, any regexp will be matched at the +earliest possible position in the string. +.IP \(bu 4 +Principle 1: In an alternation \f(CW\*(C`a|b|c...\*(C'\fR, the leftmost alternative +that allows a match for the whole regexp will be the one used. +.IP \(bu 4 +Principle 2: The maximal matching quantifiers \f(CW\*(Aq?\*(Aq\fR, \f(CW\*(Aq*\*(Aq\fR, \f(CW\*(Aq+\*(Aq\fR and +\&\f(CW\*(C`{n,m}\*(C'\fR will in general match as much of the string as possible while +still allowing the whole regexp to match. +.IP \(bu 4 +Principle 3: If there are two or more elements in a regexp, the +leftmost greedy quantifier, if any, will match as much of the string +as possible while still allowing the whole regexp to match. The next +leftmost greedy quantifier, if any, will try to match as much of the +string remaining available to it as possible, while still allowing the +whole regexp to match. And so on, until all the regexp elements are +satisfied. +.PP +As we have seen above, Principle 0 overrides the others. The regexp +will be matched as early as possible, with the other principles +determining how the regexp matches at that earliest character +position. +.PP +Here is an example of these principles in action: +.PP +.Vb 5 +\& $x = "The programming republic of Perl"; +\& $x =~ /^(.+)(e|r)(.*)$/; # matches, +\& # $1 = \*(AqThe programming republic of Pe\*(Aq +\& # $2 = \*(Aqr\*(Aq +\& # $3 = \*(Aql\*(Aq +.Ve +.PP +This regexp matches at the earliest string position, \f(CW\*(AqT\*(Aq\fR. One +might think that \f(CW\*(Aqe\*(Aq\fR, being leftmost in the alternation, would be +matched, but \f(CW\*(Aqr\*(Aq\fR produces the longest string in the first quantifier. +.PP +.Vb 3 +\& $x =~ /(m{1,2})(.*)$/; # matches, +\& # $1 = \*(Aqmm\*(Aq +\& # $2 = \*(Aqing republic of Perl\*(Aq +.Ve +.PP +Here, The earliest possible match is at the first \f(CW\*(Aqm\*(Aq\fR in +\&\f(CW\*(C`programming\*(C'\fR. \f(CW\*(C`m{1,2}\*(C'\fR is the first quantifier, so it gets to match +a maximal \f(CW\*(C`mm\*(C'\fR. +.PP +.Vb 3 +\& $x =~ /.*(m{1,2})(.*)$/; # matches, +\& # $1 = \*(Aqm\*(Aq +\& # $2 = \*(Aqing republic of Perl\*(Aq +.Ve +.PP +Here, the regexp matches at the start of the string. The first +quantifier \f(CW\*(C`.*\*(C'\fR grabs as much as possible, leaving just a single +\&\f(CW\*(Aqm\*(Aq\fR for the second quantifier \f(CW\*(C`m{1,2}\*(C'\fR. +.PP +.Vb 4 +\& $x =~ /(.?)(m{1,2})(.*)$/; # matches, +\& # $1 = \*(Aqa\*(Aq +\& # $2 = \*(Aqmm\*(Aq +\& # $3 = \*(Aqing republic of Perl\*(Aq +.Ve +.PP +Here, \f(CW\*(C`.?\*(C'\fR eats its maximal one character at the earliest possible +position in the string, \f(CW\*(Aqa\*(Aq\fR in \f(CW\*(C`programming\*(C'\fR, leaving \f(CW\*(C`m{1,2}\*(C'\fR +the opportunity to match both \f(CW\*(Aqm\*(Aq\fR's. Finally, +.PP +.Vb 1 +\& "aXXXb" =~ /(X*)/; # matches with $1 = \*(Aq\*(Aq +.Ve +.PP +because it can match zero copies of \f(CW\*(AqX\*(Aq\fR at the beginning of the +string. If you definitely want to match at least one \f(CW\*(AqX\*(Aq\fR, use +\&\f(CW\*(C`X+\*(C'\fR, not \f(CW\*(C`X*\*(C'\fR. +.PP +Sometimes greed is not good. At times, we would like quantifiers to +match a \fIminimal\fR piece of string, rather than a maximal piece. For +this purpose, Larry Wall created the \fIminimal match\fR or +\&\fInon-greedy\fR quantifiers \f(CW\*(C`??\*(C'\fR, \f(CW\*(C`*?\*(C'\fR, \f(CW\*(C`+?\*(C'\fR, and \f(CW\*(C`{}?\*(C'\fR. These are +the usual quantifiers with a \f(CW\*(Aq?\*(Aq\fR appended to them. They have the +following meanings: +.IP \(bu 4 +\&\f(CW\*(C`a??\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or 1 times. Try 0 first, then 1. +.IP \(bu 4 +\&\f(CW\*(C`a*?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or more times, \fIi.e.\fR, any number of times, +but as few times as possible +.IP \(bu 4 +\&\f(CW\*(C`a+?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or more times, \fIi.e.\fR, at least once, but +as few times as possible +.IP \(bu 4 +\&\f(CW\*(C`a{n,m}?\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, not more than \f(CW\*(C`m\*(C'\fR +times, as few times as possible +.IP \(bu 4 +\&\f(CW\*(C`a{n,}?\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but as few times as +possible +.IP \(bu 4 +\&\f(CW\*(C`a{,n}?\*(C'\fR means: match at most \f(CW\*(C`n\*(C'\fR times, but as few times as +possible +.IP \(bu 4 +\&\f(CW\*(C`a{n}?\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times. Because we match exactly +\&\f(CW\*(C`n\*(C'\fR times, \f(CW\*(C`a{n}?\*(C'\fR is equivalent to \f(CW\*(C`a{n}\*(C'\fR and is just there for +notational consistency. +.PP +Let's look at the example above, but with minimal quantifiers: +.PP +.Vb 5 +\& $x = "The programming republic of Perl"; +\& $x =~ /^(.+?)(e|r)(.*)$/; # matches, +\& # $1 = \*(AqTh\*(Aq +\& # $2 = \*(Aqe\*(Aq +\& # $3 = \*(Aq programming republic of Perl\*(Aq +.Ve +.PP +The minimal string that will allow both the start of the string \f(CW\*(Aq^\*(Aq\fR +and the alternation to match is \f(CW\*(C`Th\*(C'\fR, with the alternation \f(CW\*(C`e|r\*(C'\fR +matching \f(CW\*(Aqe\*(Aq\fR. The second quantifier \f(CW\*(C`.*\*(C'\fR is free to gobble up the +rest of the string. +.PP +.Vb 3 +\& $x =~ /(m{1,2}?)(.*?)$/; # matches, +\& # $1 = \*(Aqm\*(Aq +\& # $2 = \*(Aqming republic of Perl\*(Aq +.Ve +.PP +The first string position that this regexp can match is at the first +\&\f(CW\*(Aqm\*(Aq\fR in \f(CW\*(C`programming\*(C'\fR. At this position, the minimal \f(CW\*(C`m{1,2}?\*(C'\fR +matches just one \f(CW\*(Aqm\*(Aq\fR. Although the second quantifier \f(CW\*(C`.*?\*(C'\fR would +prefer to match no characters, it is constrained by the end-of-string +anchor \f(CW\*(Aq$\*(Aq\fR to match the rest of the string. +.PP +.Vb 4 +\& $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, +\& # $1 = \*(AqThe progra\*(Aq +\& # $2 = \*(Aqm\*(Aq +\& # $3 = \*(Aqming republic of Perl\*(Aq +.Ve +.PP +In this regexp, you might expect the first minimal quantifier \f(CW\*(C`.*?\*(C'\fR +to match the empty string, because it is not constrained by a \f(CW\*(Aq^\*(Aq\fR +anchor to match the beginning of the word. Principle 0 applies here, +however. Because it is possible for the whole regexp to match at the +start of the string, it \fIwill\fR match at the start of the string. Thus +the first quantifier has to match everything up to the first \f(CW\*(Aqm\*(Aq\fR. The +second minimal quantifier matches just one \f(CW\*(Aqm\*(Aq\fR and the third +quantifier matches the rest of the string. +.PP +.Vb 4 +\& $x =~ /(.??)(m{1,2})(.*)$/; # matches, +\& # $1 = \*(Aqa\*(Aq +\& # $2 = \*(Aqmm\*(Aq +\& # $3 = \*(Aqing republic of Perl\*(Aq +.Ve +.PP +Just as in the previous regexp, the first quantifier \f(CW\*(C`.??\*(C'\fR can match +earliest at position \f(CW\*(Aqa\*(Aq\fR, so it does. The second quantifier is +greedy, so it matches \f(CW\*(C`mm\*(C'\fR, and the third matches the rest of the +string. +.PP +We can modify principle 3 above to take into account non-greedy +quantifiers: +.IP \(bu 4 +Principle 3: If there are two or more elements in a regexp, the +leftmost greedy (non-greedy) quantifier, if any, will match as much +(little) of the string as possible while still allowing the whole +regexp to match. The next leftmost greedy (non-greedy) quantifier, if +any, will try to match as much (little) of the string remaining +available to it as possible, while still allowing the whole regexp to +match. And so on, until all the regexp elements are satisfied. +.PP +Just like alternation, quantifiers are also susceptible to +backtracking. Here is a step-by-step analysis of the example +.PP +.Vb 5 +\& $x = "the cat in the hat"; +\& $x =~ /^(.*)(at)(.*)$/; # matches, +\& # $1 = \*(Aqthe cat in the h\*(Aq +\& # $2 = \*(Aqat\*(Aq +\& # $3 = \*(Aq\*(Aq (0 matches) +.Ve +.IP 1. 4 +Start with the first letter in the string \f(CW\*(Aqt\*(Aq\fR. +.IP 2. 4 +The first quantifier \f(CW\*(Aq.*\*(Aq\fR starts out by matching the whole +string \f(CW"the cat in the hat"\fR. +.IP 3. 4 +\&\f(CW\*(Aqa\*(Aq\fR in the regexp element \f(CW\*(Aqat\*(Aq\fR doesn't match the end +of the string. Backtrack one character. +.IP 4. 4 +\&\f(CW\*(Aqa\*(Aq\fR in the regexp element \f(CW\*(Aqat\*(Aq\fR still doesn't match +the last letter of the string \f(CW\*(Aqt\*(Aq\fR, so backtrack one more character. +.IP 5. 4 +Now we can match the \f(CW\*(Aqa\*(Aq\fR and the \f(CW\*(Aqt\*(Aq\fR. +.IP 6. 4 +Move on to the third element \f(CW\*(Aq.*\*(Aq\fR. Since we are at the +end of the string and \f(CW\*(Aq.*\*(Aq\fR can match 0 times, assign it the empty +string. +.IP 7. 4 +We are done! +.PP +Most of the time, all this moving forward and backtracking happens +quickly and searching is fast. There are some pathological regexps, +however, whose execution time exponentially grows with the size of the +string. A typical structure that blows up in your face is of the form +.PP +.Vb 1 +\& /(a|b+)*/; +.Ve +.PP +The problem is the nested indeterminate quantifiers. There are many +different ways of partitioning a string of length n between the \f(CW\*(Aq+\*(Aq\fR +and \f(CW\*(Aq*\*(Aq\fR: one repetition with \f(CW\*(C`b+\*(C'\fR of length n, two repetitions with +the first \f(CW\*(C`b+\*(C'\fR length k and the second with length n\-k, m repetitions +whose bits add up to length n, \fIetc\fR. In fact there are an exponential +number of ways to partition a string as a function of its length. A +regexp may get lucky and match early in the process, but if there is +no match, Perl will try \fIevery\fR possibility before giving up. So be +careful with nested \f(CW\*(Aq*\*(Aq\fR's, \f(CW\*(C`{n,m}\*(C'\fR's, and \f(CW\*(Aq+\*(Aq\fR's. The book +\&\fIMastering Regular Expressions\fR by Jeffrey Friedl gives a wonderful +discussion of this and other efficiency issues. +.SS "Possessive quantifiers" +.IX Subsection "Possessive quantifiers" +Backtracking during the relentless search for a match may be a waste +of time, particularly when the match is bound to fail. Consider +the simple pattern +.PP +.Vb 1 +\& /^\ew+\es+\ew+$/; # a word, spaces, a word +.Ve +.PP +Whenever this is applied to a string which doesn't quite meet the +pattern's expectations such as \f(CW"abc\ \ "\fR or \f(CW"abc\ \ def\ "\fR, +the regexp engine will backtrack, approximately once for each character +in the string. But we know that there is no way around taking \fIall\fR +of the initial word characters to match the first repetition, that \fIall\fR +spaces must be eaten by the middle part, and the same goes for the second +word. +.PP +With the introduction of the \fIpossessive quantifiers\fR in Perl 5.10, we +have a way of instructing the regexp engine not to backtrack, with the +usual quantifiers with a \f(CW\*(Aq+\*(Aq\fR appended to them. This makes them greedy as +well as stingy; once they succeed they won't give anything back to permit +another solution. They have the following meanings: +.IP \(bu 4 +\&\f(CW\*(C`a{n,m}+\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, not more than \f(CW\*(C`m\*(C'\fR times, +as many times as possible, and don't give anything up. \f(CW\*(C`a?+\*(C'\fR is short +for \f(CW\*(C`a{0,1}+\*(C'\fR +.IP \(bu 4 +\&\f(CW\*(C`a{n,}+\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but as many times as possible, +and don't give anything up. \f(CW\*(C`a++\*(C'\fR is short for \f(CW\*(C`a{1,}+\*(C'\fR. +.IP \(bu 4 +\&\f(CW\*(C`a{,n}+\*(C'\fR means: match as many times as possible up to at most \f(CW\*(C`n\*(C'\fR +times, and don't give anything up. \f(CW\*(C`a*+\*(C'\fR is short for \f(CW\*(C`a{0,}+\*(C'\fR. +.IP \(bu 4 +\&\f(CW\*(C`a{n}+\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times. It is just there for +notational consistency. +.PP +These possessive quantifiers represent a special case of a more general +concept, the \fIindependent subexpression\fR, see below. +.PP +As an example where a possessive quantifier is suitable we consider +matching a quoted string, as it appears in several programming languages. +The backslash is used as an escape character that indicates that the +next character is to be taken literally, as another character for the +string. Therefore, after the opening quote, we expect a (possibly +empty) sequence of alternatives: either some character except an +unescaped quote or backslash or an escaped character. +.PP +.Vb 1 +\& /"(?:[^"\e\e]++|\e\e.)*+"/; +.Ve +.SS "Building a regexp" +.IX Subsection "Building a regexp" +At this point, we have all the basic regexp concepts covered, so let's +give a more involved example of a regular expression. We will build a +regexp that matches numbers. +.PP +The first task in building a regexp is to decide what we want to match +and what we want to exclude. In our case, we want to match both +integers and floating point numbers and we want to reject any string +that isn't a number. +.PP +The next task is to break the problem down into smaller problems that +are easily converted into a regexp. +.PP +The simplest case is integers. These consist of a sequence of digits, +with an optional sign in front. The digits we can represent with +\&\f(CW\*(C`\ed+\*(C'\fR and the sign can be matched with \f(CW\*(C`[+\-]\*(C'\fR. Thus the integer +regexp is +.PP +.Vb 1 +\& /[+\-]?\ed+/; # matches integers +.Ve +.PP +A floating point number potentially has a sign, an integral part, a +decimal point, a fractional part, and an exponent. One or more of these +parts is optional, so we need to check out the different +possibilities. Floating point numbers which are in proper form include +123., 0.345, .34, \-1e6, and 25.4E\-72. As with integers, the sign out +front is completely optional and can be matched by \f(CW\*(C`[+\-]?\*(C'\fR. We can +see that if there is no exponent, floating point numbers must have a +decimal point, otherwise they are integers. We might be tempted to +model these with \f(CW\*(C`\ed*\e.\ed*\*(C'\fR, but this would also match just a single +decimal point, which is not a number. So the three cases of floating +point number without exponent are +.PP +.Vb 3 +\& /[+\-]?\ed+\e./; # 1., 321., etc. +\& /[+\-]?\e.\ed+/; # .1, .234, etc. +\& /[+\-]?\ed+\e.\ed+/; # 1.0, 30.56, etc. +.Ve +.PP +These can be combined into a single regexp with a three-way alternation: +.PP +.Vb 1 +\& /[+\-]?(\ed+\e.\ed+|\ed+\e.|\e.\ed+)/; # floating point, no exponent +.Ve +.PP +In this alternation, it is important to put \f(CW\*(Aq\ed+\e.\ed+\*(Aq\fR before +\&\f(CW\*(Aq\ed+\e.\*(Aq\fR. If \f(CW\*(Aq\ed+\e.\*(Aq\fR were first, the regexp would happily match that +and ignore the fractional part of the number. +.PP +Now consider floating point numbers with exponents. The key +observation here is that \fIboth\fR integers and numbers with decimal +points are allowed in front of an exponent. Then exponents, like the +overall sign, are independent of whether we are matching numbers with +or without decimal points, and can be "decoupled" from the +mantissa. The overall form of the regexp now becomes clear: +.PP +.Vb 1 +\& /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; +.Ve +.PP +The exponent is an \f(CW\*(Aqe\*(Aq\fR or \f(CW\*(AqE\*(Aq\fR, followed by an integer. So the +exponent regexp is +.PP +.Vb 1 +\& /[eE][+\-]?\ed+/; # exponent +.Ve +.PP +Putting all the parts together, we get a regexp that matches numbers: +.PP +.Vb 1 +\& /^[+\-]?(\ed+\e.\ed+|\ed+\e.|\e.\ed+|\ed+)([eE][+\-]?\ed+)?$/; # Ta da! +.Ve +.PP +Long regexps like this may impress your friends, but can be hard to +decipher. In complex situations like this, the \f(CW\*(C`/x\*(C'\fR modifier for a +match is invaluable. It allows one to put nearly arbitrary whitespace +and comments into a regexp without affecting their meaning. Using it, +we can rewrite our "extended" regexp in the more pleasing form +.PP +.Vb 10 +\& /^ +\& [+\-]? # first, match an optional sign +\& ( # then match integers or f.p. mantissas: +\& \ed+\e.\ed+ # mantissa of the form a.b +\& |\ed+\e. # mantissa of the form a. +\& |\e.\ed+ # mantissa of the form .b +\& |\ed+ # integer of the form a +\& ) +\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent +\& $/x; +.Ve +.PP +If whitespace is mostly irrelevant, how does one include space +characters in an extended regexp? The answer is to backslash it +\&\f(CW\*(Aq\e\ \*(Aq\fR or put it in a character class \f(CW\*(C`[\ ]\*(C'\fR. The same thing +goes for pound signs: use \f(CW\*(C`\e#\*(C'\fR or \f(CW\*(C`[#]\*(C'\fR. For instance, Perl allows +a space between the sign and the mantissa or integer, and we could add +this to our regexp as follows: +.PP +.Vb 10 +\& /^ +\& [+\-]?\e * # first, match an optional sign *and space* +\& ( # then match integers or f.p. mantissas: +\& \ed+\e.\ed+ # mantissa of the form a.b +\& |\ed+\e. # mantissa of the form a. +\& |\e.\ed+ # mantissa of the form .b +\& |\ed+ # integer of the form a +\& ) +\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent +\& $/x; +.Ve +.PP +In this form, it is easier to see a way to simplify the +alternation. Alternatives 1, 2, and 4 all start with \f(CW\*(C`\ed+\*(C'\fR, so it +could be factored out: +.PP +.Vb 11 +\& /^ +\& [+\-]?\e * # first, match an optional sign +\& ( # then match integers or f.p. mantissas: +\& \ed+ # start out with a ... +\& ( +\& \e.\ed* # mantissa of the form a.b or a. +\& )? # ? takes care of integers of the form a +\& |\e.\ed+ # mantissa of the form .b +\& ) +\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent +\& $/x; +.Ve +.PP +Starting in Perl v5.26, specifying \f(CW\*(C`/xx\*(C'\fR changes the square-bracketed +portions of a pattern to ignore tabs and space characters unless they +are escaped by preceding them with a backslash. So, we could write +.PP +.Vb 11 +\& /^ +\& [ + \- ]?\e * # first, match an optional sign +\& ( # then match integers or f.p. mantissas: +\& \ed+ # start out with a ... +\& ( +\& \e.\ed* # mantissa of the form a.b or a. +\& )? # ? takes care of integers of the form a +\& |\e.\ed+ # mantissa of the form .b +\& ) +\& ( [ e E ] [ + \- ]? \ed+ )? # finally, optionally match an exponent +\& $/xx; +.Ve +.PP +This doesn't really improve the legibility of this example, but it's +available in case you want it. Squashing the pattern down to the +compact form, we have +.PP +.Vb 1 +\& /^[+\-]?\e *(\ed+(\e.\ed*)?|\e.\ed+)([eE][+\-]?\ed+)?$/; +.Ve +.PP +This is our final regexp. To recap, we built a regexp by +.IP \(bu 4 +specifying the task in detail, +.IP \(bu 4 +breaking down the problem into smaller parts, +.IP \(bu 4 +translating the small parts into regexps, +.IP \(bu 4 +combining the regexps, +.IP \(bu 4 +and optimizing the final combined regexp. +.PP +These are also the typical steps involved in writing a computer +program. This makes perfect sense, because regular expressions are +essentially programs written in a little computer language that specifies +patterns. +.SS "Using regular expressions in Perl" +.IX Subsection "Using regular expressions in Perl" +The last topic of Part 1 briefly covers how regexps are used in Perl +programs. Where do they fit into Perl syntax? +.PP +We have already introduced the matching operator in its default +\&\f(CW\*(C`/regexp/\*(C'\fR and arbitrary delimiter \f(CW\*(C`m!regexp!\*(C'\fR forms. We have used +the binding operator \f(CW\*(C`=~\*(C'\fR and its negation \f(CW\*(C`!~\*(C'\fR to test for string +matches. Associated with the matching operator, we have discussed the +single line \f(CW\*(C`/s\*(C'\fR, multi-line \f(CW\*(C`/m\*(C'\fR, case-insensitive \f(CW\*(C`/i\*(C'\fR and +extended \f(CW\*(C`/x\*(C'\fR modifiers. There are a few more things you might +want to know about matching operators. +.PP +\fIProhibiting substitution\fR +.IX Subsection "Prohibiting substitution" +.PP +If you change \f(CW$pattern\fR after the first substitution happens, Perl +will ignore it. If you don't want any substitutions at all, use the +special delimiter \f(CW\*(C`m\*(Aq\*(Aq\*(C'\fR: +.PP +.Vb 4 +\& @pattern = (\*(AqSeuss\*(Aq); +\& while (<>) { +\& print if m\*(Aq@pattern\*(Aq; # matches literal \*(Aq@pattern\*(Aq, not \*(AqSeuss\*(Aq +\& } +.Ve +.PP +Similar to strings, \f(CW\*(C`m\*(Aq\*(Aq\*(C'\fR acts like apostrophes on a regexp; all other +\&\f(CW\*(Aqm\*(Aq\fR delimiters act like quotes. If the regexp evaluates to the empty string, +the regexp in the \fIlast successful match\fR is used instead. So we have +.PP +.Vb 2 +\& "dog" =~ /d/; # \*(Aqd\*(Aq matches +\& "dogbert" =~ //; # this matches the \*(Aqd\*(Aq regexp used before +.Ve +.PP +\fIGlobal matching\fR +.IX Subsection "Global matching" +.PP +The final two modifiers we will discuss here, +\&\f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`/c\*(C'\fR, concern multiple matches. +The modifier \f(CW\*(C`/g\*(C'\fR stands for global matching and allows the +matching operator to match within a string as many times as possible. +In scalar context, successive invocations against a string will have +\&\f(CW\*(C`/g\*(C'\fR jump from match to match, keeping track of position in the +string as it goes along. You can get or set the position with the +\&\f(CWpos()\fR function. +.PP +The use of \f(CW\*(C`/g\*(C'\fR is shown in the following example. Suppose we have +a string that consists of words separated by spaces. If we know how +many words there are in advance, we could extract the words using +groupings: +.PP +.Vb 5 +\& $x = "cat dog house"; # 3 words +\& $x =~ /^\es*(\ew+)\es+(\ew+)\es+(\ew+)\es*$/; # matches, +\& # $1 = \*(Aqcat\*(Aq +\& # $2 = \*(Aqdog\*(Aq +\& # $3 = \*(Aqhouse\*(Aq +.Ve +.PP +But what if we had an indeterminate number of words? This is the sort +of task \f(CW\*(C`/g\*(C'\fR was made for. To extract all words, form the simple +regexp \f(CW\*(C`(\ew+)\*(C'\fR and loop over all matches with \f(CW\*(C`/(\ew+)/g\*(C'\fR: +.PP +.Vb 3 +\& while ($x =~ /(\ew+)/g) { +\& print "Word is $1, ends at position ", pos $x, "\en"; +\& } +.Ve +.PP +prints +.PP +.Vb 3 +\& Word is cat, ends at position 3 +\& Word is dog, ends at position 7 +\& Word is house, ends at position 13 +.Ve +.PP +A failed match or changing the target string resets the position. If +you don't want the position reset after failure to match, add the +\&\f(CW\*(C`/c\*(C'\fR, as in \f(CW\*(C`/regexp/gc\*(C'\fR. The current position in the string is +associated with the string, not the regexp. This means that different +strings have different positions and their respective positions can be +set or read independently. +.PP +In list context, \f(CW\*(C`/g\*(C'\fR returns a list of matched groupings, or if +there are no groupings, a list of matches to the whole regexp. So if +we wanted just the words, we could use +.PP +.Vb 4 +\& @words = ($x =~ /(\ew+)/g); # matches, +\& # $words[0] = \*(Aqcat\*(Aq +\& # $words[1] = \*(Aqdog\*(Aq +\& # $words[2] = \*(Aqhouse\*(Aq +.Ve +.PP +Closely associated with the \f(CW\*(C`/g\*(C'\fR modifier is the \f(CW\*(C`\eG\*(C'\fR anchor. The +\&\f(CW\*(C`\eG\*(C'\fR anchor matches at the point where the previous \f(CW\*(C`/g\*(C'\fR match left +off. \f(CW\*(C`\eG\*(C'\fR allows us to easily do context-sensitive matching: +.PP +.Vb 12 +\& $metric = 1; # use metric units +\& ... +\& $x = <FILE>; # read in measurement +\& $x =~ /^([+\-]?\ed+)\es*/g; # get magnitude +\& $weight = $1; +\& if ($metric) { # error checking +\& print "Units error!" unless $x =~ /\eGkg\e./g; +\& } +\& else { +\& print "Units error!" unless $x =~ /\eGlbs\e./g; +\& } +\& $x =~ /\eG\es+(widget|sprocket)/g; # continue processing +.Ve +.PP +The combination of \f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`\eG\*(C'\fR allows us to process the string a +bit at a time and use arbitrary Perl logic to decide what to do next. +Currently, the \f(CW\*(C`\eG\*(C'\fR anchor is only fully supported when used to anchor +to the start of the pattern. +.PP +\&\f(CW\*(C`\eG\*(C'\fR is also invaluable in processing fixed-length records with +regexps. Suppose we have a snippet of coding region DNA, encoded as +base pair letters \f(CW\*(C`ATCGTTGAAT...\*(C'\fR and we want to find all the stop +codons \f(CW\*(C`TGA\*(C'\fR. In a coding region, codons are 3\-letter sequences, so +we can think of the DNA snippet as a sequence of 3\-letter records. The +naive regexp +.PP +.Vb 3 +\& # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" +\& $dna = "ATCGTTGAATGCAAATGACATGAC"; +\& $dna =~ /TGA/; +.Ve +.PP +doesn't work; it may match a \f(CW\*(C`TGA\*(C'\fR, but there is no guarantee that +the match is aligned with codon boundaries, \fIe.g.\fR, the substring +\&\f(CW\*(C`GTT\ GAA\*(C'\fR gives a match. A better solution is +.PP +.Vb 3 +\& while ($dna =~ /(\ew\ew\ew)*?TGA/g) { # note the minimal *? +\& print "Got a TGA stop codon at position ", pos $dna, "\en"; +\& } +.Ve +.PP +which prints +.PP +.Vb 2 +\& Got a TGA stop codon at position 18 +\& Got a TGA stop codon at position 23 +.Ve +.PP +Position 18 is good, but position 23 is bogus. What happened? +.PP +The answer is that our regexp works well until we get past the last +real match. Then the regexp will fail to match a synchronized \f(CW\*(C`TGA\*(C'\fR +and start stepping ahead one character position at a time, not what we +want. The solution is to use \f(CW\*(C`\eG\*(C'\fR to anchor the match to the codon +alignment: +.PP +.Vb 3 +\& while ($dna =~ /\eG(\ew\ew\ew)*?TGA/g) { +\& print "Got a TGA stop codon at position ", pos $dna, "\en"; +\& } +.Ve +.PP +This prints +.PP +.Vb 1 +\& Got a TGA stop codon at position 18 +.Ve +.PP +which is the correct answer. This example illustrates that it is +important not only to match what is desired, but to reject what is not +desired. +.PP +(There are other regexp modifiers that are available, such as +\&\f(CW\*(C`/o\*(C'\fR, but their specialized uses are beyond the +scope of this introduction. ) +.PP +\fISearch and replace\fR +.IX Subsection "Search and replace" +.PP +Regular expressions also play a big role in \fIsearch and replace\fR +operations in Perl. Search and replace is accomplished with the +\&\f(CW\*(C`s///\*(C'\fR operator. The general form is +\&\f(CW\*(C`s/regexp/replacement/modifiers\*(C'\fR, with everything we know about +regexps and modifiers applying in this case as well. The +\&\fIreplacement\fR is a Perl double-quoted string that replaces in the +string whatever is matched with the \f(CW\*(C`regexp\*(C'\fR. The operator \f(CW\*(C`=~\*(C'\fR is +also used here to associate a string with \f(CW\*(C`s///\*(C'\fR. If matching +against \f(CW$_\fR, the \f(CW\*(C`$_\ =~\*(C'\fR can be dropped. If there is a match, +\&\f(CW\*(C`s///\*(C'\fR returns the number of substitutions made; otherwise it returns +false. Here are a few examples: +.PP +.Vb 8 +\& $x = "Time to feed the cat!"; +\& $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" +\& if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { +\& $more_insistent = 1; +\& } +\& $y = "\*(Aqquoted words\*(Aq"; +\& $y =~ s/^\*(Aq(.*)\*(Aq$/$1/; # strip single quotes, +\& # $y contains "quoted words" +.Ve +.PP +In the last example, the whole string was matched, but only the part +inside the single quotes was grouped. With the \f(CW\*(C`s///\*(C'\fR operator, the +matched variables \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR. are immediately available for use +in the replacement expression, so we use \f(CW$1\fR to replace the quoted +string with just what was quoted. With the global modifier, \f(CW\*(C`s///g\*(C'\fR +will search and replace all occurrences of the regexp in the string: +.PP +.Vb 6 +\& $x = "I batted 4 for 4"; +\& $x =~ s/4/four/; # doesn\*(Aqt do it all: +\& # $x contains "I batted four for 4" +\& $x = "I batted 4 for 4"; +\& $x =~ s/4/four/g; # does it all: +\& # $x contains "I batted four for four" +.Ve +.PP +If you prefer "regex" over "regexp" in this tutorial, you could use +the following program to replace it: +.PP +.Vb 9 +\& % cat > simple_replace +\& #!/usr/bin/perl +\& $regexp = shift; +\& $replacement = shift; +\& while (<>) { +\& s/$regexp/$replacement/g; +\& print; +\& } +\& ^D +\& +\& % simple_replace regexp regex perlretut.pod +.Ve +.PP +In \f(CW\*(C`simple_replace\*(C'\fR we used the \f(CW\*(C`s///g\*(C'\fR modifier to replace all +occurrences of the regexp on each line. (Even though the regular +expression appears in a loop, Perl is smart enough to compile it +only once.) As with \f(CW\*(C`simple_grep\*(C'\fR, both the +\&\f(CW\*(C`print\*(C'\fR and the \f(CW\*(C`s/$regexp/$replacement/g\*(C'\fR use \f(CW$_\fR implicitly. +.PP +If you don't want \f(CW\*(C`s///\*(C'\fR to change your original variable you can use +the non-destructive substitute modifier, \f(CW\*(C`s///r\*(C'\fR. This changes the +behavior so that \f(CW\*(C`s///r\*(C'\fR returns the final substituted string +(instead of the number of substitutions): +.PP +.Vb 3 +\& $x = "I like dogs."; +\& $y = $x =~ s/dogs/cats/r; +\& print "$x $y\en"; +.Ve +.PP +That example will print "I like dogs. I like cats". Notice the original +\&\f(CW$x\fR variable has not been affected. The overall +result of the substitution is instead stored in \f(CW$y\fR. If the +substitution doesn't affect anything then the original string is +returned: +.PP +.Vb 3 +\& $x = "I like dogs."; +\& $y = $x =~ s/elephants/cougars/r; +\& print "$x $y\en"; # prints "I like dogs. I like dogs." +.Ve +.PP +One other interesting thing that the \f(CW\*(C`s///r\*(C'\fR flag allows is chaining +substitutions: +.PP +.Vb 4 +\& $x = "Cats are great."; +\& print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ +\& s/Frogs/Hedgehogs/r, "\en"; +\& # prints "Hedgehogs are great." +.Ve +.PP +A modifier available specifically to search and replace is the +\&\f(CW\*(C`s///e\*(C'\fR evaluation modifier. \f(CW\*(C`s///e\*(C'\fR treats the +replacement text as Perl code, rather than a double-quoted +string. The value that the code returns is substituted for the +matched substring. \f(CW\*(C`s///e\*(C'\fR is useful if you need to do a bit of +computation in the process of replacing text. This example counts +character frequencies in a line: +.PP +.Vb 4 +\& $x = "Bill the cat"; +\& $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself +\& print "frequency of \*(Aq$_\*(Aq is $chars{$_}\en" +\& foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); +.Ve +.PP +This prints +.PP +.Vb 9 +\& frequency of \*(Aq \*(Aq is 2 +\& frequency of \*(Aqt\*(Aq is 2 +\& frequency of \*(Aql\*(Aq is 2 +\& frequency of \*(AqB\*(Aq is 1 +\& frequency of \*(Aqc\*(Aq is 1 +\& frequency of \*(Aqe\*(Aq is 1 +\& frequency of \*(Aqh\*(Aq is 1 +\& frequency of \*(Aqi\*(Aq is 1 +\& frequency of \*(Aqa\*(Aq is 1 +.Ve +.PP +As with the match \f(CW\*(C`m//\*(C'\fR operator, \f(CW\*(C`s///\*(C'\fR can use other delimiters, +such as \f(CW\*(C`s!!!\*(C'\fR and \f(CW\*(C`s{}{}\*(C'\fR, and even \f(CW\*(C`s{}//\*(C'\fR. If single quotes are +used \f(CW\*(C`s\*(Aq\*(Aq\*(Aq\*(C'\fR, then the regexp and replacement are +treated as single-quoted strings and there are no +variable substitutions. \f(CW\*(C`s///\*(C'\fR in list context +returns the same thing as in scalar context, \fIi.e.\fR, the number of +matches. +.PP +\fIThe split function\fR +.IX Subsection "The split function" +.PP +The \f(CWsplit()\fR function is another place where a regexp is used. +\&\f(CW\*(C`split /regexp/, string, limit\*(C'\fR separates the \f(CW\*(C`string\*(C'\fR operand into +a list of substrings and returns that list. The regexp must be designed +to match whatever constitutes the separators for the desired substrings. +The \f(CW\*(C`limit\*(C'\fR, if present, constrains splitting into no more than \f(CW\*(C`limit\*(C'\fR +number of strings. For example, to split a string into words, use +.PP +.Vb 4 +\& $x = "Calvin and Hobbes"; +\& @words = split /\es+/, $x; # $word[0] = \*(AqCalvin\*(Aq +\& # $word[1] = \*(Aqand\*(Aq +\& # $word[2] = \*(AqHobbes\*(Aq +.Ve +.PP +If the empty regexp \f(CW\*(C`//\*(C'\fR is used, the regexp always matches and +the string is split into individual characters. If the regexp has +groupings, then the resulting list contains the matched substrings from the +groupings as well. For instance, +.PP +.Vb 12 +\& $x = "/usr/bin/perl"; +\& @dirs = split m!/!, $x; # $dirs[0] = \*(Aq\*(Aq +\& # $dirs[1] = \*(Aqusr\*(Aq +\& # $dirs[2] = \*(Aqbin\*(Aq +\& # $dirs[3] = \*(Aqperl\*(Aq +\& @parts = split m!(/)!, $x; # $parts[0] = \*(Aq\*(Aq +\& # $parts[1] = \*(Aq/\*(Aq +\& # $parts[2] = \*(Aqusr\*(Aq +\& # $parts[3] = \*(Aq/\*(Aq +\& # $parts[4] = \*(Aqbin\*(Aq +\& # $parts[5] = \*(Aq/\*(Aq +\& # $parts[6] = \*(Aqperl\*(Aq +.Ve +.PP +Since the first character of \f(CW$x\fR matched the regexp, \f(CW\*(C`split\*(C'\fR prepended +an empty initial element to the list. +.PP +If you have read this far, congratulations! You now have all the basic +tools needed to use regular expressions to solve a wide range of text +processing problems. If this is your first time through the tutorial, +why not stop here and play around with regexps a while.... Part\ 2 +concerns the more esoteric aspects of regular expressions and those +concepts certainly aren't needed right at the start. +.SH "Part 2: Power tools" +.IX Header "Part 2: Power tools" +OK, you know the basics of regexps and you want to know more. If +matching regular expressions is analogous to a walk in the woods, then +the tools discussed in Part 1 are analogous to topo maps and a +compass, basic tools we use all the time. Most of the tools in part 2 +are analogous to flare guns and satellite phones. They aren't used +too often on a hike, but when we are stuck, they can be invaluable. +.PP +What follows are the more advanced, less used, or sometimes esoteric +capabilities of Perl regexps. In Part 2, we will assume you are +comfortable with the basics and concentrate on the advanced features. +.SS "More on characters, strings, and character classes" +.IX Subsection "More on characters, strings, and character classes" +There are a number of escape sequences and character classes that we +haven't covered yet. +.PP +There are several escape sequences that convert characters or strings +between upper and lower case, and they are also available within +patterns. \f(CW\*(C`\el\*(C'\fR and \f(CW\*(C`\eu\*(C'\fR convert the next character to lower or +upper case, respectively: +.PP +.Vb 4 +\& $x = "perl"; +\& $string =~ /\eu$x/; # matches \*(AqPerl\*(Aq in $string +\& $x = "M(rs?|s)\e\e."; # note the double backslash +\& $string =~ /\el$x/; # matches \*(Aqmr.\*(Aq, \*(Aqmrs.\*(Aq, and \*(Aqms.\*(Aq, +.Ve +.PP +A \f(CW\*(C`\eL\*(C'\fR or \f(CW\*(C`\eU\*(C'\fR indicates a lasting conversion of case, until +terminated by \f(CW\*(C`\eE\*(C'\fR or thrown over by another \f(CW\*(C`\eU\*(C'\fR or \f(CW\*(C`\eL\*(C'\fR: +.PP +.Vb 4 +\& $x = "This word is in lower case:\eL SHOUT\eE"; +\& $x =~ /shout/; # matches +\& $x = "I STILL KEYPUNCH CARDS FOR MY 360"; +\& $x =~ /\eUkeypunch/; # matches punch card string +.Ve +.PP +If there is no \f(CW\*(C`\eE\*(C'\fR, case is converted until the end of the +string. The regexps \f(CW\*(C`\eL\eu$word\*(C'\fR or \f(CW\*(C`\eu\eL$word\*(C'\fR convert the first +character of \f(CW$word\fR to uppercase and the rest of the characters to +lowercase. (Beyond ASCII characters, it gets somewhat more complicated; +\&\f(CW\*(C`\eu\*(C'\fR actually performs \fItitlecase\fR mapping, which for most characters +is the same as uppercase, but not for all; see +<https://unicode.org/faq/casemap_charprop.html#4>.) +.PP +Control characters can be escaped with \f(CW\*(C`\ec\*(C'\fR, so that a control-Z +character would be matched with \f(CW\*(C`\ecZ\*(C'\fR. The escape sequence +\&\f(CW\*(C`\eQ\*(C'\fR...\f(CW\*(C`\eE\*(C'\fR quotes, or protects most non-alphabetic characters. For +instance, +.PP +.Vb 2 +\& $x = "\eQThat !^*&%~& cat!"; +\& $x =~ /\eQ!^*&%~&\eE/; # check for rough language +.Ve +.PP +It does not protect \f(CW\*(Aq$\*(Aq\fR or \f(CW\*(Aq@\*(Aq\fR, so that variables can still be +substituted. +.PP +\&\f(CW\*(C`\eQ\*(C'\fR, \f(CW\*(C`\eL\*(C'\fR, \f(CW\*(C`\el\*(C'\fR, \f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\eu\*(C'\fR and \f(CW\*(C`\eE\*(C'\fR are actually part of +double-quotish syntax, and not part of regexp syntax proper. They will +work if they appear in a regular expression embedded directly in a +program, but not when contained in a string that is interpolated in a +pattern. +.PP +Perl regexps can handle more than just the +standard ASCII character set. Perl supports \fIUnicode\fR, a standard +for representing the alphabets from virtually all of the world's written +languages, and a host of symbols. Perl's text strings are Unicode strings, so +they can contain characters with a value (codepoint or character number) higher +than 255. +.PP +What does this mean for regexps? Well, regexp users don't need to know +much about Perl's internal representation of strings. But they do need +to know 1) how to represent Unicode characters in a regexp and 2) that +a matching operation will treat the string to be searched as a sequence +of characters, not bytes. The answer to 1) is that Unicode characters +greater than \f(CWchr(255)\fR are represented using the \f(CW\*(C`\ex{hex}\*(C'\fR notation, because +\&\f(CW\*(C`\ex\*(C'\fR\fIXY\fR (without curly braces and \fIXY\fR are two hex digits) doesn't +go further than 255. (Starting in Perl 5.14, if you're an octal fan, +you can also use \f(CW\*(C`\eo{oct}\*(C'\fR.) +.PP +.Vb 2 +\& /\ex{263a}/; # match a Unicode smiley face :) +\& /\ex{ 263a }/; # Same +.Ve +.PP +\&\fBNOTE\fR: In Perl 5.6.0 it used to be that one needed to say \f(CW\*(C`use +utf8\*(C'\fR to use any Unicode features. This is no longer the case: for +almost all Unicode processing, the explicit \f(CW\*(C`utf8\*(C'\fR pragma is not +needed. (The only case where it matters is if your Perl script is in +Unicode and encoded in UTF\-8, then an explicit \f(CW\*(C`use utf8\*(C'\fR is needed.) +.PP +Figuring out the hexadecimal sequence of a Unicode character you want +or deciphering someone else's hexadecimal Unicode regexp is about as +much fun as programming in machine code. So another way to specify +Unicode characters is to use the \fInamed character\fR escape +sequence \f(CW\*(C`\eN{\fR\f(CIname\fR\f(CW}\*(C'\fR. \fIname\fR is a name for the Unicode character, as +specified in the Unicode standard. For instance, if we wanted to +represent or match the astrological sign for the planet Mercury, we +could use +.PP +.Vb 3 +\& $x = "abc\eN{MERCURY}def"; +\& $x =~ /\eN{MERCURY}/; # matches +\& $x =~ /\eN{ MERCURY }/; # Also matches +.Ve +.PP +One can also use "short" names: +.PP +.Vb 2 +\& print "\eN{GREEK SMALL LETTER SIGMA} is called sigma.\en"; +\& print "\eN{greek:Sigma} is an upper\-case sigma.\en"; +.Ve +.PP +You can also restrict names to a certain alphabet by specifying the +charnames pragma: +.PP +.Vb 2 +\& use charnames qw(greek); +\& print "\eN{sigma} is Greek sigma\en"; +.Ve +.PP +An index of character names is available on-line from the Unicode +Consortium, <https://www.unicode.org/charts/charindex.html>; explanatory +material with links to other resources at +<https://www.unicode.org/standard/where>. +.PP +Starting in Perl v5.32, an alternative to \f(CW\*(C`\eN{...}\*(C'\fR for full names is +available, and that is to say +.PP +.Vb 1 +\& /\ep{Name=greek small letter sigma}/ +.Ve +.PP +The casing of the character name is irrelevant when used in \f(CW\*(C`\ep{}\*(C'\fR, as +are most spaces, underscores and hyphens. (A few outlier characters +cause problems with ignoring all of them always. The details (which you +can look up when you get more proficient, and if ever needed) are in +<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>). +.PP +The answer to requirement 2) is that a regexp (mostly) +uses Unicode characters. The "mostly" is for messy backward +compatibility reasons, but starting in Perl 5.14, any regexp compiled in +the scope of a \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (which is automatically +turned on within the scope of a \f(CW\*(C`use v5.12\*(C'\fR or higher) will turn that +"mostly" into "always". If you want to handle Unicode properly, you +should ensure that \f(CW\*(Aqunicode_strings\*(Aq\fR is turned on. +Internally, this is encoded to bytes using either UTF\-8 or a native 8 +bit encoding, depending on the history of the string, but conceptually +it is a sequence of characters, not bytes. See perlunitut for a +tutorial about that. +.PP +Let us now discuss Unicode character classes, most usually called +"character properties". These are represented by the \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW}\*(C'\fR +escape sequence. The negation of this is \f(CW\*(C`\eP{\fR\f(CIname\fR\f(CW}\*(C'\fR. For example, +to match lower and uppercase characters, +.PP +.Vb 5 +\& $x = "BOB"; +\& $x =~ /^\ep{IsUpper}/; # matches, uppercase char class +\& $x =~ /^\eP{IsUpper}/; # doesn\*(Aqt match, char class sans uppercase +\& $x =~ /^\ep{IsLower}/; # doesn\*(Aqt match, lowercase char class +\& $x =~ /^\eP{IsLower}/; # matches, char class sans lowercase +.Ve +.PP +(The "\f(CW\*(C`Is\*(C'\fR" is optional.) +.PP +There are many, many Unicode character properties. For the full list +see perluniprops. Most of them have synonyms with shorter names, +also listed there. Some synonyms are a single character. For these, +you can drop the braces. For instance, \f(CW\*(C`\epM\*(C'\fR is the same thing as +\&\f(CW\*(C`\ep{Mark}\*(C'\fR, meaning things like accent marks. +.PP +The Unicode \f(CW\*(C`\ep{Script}\*(C'\fR and \f(CW\*(C`\ep{Script_Extensions}\*(C'\fR properties are +used to categorize every Unicode character into the language script it +is written in. For example, +English, French, and a bunch of other European languages are written in +the Latin script. But there is also the Greek script, the Thai script, +the Katakana script, \fIetc\fR. (\f(CW\*(C`Script\*(C'\fR is an older, less advanced, +form of \f(CW\*(C`Script_Extensions\*(C'\fR, retained only for backwards +compatibility.) You can test whether a character is in a particular +script with, for example \f(CW\*(C`\ep{Latin}\*(C'\fR, \f(CW\*(C`\ep{Greek}\*(C'\fR, or +\&\f(CW\*(C`\ep{Katakana}\*(C'\fR. To test if it isn't in the Balinese script, you would +use \f(CW\*(C`\eP{Balinese}\*(C'\fR. (These all use \f(CW\*(C`Script_Extensions\*(C'\fR under the +hood, as that gives better results.) +.PP +What we have described so far is the single form of the \f(CW\*(C`\ep{...}\*(C'\fR character +classes. There is also a compound form which you may run into. These +look like \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW=\fR\f(CIvalue\fR\f(CW}\*(C'\fR or \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW:\fR\f(CIvalue\fR\f(CW}\*(C'\fR (the equals sign and colon +can be used interchangeably). These are more general than the single form, +and in fact most of the single forms are just Perl-defined shortcuts for common +compound forms. For example, the script examples in the previous paragraph +could be written equivalently as \f(CW\*(C`\ep{Script_Extensions=Latin}\*(C'\fR, \f(CW\*(C`\ep{Script_Extensions:Greek}\*(C'\fR, +\&\f(CW\*(C`\ep{script_extensions=katakana}\*(C'\fR, and \f(CW\*(C`\eP{script_extensions=balinese}\*(C'\fR (case is irrelevant +between the \f(CW\*(C`{}\*(C'\fR braces). You may +never have to use the compound forms, but sometimes it is necessary, and their +use can make your code easier to understand. +.PP +\&\f(CW\*(C`\eX\*(C'\fR is an abbreviation for a character class that comprises +a Unicode \fIextended grapheme cluster\fR. This represents a "logical character": +what appears to be a single character, but may be represented internally by more +than one. As an example, using the Unicode full names, \fIe.g.\fR, "A\ +\ COMBINING\ RING" is a grapheme cluster with base character "A" and combining character +"COMBINING\ RING, which translates in Danish to "A" with the circle atop it, +as in the word Ã…ngstrom. +.PP +For the full and latest information about Unicode see the latest +Unicode standard, or the Unicode Consortium's website <https://www.unicode.org> +.PP +As if all those classes weren't enough, Perl also defines POSIX-style +character classes. These have the form \f(CW\*(C`[:\fR\f(CIname\fR\f(CW:]\*(C'\fR, with \fIname\fR the +name of the POSIX class. The POSIX classes are \f(CW\*(C`alpha\*(C'\fR, \f(CW\*(C`alnum\*(C'\fR, +\&\f(CW\*(C`ascii\*(C'\fR, \f(CW\*(C`cntrl\*(C'\fR, \f(CW\*(C`digit\*(C'\fR, \f(CW\*(C`graph\*(C'\fR, \f(CW\*(C`lower\*(C'\fR, \f(CW\*(C`print\*(C'\fR, \f(CW\*(C`punct\*(C'\fR, +\&\f(CW\*(C`space\*(C'\fR, \f(CW\*(C`upper\*(C'\fR, and \f(CW\*(C`xdigit\*(C'\fR, and two extensions, \f(CW\*(C`word\*(C'\fR (a Perl +extension to match \f(CW\*(C`\ew\*(C'\fR), and \f(CW\*(C`blank\*(C'\fR (a GNU extension). The \f(CW\*(C`/a\*(C'\fR +modifier restricts these to matching just in the ASCII range; otherwise +they can match the same as their corresponding Perl Unicode classes: +\&\f(CW\*(C`[:upper:]\*(C'\fR is the same as \f(CW\*(C`\ep{IsUpper}\*(C'\fR, \fIetc\fR. (There are some +exceptions and gotchas with this; see perlrecharclass for a full +discussion.) The \f(CW\*(C`[:digit:]\*(C'\fR, \f(CW\*(C`[:word:]\*(C'\fR, and +\&\f(CW\*(C`[:space:]\*(C'\fR correspond to the familiar \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, and \f(CW\*(C`\es\*(C'\fR +character classes. To negate a POSIX class, put a \f(CW\*(Aq^\*(Aq\fR in front of +the name, so that, \fIe.g.\fR, \f(CW\*(C`[:^digit:]\*(C'\fR corresponds to \f(CW\*(C`\eD\*(C'\fR and, under +Unicode, \f(CW\*(C`\eP{IsDigit}\*(C'\fR. The Unicode and POSIX character classes can +be used just like \f(CW\*(C`\ed\*(C'\fR, with the exception that POSIX character +classes can only be used inside of a character class: +.PP +.Vb 6 +\& /\es+[abc[:digit:]xyz]\es*/; # match a,b,c,x,y,z, or a digit +\& /^=item\es[[:digit:]]/; # match \*(Aq=item\*(Aq, +\& # followed by a space and a digit +\& /\es+[abc\ep{IsDigit}xyz]\es+/; # match a,b,c,x,y,z, or a digit +\& /^=item\es\ep{IsDigit}/; # match \*(Aq=item\*(Aq, +\& # followed by a space and a digit +.Ve +.PP +Whew! That is all the rest of the characters and character classes. +.SS "Compiling and saving regular expressions" +.IX Subsection "Compiling and saving regular expressions" +In Part 1 we mentioned that Perl compiles a regexp into a compact +sequence of opcodes. Thus, a compiled regexp is a data structure +that can be stored once and used again and again. The regexp quote +\&\f(CW\*(C`qr//\*(C'\fR does exactly that: \f(CW\*(C`qr/string/\*(C'\fR compiles the \f(CW\*(C`string\*(C'\fR as a +regexp and transforms the result into a form that can be assigned to a +variable: +.PP +.Vb 1 +\& $reg = qr/foo+bar?/; # reg contains a compiled regexp +.Ve +.PP +Then \f(CW$reg\fR can be used as a regexp: +.PP +.Vb 3 +\& $x = "fooooba"; +\& $x =~ $reg; # matches, just like /foo+bar?/ +\& $x =~ /$reg/; # same thing, alternate form +.Ve +.PP +\&\f(CW$reg\fR can also be interpolated into a larger regexp: +.PP +.Vb 1 +\& $x =~ /(abc)?$reg/; # still matches +.Ve +.PP +As with the matching operator, the regexp quote can use different +delimiters, \fIe.g.\fR, \f(CW\*(C`qr!!\*(C'\fR, \f(CW\*(C`qr{}\*(C'\fR or \f(CW\*(C`qr~~\*(C'\fR. Apostrophes +as delimiters (\f(CW\*(C`qr\*(Aq\*(Aq\*(C'\fR) inhibit any interpolation. +.PP +Pre-compiled regexps are useful for creating dynamic matches that +don't need to be recompiled each time they are encountered. Using +pre-compiled regexps, we write a \f(CW\*(C`grep_step\*(C'\fR program which greps +for a sequence of patterns, advancing to the next pattern as soon +as one has been satisfied. +.PP +.Vb 4 +\& % cat > grep_step +\& #!/usr/bin/perl +\& # grep_step \- match <number> regexps, one after the other +\& # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... +\& +\& $number = shift; +\& $regexp[$_] = shift foreach (0..$number\-1); +\& @compiled = map qr/$_/, @regexp; +\& while ($line = <>) { +\& if ($line =~ /$compiled[0]/) { +\& print $line; +\& shift @compiled; +\& last unless @compiled; +\& } +\& } +\& ^D +\& +\& % grep_step 3 shift print last grep_step +\& $number = shift; +\& print $line; +\& last unless @compiled; +.Ve +.PP +Storing pre-compiled regexps in an array \f(CW@compiled\fR allows us to +simply loop through the regexps without any recompilation, thus gaining +flexibility without sacrificing speed. +.SS "Composing regular expressions at runtime" +.IX Subsection "Composing regular expressions at runtime" +Backtracking is more efficient than repeated tries with different regular +expressions. If there are several regular expressions and a match with +any of them is acceptable, then it is possible to combine them into a set +of alternatives. If the individual expressions are input data, this +can be done by programming a join operation. We'll exploit this idea in +an improved version of the \f(CW\*(C`simple_grep\*(C'\fR program: a program that matches +multiple patterns: +.PP +.Vb 4 +\& % cat > multi_grep +\& #!/usr/bin/perl +\& # multi_grep \- match any of <number> regexps +\& # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... +\& +\& $number = shift; +\& $regexp[$_] = shift foreach (0..$number\-1); +\& $pattern = join \*(Aq|\*(Aq, @regexp; +\& +\& while ($line = <>) { +\& print $line if $line =~ /$pattern/; +\& } +\& ^D +\& +\& % multi_grep 2 shift for multi_grep +\& $number = shift; +\& $regexp[$_] = shift foreach (0..$number\-1); +.Ve +.PP +Sometimes it is advantageous to construct a pattern from the \fIinput\fR +that is to be analyzed and use the permissible values on the left +hand side of the matching operations. As an example for this somewhat +paradoxical situation, let's assume that our input contains a command +verb which should match one out of a set of available command verbs, +with the additional twist that commands may be abbreviated as long as +the given string is unique. The program below demonstrates the basic +algorithm. +.PP +.Vb 10 +\& % cat > keymatch +\& #!/usr/bin/perl +\& $kwds = \*(Aqcopy compare list print\*(Aq; +\& while( $cmd = <> ){ +\& $cmd =~ s/^\es+|\es+$//g; # trim leading and trailing spaces +\& if( ( @matches = $kwds =~ /\eb$cmd\ew*/g ) == 1 ){ +\& print "command: \*(Aq@matches\*(Aq\en"; +\& } elsif( @matches == 0 ){ +\& print "no such command: \*(Aq$cmd\*(Aq\en"; +\& } else { +\& print "not unique: \*(Aq$cmd\*(Aq (could be one of: @matches)\en"; +\& } +\& } +\& ^D +\& +\& % keymatch +\& li +\& command: \*(Aqlist\*(Aq +\& co +\& not unique: \*(Aqco\*(Aq (could be one of: copy compare) +\& printer +\& no such command: \*(Aqprinter\*(Aq +.Ve +.PP +Rather than trying to match the input against the keywords, we match the +combined set of keywords against the input. The pattern matching +operation \f(CW\*(C`$kwds\ =~\ /\eb($cmd\ew*)/g\*(C'\fR does several things at the +same time. It makes sure that the given command begins where a keyword +begins (\f(CW\*(C`\eb\*(C'\fR). It tolerates abbreviations due to the added \f(CW\*(C`\ew*\*(C'\fR. It +tells us the number of matches (\f(CW\*(C`scalar @matches\*(C'\fR) and all the keywords +that were actually matched. You could hardly ask for more. +.SS "Embedding comments and modifiers in a regular expression" +.IX Subsection "Embedding comments and modifiers in a regular expression" +Starting with this section, we will be discussing Perl's set of +\&\fIextended patterns\fR. These are extensions to the traditional regular +expression syntax that provide powerful new tools for pattern +matching. We have already seen extensions in the form of the minimal +matching constructs \f(CW\*(C`??\*(C'\fR, \f(CW\*(C`*?\*(C'\fR, \f(CW\*(C`+?\*(C'\fR, \f(CW\*(C`{n,m}?\*(C'\fR, \f(CW\*(C`{n,}?\*(C'\fR, and +\&\f(CW\*(C`{,n}?\*(C'\fR. Most of the extensions below have the form \f(CW\*(C`(?char...)\*(C'\fR, +where the \f(CW\*(C`char\*(C'\fR is a character that determines the type of extension. +.PP +The first extension is an embedded comment \f(CW\*(C`(?#text)\*(C'\fR. This embeds a +comment into the regular expression without affecting its meaning. The +comment should not have any closing parentheses in the text. An +example is +.PP +.Vb 1 +\& /(?# Match an integer:)[+\-]?\ed+/; +.Ve +.PP +This style of commenting has been largely superseded by the raw, +freeform commenting that is allowed with the \f(CW\*(C`/x\*(C'\fR modifier. +.PP +Most modifiers, such as \f(CW\*(C`/i\*(C'\fR, \f(CW\*(C`/m\*(C'\fR, \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/x\*(C'\fR (or any +combination thereof) can also be embedded in +a regexp using \f(CW\*(C`(?i)\*(C'\fR, \f(CW\*(C`(?m)\*(C'\fR, \f(CW\*(C`(?s)\*(C'\fR, and \f(CW\*(C`(?x)\*(C'\fR. For instance, +.PP +.Vb 7 +\& /(?i)yes/; # match \*(Aqyes\*(Aq case insensitively +\& /yes/i; # same thing +\& /(?x)( # freeform version of an integer regexp +\& [+\-]? # match an optional sign +\& \ed+ # match a sequence of digits +\& ) +\& /x; +.Ve +.PP +Embedded modifiers can have two important advantages over the usual +modifiers. Embedded modifiers allow a custom set of modifiers for +\&\fIeach\fR regexp pattern. This is great for matching an array of regexps +that must have different modifiers: +.PP +.Vb 8 +\& $pattern[0] = \*(Aq(?i)doctor\*(Aq; +\& $pattern[1] = \*(AqJohnson\*(Aq; +\& ... +\& while (<>) { +\& foreach $patt (@pattern) { +\& print if /$patt/; +\& } +\& } +.Ve +.PP +The second advantage is that embedded modifiers (except \f(CW\*(C`/p\*(C'\fR, which +modifies the entire regexp) only affect the regexp +inside the group the embedded modifier is contained in. So grouping +can be used to localize the modifier's effects: +.PP +.Vb 1 +\& /Answer: ((?i)yes)/; # matches \*(AqAnswer: yes\*(Aq, \*(AqAnswer: YES\*(Aq, etc. +.Ve +.PP +Embedded modifiers can also turn off any modifiers already present +by using, \fIe.g.\fR, \f(CW\*(C`(?\-i)\*(C'\fR. Modifiers can also be combined into +a single expression, \fIe.g.\fR, \f(CW\*(C`(?s\-i)\*(C'\fR turns on single line mode and +turns off case insensitivity. +.PP +Embedded modifiers may also be added to a non-capturing grouping. +\&\f(CW\*(C`(?i\-m:regexp)\*(C'\fR is a non-capturing grouping that matches \f(CW\*(C`regexp\*(C'\fR +case insensitively and turns off multi-line mode. +.SS "Looking ahead and looking behind" +.IX Subsection "Looking ahead and looking behind" +This section concerns the lookahead and lookbehind assertions. First, +a little background. +.PP +In Perl regular expressions, most regexp elements "eat up" a certain +amount of string when they match. For instance, the regexp element +\&\f(CW\*(C`[abc]\*(C'\fR eats up one character of the string when it matches, in the +sense that Perl moves to the next character position in the string +after the match. There are some elements, however, that don't eat up +characters (advance the character position) if they match. The examples +we have seen so far are the anchors. The anchor \f(CW\*(Aq^\*(Aq\fR matches the +beginning of the line, but doesn't eat any characters. Similarly, the +word boundary anchor \f(CW\*(C`\eb\*(C'\fR matches wherever a character matching \f(CW\*(C`\ew\*(C'\fR +is next to a character that doesn't, but it doesn't eat up any +characters itself. Anchors are examples of \fIzero-width assertions\fR: +zero-width, because they consume +no characters, and assertions, because they test some property of the +string. In the context of our walk in the woods analogy to regexp +matching, most regexp elements move us along a trail, but anchors have +us stop a moment and check our surroundings. If the local environment +checks out, we can proceed forward. But if the local environment +doesn't satisfy us, we must backtrack. +.PP +Checking the environment entails either looking ahead on the trail, +looking behind, or both. \f(CW\*(Aq^\*(Aq\fR looks behind, to see that there are no +characters before. \f(CW\*(Aq$\*(Aq\fR looks ahead, to see that there are no +characters after. \f(CW\*(C`\eb\*(C'\fR looks both ahead and behind, to see if the +characters on either side differ in their "word-ness". +.PP +The lookahead and lookbehind assertions are generalizations of the +anchor concept. Lookahead and lookbehind are zero-width assertions +that let us specify which characters we want to test for. The +lookahead assertion is denoted by \f(CW\*(C`(?=regexp)\*(C'\fR or (starting in 5.32, +experimentally in 5.28) \f(CW\*(C`(*pla:regexp)\*(C'\fR or +\&\f(CW\*(C`(*positive_lookahead:regexp)\*(C'\fR; and the lookbehind assertion is denoted +by \f(CW\*(C`(?<=fixed\-regexp)\*(C'\fR or (starting in 5.32, experimentally in +5.28) \f(CW\*(C`(*plb:fixed\-regexp)\*(C'\fR or \f(CW\*(C`(*positive_lookbehind:fixed\-regexp)\*(C'\fR. +Some examples are +.PP +.Vb 8 +\& $x = "I catch the housecat \*(AqTom\-cat\*(Aq with catnip"; +\& $x =~ /cat(*pla:\es)/; # matches \*(Aqcat\*(Aq in \*(Aqhousecat\*(Aq +\& @catwords = ($x =~ /(?<=\es)cat\ew+/g); # matches, +\& # $catwords[0] = \*(Aqcatch\*(Aq +\& # $catwords[1] = \*(Aqcatnip\*(Aq +\& $x =~ /\ebcat\eb/; # matches \*(Aqcat\*(Aq in \*(AqTom\-cat\*(Aq +\& $x =~ /(?<=\es)cat(?=\es)/; # doesn\*(Aqt match; no isolated \*(Aqcat\*(Aq in +\& # middle of $x +.Ve +.PP +Note that the parentheses in these are +non-capturing, since these are zero-width assertions. Thus in the +second regexp, the substrings captured are those of the whole regexp +itself. Lookahead can match arbitrary regexps, but +lookbehind prior to 5.30 \f(CW\*(C`(?<=fixed\-regexp)\*(C'\fR only works for regexps +of fixed width, \fIi.e.\fR, a fixed number of characters long. Thus +\&\f(CW\*(C`(?<=(ab|bc))\*(C'\fR is fine, but \f(CW\*(C`(?<=(ab)*)\*(C'\fR prior to 5.30 is not. +.PP +The negated versions of the lookahead and lookbehind assertions are +denoted by \f(CW\*(C`(?!regexp)\*(C'\fR and \f(CW\*(C`(?<!fixed\-regexp)\*(C'\fR respectively. +Or, starting in 5.32 (experimentally in 5.28), \f(CW\*(C`(*nla:regexp)\*(C'\fR, +\&\f(CW\*(C`(*negative_lookahead:regexp)\*(C'\fR, \f(CW\*(C`(*nlb:regexp)\*(C'\fR, or +\&\f(CW\*(C`(*negative_lookbehind:regexp)\*(C'\fR. +They evaluate true if the regexps do \fInot\fR match: +.PP +.Vb 4 +\& $x = "foobar"; +\& $x =~ /foo(?!bar)/; # doesn\*(Aqt match, \*(Aqbar\*(Aq follows \*(Aqfoo\*(Aq +\& $x =~ /foo(?!baz)/; # matches, \*(Aqbaz\*(Aq doesn\*(Aqt follow \*(Aqfoo\*(Aq +\& $x =~ /(?<!\es)foo/; # matches, there is no \es before \*(Aqfoo\*(Aq +.Ve +.PP +Here is an example where a string containing blank-separated words, +numbers and single dashes is to be split into its components. +Using \f(CW\*(C`/\es+/\*(C'\fR alone won't work, because spaces are not required between +dashes, or a word or a dash. Additional places for a split are established +by looking ahead and behind: +.PP +.Vb 5 +\& $str = "one two \- \-\-6\-8"; +\& @toks = split / \es+ # a run of spaces +\& | (?<=\eS) (?=\-) # any non\-space followed by \*(Aq\-\*(Aq +\& | (?<=\-) (?=\eS) # a \*(Aq\-\*(Aq followed by any non\-space +\& /x, $str; # @toks = qw(one two \- \- \- 6 \- 8) +.Ve +.SS "Using independent subexpressions to prevent backtracking" +.IX Subsection "Using independent subexpressions to prevent backtracking" +\&\fIIndependent subexpressions\fR (or atomic subexpressions) are regular +expressions, in the context of a larger regular expression, that +function independently of the larger regular expression. That is, they +consume as much or as little of the string as they wish without regard +for the ability of the larger regexp to match. Independent +subexpressions are represented by +\&\f(CW\*(C`(?>regexp)\*(C'\fR or (starting in 5.32, experimentally in 5.28) +\&\f(CW\*(C`(*atomic:regexp)\*(C'\fR. We can illustrate their behavior by first +considering an ordinary regexp: +.PP +.Vb 2 +\& $x = "ab"; +\& $x =~ /a*ab/; # matches +.Ve +.PP +This obviously matches, but in the process of matching, the +subexpression \f(CW\*(C`a*\*(C'\fR first grabbed the \f(CW\*(Aqa\*(Aq\fR. Doing so, however, +wouldn't allow the whole regexp to match, so after backtracking, \f(CW\*(C`a*\*(C'\fR +eventually gave back the \f(CW\*(Aqa\*(Aq\fR and matched the empty string. Here, what +\&\f(CW\*(C`a*\*(C'\fR matched was \fIdependent\fR on what the rest of the regexp matched. +.PP +Contrast that with an independent subexpression: +.PP +.Vb 1 +\& $x =~ /(?>a*)ab/; # doesn\*(Aqt match! +.Ve +.PP +The independent subexpression \f(CW\*(C`(?>a*)\*(C'\fR doesn't care about the rest +of the regexp, so it sees an \f(CW\*(Aqa\*(Aq\fR and grabs it. Then the rest of the +regexp \f(CW\*(C`ab\*(C'\fR cannot match. Because \f(CW\*(C`(?>a*)\*(C'\fR is independent, there +is no backtracking and the independent subexpression does not give +up its \f(CW\*(Aqa\*(Aq\fR. Thus the match of the regexp as a whole fails. A similar +behavior occurs with completely independent regexps: +.PP +.Vb 3 +\& $x = "ab"; +\& $x =~ /a*/g; # matches, eats an \*(Aqa\*(Aq +\& $x =~ /\eGab/g; # doesn\*(Aqt match, no \*(Aqa\*(Aq available +.Ve +.PP +Here \f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`\eG\*(C'\fR create a "tag team" handoff of the string from +one regexp to the other. Regexps with an independent subexpression are +much like this, with a handoff of the string to the independent +subexpression, and a handoff of the string back to the enclosing +regexp. +.PP +The ability of an independent subexpression to prevent backtracking +can be quite useful. Suppose we want to match a non-empty string +enclosed in parentheses up to two levels deep. Then the following +regexp matches: +.PP +.Vb 2 +\& $x = "abc(de(fg)h"; # unbalanced parentheses +\& $x =~ /\e( ( [ ^ () ]+ | \e( [ ^ () ]* \e) )+ \e)/xx; +.Ve +.PP +The regexp matches an open parenthesis, one or more copies of an +alternation, and a close parenthesis. The alternation is two-way, with +the first alternative \f(CW\*(C`[^()]+\*(C'\fR matching a substring with no +parentheses and the second alternative \f(CW\*(C`\e([^()]*\e)\*(C'\fR matching a +substring delimited by parentheses. The problem with this regexp is +that it is pathological: it has nested indeterminate quantifiers +of the form \f(CW\*(C`(a+|b)+\*(C'\fR. We discussed in Part 1 how nested quantifiers +like this could take an exponentially long time to execute if there +is no match possible. To prevent the exponential blowup, we need to +prevent useless backtracking at some point. This can be done by +enclosing the inner quantifier as an independent subexpression: +.PP +.Vb 1 +\& $x =~ /\e( ( (?> [ ^ () ]+ ) | \e([ ^ () ]* \e) )+ \e)/xx; +.Ve +.PP +Here, \f(CW\*(C`(?>[^()]+)\*(C'\fR breaks the degeneracy of string partitioning +by gobbling up as much of the string as possible and keeping it. Then +match failures fail much more quickly. +.SS "Conditional expressions" +.IX Subsection "Conditional expressions" +A \fIconditional expression\fR is a form of if-then-else statement +that allows one to choose which patterns are to be matched, based on +some condition. There are two types of conditional expression: +\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-regexp\fR\f(CW)\*(C'\fR and +\&\f(CW\*(C`(?(condition)\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR. +\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-regexp\fR\f(CW)\*(C'\fR is +like an \f(CW\*(Aqif\ ()\ {}\*(Aq\fR statement in Perl. If the \fIcondition\fR is true, +the \fIyes-regexp\fR will be matched. If the \fIcondition\fR is false, the +\&\fIyes-regexp\fR will be skipped and Perl will move onto the next regexp +element. The second form is like an \f(CW\*(Aqif\ ()\ {}\ else\ {}\*(Aq\fR statement +in Perl. If the \fIcondition\fR is true, the \fIyes-regexp\fR will be +matched, otherwise the \fIno-regexp\fR will be matched. +.PP +The \fIcondition\fR can have several forms. The first form is simply an +integer in parentheses \f(CW\*(C`(\fR\f(CIinteger\fR\f(CW)\*(C'\fR. It is true if the corresponding +backreference \f(CW\*(C`\e\fR\f(CIinteger\fR\f(CW\*(C'\fR matched earlier in the regexp. The same +thing can be done with a name associated with a capture group, written +as \f(CW\*(C`(<\fR\f(CIname\fR\f(CW>)\*(C'\fR or \f(CW\*(C`(\*(Aq\fR\f(CIname\fR\f(CW\*(Aq)\*(C'\fR. The second form is a bare +zero-width assertion \f(CW\*(C`(?...)\*(C'\fR, either a lookahead, a lookbehind, or a +code assertion (discussed in the next section). The third set of forms +provides tests that return true if the expression is executed within +a recursion (\f(CW\*(C`(R)\*(C'\fR) or is being called from some capturing group, +referenced either by number (\f(CW\*(C`(R1)\*(C'\fR, \f(CW\*(C`(R2)\*(C'\fR,...) or by name +(\f(CW\*(C`(R&\fR\f(CIname\fR\f(CW)\*(C'\fR). +.PP +The integer or name form of the \f(CW\*(C`condition\*(C'\fR allows us to choose, +with more flexibility, what to match based on what matched earlier in the +regexp. This searches for words of the form \f(CW"$x$x"\fR or \f(CW"$x$y$y$x"\fR: +.PP +.Vb 9 +\& % simple_grep \*(Aq^(\ew+)(\ew+)?(?(2)\eg2\eg1|\eg1)$\*(Aq /usr/dict/words +\& beriberi +\& coco +\& couscous +\& deed +\& ... +\& toot +\& toto +\& tutu +.Ve +.PP +The lookbehind \f(CW\*(C`condition\*(C'\fR allows, along with backreferences, +an earlier part of the match to influence a later part of the +match. For instance, +.PP +.Vb 1 +\& /[ATGC]+(?(?<=AA)G|C)$/; +.Ve +.PP +matches a DNA sequence such that it either ends in \f(CW\*(C`AAG\*(C'\fR, or some +other base pair combination and \f(CW\*(AqC\*(Aq\fR. Note that the form is +\&\f(CW\*(C`(?(?<=AA)G|C)\*(C'\fR and not \f(CW\*(C`(?((?<=AA))G|C)\*(C'\fR; for the +lookahead, lookbehind or code assertions, the parentheses around the +conditional are not needed. +.SS "Defining named patterns" +.IX Subsection "Defining named patterns" +Some regular expressions use identical subpatterns in several places. +Starting with Perl 5.10, it is possible to define named subpatterns in +a section of the pattern so that they can be called up by name +anywhere in the pattern. This syntactic pattern for this definition +group is \f(CW\*(C`(?(DEFINE)(?<\fR\f(CIname\fR\f(CW>\fR\f(CIpattern\fR\f(CW)...)\*(C'\fR. An insertion +of a named pattern is written as \f(CW\*(C`(?&\fR\f(CIname\fR\f(CW)\*(C'\fR. +.PP +The example below illustrates this feature using the pattern for +floating point numbers that was presented earlier on. The three +subpatterns that are used more than once are the optional sign, the +digit sequence for an integer and the decimal fraction. The \f(CW\*(C`DEFINE\*(C'\fR +group at the end of the pattern contains their definition. Notice +that the decimal fraction pattern is the first place where we can +reuse the integer pattern. +.PP +.Vb 8 +\& /^ (?&osg)\e * ( (?&int)(?&dec)? | (?&dec) ) +\& (?: [eE](?&osg)(?&int) )? +\& $ +\& (?(DEFINE) +\& (?<osg>[\-+]?) # optional sign +\& (?<int>\ed++) # integer +\& (?<dec>\e.(?&int)) # decimal fraction +\& )/x +.Ve +.SS "Recursive patterns" +.IX Subsection "Recursive patterns" +This feature (introduced in Perl 5.10) significantly extends the +power of Perl's pattern matching. By referring to some other +capture group anywhere in the pattern with the construct +\&\f(CW\*(C`(?\fR\f(CIgroup\-ref\fR\f(CW)\*(C'\fR, the \fIpattern\fR within the referenced group is used +as an independent subpattern in place of the group reference itself. +Because the group reference may be contained \fIwithin\fR the group it +refers to, it is now possible to apply pattern matching to tasks that +hitherto required a recursive parser. +.PP +To illustrate this feature, we'll design a pattern that matches if +a string contains a palindrome. (This is a word or a sentence that, +while ignoring spaces, interpunctuation and case, reads the same backwards +as forwards. We begin by observing that the empty string or a string +containing just one word character is a palindrome. Otherwise it must +have a word character up front and the same at its end, with another +palindrome in between. +.PP +.Vb 1 +\& /(?: (\ew) (?...Here be a palindrome...) \eg{ \-1 } | \ew? )/x +.Ve +.PP +Adding \f(CW\*(C`\eW*\*(C'\fR at either end to eliminate what is to be ignored, we already +have the full pattern: +.PP +.Vb 4 +\& my $pp = qr/^(\eW* (?: (\ew) (?1) \eg{\-1} | \ew? ) \eW*)$/ix; +\& for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){ +\& print "\*(Aq$s\*(Aq is a palindrome\en" if $s =~ /$pp/; +\& } +.Ve +.PP +In \f(CW\*(C`(?...)\*(C'\fR both absolute and relative backreferences may be used. +The entire pattern can be reinserted with \f(CW\*(C`(?R)\*(C'\fR or \f(CW\*(C`(?0)\*(C'\fR. +If you prefer to name your groups, you can use \f(CW\*(C`(?&\fR\f(CIname\fR\f(CW)\*(C'\fR to +recurse into that group. +.SS "A bit of magic: executing Perl code in a regular expression" +.IX Subsection "A bit of magic: executing Perl code in a regular expression" +Normally, regexps are a part of Perl expressions. +\&\fICode evaluation\fR expressions turn that around by allowing +arbitrary Perl code to be a part of a regexp. A code evaluation +expression is denoted \f(CW\*(C`(?{\fR\f(CIcode\fR\f(CW})\*(C'\fR, with \fIcode\fR a string of Perl +statements. +.PP +Code expressions are zero-width assertions, and the value they return +depends on their environment. There are two possibilities: either the +code expression is used as a conditional in a conditional expression +\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)...)\*(C'\fR, or it is not. If the code expression is a +conditional, the code is evaluated and the result (\fIi.e.\fR, the result of +the last statement) is used to determine truth or falsehood. If the +code expression is not used as a conditional, the assertion always +evaluates true and the result is put into the special variable +\&\f(CW$^R\fR. The variable \f(CW$^R\fR can then be used in code expressions later +in the regexp. Here are some silly examples: +.PP +.Vb 5 +\& $x = "abcdef"; +\& $x =~ /abc(?{print "Hi Mom!";})def/; # matches, +\& # prints \*(AqHi Mom!\*(Aq +\& $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn\*(Aqt match, +\& # no \*(AqHi Mom!\*(Aq +.Ve +.PP +Pay careful attention to the next example: +.PP +.Vb 3 +\& $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn\*(Aqt match, +\& # no \*(AqHi Mom!\*(Aq +\& # but why not? +.Ve +.PP +At first glance, you'd think that it shouldn't print, because obviously +the \f(CW\*(C`ddd\*(C'\fR isn't going to match the target string. But look at this +example: +.PP +.Vb 2 +\& $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn\*(Aqt match, +\& # but _does_ print +.Ve +.PP +Hmm. What happened here? If you've been following along, you know that +the above pattern should be effectively (almost) the same as the last one; +enclosing the \f(CW\*(Aqd\*(Aq\fR in a character class isn't going to change what it +matches. So why does the first not print while the second one does? +.PP +The answer lies in the optimizations the regexp engine makes. In the first +case, all the engine sees are plain old characters (aside from the +\&\f(CW\*(C`?{}\*(C'\fR construct). It's smart enough to realize that the string \f(CW\*(Aqddd\*(Aq\fR +doesn't occur in our target string before actually running the pattern +through. But in the second case, we've tricked it into thinking that our +pattern is more complicated. It takes a look, sees our +character class, and decides that it will have to actually run the +pattern to determine whether or not it matches, and in the process of +running it hits the print statement before it discovers that we don't +have a match. +.PP +To take a closer look at how the engine does optimizations, see the +section "Pragmas and debugging" below. +.PP +More fun with \f(CW\*(C`?{}\*(C'\fR: +.PP +.Vb 6 +\& $x =~ /(?{print "Hi Mom!";})/; # matches, +\& # prints \*(AqHi Mom!\*(Aq +\& $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, +\& # prints \*(Aq1\*(Aq +\& $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, +\& # prints \*(Aq1\*(Aq +.Ve +.PP +The bit of magic mentioned in the section title occurs when the regexp +backtracks in the process of searching for a match. If the regexp +backtracks over a code expression and if the variables used within are +localized using \f(CW\*(C`local\*(C'\fR, the changes in the variables produced by the +code expression are undone! Thus, if we wanted to count how many times +a character got matched inside a group, we could use, \fIe.g.\fR, +.PP +.Vb 11 +\& $x = "aaaa"; +\& $count = 0; # initialize \*(Aqa\*(Aq count +\& $c = "bob"; # test if $c gets clobbered +\& $x =~ /(?{local $c = 0;}) # initialize count +\& ( a # match \*(Aqa\*(Aq +\& (?{local $c = $c + 1;}) # increment count +\& )* # do this any number of times, +\& aa # but match \*(Aqaa\*(Aq at the end +\& (?{$count = $c;}) # copy local $c var into $count +\& /x; +\& print "\*(Aqa\*(Aq count is $count, \e$c variable is \*(Aq$c\*(Aq\en"; +.Ve +.PP +This prints +.PP +.Vb 1 +\& \*(Aqa\*(Aq count is 2, $c variable is \*(Aqbob\*(Aq +.Ve +.PP +If we replace the \f(CW\*(C`\ (?{local\ $c\ =\ $c\ +\ 1;})\*(C'\fR with +\&\f(CW\*(C`\ (?{$c\ =\ $c\ +\ 1;})\*(C'\fR, the variable changes are \fInot\fR undone +during backtracking, and we get +.PP +.Vb 1 +\& \*(Aqa\*(Aq count is 4, $c variable is \*(Aqbob\*(Aq +.Ve +.PP +Note that only localized variable changes are undone. Other side +effects of code expression execution are permanent. Thus +.PP +.Vb 2 +\& $x = "aaaa"; +\& $x =~ /(a(?{print "Yow\en";}))*aa/; +.Ve +.PP +produces +.PP +.Vb 4 +\& Yow +\& Yow +\& Yow +\& Yow +.Ve +.PP +The result \f(CW$^R\fR is automatically localized, so that it will behave +properly in the presence of backtracking. +.PP +This example uses a code expression in a conditional to match a +definite article, either \f(CW\*(Aqthe\*(Aq\fR in English or \f(CW\*(Aqder|die|das\*(Aq\fR in +German: +.PP +.Vb 11 +\& $lang = \*(AqDE\*(Aq; # use German +\& ... +\& $text = "das"; +\& print "matched\en" +\& if $text =~ /(?(?{ +\& $lang eq \*(AqEN\*(Aq; # is the language English? +\& }) +\& the | # if so, then match \*(Aqthe\*(Aq +\& (der|die|das) # else, match \*(Aqder|die|das\*(Aq +\& ) +\& /xi; +.Ve +.PP +Note that the syntax here is \f(CW\*(C`(?(?{...})\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR, not +\&\f(CW\*(C`(?((?{...}))\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR. In other words, in the case of a +code expression, we don't need the extra parentheses around the +conditional. +.PP +If you try to use code expressions where the code text is contained within +an interpolated variable, rather than appearing literally in the pattern, +Perl may surprise you: +.PP +.Vb 5 +\& $bar = 5; +\& $pat = \*(Aq(?{ 1 })\*(Aq; +\& /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated +\& /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated +\& /foo${pat}bar/; # compile error! +\& +\& $pat = qr/(?{ $foo = 1 })/; # precompile code regexp +\& /foo${pat}bar/; # compiles ok +.Ve +.PP +If a regexp has a variable that interpolates a code expression, Perl +treats the regexp as an error. If the code expression is precompiled into +a variable, however, interpolating is ok. The question is, why is this an +error? +.PP +The reason is that variable interpolation and code expressions +together pose a security risk. The combination is dangerous because +many programmers who write search engines often take user input and +plug it directly into a regexp: +.PP +.Vb 3 +\& $regexp = <>; # read user\-supplied regexp +\& $chomp $regexp; # get rid of possible newline +\& $text =~ /$regexp/; # search $text for the $regexp +.Ve +.PP +If the \f(CW$regexp\fR variable contains a code expression, the user could +then execute arbitrary Perl code. For instance, some joker could +search for \f(CW\*(C`system(\*(Aqrm\ \-rf\ *\*(Aq);\*(C'\fR to erase your files. In this +sense, the combination of interpolation and code expressions \fItaints\fR +your regexp. So by default, using both interpolation and code +expressions in the same regexp is not allowed. If you're not +concerned about malicious users, it is possible to bypass this +security check by invoking \f(CW\*(C`use\ re\ \*(Aqeval\*(Aq\*(C'\fR: +.PP +.Vb 4 +\& use re \*(Aqeval\*(Aq; # throw caution out the door +\& $bar = 5; +\& $pat = \*(Aq(?{ 1 })\*(Aq; +\& /foo${pat}bar/; # compiles ok +.Ve +.PP +Another form of code expression is the \fIpattern code expression\fR. +The pattern code expression is like a regular code expression, except +that the result of the code evaluation is treated as a regular +expression and matched immediately. A simple example is +.PP +.Vb 4 +\& $length = 5; +\& $char = \*(Aqa\*(Aq; +\& $x = \*(Aqaaaaabb\*(Aq; +\& $x =~ /(??{$char x $length})/x; # matches, there are 5 of \*(Aqa\*(Aq +.Ve +.PP +This final example contains both ordinary and pattern code +expressions. It detects whether a binary string \f(CW1101010010001...\fR has a +Fibonacci spacing 0,1,1,2,3,5,... of the \f(CW\*(Aq1\*(Aq\fR's: +.PP +.Vb 12 +\& $x = "1101010010001000001"; +\& $z0 = \*(Aq\*(Aq; $z1 = \*(Aq0\*(Aq; # initial conditions +\& print "It is a Fibonacci sequence\en" +\& if $x =~ /^1 # match an initial \*(Aq1\*(Aq +\& (?: +\& ((??{ $z0 })) # match some \*(Aq0\*(Aq +\& 1 # and then a \*(Aq1\*(Aq +\& (?{ $z0 = $z1; $z1 .= $^N; }) +\& )+ # repeat as needed +\& $ # that is all there is +\& /x; +\& printf "Largest sequence matched was %d\en", length($z1)\-length($z0); +.Ve +.PP +Remember that \f(CW$^N\fR is set to whatever was matched by the last +completed capture group. This prints +.PP +.Vb 2 +\& It is a Fibonacci sequence +\& Largest sequence matched was 5 +.Ve +.PP +Ha! Try that with your garden variety regexp package... +.PP +Note that the variables \f(CW$z0\fR and \f(CW$z1\fR are not substituted when the +regexp is compiled, as happens for ordinary variables outside a code +expression. Rather, the whole code block is parsed as perl code at the +same time as perl is compiling the code containing the literal regexp +pattern. +.PP +This regexp without the \f(CW\*(C`/x\*(C'\fR modifier is +.PP +.Vb 1 +\& /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/ +.Ve +.PP +which shows that spaces are still possible in the code parts. Nevertheless, +when working with code and conditional expressions, the extended form of +regexps is almost necessary in creating and debugging regexps. +.SS "Backtracking control verbs" +.IX Subsection "Backtracking control verbs" +Perl 5.10 introduced a number of control verbs intended to provide +detailed control over the backtracking process, by directly influencing +the regexp engine and by providing monitoring techniques. See +"Special Backtracking Control Verbs" in perlre for a detailed +description. +.PP +Below is just one example, illustrating the control verb \f(CW\*(C`(*FAIL)\*(C'\fR, +which may be abbreviated as \f(CW\*(C`(*F)\*(C'\fR. If this is inserted in a regexp +it will cause it to fail, just as it would at some +mismatch between the pattern and the string. Processing +of the regexp continues as it would after any "normal" +failure, so that, for instance, the next position in the string or another +alternative will be tried. As failing to match doesn't preserve capture +groups or produce results, it may be necessary to use this in +combination with embedded code. +.PP +.Vb 4 +\& %count = (); +\& "supercalifragilisticexpialidocious" =~ +\& /([aeiou])(?{ $count{$1}++; })(*FAIL)/i; +\& printf "%3d \*(Aq%s\*(Aq\en", $count{$_}, $_ for (sort keys %count); +.Ve +.PP +The pattern begins with a class matching a subset of letters. Whenever +this matches, a statement like \f(CW\*(C`$count{\*(Aqa\*(Aq}++;\*(C'\fR is executed, incrementing +the letter's counter. Then \f(CW\*(C`(*FAIL)\*(C'\fR does what it says, and +the regexp engine proceeds according to the book: as long as the end of +the string hasn't been reached, the position is advanced before looking +for another vowel. Thus, match or no match makes no difference, and the +regexp engine proceeds until the entire string has been inspected. +(It's remarkable that an alternative solution using something like +.PP +.Vb 2 +\& $count{lc($_)}++ for split(\*(Aq\*(Aq, "supercalifragilisticexpialidocious"); +\& printf "%3d \*(Aq%s\*(Aq\en", $count2{$_}, $_ for ( qw{ a e i o u } ); +.Ve +.PP +is considerably slower.) +.SS "Pragmas and debugging" +.IX Subsection "Pragmas and debugging" +Speaking of debugging, there are several pragmas available to control +and debug regexps in Perl. We have already encountered one pragma in +the previous section, \f(CW\*(C`use\ re\ \*(Aqeval\*(Aq;\*(C'\fR, that allows variable +interpolation and code expressions to coexist in a regexp. The other +pragmas are +.PP +.Vb 3 +\& use re \*(Aqtaint\*(Aq; +\& $tainted = <>; +\& @parts = ($tainted =~ /(\ew+)\es+(\ew+)/; # @parts is now tainted +.Ve +.PP +The \f(CW\*(C`taint\*(C'\fR pragma causes any substrings from a match with a tainted +variable to be tainted as well, if your perl supports tainting +(see perlsec). This is not normally the case, as +regexps are often used to extract the safe bits from a tainted +variable. Use \f(CW\*(C`taint\*(C'\fR when you are not extracting safe bits, but are +performing some other processing. Both \f(CW\*(C`taint\*(C'\fR and \f(CW\*(C`eval\*(C'\fR pragmas +are lexically scoped, which means they are in effect only until +the end of the block enclosing the pragmas. +.PP +.Vb 2 +\& use re \*(Aq/m\*(Aq; # or any other flags +\& $multiline_string =~ /^foo/; # /m is implied +.Ve +.PP +The \f(CW\*(C`re \*(Aq/flags\*(Aq\*(C'\fR pragma (introduced in Perl +5.14) turns on the given regular expression flags +until the end of the lexical scope. See +"'/flags' mode" in re for more +detail. +.PP +.Vb 2 +\& use re \*(Aqdebug\*(Aq; +\& /^(.*)$/s; # output debugging info +\& +\& use re \*(Aqdebugcolor\*(Aq; +\& /^(.*)$/s; # output debugging info in living color +.Ve +.PP +The global \f(CW\*(C`debug\*(C'\fR and \f(CW\*(C`debugcolor\*(C'\fR pragmas allow one to get +detailed debugging info about regexp compilation and +execution. \f(CW\*(C`debugcolor\*(C'\fR is the same as debug, except the debugging +information is displayed in color on terminals that can display +termcap color sequences. Here is example output: +.PP +.Vb 10 +\& % perl \-e \*(Aquse re "debug"; "abc" =~ /a*b+c/;\*(Aq +\& Compiling REx \*(Aqa*b+c\*(Aq +\& size 9 first at 1 +\& 1: STAR(4) +\& 2: EXACT <a>(0) +\& 4: PLUS(7) +\& 5: EXACT <b>(0) +\& 7: EXACT <c>(9) +\& 9: END(0) +\& floating \*(Aqbc\*(Aq at 0..2147483647 (checking floating) minlen 2 +\& Guessing start of match, REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq... +\& Found floating substr \*(Aqbc\*(Aq at offset 1... +\& Guessed: match at offset 0 +\& Matching REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq +\& Setting an EVAL scope, savestack=3 +\& 0 <> <abc> | 1: STAR +\& EXACT <a> can match 1 times out of 32767... +\& Setting an EVAL scope, savestack=3 +\& 1 <a> <bc> | 4: PLUS +\& EXACT <b> can match 1 times out of 32767... +\& Setting an EVAL scope, savestack=3 +\& 2 <ab> <c> | 7: EXACT <c> +\& 3 <abc> <> | 9: END +\& Match successful! +\& Freeing REx: \*(Aqa*b+c\*(Aq +.Ve +.PP +If you have gotten this far into the tutorial, you can probably guess +what the different parts of the debugging output tell you. The first +part +.PP +.Vb 8 +\& Compiling REx \*(Aqa*b+c\*(Aq +\& size 9 first at 1 +\& 1: STAR(4) +\& 2: EXACT <a>(0) +\& 4: PLUS(7) +\& 5: EXACT <b>(0) +\& 7: EXACT <c>(9) +\& 9: END(0) +.Ve +.PP +describes the compilation stage. \f(CWSTAR(4)\fR means that there is a +starred object, in this case \f(CW\*(Aqa\*(Aq\fR, and if it matches, goto line 4, +\&\fIi.e.\fR, \f(CWPLUS(7)\fR. The middle lines describe some heuristics and +optimizations performed before a match: +.PP +.Vb 4 +\& floating \*(Aqbc\*(Aq at 0..2147483647 (checking floating) minlen 2 +\& Guessing start of match, REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq... +\& Found floating substr \*(Aqbc\*(Aq at offset 1... +\& Guessed: match at offset 0 +.Ve +.PP +Then the match is executed and the remaining lines describe the +process: +.PP +.Vb 12 +\& Matching REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq +\& Setting an EVAL scope, savestack=3 +\& 0 <> <abc> | 1: STAR +\& EXACT <a> can match 1 times out of 32767... +\& Setting an EVAL scope, savestack=3 +\& 1 <a> <bc> | 4: PLUS +\& EXACT <b> can match 1 times out of 32767... +\& Setting an EVAL scope, savestack=3 +\& 2 <ab> <c> | 7: EXACT <c> +\& 3 <abc> <> | 9: END +\& Match successful! +\& Freeing REx: \*(Aqa*b+c\*(Aq +.Ve +.PP +Each step is of the form \f(CW\*(C`n\ <x>\ <y>\*(C'\fR, with \f(CW\*(C`<x>\*(C'\fR the +part of the string matched and \f(CW\*(C`<y>\*(C'\fR the part not yet +matched. The \f(CW\*(C`|\ \ 1:\ \ STAR\*(C'\fR says that Perl is at line number 1 +in the compilation list above. See +"Debugging Regular Expressions" in perldebguts for much more detail. +.PP +An alternative method of debugging regexps is to embed \f(CW\*(C`print\*(C'\fR +statements within the regexp. This provides a blow-by-blow account of +the backtracking in an alternation: +.PP +.Vb 12 +\& "that this" =~ m@(?{print "Start at position ", pos, "\en";}) +\& t(?{print "t1\en";}) +\& h(?{print "h1\en";}) +\& i(?{print "i1\en";}) +\& s(?{print "s1\en";}) +\& | +\& t(?{print "t2\en";}) +\& h(?{print "h2\en";}) +\& a(?{print "a2\en";}) +\& t(?{print "t2\en";}) +\& (?{print "Done at position ", pos, "\en";}) +\& @x; +.Ve +.PP +prints +.PP +.Vb 8 +\& Start at position 0 +\& t1 +\& h1 +\& t2 +\& h2 +\& a2 +\& t2 +\& Done at position 4 +.Ve +.SH "SEE ALSO" +.IX Header "SEE ALSO" +This is just a tutorial. For the full story on Perl regular +expressions, see the perlre regular expressions reference page. +.PP +For more information on the matching \f(CW\*(C`m//\*(C'\fR and substitution \f(CW\*(C`s///\*(C'\fR +operators, see "Regexp Quote-Like Operators" in perlop. For +information on the \f(CW\*(C`split\*(C'\fR operation, see "split" in perlfunc. +.PP +For an excellent all-around resource on the care and feeding of +regular expressions, see the book \fIMastering Regular Expressions\fR by +Jeffrey Friedl (published by O'Reilly, ISBN 1556592\-257\-3). +.SH "AUTHOR AND COPYRIGHT" +.IX Header "AUTHOR AND COPYRIGHT" +Copyright (c) 2000 Mark Kvale. +All rights reserved. +Now maintained by Perl porters. +.PP +This document may be distributed under the same terms as Perl itself. +.SS Acknowledgments +.IX Subsection "Acknowledgments" +The inspiration for the stop codon DNA example came from the ZIP +code example in chapter 7 of \fIMastering Regular Expressions\fR. +.PP +The author would like to thank Jeff Pinyan, Andrew Johnson, Peter +Haworth, Ronald J Kimball, and Joe Smith for all their helpful +comments. |