summaryrefslogtreecommitdiffstats
path: root/upstream/archlinux/man1/perlretut.1perl
diff options
context:
space:
mode:
Diffstat (limited to 'upstream/archlinux/man1/perlretut.1perl')
-rw-r--r--upstream/archlinux/man1/perlretut.1perl3219
1 files changed, 3219 insertions, 0 deletions
diff --git a/upstream/archlinux/man1/perlretut.1perl b/upstream/archlinux/man1/perlretut.1perl
new file mode 100644
index 00000000..2967ebd2
--- /dev/null
+++ b/upstream/archlinux/man1/perlretut.1perl
@@ -0,0 +1,3219 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLRETUT 1perl"
+.TH PERLRETUT 1perl 2024-02-11 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlretut \- Perl regular expressions tutorial
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+This page provides a basic tutorial on understanding, creating and
+using regular expressions in Perl. It serves as a complement to the
+reference page on regular expressions perlre. Regular expressions
+are an integral part of the \f(CW\*(C`m//\*(C'\fR, \f(CW\*(C`s///\*(C'\fR, \f(CW\*(C`qr//\*(C'\fR and \f(CW\*(C`split\*(C'\fR
+operators and so this tutorial also overlaps with
+"Regexp Quote-Like Operators" in perlop and "split" in perlfunc.
+.PP
+Perl is widely renowned for excellence in text processing, and regular
+expressions are one of the big factors behind this fame. Perl regular
+expressions display an efficiency and flexibility unknown in most
+other computer languages. Mastering even the basics of regular
+expressions will allow you to manipulate text with surprising ease.
+.PP
+What is a regular expression? At its most basic, a regular expression
+is a template that is used to determine if a string has certain
+characteristics. The string is most often some text, such as a line,
+sentence, web page, or even a whole book, but it doesn't have to be. It
+could be binary data, for example. Biologists often use Perl to look
+for patterns in long DNA sequences.
+.PP
+Suppose we want to determine if the text in variable, \f(CW$var\fR contains
+the sequence of characters \f(CW\*(C`m\ u\ s\ h\ r\ o\ o\ m\*(C'\fR
+(blanks added for legibility). We can write in Perl
+.PP
+.Vb 1
+\& $var =~ m/mushroom/
+.Ve
+.PP
+The value of this expression will be TRUE if \f(CW$var\fR contains that
+sequence of characters anywhere within it, and FALSE otherwise. The
+portion enclosed in \f(CW\*(Aq/\*(Aq\fR characters denotes the characteristic we
+are looking for.
+We use the term \fIpattern\fR for it. The process of looking to see if the
+pattern occurs in the string is called \fImatching\fR, and the \f(CW"=~"\fR
+operator along with the \f(CW\*(C`m//\*(C'\fR tell Perl to try to match the pattern
+against the string. Note that the pattern is also a string, but a very
+special kind of one, as we will see. Patterns are in common use these
+days;
+examples are the patterns typed into a search engine to find web pages
+and the patterns used to list files in a directory, \fIe.g.\fR, "\f(CW\*(C`ls *.txt\*(C'\fR"
+or "\f(CW\*(C`dir *.*\*(C'\fR". In Perl, the patterns described by regular expressions
+are used not only to search strings, but to also extract desired parts
+of strings, and to do search and replace operations.
+.PP
+Regular expressions have the undeserved reputation of being abstract
+and difficult to understand. This really stems simply because the
+notation used to express them tends to be terse and dense, and not
+because of inherent complexity. We recommend using the \f(CW\*(C`/x\*(C'\fR regular
+expression modifier (described below) along with plenty of white space
+to make them less dense, and easier to read. Regular expressions are
+constructed using
+simple concepts like conditionals and loops and are no more difficult
+to understand than the corresponding \f(CW\*(C`if\*(C'\fR conditionals and \f(CW\*(C`while\*(C'\fR
+loops in the Perl language itself.
+.PP
+This tutorial flattens the learning curve by discussing regular
+expression concepts, along with their notation, one at a time and with
+many examples. The first part of the tutorial will progress from the
+simplest word searches to the basic regular expression concepts. If
+you master the first part, you will have all the tools needed to solve
+about 98% of your needs. The second part of the tutorial is for those
+comfortable with the basics, and hungry for more power tools. It
+discusses the more advanced regular expression operators and
+introduces the latest cutting-edge innovations.
+.PP
+A note: to save time, "regular expression" is often abbreviated as
+regexp or regex. Regexp is a more natural abbreviation than regex, but
+is harder to pronounce. The Perl pod documentation is evenly split on
+regexp vs regex; in Perl, there is more than one way to abbreviate it.
+We'll use regexp in this tutorial.
+.PP
+New in v5.22, \f(CW\*(C`use re \*(Aqstrict\*(Aq\*(C'\fR applies stricter
+rules than otherwise when compiling regular expression patterns. It can
+find things that, while legal, may not be what you intended.
+.SH "Part 1: The basics"
+.IX Header "Part 1: The basics"
+.SS "Simple word matching"
+.IX Subsection "Simple word matching"
+The simplest regexp is simply a word, or more generally, a string of
+characters. A regexp consisting of just a word matches any string that
+contains that word:
+.PP
+.Vb 1
+\& "Hello World" =~ /World/; # matches
+.Ve
+.PP
+What is this Perl statement all about? \f(CW"Hello World"\fR is a simple
+double-quoted string. \f(CW\*(C`World\*(C'\fR is the regular expression and the
+\&\f(CW\*(C`//\*(C'\fR enclosing \f(CW\*(C`/World/\*(C'\fR tells Perl to search a string for a match.
+The operator \f(CW\*(C`=~\*(C'\fR associates the string with the regexp match and
+produces a true value if the regexp matched, or false if the regexp
+did not match. In our case, \f(CW\*(C`World\*(C'\fR matches the second word in
+\&\f(CW"Hello World"\fR, so the expression is true. Expressions like this
+are useful in conditionals:
+.PP
+.Vb 6
+\& if ("Hello World" =~ /World/) {
+\& print "It matches\en";
+\& }
+\& else {
+\& print "It doesn\*(Aqt match\en";
+\& }
+.Ve
+.PP
+There are useful variations on this theme. The sense of the match can
+be reversed by using the \f(CW\*(C`!~\*(C'\fR operator:
+.PP
+.Vb 6
+\& if ("Hello World" !~ /World/) {
+\& print "It doesn\*(Aqt match\en";
+\& }
+\& else {
+\& print "It matches\en";
+\& }
+.Ve
+.PP
+The literal string in the regexp can be replaced by a variable:
+.PP
+.Vb 7
+\& my $greeting = "World";
+\& if ("Hello World" =~ /$greeting/) {
+\& print "It matches\en";
+\& }
+\& else {
+\& print "It doesn\*(Aqt match\en";
+\& }
+.Ve
+.PP
+If you're matching against the special default variable \f(CW$_\fR, the
+\&\f(CW\*(C`$_ =~\*(C'\fR part can be omitted:
+.PP
+.Vb 7
+\& $_ = "Hello World";
+\& if (/World/) {
+\& print "It matches\en";
+\& }
+\& else {
+\& print "It doesn\*(Aqt match\en";
+\& }
+.Ve
+.PP
+And finally, the \f(CW\*(C`//\*(C'\fR default delimiters for a match can be changed
+to arbitrary delimiters by putting an \f(CW\*(Aqm\*(Aq\fR out front:
+.PP
+.Vb 4
+\& "Hello World" =~ m!World!; # matches, delimited by \*(Aq!\*(Aq
+\& "Hello World" =~ m{World}; # matches, note the paired \*(Aq{}\*(Aq
+\& "/usr/bin/perl" =~ m"/perl"; # matches after \*(Aq/usr/bin\*(Aq,
+\& # \*(Aq/\*(Aq becomes an ordinary char
+.Ve
+.PP
+\&\f(CW\*(C`/World/\*(C'\fR, \f(CW\*(C`m!World!\*(C'\fR, and \f(CW\*(C`m{World}\*(C'\fR all represent the
+same thing. When, \fIe.g.\fR, the quote (\f(CW\*(Aq"\*(Aq\fR) is used as a delimiter, the forward
+slash \f(CW\*(Aq/\*(Aq\fR becomes an ordinary character and can be used in this regexp
+without trouble.
+.PP
+Let's consider how different regexps would match \f(CW"Hello World"\fR:
+.PP
+.Vb 4
+\& "Hello World" =~ /world/; # doesn\*(Aqt match
+\& "Hello World" =~ /o W/; # matches
+\& "Hello World" =~ /oW/; # doesn\*(Aqt match
+\& "Hello World" =~ /World /; # doesn\*(Aqt match
+.Ve
+.PP
+The first regexp \f(CW\*(C`world\*(C'\fR doesn't match because regexps are by default
+case-sensitive. The second regexp matches because the substring
+\&\f(CW\*(Aqo\ W\*(Aq\fR occurs in the string \f(CW"Hello\ World"\fR. The space
+character \f(CW\*(Aq \*(Aq\fR is treated like any other character in a regexp and is
+needed to match in this case. The lack of a space character is the
+reason the third regexp \f(CW\*(AqoW\*(Aq\fR doesn't match. The fourth regexp
+"\f(CW\*(C`World \*(C'\fR" doesn't match because there is a space at the end of the
+regexp, but not at the end of the string. The lesson here is that
+regexps must match a part of the string \fIexactly\fR in order for the
+statement to be true.
+.PP
+If a regexp matches in more than one place in the string, Perl will
+always match at the earliest possible point in the string:
+.PP
+.Vb 2
+\& "Hello World" =~ /o/; # matches \*(Aqo\*(Aq in \*(AqHello\*(Aq
+\& "That hat is red" =~ /hat/; # matches \*(Aqhat\*(Aq in \*(AqThat\*(Aq
+.Ve
+.PP
+With respect to character matching, there are a few more points you
+need to know about. First of all, not all characters can be used
+"as-is" in a match. Some characters, called \fImetacharacters\fR, are
+generally reserved for use in regexp notation. The metacharacters are
+.PP
+.Vb 1
+\& {}[]()^$.|*+?\-#\e
+.Ve
+.PP
+This list is not as definitive as it may appear (or be claimed to be in
+other documentation). For example, \f(CW"#"\fR is a metacharacter only when
+the \f(CW\*(C`/x\*(C'\fR pattern modifier (described below) is used, and both \f(CW"}"\fR
+and \f(CW"]"\fR are metacharacters only when paired with opening \f(CW"{"\fR or
+\&\f(CW"["\fR respectively; other gotchas apply.
+.PP
+The significance of each of these will be explained
+in the rest of the tutorial, but for now, it is important only to know
+that a metacharacter can be matched as-is by putting a backslash before
+it:
+.PP
+.Vb 5
+\& "2+2=4" =~ /2+2/; # doesn\*(Aqt match, + is a metacharacter
+\& "2+2=4" =~ /2\e+2/; # matches, \e+ is treated like an ordinary +
+\& "The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
+\& "The interval is [0,1)." =~ /\e[0,1\e)\e./ # matches
+\& "#!/usr/bin/perl" =~ /#!\e/usr\e/bin\e/perl/; # matches
+.Ve
+.PP
+In the last regexp, the forward slash \f(CW\*(Aq/\*(Aq\fR is also backslashed,
+because it is used to delimit the regexp. This can lead to LTS
+(leaning toothpick syndrome), however, and it is often more readable
+to change delimiters.
+.PP
+.Vb 1
+\& "#!/usr/bin/perl" =~ m!#\e!/usr/bin/perl!; # easier to read
+.Ve
+.PP
+The backslash character \f(CW\*(Aq\e\*(Aq\fR is a metacharacter itself and needs to
+be backslashed:
+.PP
+.Vb 1
+\& \*(AqC:\eWIN32\*(Aq =~ /C:\e\eWIN/; # matches
+.Ve
+.PP
+In situations where it doesn't make sense for a particular metacharacter
+to mean what it normally does, it automatically loses its
+metacharacter-ness and becomes an ordinary character that is to be
+matched literally. For example, the \f(CW\*(Aq}\*(Aq\fR is a metacharacter only when
+it is the mate of a \f(CW\*(Aq{\*(Aq\fR metacharacter. Otherwise it is treated as a
+literal RIGHT CURLY BRACKET. This may lead to unexpected results.
+\&\f(CW\*(C`use re \*(Aqstrict\*(Aq\*(C'\fR can catch some of these.
+.PP
+In addition to the metacharacters, there are some ASCII characters
+which don't have printable character equivalents and are instead
+represented by \fIescape sequences\fR. Common examples are \f(CW\*(C`\et\*(C'\fR for a
+tab, \f(CW\*(C`\en\*(C'\fR for a newline, \f(CW\*(C`\er\*(C'\fR for a carriage return and \f(CW\*(C`\ea\*(C'\fR for a
+bell (or alert). If your string is better thought of as a sequence of arbitrary
+bytes, the octal escape sequence, \fIe.g.\fR, \f(CW\*(C`\e033\*(C'\fR, or hexadecimal escape
+sequence, \fIe.g.\fR, \f(CW\*(C`\ex1B\*(C'\fR may be a more natural representation for your
+bytes. Here are some examples of escapes:
+.PP
+.Vb 5
+\& "1000\et2000" =~ m(0\et2) # matches
+\& "1000\en2000" =~ /0\en20/ # matches
+\& "1000\et2000" =~ /\e000\et2/ # doesn\*(Aqt match, "0" ne "\e000"
+\& "cat" =~ /\eo{143}\ex61\ex74/ # matches in ASCII, but a weird way
+\& # to spell cat
+.Ve
+.PP
+If you've been around Perl a while, all this talk of escape sequences
+may seem familiar. Similar escape sequences are used in double-quoted
+strings and in fact the regexps in Perl are mostly treated as
+double-quoted strings. This means that variables can be used in
+regexps as well. Just like double-quoted strings, the values of the
+variables in the regexp will be substituted in before the regexp is
+evaluated for matching purposes. So we have:
+.PP
+.Vb 4
+\& $foo = \*(Aqhouse\*(Aq;
+\& \*(Aqhousecat\*(Aq =~ /$foo/; # matches
+\& \*(Aqcathouse\*(Aq =~ /cat$foo/; # matches
+\& \*(Aqhousecat\*(Aq =~ /${foo}cat/; # matches
+.Ve
+.PP
+So far, so good. With the knowledge above you can already perform
+searches with just about any literal string regexp you can dream up.
+Here is a \fIvery simple\fR emulation of the Unix grep program:
+.PP
+.Vb 7
+\& % cat > simple_grep
+\& #!/usr/bin/perl
+\& $regexp = shift;
+\& while (<>) {
+\& print if /$regexp/;
+\& }
+\& ^D
+\&
+\& % chmod +x simple_grep
+\&
+\& % simple_grep abba /usr/dict/words
+\& Babbage
+\& cabbage
+\& cabbages
+\& sabbath
+\& Sabbathize
+\& Sabbathizes
+\& sabbatical
+\& scabbard
+\& scabbards
+.Ve
+.PP
+This program is easy to understand. \f(CW\*(C`#!/usr/bin/perl\*(C'\fR is the standard
+way to invoke a perl program from the shell.
+\&\f(CW\*(C`$regexp\ =\ shift;\*(C'\fR saves the first command line argument as the
+regexp to be used, leaving the rest of the command line arguments to
+be treated as files. \f(CW\*(C`while\ (<>)\*(C'\fR loops over all the lines in
+all the files. For each line, \f(CW\*(C`print\ if\ /$regexp/;\*(C'\fR prints the
+line if the regexp matches the line. In this line, both \f(CW\*(C`print\*(C'\fR and
+\&\f(CW\*(C`/$regexp/\*(C'\fR use the default variable \f(CW$_\fR implicitly.
+.PP
+With all of the regexps above, if the regexp matched anywhere in the
+string, it was considered a match. Sometimes, however, we'd like to
+specify \fIwhere\fR in the string the regexp should try to match. To do
+this, we would use the \fIanchor\fR metacharacters \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR. The
+anchor \f(CW\*(Aq^\*(Aq\fR means match at the beginning of the string and the anchor
+\&\f(CW\*(Aq$\*(Aq\fR means match at the end of the string, or before a newline at the
+end of the string. Here is how they are used:
+.PP
+.Vb 4
+\& "housekeeper" =~ /keeper/; # matches
+\& "housekeeper" =~ /^keeper/; # doesn\*(Aqt match
+\& "housekeeper" =~ /keeper$/; # matches
+\& "housekeeper\en" =~ /keeper$/; # matches
+.Ve
+.PP
+The second regexp doesn't match because \f(CW\*(Aq^\*(Aq\fR constrains \f(CW\*(C`keeper\*(C'\fR to
+match only at the beginning of the string, but \f(CW"housekeeper"\fR has
+keeper starting in the middle. The third regexp does match, since the
+\&\f(CW\*(Aq$\*(Aq\fR constrains \f(CW\*(C`keeper\*(C'\fR to match only at the end of the string.
+.PP
+When both \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR are used at the same time, the regexp has to
+match both the beginning and the end of the string, \fIi.e.\fR, the regexp
+matches the whole string. Consider
+.PP
+.Vb 3
+\& "keeper" =~ /^keep$/; # doesn\*(Aqt match
+\& "keeper" =~ /^keeper$/; # matches
+\& "" =~ /^$/; # ^$ matches an empty string
+.Ve
+.PP
+The first regexp doesn't match because the string has more to it than
+\&\f(CW\*(C`keep\*(C'\fR. Since the second regexp is exactly the string, it
+matches. Using both \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR in a regexp forces the complete
+string to match, so it gives you complete control over which strings
+match and which don't. Suppose you are looking for a fellow named
+bert, off in a string by himself:
+.PP
+.Vb 1
+\& "dogbert" =~ /bert/; # matches, but not what you want
+\&
+\& "dilbert" =~ /^bert/; # doesn\*(Aqt match, but ..
+\& "bertram" =~ /^bert/; # matches, so still not good enough
+\&
+\& "bertram" =~ /^bert$/; # doesn\*(Aqt match, good
+\& "dilbert" =~ /^bert$/; # doesn\*(Aqt match, good
+\& "bert" =~ /^bert$/; # matches, perfect
+.Ve
+.PP
+Of course, in the case of a literal string, one could just as easily
+use the string comparison \f(CW\*(C`$string\ eq\ \*(Aqbert\*(Aq\*(C'\fR and it would be
+more efficient. The \f(CW\*(C`^...$\*(C'\fR regexp really becomes useful when we
+add in the more powerful regexp tools below.
+.SS "Using character classes"
+.IX Subsection "Using character classes"
+Although one can already do quite a lot with the literal string
+regexps above, we've only scratched the surface of regular expression
+technology. In this and subsequent sections we will introduce regexp
+concepts (and associated metacharacter notations) that will allow a
+regexp to represent not just a single character sequence, but a \fIwhole
+class\fR of them.
+.PP
+One such concept is that of a \fIcharacter class\fR. A character class
+allows a set of possible characters, rather than just a single
+character, to match at a particular point in a regexp. You can define
+your own custom character classes. These
+are denoted by brackets \f(CW\*(C`[...]\*(C'\fR, with the set of characters
+to be possibly matched inside. Here are some examples:
+.PP
+.Vb 4
+\& /cat/; # matches \*(Aqcat\*(Aq
+\& /[bcr]at/; # matches \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq
+\& /item[0123456789]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq
+\& "abc" =~ /[cab]/; # matches \*(Aqa\*(Aq
+.Ve
+.PP
+In the last statement, even though \f(CW\*(Aqc\*(Aq\fR is the first character in
+the class, \f(CW\*(Aqa\*(Aq\fR matches because the first character position in the
+string is the earliest point at which the regexp can match.
+.PP
+.Vb 2
+\& /[yY][eE][sS]/; # match \*(Aqyes\*(Aq in a case\-insensitive way
+\& # \*(Aqyes\*(Aq, \*(AqYes\*(Aq, \*(AqYES\*(Aq, etc.
+.Ve
+.PP
+This regexp displays a common task: perform a case-insensitive
+match. Perl provides a way of avoiding all those brackets by simply
+appending an \f(CW\*(Aqi\*(Aq\fR to the end of the match. Then \f(CW\*(C`/[yY][eE][sS]/;\*(C'\fR
+can be rewritten as \f(CW\*(C`/yes/i;\*(C'\fR. The \f(CW\*(Aqi\*(Aq\fR stands for
+case-insensitive and is an example of a \fImodifier\fR of the matching
+operation. We will meet other modifiers later in the tutorial.
+.PP
+We saw in the section above that there were ordinary characters, which
+represented themselves, and special characters, which needed a
+backslash \f(CW\*(Aq\e\*(Aq\fR to represent themselves. The same is true in a
+character class, but the sets of ordinary and special characters
+inside a character class are different than those outside a character
+class. The special characters for a character class are \f(CW\*(C`\-]\e^$\*(C'\fR (and
+the pattern delimiter, whatever it is).
+\&\f(CW\*(Aq]\*(Aq\fR is special because it denotes the end of a character class. \f(CW\*(Aq$\*(Aq\fR is
+special because it denotes a scalar variable. \f(CW\*(Aq\e\*(Aq\fR is special because
+it is used in escape sequences, just like above. Here is how the
+special characters \f(CW\*(C`]$\e\*(C'\fR are handled:
+.PP
+.Vb 5
+\& /[\e]c]def/; # matches \*(Aq]def\*(Aq or \*(Aqcdef\*(Aq
+\& $x = \*(Aqbcr\*(Aq;
+\& /[$x]at/; # matches \*(Aqbat\*(Aq, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq
+\& /[\e$x]at/; # matches \*(Aq$at\*(Aq or \*(Aqxat\*(Aq
+\& /[\e\e$x]at/; # matches \*(Aq\eat\*(Aq, \*(Aqbat, \*(Aqcat\*(Aq, or \*(Aqrat\*(Aq
+.Ve
+.PP
+The last two are a little tricky. In \f(CW\*(C`[\e$x]\*(C'\fR, the backslash protects
+the dollar sign, so the character class has two members \f(CW\*(Aq$\*(Aq\fR and \f(CW\*(Aqx\*(Aq\fR.
+In \f(CW\*(C`[\e\e$x]\*(C'\fR, the backslash is protected, so \f(CW$x\fR is treated as a
+variable and substituted in double quote fashion.
+.PP
+The special character \f(CW\*(Aq\-\*(Aq\fR acts as a range operator within character
+classes, so that a contiguous set of characters can be written as a
+range. With ranges, the unwieldy \f(CW\*(C`[0123456789]\*(C'\fR and \f(CW\*(C`[abc...xyz]\*(C'\fR
+become the svelte \f(CW\*(C`[0\-9]\*(C'\fR and \f(CW\*(C`[a\-z]\*(C'\fR. Some examples are
+.PP
+.Vb 6
+\& /item[0\-9]/; # matches \*(Aqitem0\*(Aq or ... or \*(Aqitem9\*(Aq
+\& /[0\-9bx\-z]aa/; # matches \*(Aq0aa\*(Aq, ..., \*(Aq9aa\*(Aq,
+\& # \*(Aqbaa\*(Aq, \*(Aqxaa\*(Aq, \*(Aqyaa\*(Aq, or \*(Aqzaa\*(Aq
+\& /[0\-9a\-fA\-F]/; # matches a hexadecimal digit
+\& /[0\-9a\-zA\-Z_]/; # matches a "word" character,
+\& # like those in a Perl variable name
+.Ve
+.PP
+If \f(CW\*(Aq\-\*(Aq\fR is the first or last character in a character class, it is
+treated as an ordinary character; \f(CW\*(C`[\-ab]\*(C'\fR, \f(CW\*(C`[ab\-]\*(C'\fR and \f(CW\*(C`[a\e\-b]\*(C'\fR are
+all equivalent.
+.PP
+The special character \f(CW\*(Aq^\*(Aq\fR in the first position of a character class
+denotes a \fInegated character class\fR, which matches any character but
+those in the brackets. Both \f(CW\*(C`[...]\*(C'\fR and \f(CW\*(C`[^...]\*(C'\fR must match a
+character, or the match fails. Then
+.PP
+.Vb 4
+\& /[^a]at/; # doesn\*(Aqt match \*(Aqaat\*(Aq or \*(Aqat\*(Aq, but matches
+\& # all other \*(Aqbat\*(Aq, \*(Aqcat, \*(Aq0at\*(Aq, \*(Aq%at\*(Aq, etc.
+\& /[^0\-9]/; # matches a non\-numeric character
+\& /[a^]at/; # matches \*(Aqaat\*(Aq or \*(Aq^at\*(Aq; here \*(Aq^\*(Aq is ordinary
+.Ve
+.PP
+Now, even \f(CW\*(C`[0\-9]\*(C'\fR can be a bother to write multiple times, so in the
+interest of saving keystrokes and making regexps more readable, Perl
+has several abbreviations for common character classes, as shown below.
+Since the introduction of Unicode, unless the \f(CW\*(C`/a\*(C'\fR modifier is in
+effect, these character classes match more than just a few characters in
+the ASCII range.
+.IP \(bu 4
+\&\f(CW\*(C`\ed\*(C'\fR matches a digit, not just \f(CW\*(C`[0\-9]\*(C'\fR but also digits from non-roman scripts
+.IP \(bu 4
+\&\f(CW\*(C`\es\*(C'\fR matches a whitespace character, the set \f(CW\*(C`[\e \et\er\en\ef]\*(C'\fR and others
+.IP \(bu 4
+\&\f(CW\*(C`\ew\*(C'\fR matches a word character (alphanumeric or \f(CW\*(Aq_\*(Aq\fR), not just \f(CW\*(C`[0\-9a\-zA\-Z_]\*(C'\fR
+but also digits and characters from non-roman scripts
+.IP \(bu 4
+\&\f(CW\*(C`\eD\*(C'\fR is a negated \f(CW\*(C`\ed\*(C'\fR; it represents any other character than a digit, or \f(CW\*(C`[^\ed]\*(C'\fR
+.IP \(bu 4
+\&\f(CW\*(C`\eS\*(C'\fR is a negated \f(CW\*(C`\es\*(C'\fR; it represents any non-whitespace character \f(CW\*(C`[^\es]\*(C'\fR
+.IP \(bu 4
+\&\f(CW\*(C`\eW\*(C'\fR is a negated \f(CW\*(C`\ew\*(C'\fR; it represents any non-word character \f(CW\*(C`[^\ew]\*(C'\fR
+.IP \(bu 4
+The period \f(CW\*(Aq.\*(Aq\fR matches any character but \f(CW"\en"\fR (unless the modifier \f(CW\*(C`/s\*(C'\fR is
+in effect, as explained below).
+.IP \(bu 4
+\&\f(CW\*(C`\eN\*(C'\fR, like the period, matches any character but \f(CW"\en"\fR, but it does so
+regardless of whether the modifier \f(CW\*(C`/s\*(C'\fR is in effect.
+.PP
+The \f(CW\*(C`/a\*(C'\fR modifier, available starting in Perl 5.14, is used to
+restrict the matches of \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR, and \f(CW\*(C`\ew\*(C'\fR to just those in the ASCII range.
+It is useful to keep your program from being needlessly exposed to full
+Unicode (and its accompanying security considerations) when all you want
+is to process English-like text. (The "a" may be doubled, \f(CW\*(C`/aa\*(C'\fR, to
+provide even more restrictions, preventing case-insensitive matching of
+ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign"
+would caselessly match a "k" or "K".)
+.PP
+The \f(CW\*(C`\ed\es\ew\eD\eS\eW\*(C'\fR abbreviations can be used both inside and outside
+of bracketed character classes. Here are some in use:
+.PP
+.Vb 7
+\& /\ed\ed:\ed\ed:\ed\ed/; # matches a hh:mm:ss time format
+\& /[\ed\es]/; # matches any digit or whitespace character
+\& /\ew\eW\ew/; # matches a word char, followed by a
+\& # non\-word char, followed by a word char
+\& /..rt/; # matches any two chars, followed by \*(Aqrt\*(Aq
+\& /end\e./; # matches \*(Aqend.\*(Aq
+\& /end[.]/; # same thing, matches \*(Aqend.\*(Aq
+.Ve
+.PP
+Because a period is a metacharacter, it needs to be escaped to match
+as an ordinary period. Because, for example, \f(CW\*(C`\ed\*(C'\fR and \f(CW\*(C`\ew\*(C'\fR are sets
+of characters, it is incorrect to think of \f(CW\*(C`[^\ed\ew]\*(C'\fR as \f(CW\*(C`[\eD\eW]\*(C'\fR; in
+fact \f(CW\*(C`[^\ed\ew]\*(C'\fR is the same as \f(CW\*(C`[^\ew]\*(C'\fR, which is the same as
+\&\f(CW\*(C`[\eW]\*(C'\fR. Think De Morgan's laws.
+.PP
+In actuality, the period and \f(CW\*(C`\ed\es\ew\eD\eS\eW\*(C'\fR abbreviations are
+themselves types of character classes, so the ones surrounded by
+brackets are just one type of character class. When we need to make a
+distinction, we refer to them as "bracketed character classes."
+.PP
+An anchor useful in basic regexps is the \fIword anchor\fR
+\&\f(CW\*(C`\eb\*(C'\fR. This matches a boundary between a word character and a non-word
+character \f(CW\*(C`\ew\eW\*(C'\fR or \f(CW\*(C`\eW\ew\*(C'\fR:
+.PP
+.Vb 5
+\& $x = "Housecat catenates house and cat";
+\& $x =~ /cat/; # matches cat in \*(Aqhousecat\*(Aq
+\& $x =~ /\ebcat/; # matches cat in \*(Aqcatenates\*(Aq
+\& $x =~ /cat\eb/; # matches cat in \*(Aqhousecat\*(Aq
+\& $x =~ /\ebcat\eb/; # matches \*(Aqcat\*(Aq at end of string
+.Ve
+.PP
+Note in the last example, the end of the string is considered a word
+boundary.
+.PP
+For natural language processing (so that, for example, apostrophes are
+included in words), use instead \f(CW\*(C`\eb{wb}\*(C'\fR
+.PP
+.Vb 1
+\& "don\*(Aqt" =~ / .+? \eb{wb} /x; # matches the whole string
+.Ve
+.PP
+You might wonder why \f(CW\*(Aq.\*(Aq\fR matches everything but \f(CW"\en"\fR \- why not
+every character? The reason is that often one is matching against
+lines and would like to ignore the newline characters. For instance,
+while the string \f(CW"\en"\fR represents one line, we would like to think
+of it as empty. Then
+.PP
+.Vb 2
+\& "" =~ /^$/; # matches
+\& "\en" =~ /^$/; # matches, $ anchors before "\en"
+\&
+\& "" =~ /./; # doesn\*(Aqt match; it needs a char
+\& "" =~ /^.$/; # doesn\*(Aqt match; it needs a char
+\& "\en" =~ /^.$/; # doesn\*(Aqt match; it needs a char other than "\en"
+\& "a" =~ /^.$/; # matches
+\& "a\en" =~ /^.$/; # matches, $ anchors before "\en"
+.Ve
+.PP
+This behavior is convenient, because we usually want to ignore
+newlines when we count and match characters in a line. Sometimes,
+however, we want to keep track of newlines. We might even want \f(CW\*(Aq^\*(Aq\fR
+and \f(CW\*(Aq$\*(Aq\fR to anchor at the beginning and end of lines within the
+string, rather than just the beginning and end of the string. Perl
+allows us to choose between ignoring and paying attention to newlines
+by using the \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR modifiers. \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR stand for
+single line and multi-line and they determine whether a string is to
+be treated as one continuous string, or as a set of lines. The two
+modifiers affect two aspects of how the regexp is interpreted: 1) how
+the \f(CW\*(Aq.\*(Aq\fR character class is defined, and 2) where the anchors \f(CW\*(Aq^\*(Aq\fR
+and \f(CW\*(Aq$\*(Aq\fR are able to match. Here are the four possible combinations:
+.IP \(bu 4
+no modifiers: Default behavior. \f(CW\*(Aq.\*(Aq\fR matches any character
+except \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR matches only at the beginning of the string and
+\&\f(CW\*(Aq$\*(Aq\fR matches only at the end or before a newline at the end.
+.IP \(bu 4
+s modifier (\f(CW\*(C`/s\*(C'\fR): Treat string as a single long line. \f(CW\*(Aq.\*(Aq\fR matches
+any character, even \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR matches only at the beginning of
+the string and \f(CW\*(Aq$\*(Aq\fR matches only at the end or before a newline at the
+end.
+.IP \(bu 4
+m modifier (\f(CW\*(C`/m\*(C'\fR): Treat string as a set of multiple lines. \f(CW\*(Aq.\*(Aq\fR
+matches any character except \f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR are able to match
+at the start or end of \fIany\fR line within the string.
+.IP \(bu 4
+both s and m modifiers (\f(CW\*(C`/sm\*(C'\fR): Treat string as a single long line, but
+detect multiple lines. \f(CW\*(Aq.\*(Aq\fR matches any character, even
+\&\f(CW"\en"\fR. \f(CW\*(Aq^\*(Aq\fR and \f(CW\*(Aq$\*(Aq\fR, however, are able to match at the start or end
+of \fIany\fR line within the string.
+.PP
+Here are examples of \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/m\*(C'\fR in action:
+.PP
+.Vb 1
+\& $x = "There once was a girl\enWho programmed in Perl\en";
+\&
+\& $x =~ /^Who/; # doesn\*(Aqt match, "Who" not at start of string
+\& $x =~ /^Who/s; # doesn\*(Aqt match, "Who" not at start of string
+\& $x =~ /^Who/m; # matches, "Who" at start of second line
+\& $x =~ /^Who/sm; # matches, "Who" at start of second line
+\&
+\& $x =~ /girl.Who/; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en"
+\& $x =~ /girl.Who/s; # matches, "." matches "\en"
+\& $x =~ /girl.Who/m; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en"
+\& $x =~ /girl.Who/sm; # matches, "." matches "\en"
+.Ve
+.PP
+Most of the time, the default behavior is what is wanted, but \f(CW\*(C`/s\*(C'\fR and
+\&\f(CW\*(C`/m\*(C'\fR are occasionally very useful. If \f(CW\*(C`/m\*(C'\fR is being used, the start
+of the string can still be matched with \f(CW\*(C`\eA\*(C'\fR and the end of the string
+can still be matched with the anchors \f(CW\*(C`\eZ\*(C'\fR (matches both the end and
+the newline before, like \f(CW\*(Aq$\*(Aq\fR), and \f(CW\*(C`\ez\*(C'\fR (matches only the end):
+.PP
+.Vb 2
+\& $x =~ /^Who/m; # matches, "Who" at start of second line
+\& $x =~ /\eAWho/m; # doesn\*(Aqt match, "Who" is not at start of string
+\&
+\& $x =~ /girl$/m; # matches, "girl" at end of first line
+\& $x =~ /girl\eZ/m; # doesn\*(Aqt match, "girl" is not at end of string
+\&
+\& $x =~ /Perl\eZ/m; # matches, "Perl" is at newline before end
+\& $x =~ /Perl\ez/m; # doesn\*(Aqt match, "Perl" is not at end of string
+.Ve
+.PP
+We now know how to create choices among classes of characters in a
+regexp. What about choices among words or character strings? Such
+choices are described in the next section.
+.SS "Matching this or that"
+.IX Subsection "Matching this or that"
+Sometimes we would like our regexp to be able to match different
+possible words or character strings. This is accomplished by using
+the \fIalternation\fR metacharacter \f(CW\*(Aq|\*(Aq\fR. To match \f(CW\*(C`dog\*(C'\fR or \f(CW\*(C`cat\*(C'\fR, we
+form the regexp \f(CW\*(C`dog|cat\*(C'\fR. As before, Perl will try to match the
+regexp at the earliest possible point in the string. At each
+character position, Perl will first try to match the first
+alternative, \f(CW\*(C`dog\*(C'\fR. If \f(CW\*(C`dog\*(C'\fR doesn't match, Perl will then try the
+next alternative, \f(CW\*(C`cat\*(C'\fR. If \f(CW\*(C`cat\*(C'\fR doesn't match either, then the
+match fails and Perl moves to the next position in the string. Some
+examples:
+.PP
+.Vb 2
+\& "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
+\& "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
+.Ve
+.PP
+Even though \f(CW\*(C`dog\*(C'\fR is the first alternative in the second regexp,
+\&\f(CW\*(C`cat\*(C'\fR is able to match earlier in the string.
+.PP
+.Vb 2
+\& "cats" =~ /c|ca|cat|cats/; # matches "c"
+\& "cats" =~ /cats|cat|ca|c/; # matches "cats"
+.Ve
+.PP
+Here, all the alternatives match at the first string position, so the
+first alternative is the one that matches. If some of the
+alternatives are truncations of the others, put the longest ones first
+to give them a chance to match.
+.PP
+.Vb 2
+\& "cab" =~ /a|b|c/ # matches "c"
+\& # /a|b|c/ == /[abc]/
+.Ve
+.PP
+The last example points out that character classes are like
+alternations of characters. At a given character position, the first
+alternative that allows the regexp match to succeed will be the one
+that matches.
+.SS "Grouping things and hierarchical matching"
+.IX Subsection "Grouping things and hierarchical matching"
+Alternation allows a regexp to choose among alternatives, but by
+itself it is unsatisfying. The reason is that each alternative is a whole
+regexp, but sometime we want alternatives for just part of a
+regexp. For instance, suppose we want to search for housecats or
+housekeepers. The regexp \f(CW\*(C`housecat|housekeeper\*(C'\fR fits the bill, but is
+inefficient because we had to type \f(CW\*(C`house\*(C'\fR twice. It would be nice to
+have parts of the regexp be constant, like \f(CW\*(C`house\*(C'\fR, and some
+parts have alternatives, like \f(CW\*(C`cat|keeper\*(C'\fR.
+.PP
+The \fIgrouping\fR metacharacters \f(CW\*(C`()\*(C'\fR solve this problem. Grouping
+allows parts of a regexp to be treated as a single unit. Parts of a
+regexp are grouped by enclosing them in parentheses. Thus we could solve
+the \f(CW\*(C`housecat|housekeeper\*(C'\fR by forming the regexp as
+\&\f(CWhouse(cat|keeper)\fR. The regexp \f(CWhouse(cat|keeper)\fR means match
+\&\f(CW\*(C`house\*(C'\fR followed by either \f(CW\*(C`cat\*(C'\fR or \f(CW\*(C`keeper\*(C'\fR. Some more examples
+are
+.PP
+.Vb 4
+\& /(a|b)b/; # matches \*(Aqab\*(Aq or \*(Aqbb\*(Aq
+\& /(ac|b)b/; # matches \*(Aqacb\*(Aq or \*(Aqbb\*(Aq
+\& /(^a|b)c/; # matches \*(Aqac\*(Aq at start of string or \*(Aqbc\*(Aq anywhere
+\& /(a|[bc])d/; # matches \*(Aqad\*(Aq, \*(Aqbd\*(Aq, or \*(Aqcd\*(Aq
+\&
+\& /house(cat|)/; # matches either \*(Aqhousecat\*(Aq or \*(Aqhouse\*(Aq
+\& /house(cat(s|)|)/; # matches either \*(Aqhousecats\*(Aq or \*(Aqhousecat\*(Aq or
+\& # \*(Aqhouse\*(Aq. Note groups can be nested.
+\&
+\& /(19|20|)\ed\ed/; # match years 19xx, 20xx, or the Y2K problem, xx
+\& "20" =~ /(19|20|)\ed\ed/; # matches the null alternative \*(Aq()\ed\ed\*(Aq,
+\& # because \*(Aq20\ed\ed\*(Aq can\*(Aqt match
+.Ve
+.PP
+Alternations behave the same way in groups as out of them: at a given
+string position, the leftmost alternative that allows the regexp to
+match is taken. So in the last example at the first string position,
+\&\f(CW"20"\fR matches the second alternative, but there is nothing left over
+to match the next two digits \f(CW\*(C`\ed\ed\*(C'\fR. So Perl moves on to the next
+alternative, which is the null alternative and that works, since
+\&\f(CW"20"\fR is two digits.
+.PP
+The process of trying one alternative, seeing if it matches, and
+moving on to the next alternative, while going back in the string
+from where the previous alternative was tried, if it doesn't, is called
+\&\fIbacktracking\fR. The term "backtracking" comes from the idea that
+matching a regexp is like a walk in the woods. Successfully matching
+a regexp is like arriving at a destination. There are many possible
+trailheads, one for each string position, and each one is tried in
+order, left to right. From each trailhead there may be many paths,
+some of which get you there, and some which are dead ends. When you
+walk along a trail and hit a dead end, you have to backtrack along the
+trail to an earlier point to try another trail. If you hit your
+destination, you stop immediately and forget about trying all the
+other trails. You are persistent, and only if you have tried all the
+trails from all the trailheads and not arrived at your destination, do
+you declare failure. To be concrete, here is a step-by-step analysis
+of what Perl does when it tries to match the regexp
+.PP
+.Vb 1
+\& "abcde" =~ /(abd|abc)(df|d|de)/;
+.Ve
+.IP 1. 4
+Start with the first letter in the string \f(CW\*(Aqa\*(Aq\fR.
+.IP 2. 4
+Try the first alternative in the first group \f(CW\*(Aqabd\*(Aq\fR.
+.IP 3. 4
+Match \f(CW\*(Aqa\*(Aq\fR followed by \f(CW\*(Aqb\*(Aq\fR. So far so good.
+.IP 4. 4
+\&\f(CW\*(Aqd\*(Aq\fR in the regexp doesn't match \f(CW\*(Aqc\*(Aq\fR in the string \- a
+dead end. So backtrack two characters and pick the second alternative
+in the first group \f(CW\*(Aqabc\*(Aq\fR.
+.IP 5. 4
+Match \f(CW\*(Aqa\*(Aq\fR followed by \f(CW\*(Aqb\*(Aq\fR followed by \f(CW\*(Aqc\*(Aq\fR. We are on a roll
+and have satisfied the first group. Set \f(CW$1\fR to \f(CW\*(Aqabc\*(Aq\fR.
+.IP 6. 4
+Move on to the second group and pick the first alternative \f(CW\*(Aqdf\*(Aq\fR.
+.IP 7. 4
+Match the \f(CW\*(Aqd\*(Aq\fR.
+.IP 8. 4
+\&\f(CW\*(Aqf\*(Aq\fR in the regexp doesn't match \f(CW\*(Aqe\*(Aq\fR in the string, so a dead
+end. Backtrack one character and pick the second alternative in the
+second group \f(CW\*(Aqd\*(Aq\fR.
+.IP 9. 4
+\&\f(CW\*(Aqd\*(Aq\fR matches. The second grouping is satisfied, so set
+\&\f(CW$2\fR to \f(CW\*(Aqd\*(Aq\fR.
+.IP 10. 4
+We are at the end of the regexp, so we are done! We have
+matched \f(CW\*(Aqabcd\*(Aq\fR out of the string \f(CW"abcde"\fR.
+.PP
+There are a couple of things to note about this analysis. First, the
+third alternative in the second group \f(CW\*(Aqde\*(Aq\fR also allows a match, but we
+stopped before we got to it \- at a given character position, leftmost
+wins. Second, we were able to get a match at the first character
+position of the string \f(CW\*(Aqa\*(Aq\fR. If there were no matches at the first
+position, Perl would move to the second character position \f(CW\*(Aqb\*(Aq\fR and
+attempt the match all over again. Only when all possible paths at all
+possible character positions have been exhausted does Perl give
+up and declare \f(CW\*(C`$string\ =~\ /(abd|abc)(df|d|de)/;\*(C'\fR to be false.
+.PP
+Even with all this work, regexp matching happens remarkably fast. To
+speed things up, Perl compiles the regexp into a compact sequence of
+opcodes that can often fit inside a processor cache. When the code is
+executed, these opcodes can then run at full throttle and search very
+quickly.
+.SS "Extracting matches"
+.IX Subsection "Extracting matches"
+The grouping metacharacters \f(CW\*(C`()\*(C'\fR also serve another completely
+different function: they allow the extraction of the parts of a string
+that matched. This is very useful to find out what matched and for
+text processing in general. For each grouping, the part that matched
+inside goes into the special variables \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR. They can be
+used just as ordinary variables:
+.PP
+.Vb 6
+\& # extract hours, minutes, seconds
+\& if ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/) { # match hh:mm:ss format
+\& $hours = $1;
+\& $minutes = $2;
+\& $seconds = $3;
+\& }
+.Ve
+.PP
+Now, we know that in scalar context,
+\&\f(CW\*(C`$time\ =~\ /(\ed\ed):(\ed\ed):(\ed\ed)/\*(C'\fR returns a true or false
+value. In list context, however, it returns the list of matched values
+\&\f(CW\*(C`($1,$2,$3)\*(C'\fR. So we could write the code more compactly as
+.PP
+.Vb 2
+\& # extract hours, minutes, seconds
+\& ($hours, $minutes, $second) = ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/);
+.Ve
+.PP
+If the groupings in a regexp are nested, \f(CW$1\fR gets the group with the
+leftmost opening parenthesis, \f(CW$2\fR the next opening parenthesis,
+\&\fIetc\fR. Here is a regexp with nested groups:
+.PP
+.Vb 2
+\& /(ab(cd|ef)((gi)|j))/;
+\& 1 2 34
+.Ve
+.PP
+If this regexp matches, \f(CW$1\fR contains a string starting with
+\&\f(CW\*(Aqab\*(Aq\fR, \f(CW$2\fR is either set to \f(CW\*(Aqcd\*(Aq\fR or \f(CW\*(Aqef\*(Aq\fR, \f(CW$3\fR equals either
+\&\f(CW\*(Aqgi\*(Aq\fR or \f(CW\*(Aqj\*(Aq\fR, and \f(CW$4\fR is either set to \f(CW\*(Aqgi\*(Aq\fR, just like \f(CW$3\fR,
+or it remains undefined.
+.PP
+For convenience, Perl sets \f(CW$+\fR to the string held by the highest numbered
+\&\f(CW$1\fR, \f(CW$2\fR,... that got assigned (and, somewhat related, \f(CW$^N\fR to the
+value of the \f(CW$1\fR, \f(CW$2\fR,... most-recently assigned; \fIi.e.\fR the \f(CW$1\fR,
+\&\f(CW$2\fR,... associated with the rightmost closing parenthesis used in the
+match).
+.SS Backreferences
+.IX Subsection "Backreferences"
+Closely associated with the matching variables \f(CW$1\fR, \f(CW$2\fR, ... are
+the \fIbackreferences\fR \f(CW\*(C`\eg1\*(C'\fR, \f(CW\*(C`\eg2\*(C'\fR,... Backreferences are simply
+matching variables that can be used \fIinside\fR a regexp. This is a
+really nice feature; what matches later in a regexp is made to depend on
+what matched earlier in the regexp. Suppose we wanted to look
+for doubled words in a text, like "the the". The following regexp finds
+all 3\-letter doubles with a space in between:
+.PP
+.Vb 1
+\& /\eb(\ew\ew\ew)\es\eg1\eb/;
+.Ve
+.PP
+The grouping assigns a value to \f(CW\*(C`\eg1\*(C'\fR, so that the same 3\-letter sequence
+is used for both parts.
+.PP
+A similar task is to find words consisting of two identical parts:
+.PP
+.Vb 7
+\& % simple_grep \*(Aq^(\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew)\eg1$\*(Aq /usr/dict/words
+\& beriberi
+\& booboo
+\& coco
+\& mama
+\& murmur
+\& papa
+.Ve
+.PP
+The regexp has a single grouping which considers 4\-letter
+combinations, then 3\-letter combinations, \fIetc\fR., and uses \f(CW\*(C`\eg1\*(C'\fR to look for
+a repeat. Although \f(CW$1\fR and \f(CW\*(C`\eg1\*(C'\fR represent the same thing, care should be
+taken to use matched variables \f(CW$1\fR, \f(CW$2\fR,... only \fIoutside\fR a regexp
+and backreferences \f(CW\*(C`\eg1\*(C'\fR, \f(CW\*(C`\eg2\*(C'\fR,... only \fIinside\fR a regexp; not doing
+so may lead to surprising and unsatisfactory results.
+.SS "Relative backreferences"
+.IX Subsection "Relative backreferences"
+Counting the opening parentheses to get the correct number for a
+backreference is error-prone as soon as there is more than one
+capturing group. A more convenient technique became available
+with Perl 5.10: relative backreferences. To refer to the immediately
+preceding capture group one now may write \f(CW\*(C`\eg\-1\*(C'\fR or \f(CW\*(C`\eg{\-1}\*(C'\fR, the next but
+last is available via \f(CW\*(C`\eg\-2\*(C'\fR or \f(CW\*(C`\eg{\-2}\*(C'\fR, and so on.
+.PP
+Another good reason in addition to readability and maintainability
+for using relative backreferences is illustrated by the following example,
+where a simple pattern for matching peculiar strings is used:
+.PP
+.Vb 1
+\& $a99a = \*(Aq([a\-z])(\ed)\eg2\eg1\*(Aq; # matches a11a, g22g, x33x, etc.
+.Ve
+.PP
+Now that we have this pattern stored as a handy string, we might feel
+tempted to use it as a part of some other pattern:
+.PP
+.Vb 6
+\& $line = "code=e99e";
+\& if ($line =~ /^(\ew+)=$a99a$/){ # unexpected behavior!
+\& print "$1 is valid\en";
+\& } else {
+\& print "bad line: \*(Aq$line\*(Aq\en";
+\& }
+.Ve
+.PP
+But this doesn't match, at least not the way one might expect. Only
+after inserting the interpolated \f(CW$a99a\fR and looking at the resulting
+full text of the regexp is it obvious that the backreferences have
+backfired. The subexpression \f(CW\*(C`(\ew+)\*(C'\fR has snatched number 1 and
+demoted the groups in \f(CW$a99a\fR by one rank. This can be avoided by
+using relative backreferences:
+.PP
+.Vb 1
+\& $a99a = \*(Aq([a\-z])(\ed)\eg{\-1}\eg{\-2}\*(Aq; # safe for being interpolated
+.Ve
+.SS "Named backreferences"
+.IX Subsection "Named backreferences"
+Perl 5.10 also introduced named capture groups and named backreferences.
+To attach a name to a capturing group, you write either
+\&\f(CW\*(C`(?<name>...)\*(C'\fR or \f(CW\*(C`(?\*(Aqname\*(Aq...)\*(C'\fR. The backreference may
+then be written as \f(CW\*(C`\eg{name}\*(C'\fR. It is permissible to attach the
+same name to more than one group, but then only the leftmost one of the
+eponymous set can be referenced. Outside of the pattern a named
+capture group is accessible through the \f(CW\*(C`%+\*(C'\fR hash.
+.PP
+Assuming that we have to match calendar dates which may be given in one
+of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write
+three suitable patterns where we use \f(CW\*(Aqd\*(Aq\fR, \f(CW\*(Aqm\*(Aq\fR and \f(CW\*(Aqy\*(Aq\fR respectively as the
+names of the groups capturing the pertaining components of a date. The
+matching operation combines the three patterns as alternatives:
+.PP
+.Vb 8
+\& $fmt1 = \*(Aq(?<y>\ed\ed\ed\ed)\-(?<m>\ed\ed)\-(?<d>\ed\ed)\*(Aq;
+\& $fmt2 = \*(Aq(?<m>\ed\ed)/(?<d>\ed\ed)/(?<y>\ed\ed\ed\ed)\*(Aq;
+\& $fmt3 = \*(Aq(?<d>\ed\ed)\e.(?<m>\ed\ed)\e.(?<y>\ed\ed\ed\ed)\*(Aq;
+\& for my $d (qw(2006\-10\-21 15.01.2007 10/31/2005)) {
+\& if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
+\& print "day=$+{d} month=$+{m} year=$+{y}\en";
+\& }
+\& }
+.Ve
+.PP
+If any of the alternatives matches, the hash \f(CW\*(C`%+\*(C'\fR is bound to contain the
+three key-value pairs.
+.SS "Alternative capture group numbering"
+.IX Subsection "Alternative capture group numbering"
+Yet another capturing group numbering technique (also as from Perl 5.10)
+deals with the problem of referring to groups within a set of alternatives.
+Consider a pattern for matching a time of the day, civil or military style:
+.PP
+.Vb 3
+\& if ( $time =~ /(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed)/ ){
+\& # process hour and minute
+\& }
+.Ve
+.PP
+Processing the results requires an additional if statement to determine
+whether \f(CW$1\fR and \f(CW$2\fR or \f(CW$3\fR and \f(CW$4\fR contain the goodies. It would
+be easier if we could use group numbers 1 and 2 in second alternative as
+well, and this is exactly what the parenthesized construct \f(CW\*(C`(?|...)\*(C'\fR,
+set around an alternative achieves. Here is an extended version of the
+previous pattern:
+.PP
+.Vb 3
+\& if($time =~ /(?|(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed))\es+([A\-Z][A\-Z][A\-Z])/){
+\& print "hour=$1 minute=$2 zone=$3\en";
+\& }
+.Ve
+.PP
+Within the alternative numbering group, group numbers start at the same
+position for each alternative. After the group, numbering continues
+with one higher than the maximum reached across all the alternatives.
+.SS "Position information"
+.IX Subsection "Position information"
+In addition to what was matched, Perl also provides the
+positions of what was matched as contents of the \f(CW\*(C`@\-\*(C'\fR and \f(CW\*(C`@+\*(C'\fR
+arrays. \f(CW\*(C`$\-[0]\*(C'\fR is the position of the start of the entire match and
+\&\f(CW$+[0]\fR is the position of the end. Similarly, \f(CW\*(C`$\-[n]\*(C'\fR is the
+position of the start of the \f(CW$n\fR match and \f(CW$+[n]\fR is the position
+of the end. If \f(CW$n\fR is undefined, so are \f(CW\*(C`$\-[n]\*(C'\fR and \f(CW$+[n]\fR. Then
+this code
+.PP
+.Vb 6
+\& $x = "Mmm...donut, thought Homer";
+\& $x =~ /^(Mmm|Yech)\e.\e.\e.(donut|peas)/; # matches
+\& foreach $exp (1..$#\-) {
+\& no strict \*(Aqrefs\*(Aq;
+\& print "Match $exp: \*(Aq$$exp\*(Aq at position ($\-[$exp],$+[$exp])\en";
+\& }
+.Ve
+.PP
+prints
+.PP
+.Vb 2
+\& Match 1: \*(AqMmm\*(Aq at position (0,3)
+\& Match 2: \*(Aqdonut\*(Aq at position (6,11)
+.Ve
+.PP
+Even if there are no groupings in a regexp, it is still possible to
+find out what exactly matched in a string. If you use them, Perl
+will set \f(CW\*(C`$\`\*(C'\fR to the part of the string before the match, will set \f(CW$&\fR
+to the part of the string that matched, and will set \f(CW\*(C`$\*(Aq\*(C'\fR to the part
+of the string after the match. An example:
+.PP
+.Vb 3
+\& $x = "the cat caught the mouse";
+\& $x =~ /cat/; # $\` = \*(Aqthe \*(Aq, $& = \*(Aqcat\*(Aq, $\*(Aq = \*(Aq caught the mouse\*(Aq
+\& $x =~ /the/; # $\` = \*(Aq\*(Aq, $& = \*(Aqthe\*(Aq, $\*(Aq = \*(Aq cat caught the mouse\*(Aq
+.Ve
+.PP
+In the second match, \f(CW\*(C`$\`\*(C'\fR equals \f(CW\*(Aq\*(Aq\fR because the regexp matched at the
+first character position in the string and stopped; it never saw the
+second "the".
+.PP
+If your code is to run on Perl versions earlier than
+5.20, it is worthwhile to note that using \f(CW\*(C`$\`\*(C'\fR and \f(CW\*(C`$\*(Aq\*(C'\fR
+slows down regexp matching quite a bit, while \f(CW$&\fR slows it down to a
+lesser extent, because if they are used in one regexp in a program,
+they are generated for \fIall\fR regexps in the program. So if raw
+performance is a goal of your application, they should be avoided.
+If you need to extract the corresponding substrings, use \f(CW\*(C`@\-\*(C'\fR and
+\&\f(CW\*(C`@+\*(C'\fR instead:
+.PP
+.Vb 3
+\& $\` is the same as substr( $x, 0, $\-[0] )
+\& $& is the same as substr( $x, $\-[0], $+[0]\-$\-[0] )
+\& $\*(Aq is the same as substr( $x, $+[0] )
+.Ve
+.PP
+As of Perl 5.10, the \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR and \f(CW\*(C`${^POSTMATCH}\*(C'\fR
+variables may be used. These are only set if the \f(CW\*(C`/p\*(C'\fR modifier is
+present. Consequently they do not penalize the rest of the program. In
+Perl 5.20, \f(CW\*(C`${^PREMATCH}\*(C'\fR, \f(CW\*(C`${^MATCH}\*(C'\fR and \f(CW\*(C`${^POSTMATCH}\*(C'\fR are available
+whether the \f(CW\*(C`/p\*(C'\fR has been used or not (the modifier is ignored), and
+\&\f(CW\*(C`$\`\*(C'\fR, \f(CW\*(C`$\*(Aq\*(C'\fR and \f(CW$&\fR do not cause any speed difference.
+.SS "Non-capturing groupings"
+.IX Subsection "Non-capturing groupings"
+A group that is required to bundle a set of alternatives may or may not be
+useful as a capturing group. If it isn't, it just creates a superfluous
+addition to the set of available capture group values, inside as well as
+outside the regexp. Non-capturing groupings, denoted by \f(CW\*(C`(?:regexp)\*(C'\fR,
+still allow the regexp to be treated as a single unit, but don't establish
+a capturing group at the same time. Both capturing and non-capturing
+groupings are allowed to co-exist in the same regexp. Because there is
+no extraction, non-capturing groupings are faster than capturing
+groupings. Non-capturing groupings are also handy for choosing exactly
+which parts of a regexp are to be extracted to matching variables:
+.PP
+.Vb 2
+\& # match a number, $1\-$4 are set, but we only want $1
+\& /([+\-]?\e *(\ed+(\e.\ed*)?|\e.\ed+)([eE][+\-]?\ed+)?)/;
+\&
+\& # match a number faster , only $1 is set
+\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE][+\-]?\ed+)?)/;
+\&
+\& # match a number, get $1 = whole number, $2 = exponent
+\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE]([+\-]?\ed+))?)/;
+.Ve
+.PP
+Non-capturing groupings are also useful for removing nuisance
+elements gathered from a split operation where parentheses are
+required for some reason:
+.PP
+.Vb 3
+\& $x = \*(Aq12aba34ba5\*(Aq;
+\& @num = split /(a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aqa\*(Aq,\*(Aq34\*(Aq,\*(Aqa\*(Aq,\*(Aq5\*(Aq)
+\& @num = split /(?:a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aq34\*(Aq,\*(Aq5\*(Aq)
+.Ve
+.PP
+In Perl 5.22 and later, all groups within a regexp can be set to
+non-capturing by using the new \f(CW\*(C`/n\*(C'\fR flag:
+.PP
+.Vb 1
+\& "hello" =~ /(hi|hello)/n; # $1 is not set!
+.Ve
+.PP
+See "n" in perlre for more information.
+.SS "Matching repetitions"
+.IX Subsection "Matching repetitions"
+The examples in the previous section display an annoying weakness. We
+were only matching 3\-letter words, or chunks of words of 4 letters or
+less. We'd like to be able to match words or, more generally, strings
+of any length, without writing out tedious alternatives like
+\&\f(CW\*(C`\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew\*(C'\fR.
+.PP
+This is exactly the problem the \fIquantifier\fR metacharacters \f(CW\*(Aq?\*(Aq\fR,
+\&\f(CW\*(Aq*\*(Aq\fR, \f(CW\*(Aq+\*(Aq\fR, and \f(CW\*(C`{}\*(C'\fR were created for. They allow us to delimit the
+number of repeats for a portion of a regexp we consider to be a
+match. Quantifiers are put immediately after the character, character
+class, or grouping that we want to specify. They have the following
+meanings:
+.IP \(bu 4
+\&\f(CW\*(C`a?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or 0 times
+.IP \(bu 4
+\&\f(CW\*(C`a*\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or more times, \fIi.e.\fR, any number of times
+.IP \(bu 4
+\&\f(CW\*(C`a+\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or more times, \fIi.e.\fR, at least once
+.IP \(bu 4
+\&\f(CW\*(C`a{n,m}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but not more than \f(CW\*(C`m\*(C'\fR
+times.
+.IP \(bu 4
+\&\f(CW\*(C`a{n,}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR or more times
+.IP \(bu 4
+\&\f(CW\*(C`a{,n}\*(C'\fR means: match at most \f(CW\*(C`n\*(C'\fR times, or fewer
+.IP \(bu 4
+\&\f(CW\*(C`a{n}\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times
+.PP
+If you like, you can add blanks (tab or space characters) within the
+braces, but adjacent to them, and/or next to the comma (if any).
+.PP
+Here are some examples:
+.PP
+.Vb 10
+\& /[a\-z]+\es+\ed*/; # match a lowercase word, at least one space, and
+\& # any number of digits
+\& /(\ew+)\es+\eg1/; # match doubled words of arbitrary length
+\& /y(es)?/i; # matches \*(Aqy\*(Aq, \*(AqY\*(Aq, or a case\-insensitive \*(Aqyes\*(Aq
+\& $year =~ /^\ed{2,4}$/; # make sure year is at least 2 but not more
+\& # than 4 digits
+\& $year =~ /^\ed{ 2, 4 }$/; # Same; for those who like wide open
+\& # spaces.
+\& $year =~ /^\ed{2, 4}$/; # Same.
+\& $year =~ /^\ed{4}$|^\ed{2}$/; # better match; throw out 3\-digit dates
+\& $year =~ /^\ed{2}(\ed{2})?$/; # same thing written differently.
+\& # However, this captures the last two
+\& # digits in $1 and the other does not.
+\&
+\& % simple_grep \*(Aq^(\ew+)\eg1$\*(Aq /usr/dict/words # isn\*(Aqt this easier?
+\& beriberi
+\& booboo
+\& coco
+\& mama
+\& murmur
+\& papa
+.Ve
+.PP
+For all of these quantifiers, Perl will try to match as much of the
+string as possible, while still allowing the regexp to succeed. Thus
+with \f(CW\*(C`/a?.../\*(C'\fR, Perl will first try to match the regexp with the \f(CW\*(Aqa\*(Aq\fR
+present; if that fails, Perl will try to match the regexp without the
+\&\f(CW\*(Aqa\*(Aq\fR present. For the quantifier \f(CW\*(Aq*\*(Aq\fR, we get the following:
+.PP
+.Vb 5
+\& $x = "the cat in the hat";
+\& $x =~ /^(.*)(cat)(.*)$/; # matches,
+\& # $1 = \*(Aqthe \*(Aq
+\& # $2 = \*(Aqcat\*(Aq
+\& # $3 = \*(Aq in the hat\*(Aq
+.Ve
+.PP
+Which is what we might expect, the match finds the only \f(CW\*(C`cat\*(C'\fR in the
+string and locks onto it. Consider, however, this regexp:
+.PP
+.Vb 4
+\& $x =~ /^(.*)(at)(.*)$/; # matches,
+\& # $1 = \*(Aqthe cat in the h\*(Aq
+\& # $2 = \*(Aqat\*(Aq
+\& # $3 = \*(Aq\*(Aq (0 characters match)
+.Ve
+.PP
+One might initially guess that Perl would find the \f(CW\*(C`at\*(C'\fR in \f(CW\*(C`cat\*(C'\fR and
+stop there, but that wouldn't give the longest possible string to the
+first quantifier \f(CW\*(C`.*\*(C'\fR. Instead, the first quantifier \f(CW\*(C`.*\*(C'\fR grabs as
+much of the string as possible while still having the regexp match. In
+this example, that means having the \f(CW\*(C`at\*(C'\fR sequence with the final \f(CW\*(C`at\*(C'\fR
+in the string. The other important principle illustrated here is that,
+when there are two or more elements in a regexp, the \fIleftmost\fR
+quantifier, if there is one, gets to grab as much of the string as
+possible, leaving the rest of the regexp to fight over scraps. Thus in
+our example, the first quantifier \f(CW\*(C`.*\*(C'\fR grabs most of the string, while
+the second quantifier \f(CW\*(C`.*\*(C'\fR gets the empty string. Quantifiers that
+grab as much of the string as possible are called \fImaximal match\fR or
+\&\fIgreedy\fR quantifiers.
+.PP
+When a regexp can match a string in several different ways, we can use
+the principles above to predict which way the regexp will match:
+.IP \(bu 4
+Principle 0: Taken as a whole, any regexp will be matched at the
+earliest possible position in the string.
+.IP \(bu 4
+Principle 1: In an alternation \f(CW\*(C`a|b|c...\*(C'\fR, the leftmost alternative
+that allows a match for the whole regexp will be the one used.
+.IP \(bu 4
+Principle 2: The maximal matching quantifiers \f(CW\*(Aq?\*(Aq\fR, \f(CW\*(Aq*\*(Aq\fR, \f(CW\*(Aq+\*(Aq\fR and
+\&\f(CW\*(C`{n,m}\*(C'\fR will in general match as much of the string as possible while
+still allowing the whole regexp to match.
+.IP \(bu 4
+Principle 3: If there are two or more elements in a regexp, the
+leftmost greedy quantifier, if any, will match as much of the string
+as possible while still allowing the whole regexp to match. The next
+leftmost greedy quantifier, if any, will try to match as much of the
+string remaining available to it as possible, while still allowing the
+whole regexp to match. And so on, until all the regexp elements are
+satisfied.
+.PP
+As we have seen above, Principle 0 overrides the others. The regexp
+will be matched as early as possible, with the other principles
+determining how the regexp matches at that earliest character
+position.
+.PP
+Here is an example of these principles in action:
+.PP
+.Vb 5
+\& $x = "The programming republic of Perl";
+\& $x =~ /^(.+)(e|r)(.*)$/; # matches,
+\& # $1 = \*(AqThe programming republic of Pe\*(Aq
+\& # $2 = \*(Aqr\*(Aq
+\& # $3 = \*(Aql\*(Aq
+.Ve
+.PP
+This regexp matches at the earliest string position, \f(CW\*(AqT\*(Aq\fR. One
+might think that \f(CW\*(Aqe\*(Aq\fR, being leftmost in the alternation, would be
+matched, but \f(CW\*(Aqr\*(Aq\fR produces the longest string in the first quantifier.
+.PP
+.Vb 3
+\& $x =~ /(m{1,2})(.*)$/; # matches,
+\& # $1 = \*(Aqmm\*(Aq
+\& # $2 = \*(Aqing republic of Perl\*(Aq
+.Ve
+.PP
+Here, The earliest possible match is at the first \f(CW\*(Aqm\*(Aq\fR in
+\&\f(CW\*(C`programming\*(C'\fR. \f(CW\*(C`m{1,2}\*(C'\fR is the first quantifier, so it gets to match
+a maximal \f(CW\*(C`mm\*(C'\fR.
+.PP
+.Vb 3
+\& $x =~ /.*(m{1,2})(.*)$/; # matches,
+\& # $1 = \*(Aqm\*(Aq
+\& # $2 = \*(Aqing republic of Perl\*(Aq
+.Ve
+.PP
+Here, the regexp matches at the start of the string. The first
+quantifier \f(CW\*(C`.*\*(C'\fR grabs as much as possible, leaving just a single
+\&\f(CW\*(Aqm\*(Aq\fR for the second quantifier \f(CW\*(C`m{1,2}\*(C'\fR.
+.PP
+.Vb 4
+\& $x =~ /(.?)(m{1,2})(.*)$/; # matches,
+\& # $1 = \*(Aqa\*(Aq
+\& # $2 = \*(Aqmm\*(Aq
+\& # $3 = \*(Aqing republic of Perl\*(Aq
+.Ve
+.PP
+Here, \f(CW\*(C`.?\*(C'\fR eats its maximal one character at the earliest possible
+position in the string, \f(CW\*(Aqa\*(Aq\fR in \f(CW\*(C`programming\*(C'\fR, leaving \f(CW\*(C`m{1,2}\*(C'\fR
+the opportunity to match both \f(CW\*(Aqm\*(Aq\fR's. Finally,
+.PP
+.Vb 1
+\& "aXXXb" =~ /(X*)/; # matches with $1 = \*(Aq\*(Aq
+.Ve
+.PP
+because it can match zero copies of \f(CW\*(AqX\*(Aq\fR at the beginning of the
+string. If you definitely want to match at least one \f(CW\*(AqX\*(Aq\fR, use
+\&\f(CW\*(C`X+\*(C'\fR, not \f(CW\*(C`X*\*(C'\fR.
+.PP
+Sometimes greed is not good. At times, we would like quantifiers to
+match a \fIminimal\fR piece of string, rather than a maximal piece. For
+this purpose, Larry Wall created the \fIminimal match\fR or
+\&\fInon-greedy\fR quantifiers \f(CW\*(C`??\*(C'\fR, \f(CW\*(C`*?\*(C'\fR, \f(CW\*(C`+?\*(C'\fR, and \f(CW\*(C`{}?\*(C'\fR. These are
+the usual quantifiers with a \f(CW\*(Aq?\*(Aq\fR appended to them. They have the
+following meanings:
+.IP \(bu 4
+\&\f(CW\*(C`a??\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or 1 times. Try 0 first, then 1.
+.IP \(bu 4
+\&\f(CW\*(C`a*?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 0 or more times, \fIi.e.\fR, any number of times,
+but as few times as possible
+.IP \(bu 4
+\&\f(CW\*(C`a+?\*(C'\fR means: match \f(CW\*(Aqa\*(Aq\fR 1 or more times, \fIi.e.\fR, at least once, but
+as few times as possible
+.IP \(bu 4
+\&\f(CW\*(C`a{n,m}?\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, not more than \f(CW\*(C`m\*(C'\fR
+times, as few times as possible
+.IP \(bu 4
+\&\f(CW\*(C`a{n,}?\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but as few times as
+possible
+.IP \(bu 4
+\&\f(CW\*(C`a{,n}?\*(C'\fR means: match at most \f(CW\*(C`n\*(C'\fR times, but as few times as
+possible
+.IP \(bu 4
+\&\f(CW\*(C`a{n}?\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times. Because we match exactly
+\&\f(CW\*(C`n\*(C'\fR times, \f(CW\*(C`a{n}?\*(C'\fR is equivalent to \f(CW\*(C`a{n}\*(C'\fR and is just there for
+notational consistency.
+.PP
+Let's look at the example above, but with minimal quantifiers:
+.PP
+.Vb 5
+\& $x = "The programming republic of Perl";
+\& $x =~ /^(.+?)(e|r)(.*)$/; # matches,
+\& # $1 = \*(AqTh\*(Aq
+\& # $2 = \*(Aqe\*(Aq
+\& # $3 = \*(Aq programming republic of Perl\*(Aq
+.Ve
+.PP
+The minimal string that will allow both the start of the string \f(CW\*(Aq^\*(Aq\fR
+and the alternation to match is \f(CW\*(C`Th\*(C'\fR, with the alternation \f(CW\*(C`e|r\*(C'\fR
+matching \f(CW\*(Aqe\*(Aq\fR. The second quantifier \f(CW\*(C`.*\*(C'\fR is free to gobble up the
+rest of the string.
+.PP
+.Vb 3
+\& $x =~ /(m{1,2}?)(.*?)$/; # matches,
+\& # $1 = \*(Aqm\*(Aq
+\& # $2 = \*(Aqming republic of Perl\*(Aq
+.Ve
+.PP
+The first string position that this regexp can match is at the first
+\&\f(CW\*(Aqm\*(Aq\fR in \f(CW\*(C`programming\*(C'\fR. At this position, the minimal \f(CW\*(C`m{1,2}?\*(C'\fR
+matches just one \f(CW\*(Aqm\*(Aq\fR. Although the second quantifier \f(CW\*(C`.*?\*(C'\fR would
+prefer to match no characters, it is constrained by the end-of-string
+anchor \f(CW\*(Aq$\*(Aq\fR to match the rest of the string.
+.PP
+.Vb 4
+\& $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
+\& # $1 = \*(AqThe progra\*(Aq
+\& # $2 = \*(Aqm\*(Aq
+\& # $3 = \*(Aqming republic of Perl\*(Aq
+.Ve
+.PP
+In this regexp, you might expect the first minimal quantifier \f(CW\*(C`.*?\*(C'\fR
+to match the empty string, because it is not constrained by a \f(CW\*(Aq^\*(Aq\fR
+anchor to match the beginning of the word. Principle 0 applies here,
+however. Because it is possible for the whole regexp to match at the
+start of the string, it \fIwill\fR match at the start of the string. Thus
+the first quantifier has to match everything up to the first \f(CW\*(Aqm\*(Aq\fR. The
+second minimal quantifier matches just one \f(CW\*(Aqm\*(Aq\fR and the third
+quantifier matches the rest of the string.
+.PP
+.Vb 4
+\& $x =~ /(.??)(m{1,2})(.*)$/; # matches,
+\& # $1 = \*(Aqa\*(Aq
+\& # $2 = \*(Aqmm\*(Aq
+\& # $3 = \*(Aqing republic of Perl\*(Aq
+.Ve
+.PP
+Just as in the previous regexp, the first quantifier \f(CW\*(C`.??\*(C'\fR can match
+earliest at position \f(CW\*(Aqa\*(Aq\fR, so it does. The second quantifier is
+greedy, so it matches \f(CW\*(C`mm\*(C'\fR, and the third matches the rest of the
+string.
+.PP
+We can modify principle 3 above to take into account non-greedy
+quantifiers:
+.IP \(bu 4
+Principle 3: If there are two or more elements in a regexp, the
+leftmost greedy (non-greedy) quantifier, if any, will match as much
+(little) of the string as possible while still allowing the whole
+regexp to match. The next leftmost greedy (non-greedy) quantifier, if
+any, will try to match as much (little) of the string remaining
+available to it as possible, while still allowing the whole regexp to
+match. And so on, until all the regexp elements are satisfied.
+.PP
+Just like alternation, quantifiers are also susceptible to
+backtracking. Here is a step-by-step analysis of the example
+.PP
+.Vb 5
+\& $x = "the cat in the hat";
+\& $x =~ /^(.*)(at)(.*)$/; # matches,
+\& # $1 = \*(Aqthe cat in the h\*(Aq
+\& # $2 = \*(Aqat\*(Aq
+\& # $3 = \*(Aq\*(Aq (0 matches)
+.Ve
+.IP 1. 4
+Start with the first letter in the string \f(CW\*(Aqt\*(Aq\fR.
+.IP 2. 4
+The first quantifier \f(CW\*(Aq.*\*(Aq\fR starts out by matching the whole
+string \f(CW"the cat in the hat"\fR.
+.IP 3. 4
+\&\f(CW\*(Aqa\*(Aq\fR in the regexp element \f(CW\*(Aqat\*(Aq\fR doesn't match the end
+of the string. Backtrack one character.
+.IP 4. 4
+\&\f(CW\*(Aqa\*(Aq\fR in the regexp element \f(CW\*(Aqat\*(Aq\fR still doesn't match
+the last letter of the string \f(CW\*(Aqt\*(Aq\fR, so backtrack one more character.
+.IP 5. 4
+Now we can match the \f(CW\*(Aqa\*(Aq\fR and the \f(CW\*(Aqt\*(Aq\fR.
+.IP 6. 4
+Move on to the third element \f(CW\*(Aq.*\*(Aq\fR. Since we are at the
+end of the string and \f(CW\*(Aq.*\*(Aq\fR can match 0 times, assign it the empty
+string.
+.IP 7. 4
+We are done!
+.PP
+Most of the time, all this moving forward and backtracking happens
+quickly and searching is fast. There are some pathological regexps,
+however, whose execution time exponentially grows with the size of the
+string. A typical structure that blows up in your face is of the form
+.PP
+.Vb 1
+\& /(a|b+)*/;
+.Ve
+.PP
+The problem is the nested indeterminate quantifiers. There are many
+different ways of partitioning a string of length n between the \f(CW\*(Aq+\*(Aq\fR
+and \f(CW\*(Aq*\*(Aq\fR: one repetition with \f(CW\*(C`b+\*(C'\fR of length n, two repetitions with
+the first \f(CW\*(C`b+\*(C'\fR length k and the second with length n\-k, m repetitions
+whose bits add up to length n, \fIetc\fR. In fact there are an exponential
+number of ways to partition a string as a function of its length. A
+regexp may get lucky and match early in the process, but if there is
+no match, Perl will try \fIevery\fR possibility before giving up. So be
+careful with nested \f(CW\*(Aq*\*(Aq\fR's, \f(CW\*(C`{n,m}\*(C'\fR's, and \f(CW\*(Aq+\*(Aq\fR's. The book
+\&\fIMastering Regular Expressions\fR by Jeffrey Friedl gives a wonderful
+discussion of this and other efficiency issues.
+.SS "Possessive quantifiers"
+.IX Subsection "Possessive quantifiers"
+Backtracking during the relentless search for a match may be a waste
+of time, particularly when the match is bound to fail. Consider
+the simple pattern
+.PP
+.Vb 1
+\& /^\ew+\es+\ew+$/; # a word, spaces, a word
+.Ve
+.PP
+Whenever this is applied to a string which doesn't quite meet the
+pattern's expectations such as \f(CW"abc\ \ "\fR or \f(CW"abc\ \ def\ "\fR,
+the regexp engine will backtrack, approximately once for each character
+in the string. But we know that there is no way around taking \fIall\fR
+of the initial word characters to match the first repetition, that \fIall\fR
+spaces must be eaten by the middle part, and the same goes for the second
+word.
+.PP
+With the introduction of the \fIpossessive quantifiers\fR in Perl 5.10, we
+have a way of instructing the regexp engine not to backtrack, with the
+usual quantifiers with a \f(CW\*(Aq+\*(Aq\fR appended to them. This makes them greedy as
+well as stingy; once they succeed they won't give anything back to permit
+another solution. They have the following meanings:
+.IP \(bu 4
+\&\f(CW\*(C`a{n,m}+\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, not more than \f(CW\*(C`m\*(C'\fR times,
+as many times as possible, and don't give anything up. \f(CW\*(C`a?+\*(C'\fR is short
+for \f(CW\*(C`a{0,1}+\*(C'\fR
+.IP \(bu 4
+\&\f(CW\*(C`a{n,}+\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but as many times as possible,
+and don't give anything up. \f(CW\*(C`a++\*(C'\fR is short for \f(CW\*(C`a{1,}+\*(C'\fR.
+.IP \(bu 4
+\&\f(CW\*(C`a{,n}+\*(C'\fR means: match as many times as possible up to at most \f(CW\*(C`n\*(C'\fR
+times, and don't give anything up. \f(CW\*(C`a*+\*(C'\fR is short for \f(CW\*(C`a{0,}+\*(C'\fR.
+.IP \(bu 4
+\&\f(CW\*(C`a{n}+\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times. It is just there for
+notational consistency.
+.PP
+These possessive quantifiers represent a special case of a more general
+concept, the \fIindependent subexpression\fR, see below.
+.PP
+As an example where a possessive quantifier is suitable we consider
+matching a quoted string, as it appears in several programming languages.
+The backslash is used as an escape character that indicates that the
+next character is to be taken literally, as another character for the
+string. Therefore, after the opening quote, we expect a (possibly
+empty) sequence of alternatives: either some character except an
+unescaped quote or backslash or an escaped character.
+.PP
+.Vb 1
+\& /"(?:[^"\e\e]++|\e\e.)*+"/;
+.Ve
+.SS "Building a regexp"
+.IX Subsection "Building a regexp"
+At this point, we have all the basic regexp concepts covered, so let's
+give a more involved example of a regular expression. We will build a
+regexp that matches numbers.
+.PP
+The first task in building a regexp is to decide what we want to match
+and what we want to exclude. In our case, we want to match both
+integers and floating point numbers and we want to reject any string
+that isn't a number.
+.PP
+The next task is to break the problem down into smaller problems that
+are easily converted into a regexp.
+.PP
+The simplest case is integers. These consist of a sequence of digits,
+with an optional sign in front. The digits we can represent with
+\&\f(CW\*(C`\ed+\*(C'\fR and the sign can be matched with \f(CW\*(C`[+\-]\*(C'\fR. Thus the integer
+regexp is
+.PP
+.Vb 1
+\& /[+\-]?\ed+/; # matches integers
+.Ve
+.PP
+A floating point number potentially has a sign, an integral part, a
+decimal point, a fractional part, and an exponent. One or more of these
+parts is optional, so we need to check out the different
+possibilities. Floating point numbers which are in proper form include
+123., 0.345, .34, \-1e6, and 25.4E\-72. As with integers, the sign out
+front is completely optional and can be matched by \f(CW\*(C`[+\-]?\*(C'\fR. We can
+see that if there is no exponent, floating point numbers must have a
+decimal point, otherwise they are integers. We might be tempted to
+model these with \f(CW\*(C`\ed*\e.\ed*\*(C'\fR, but this would also match just a single
+decimal point, which is not a number. So the three cases of floating
+point number without exponent are
+.PP
+.Vb 3
+\& /[+\-]?\ed+\e./; # 1., 321., etc.
+\& /[+\-]?\e.\ed+/; # .1, .234, etc.
+\& /[+\-]?\ed+\e.\ed+/; # 1.0, 30.56, etc.
+.Ve
+.PP
+These can be combined into a single regexp with a three-way alternation:
+.PP
+.Vb 1
+\& /[+\-]?(\ed+\e.\ed+|\ed+\e.|\e.\ed+)/; # floating point, no exponent
+.Ve
+.PP
+In this alternation, it is important to put \f(CW\*(Aq\ed+\e.\ed+\*(Aq\fR before
+\&\f(CW\*(Aq\ed+\e.\*(Aq\fR. If \f(CW\*(Aq\ed+\e.\*(Aq\fR were first, the regexp would happily match that
+and ignore the fractional part of the number.
+.PP
+Now consider floating point numbers with exponents. The key
+observation here is that \fIboth\fR integers and numbers with decimal
+points are allowed in front of an exponent. Then exponents, like the
+overall sign, are independent of whether we are matching numbers with
+or without decimal points, and can be "decoupled" from the
+mantissa. The overall form of the regexp now becomes clear:
+.PP
+.Vb 1
+\& /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/;
+.Ve
+.PP
+The exponent is an \f(CW\*(Aqe\*(Aq\fR or \f(CW\*(AqE\*(Aq\fR, followed by an integer. So the
+exponent regexp is
+.PP
+.Vb 1
+\& /[eE][+\-]?\ed+/; # exponent
+.Ve
+.PP
+Putting all the parts together, we get a regexp that matches numbers:
+.PP
+.Vb 1
+\& /^[+\-]?(\ed+\e.\ed+|\ed+\e.|\e.\ed+|\ed+)([eE][+\-]?\ed+)?$/; # Ta da!
+.Ve
+.PP
+Long regexps like this may impress your friends, but can be hard to
+decipher. In complex situations like this, the \f(CW\*(C`/x\*(C'\fR modifier for a
+match is invaluable. It allows one to put nearly arbitrary whitespace
+and comments into a regexp without affecting their meaning. Using it,
+we can rewrite our "extended" regexp in the more pleasing form
+.PP
+.Vb 10
+\& /^
+\& [+\-]? # first, match an optional sign
+\& ( # then match integers or f.p. mantissas:
+\& \ed+\e.\ed+ # mantissa of the form a.b
+\& |\ed+\e. # mantissa of the form a.
+\& |\e.\ed+ # mantissa of the form .b
+\& |\ed+ # integer of the form a
+\& )
+\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent
+\& $/x;
+.Ve
+.PP
+If whitespace is mostly irrelevant, how does one include space
+characters in an extended regexp? The answer is to backslash it
+\&\f(CW\*(Aq\e\ \*(Aq\fR or put it in a character class \f(CW\*(C`[\ ]\*(C'\fR. The same thing
+goes for pound signs: use \f(CW\*(C`\e#\*(C'\fR or \f(CW\*(C`[#]\*(C'\fR. For instance, Perl allows
+a space between the sign and the mantissa or integer, and we could add
+this to our regexp as follows:
+.PP
+.Vb 10
+\& /^
+\& [+\-]?\e * # first, match an optional sign *and space*
+\& ( # then match integers or f.p. mantissas:
+\& \ed+\e.\ed+ # mantissa of the form a.b
+\& |\ed+\e. # mantissa of the form a.
+\& |\e.\ed+ # mantissa of the form .b
+\& |\ed+ # integer of the form a
+\& )
+\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent
+\& $/x;
+.Ve
+.PP
+In this form, it is easier to see a way to simplify the
+alternation. Alternatives 1, 2, and 4 all start with \f(CW\*(C`\ed+\*(C'\fR, so it
+could be factored out:
+.PP
+.Vb 11
+\& /^
+\& [+\-]?\e * # first, match an optional sign
+\& ( # then match integers or f.p. mantissas:
+\& \ed+ # start out with a ...
+\& (
+\& \e.\ed* # mantissa of the form a.b or a.
+\& )? # ? takes care of integers of the form a
+\& |\e.\ed+ # mantissa of the form .b
+\& )
+\& ( [eE] [+\-]? \ed+ )? # finally, optionally match an exponent
+\& $/x;
+.Ve
+.PP
+Starting in Perl v5.26, specifying \f(CW\*(C`/xx\*(C'\fR changes the square-bracketed
+portions of a pattern to ignore tabs and space characters unless they
+are escaped by preceding them with a backslash. So, we could write
+.PP
+.Vb 11
+\& /^
+\& [ + \- ]?\e * # first, match an optional sign
+\& ( # then match integers or f.p. mantissas:
+\& \ed+ # start out with a ...
+\& (
+\& \e.\ed* # mantissa of the form a.b or a.
+\& )? # ? takes care of integers of the form a
+\& |\e.\ed+ # mantissa of the form .b
+\& )
+\& ( [ e E ] [ + \- ]? \ed+ )? # finally, optionally match an exponent
+\& $/xx;
+.Ve
+.PP
+This doesn't really improve the legibility of this example, but it's
+available in case you want it. Squashing the pattern down to the
+compact form, we have
+.PP
+.Vb 1
+\& /^[+\-]?\e *(\ed+(\e.\ed*)?|\e.\ed+)([eE][+\-]?\ed+)?$/;
+.Ve
+.PP
+This is our final regexp. To recap, we built a regexp by
+.IP \(bu 4
+specifying the task in detail,
+.IP \(bu 4
+breaking down the problem into smaller parts,
+.IP \(bu 4
+translating the small parts into regexps,
+.IP \(bu 4
+combining the regexps,
+.IP \(bu 4
+and optimizing the final combined regexp.
+.PP
+These are also the typical steps involved in writing a computer
+program. This makes perfect sense, because regular expressions are
+essentially programs written in a little computer language that specifies
+patterns.
+.SS "Using regular expressions in Perl"
+.IX Subsection "Using regular expressions in Perl"
+The last topic of Part 1 briefly covers how regexps are used in Perl
+programs. Where do they fit into Perl syntax?
+.PP
+We have already introduced the matching operator in its default
+\&\f(CW\*(C`/regexp/\*(C'\fR and arbitrary delimiter \f(CW\*(C`m!regexp!\*(C'\fR forms. We have used
+the binding operator \f(CW\*(C`=~\*(C'\fR and its negation \f(CW\*(C`!~\*(C'\fR to test for string
+matches. Associated with the matching operator, we have discussed the
+single line \f(CW\*(C`/s\*(C'\fR, multi-line \f(CW\*(C`/m\*(C'\fR, case-insensitive \f(CW\*(C`/i\*(C'\fR and
+extended \f(CW\*(C`/x\*(C'\fR modifiers. There are a few more things you might
+want to know about matching operators.
+.PP
+\fIProhibiting substitution\fR
+.IX Subsection "Prohibiting substitution"
+.PP
+If you change \f(CW$pattern\fR after the first substitution happens, Perl
+will ignore it. If you don't want any substitutions at all, use the
+special delimiter \f(CW\*(C`m\*(Aq\*(Aq\*(C'\fR:
+.PP
+.Vb 4
+\& @pattern = (\*(AqSeuss\*(Aq);
+\& while (<>) {
+\& print if m\*(Aq@pattern\*(Aq; # matches literal \*(Aq@pattern\*(Aq, not \*(AqSeuss\*(Aq
+\& }
+.Ve
+.PP
+Similar to strings, \f(CW\*(C`m\*(Aq\*(Aq\*(C'\fR acts like apostrophes on a regexp; all other
+\&\f(CW\*(Aqm\*(Aq\fR delimiters act like quotes. If the regexp evaluates to the empty string,
+the regexp in the \fIlast successful match\fR is used instead. So we have
+.PP
+.Vb 2
+\& "dog" =~ /d/; # \*(Aqd\*(Aq matches
+\& "dogbert" =~ //; # this matches the \*(Aqd\*(Aq regexp used before
+.Ve
+.PP
+\fIGlobal matching\fR
+.IX Subsection "Global matching"
+.PP
+The final two modifiers we will discuss here,
+\&\f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`/c\*(C'\fR, concern multiple matches.
+The modifier \f(CW\*(C`/g\*(C'\fR stands for global matching and allows the
+matching operator to match within a string as many times as possible.
+In scalar context, successive invocations against a string will have
+\&\f(CW\*(C`/g\*(C'\fR jump from match to match, keeping track of position in the
+string as it goes along. You can get or set the position with the
+\&\f(CWpos()\fR function.
+.PP
+The use of \f(CW\*(C`/g\*(C'\fR is shown in the following example. Suppose we have
+a string that consists of words separated by spaces. If we know how
+many words there are in advance, we could extract the words using
+groupings:
+.PP
+.Vb 5
+\& $x = "cat dog house"; # 3 words
+\& $x =~ /^\es*(\ew+)\es+(\ew+)\es+(\ew+)\es*$/; # matches,
+\& # $1 = \*(Aqcat\*(Aq
+\& # $2 = \*(Aqdog\*(Aq
+\& # $3 = \*(Aqhouse\*(Aq
+.Ve
+.PP
+But what if we had an indeterminate number of words? This is the sort
+of task \f(CW\*(C`/g\*(C'\fR was made for. To extract all words, form the simple
+regexp \f(CW\*(C`(\ew+)\*(C'\fR and loop over all matches with \f(CW\*(C`/(\ew+)/g\*(C'\fR:
+.PP
+.Vb 3
+\& while ($x =~ /(\ew+)/g) {
+\& print "Word is $1, ends at position ", pos $x, "\en";
+\& }
+.Ve
+.PP
+prints
+.PP
+.Vb 3
+\& Word is cat, ends at position 3
+\& Word is dog, ends at position 7
+\& Word is house, ends at position 13
+.Ve
+.PP
+A failed match or changing the target string resets the position. If
+you don't want the position reset after failure to match, add the
+\&\f(CW\*(C`/c\*(C'\fR, as in \f(CW\*(C`/regexp/gc\*(C'\fR. The current position in the string is
+associated with the string, not the regexp. This means that different
+strings have different positions and their respective positions can be
+set or read independently.
+.PP
+In list context, \f(CW\*(C`/g\*(C'\fR returns a list of matched groupings, or if
+there are no groupings, a list of matches to the whole regexp. So if
+we wanted just the words, we could use
+.PP
+.Vb 4
+\& @words = ($x =~ /(\ew+)/g); # matches,
+\& # $words[0] = \*(Aqcat\*(Aq
+\& # $words[1] = \*(Aqdog\*(Aq
+\& # $words[2] = \*(Aqhouse\*(Aq
+.Ve
+.PP
+Closely associated with the \f(CW\*(C`/g\*(C'\fR modifier is the \f(CW\*(C`\eG\*(C'\fR anchor. The
+\&\f(CW\*(C`\eG\*(C'\fR anchor matches at the point where the previous \f(CW\*(C`/g\*(C'\fR match left
+off. \f(CW\*(C`\eG\*(C'\fR allows us to easily do context-sensitive matching:
+.PP
+.Vb 12
+\& $metric = 1; # use metric units
+\& ...
+\& $x = <FILE>; # read in measurement
+\& $x =~ /^([+\-]?\ed+)\es*/g; # get magnitude
+\& $weight = $1;
+\& if ($metric) { # error checking
+\& print "Units error!" unless $x =~ /\eGkg\e./g;
+\& }
+\& else {
+\& print "Units error!" unless $x =~ /\eGlbs\e./g;
+\& }
+\& $x =~ /\eG\es+(widget|sprocket)/g; # continue processing
+.Ve
+.PP
+The combination of \f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`\eG\*(C'\fR allows us to process the string a
+bit at a time and use arbitrary Perl logic to decide what to do next.
+Currently, the \f(CW\*(C`\eG\*(C'\fR anchor is only fully supported when used to anchor
+to the start of the pattern.
+.PP
+\&\f(CW\*(C`\eG\*(C'\fR is also invaluable in processing fixed-length records with
+regexps. Suppose we have a snippet of coding region DNA, encoded as
+base pair letters \f(CW\*(C`ATCGTTGAAT...\*(C'\fR and we want to find all the stop
+codons \f(CW\*(C`TGA\*(C'\fR. In a coding region, codons are 3\-letter sequences, so
+we can think of the DNA snippet as a sequence of 3\-letter records. The
+naive regexp
+.PP
+.Vb 3
+\& # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
+\& $dna = "ATCGTTGAATGCAAATGACATGAC";
+\& $dna =~ /TGA/;
+.Ve
+.PP
+doesn't work; it may match a \f(CW\*(C`TGA\*(C'\fR, but there is no guarantee that
+the match is aligned with codon boundaries, \fIe.g.\fR, the substring
+\&\f(CW\*(C`GTT\ GAA\*(C'\fR gives a match. A better solution is
+.PP
+.Vb 3
+\& while ($dna =~ /(\ew\ew\ew)*?TGA/g) { # note the minimal *?
+\& print "Got a TGA stop codon at position ", pos $dna, "\en";
+\& }
+.Ve
+.PP
+which prints
+.PP
+.Vb 2
+\& Got a TGA stop codon at position 18
+\& Got a TGA stop codon at position 23
+.Ve
+.PP
+Position 18 is good, but position 23 is bogus. What happened?
+.PP
+The answer is that our regexp works well until we get past the last
+real match. Then the regexp will fail to match a synchronized \f(CW\*(C`TGA\*(C'\fR
+and start stepping ahead one character position at a time, not what we
+want. The solution is to use \f(CW\*(C`\eG\*(C'\fR to anchor the match to the codon
+alignment:
+.PP
+.Vb 3
+\& while ($dna =~ /\eG(\ew\ew\ew)*?TGA/g) {
+\& print "Got a TGA stop codon at position ", pos $dna, "\en";
+\& }
+.Ve
+.PP
+This prints
+.PP
+.Vb 1
+\& Got a TGA stop codon at position 18
+.Ve
+.PP
+which is the correct answer. This example illustrates that it is
+important not only to match what is desired, but to reject what is not
+desired.
+.PP
+(There are other regexp modifiers that are available, such as
+\&\f(CW\*(C`/o\*(C'\fR, but their specialized uses are beyond the
+scope of this introduction. )
+.PP
+\fISearch and replace\fR
+.IX Subsection "Search and replace"
+.PP
+Regular expressions also play a big role in \fIsearch and replace\fR
+operations in Perl. Search and replace is accomplished with the
+\&\f(CW\*(C`s///\*(C'\fR operator. The general form is
+\&\f(CW\*(C`s/regexp/replacement/modifiers\*(C'\fR, with everything we know about
+regexps and modifiers applying in this case as well. The
+\&\fIreplacement\fR is a Perl double-quoted string that replaces in the
+string whatever is matched with the \f(CW\*(C`regexp\*(C'\fR. The operator \f(CW\*(C`=~\*(C'\fR is
+also used here to associate a string with \f(CW\*(C`s///\*(C'\fR. If matching
+against \f(CW$_\fR, the \f(CW\*(C`$_\ =~\*(C'\fR can be dropped. If there is a match,
+\&\f(CW\*(C`s///\*(C'\fR returns the number of substitutions made; otherwise it returns
+false. Here are a few examples:
+.PP
+.Vb 8
+\& $x = "Time to feed the cat!";
+\& $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
+\& if ($x =~ s/^(Time.*hacker)!$/$1 now!/) {
+\& $more_insistent = 1;
+\& }
+\& $y = "\*(Aqquoted words\*(Aq";
+\& $y =~ s/^\*(Aq(.*)\*(Aq$/$1/; # strip single quotes,
+\& # $y contains "quoted words"
+.Ve
+.PP
+In the last example, the whole string was matched, but only the part
+inside the single quotes was grouped. With the \f(CW\*(C`s///\*(C'\fR operator, the
+matched variables \f(CW$1\fR, \f(CW$2\fR, \fIetc\fR. are immediately available for use
+in the replacement expression, so we use \f(CW$1\fR to replace the quoted
+string with just what was quoted. With the global modifier, \f(CW\*(C`s///g\*(C'\fR
+will search and replace all occurrences of the regexp in the string:
+.PP
+.Vb 6
+\& $x = "I batted 4 for 4";
+\& $x =~ s/4/four/; # doesn\*(Aqt do it all:
+\& # $x contains "I batted four for 4"
+\& $x = "I batted 4 for 4";
+\& $x =~ s/4/four/g; # does it all:
+\& # $x contains "I batted four for four"
+.Ve
+.PP
+If you prefer "regex" over "regexp" in this tutorial, you could use
+the following program to replace it:
+.PP
+.Vb 9
+\& % cat > simple_replace
+\& #!/usr/bin/perl
+\& $regexp = shift;
+\& $replacement = shift;
+\& while (<>) {
+\& s/$regexp/$replacement/g;
+\& print;
+\& }
+\& ^D
+\&
+\& % simple_replace regexp regex perlretut.pod
+.Ve
+.PP
+In \f(CW\*(C`simple_replace\*(C'\fR we used the \f(CW\*(C`s///g\*(C'\fR modifier to replace all
+occurrences of the regexp on each line. (Even though the regular
+expression appears in a loop, Perl is smart enough to compile it
+only once.) As with \f(CW\*(C`simple_grep\*(C'\fR, both the
+\&\f(CW\*(C`print\*(C'\fR and the \f(CW\*(C`s/$regexp/$replacement/g\*(C'\fR use \f(CW$_\fR implicitly.
+.PP
+If you don't want \f(CW\*(C`s///\*(C'\fR to change your original variable you can use
+the non-destructive substitute modifier, \f(CW\*(C`s///r\*(C'\fR. This changes the
+behavior so that \f(CW\*(C`s///r\*(C'\fR returns the final substituted string
+(instead of the number of substitutions):
+.PP
+.Vb 3
+\& $x = "I like dogs.";
+\& $y = $x =~ s/dogs/cats/r;
+\& print "$x $y\en";
+.Ve
+.PP
+That example will print "I like dogs. I like cats". Notice the original
+\&\f(CW$x\fR variable has not been affected. The overall
+result of the substitution is instead stored in \f(CW$y\fR. If the
+substitution doesn't affect anything then the original string is
+returned:
+.PP
+.Vb 3
+\& $x = "I like dogs.";
+\& $y = $x =~ s/elephants/cougars/r;
+\& print "$x $y\en"; # prints "I like dogs. I like dogs."
+.Ve
+.PP
+One other interesting thing that the \f(CW\*(C`s///r\*(C'\fR flag allows is chaining
+substitutions:
+.PP
+.Vb 4
+\& $x = "Cats are great.";
+\& print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~
+\& s/Frogs/Hedgehogs/r, "\en";
+\& # prints "Hedgehogs are great."
+.Ve
+.PP
+A modifier available specifically to search and replace is the
+\&\f(CW\*(C`s///e\*(C'\fR evaluation modifier. \f(CW\*(C`s///e\*(C'\fR treats the
+replacement text as Perl code, rather than a double-quoted
+string. The value that the code returns is substituted for the
+matched substring. \f(CW\*(C`s///e\*(C'\fR is useful if you need to do a bit of
+computation in the process of replacing text. This example counts
+character frequencies in a line:
+.PP
+.Vb 4
+\& $x = "Bill the cat";
+\& $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
+\& print "frequency of \*(Aq$_\*(Aq is $chars{$_}\en"
+\& foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
+.Ve
+.PP
+This prints
+.PP
+.Vb 9
+\& frequency of \*(Aq \*(Aq is 2
+\& frequency of \*(Aqt\*(Aq is 2
+\& frequency of \*(Aql\*(Aq is 2
+\& frequency of \*(AqB\*(Aq is 1
+\& frequency of \*(Aqc\*(Aq is 1
+\& frequency of \*(Aqe\*(Aq is 1
+\& frequency of \*(Aqh\*(Aq is 1
+\& frequency of \*(Aqi\*(Aq is 1
+\& frequency of \*(Aqa\*(Aq is 1
+.Ve
+.PP
+As with the match \f(CW\*(C`m//\*(C'\fR operator, \f(CW\*(C`s///\*(C'\fR can use other delimiters,
+such as \f(CW\*(C`s!!!\*(C'\fR and \f(CW\*(C`s{}{}\*(C'\fR, and even \f(CW\*(C`s{}//\*(C'\fR. If single quotes are
+used \f(CW\*(C`s\*(Aq\*(Aq\*(Aq\*(C'\fR, then the regexp and replacement are
+treated as single-quoted strings and there are no
+variable substitutions. \f(CW\*(C`s///\*(C'\fR in list context
+returns the same thing as in scalar context, \fIi.e.\fR, the number of
+matches.
+.PP
+\fIThe split function\fR
+.IX Subsection "The split function"
+.PP
+The \f(CWsplit()\fR function is another place where a regexp is used.
+\&\f(CW\*(C`split /regexp/, string, limit\*(C'\fR separates the \f(CW\*(C`string\*(C'\fR operand into
+a list of substrings and returns that list. The regexp must be designed
+to match whatever constitutes the separators for the desired substrings.
+The \f(CW\*(C`limit\*(C'\fR, if present, constrains splitting into no more than \f(CW\*(C`limit\*(C'\fR
+number of strings. For example, to split a string into words, use
+.PP
+.Vb 4
+\& $x = "Calvin and Hobbes";
+\& @words = split /\es+/, $x; # $word[0] = \*(AqCalvin\*(Aq
+\& # $word[1] = \*(Aqand\*(Aq
+\& # $word[2] = \*(AqHobbes\*(Aq
+.Ve
+.PP
+If the empty regexp \f(CW\*(C`//\*(C'\fR is used, the regexp always matches and
+the string is split into individual characters. If the regexp has
+groupings, then the resulting list contains the matched substrings from the
+groupings as well. For instance,
+.PP
+.Vb 12
+\& $x = "/usr/bin/perl";
+\& @dirs = split m!/!, $x; # $dirs[0] = \*(Aq\*(Aq
+\& # $dirs[1] = \*(Aqusr\*(Aq
+\& # $dirs[2] = \*(Aqbin\*(Aq
+\& # $dirs[3] = \*(Aqperl\*(Aq
+\& @parts = split m!(/)!, $x; # $parts[0] = \*(Aq\*(Aq
+\& # $parts[1] = \*(Aq/\*(Aq
+\& # $parts[2] = \*(Aqusr\*(Aq
+\& # $parts[3] = \*(Aq/\*(Aq
+\& # $parts[4] = \*(Aqbin\*(Aq
+\& # $parts[5] = \*(Aq/\*(Aq
+\& # $parts[6] = \*(Aqperl\*(Aq
+.Ve
+.PP
+Since the first character of \f(CW$x\fR matched the regexp, \f(CW\*(C`split\*(C'\fR prepended
+an empty initial element to the list.
+.PP
+If you have read this far, congratulations! You now have all the basic
+tools needed to use regular expressions to solve a wide range of text
+processing problems. If this is your first time through the tutorial,
+why not stop here and play around with regexps a while.... Part\ 2
+concerns the more esoteric aspects of regular expressions and those
+concepts certainly aren't needed right at the start.
+.SH "Part 2: Power tools"
+.IX Header "Part 2: Power tools"
+OK, you know the basics of regexps and you want to know more. If
+matching regular expressions is analogous to a walk in the woods, then
+the tools discussed in Part 1 are analogous to topo maps and a
+compass, basic tools we use all the time. Most of the tools in part 2
+are analogous to flare guns and satellite phones. They aren't used
+too often on a hike, but when we are stuck, they can be invaluable.
+.PP
+What follows are the more advanced, less used, or sometimes esoteric
+capabilities of Perl regexps. In Part 2, we will assume you are
+comfortable with the basics and concentrate on the advanced features.
+.SS "More on characters, strings, and character classes"
+.IX Subsection "More on characters, strings, and character classes"
+There are a number of escape sequences and character classes that we
+haven't covered yet.
+.PP
+There are several escape sequences that convert characters or strings
+between upper and lower case, and they are also available within
+patterns. \f(CW\*(C`\el\*(C'\fR and \f(CW\*(C`\eu\*(C'\fR convert the next character to lower or
+upper case, respectively:
+.PP
+.Vb 4
+\& $x = "perl";
+\& $string =~ /\eu$x/; # matches \*(AqPerl\*(Aq in $string
+\& $x = "M(rs?|s)\e\e."; # note the double backslash
+\& $string =~ /\el$x/; # matches \*(Aqmr.\*(Aq, \*(Aqmrs.\*(Aq, and \*(Aqms.\*(Aq,
+.Ve
+.PP
+A \f(CW\*(C`\eL\*(C'\fR or \f(CW\*(C`\eU\*(C'\fR indicates a lasting conversion of case, until
+terminated by \f(CW\*(C`\eE\*(C'\fR or thrown over by another \f(CW\*(C`\eU\*(C'\fR or \f(CW\*(C`\eL\*(C'\fR:
+.PP
+.Vb 4
+\& $x = "This word is in lower case:\eL SHOUT\eE";
+\& $x =~ /shout/; # matches
+\& $x = "I STILL KEYPUNCH CARDS FOR MY 360";
+\& $x =~ /\eUkeypunch/; # matches punch card string
+.Ve
+.PP
+If there is no \f(CW\*(C`\eE\*(C'\fR, case is converted until the end of the
+string. The regexps \f(CW\*(C`\eL\eu$word\*(C'\fR or \f(CW\*(C`\eu\eL$word\*(C'\fR convert the first
+character of \f(CW$word\fR to uppercase and the rest of the characters to
+lowercase. (Beyond ASCII characters, it gets somewhat more complicated;
+\&\f(CW\*(C`\eu\*(C'\fR actually performs \fItitlecase\fR mapping, which for most characters
+is the same as uppercase, but not for all; see
+<https://unicode.org/faq/casemap_charprop.html#4>.)
+.PP
+Control characters can be escaped with \f(CW\*(C`\ec\*(C'\fR, so that a control-Z
+character would be matched with \f(CW\*(C`\ecZ\*(C'\fR. The escape sequence
+\&\f(CW\*(C`\eQ\*(C'\fR...\f(CW\*(C`\eE\*(C'\fR quotes, or protects most non-alphabetic characters. For
+instance,
+.PP
+.Vb 2
+\& $x = "\eQThat !^*&%~& cat!";
+\& $x =~ /\eQ!^*&%~&\eE/; # check for rough language
+.Ve
+.PP
+It does not protect \f(CW\*(Aq$\*(Aq\fR or \f(CW\*(Aq@\*(Aq\fR, so that variables can still be
+substituted.
+.PP
+\&\f(CW\*(C`\eQ\*(C'\fR, \f(CW\*(C`\eL\*(C'\fR, \f(CW\*(C`\el\*(C'\fR, \f(CW\*(C`\eU\*(C'\fR, \f(CW\*(C`\eu\*(C'\fR and \f(CW\*(C`\eE\*(C'\fR are actually part of
+double-quotish syntax, and not part of regexp syntax proper. They will
+work if they appear in a regular expression embedded directly in a
+program, but not when contained in a string that is interpolated in a
+pattern.
+.PP
+Perl regexps can handle more than just the
+standard ASCII character set. Perl supports \fIUnicode\fR, a standard
+for representing the alphabets from virtually all of the world's written
+languages, and a host of symbols. Perl's text strings are Unicode strings, so
+they can contain characters with a value (codepoint or character number) higher
+than 255.
+.PP
+What does this mean for regexps? Well, regexp users don't need to know
+much about Perl's internal representation of strings. But they do need
+to know 1) how to represent Unicode characters in a regexp and 2) that
+a matching operation will treat the string to be searched as a sequence
+of characters, not bytes. The answer to 1) is that Unicode characters
+greater than \f(CWchr(255)\fR are represented using the \f(CW\*(C`\ex{hex}\*(C'\fR notation, because
+\&\f(CW\*(C`\ex\*(C'\fR\fIXY\fR (without curly braces and \fIXY\fR are two hex digits) doesn't
+go further than 255. (Starting in Perl 5.14, if you're an octal fan,
+you can also use \f(CW\*(C`\eo{oct}\*(C'\fR.)
+.PP
+.Vb 2
+\& /\ex{263a}/; # match a Unicode smiley face :)
+\& /\ex{ 263a }/; # Same
+.Ve
+.PP
+\&\fBNOTE\fR: In Perl 5.6.0 it used to be that one needed to say \f(CW\*(C`use
+utf8\*(C'\fR to use any Unicode features. This is no longer the case: for
+almost all Unicode processing, the explicit \f(CW\*(C`utf8\*(C'\fR pragma is not
+needed. (The only case where it matters is if your Perl script is in
+Unicode and encoded in UTF\-8, then an explicit \f(CW\*(C`use utf8\*(C'\fR is needed.)
+.PP
+Figuring out the hexadecimal sequence of a Unicode character you want
+or deciphering someone else's hexadecimal Unicode regexp is about as
+much fun as programming in machine code. So another way to specify
+Unicode characters is to use the \fInamed character\fR escape
+sequence \f(CW\*(C`\eN{\fR\f(CIname\fR\f(CW}\*(C'\fR. \fIname\fR is a name for the Unicode character, as
+specified in the Unicode standard. For instance, if we wanted to
+represent or match the astrological sign for the planet Mercury, we
+could use
+.PP
+.Vb 3
+\& $x = "abc\eN{MERCURY}def";
+\& $x =~ /\eN{MERCURY}/; # matches
+\& $x =~ /\eN{ MERCURY }/; # Also matches
+.Ve
+.PP
+One can also use "short" names:
+.PP
+.Vb 2
+\& print "\eN{GREEK SMALL LETTER SIGMA} is called sigma.\en";
+\& print "\eN{greek:Sigma} is an upper\-case sigma.\en";
+.Ve
+.PP
+You can also restrict names to a certain alphabet by specifying the
+charnames pragma:
+.PP
+.Vb 2
+\& use charnames qw(greek);
+\& print "\eN{sigma} is Greek sigma\en";
+.Ve
+.PP
+An index of character names is available on-line from the Unicode
+Consortium, <https://www.unicode.org/charts/charindex.html>; explanatory
+material with links to other resources at
+<https://www.unicode.org/standard/where>.
+.PP
+Starting in Perl v5.32, an alternative to \f(CW\*(C`\eN{...}\*(C'\fR for full names is
+available, and that is to say
+.PP
+.Vb 1
+\& /\ep{Name=greek small letter sigma}/
+.Ve
+.PP
+The casing of the character name is irrelevant when used in \f(CW\*(C`\ep{}\*(C'\fR, as
+are most spaces, underscores and hyphens. (A few outlier characters
+cause problems with ignoring all of them always. The details (which you
+can look up when you get more proficient, and if ever needed) are in
+<https://www.unicode.org/reports/tr44/tr44\-24.html#UAX44\-LM2>).
+.PP
+The answer to requirement 2) is that a regexp (mostly)
+uses Unicode characters. The "mostly" is for messy backward
+compatibility reasons, but starting in Perl 5.14, any regexp compiled in
+the scope of a \f(CW\*(C`use feature \*(Aqunicode_strings\*(Aq\*(C'\fR (which is automatically
+turned on within the scope of a \f(CW\*(C`use v5.12\*(C'\fR or higher) will turn that
+"mostly" into "always". If you want to handle Unicode properly, you
+should ensure that \f(CW\*(Aqunicode_strings\*(Aq\fR is turned on.
+Internally, this is encoded to bytes using either UTF\-8 or a native 8
+bit encoding, depending on the history of the string, but conceptually
+it is a sequence of characters, not bytes. See perlunitut for a
+tutorial about that.
+.PP
+Let us now discuss Unicode character classes, most usually called
+"character properties". These are represented by the \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW}\*(C'\fR
+escape sequence. The negation of this is \f(CW\*(C`\eP{\fR\f(CIname\fR\f(CW}\*(C'\fR. For example,
+to match lower and uppercase characters,
+.PP
+.Vb 5
+\& $x = "BOB";
+\& $x =~ /^\ep{IsUpper}/; # matches, uppercase char class
+\& $x =~ /^\eP{IsUpper}/; # doesn\*(Aqt match, char class sans uppercase
+\& $x =~ /^\ep{IsLower}/; # doesn\*(Aqt match, lowercase char class
+\& $x =~ /^\eP{IsLower}/; # matches, char class sans lowercase
+.Ve
+.PP
+(The "\f(CW\*(C`Is\*(C'\fR" is optional.)
+.PP
+There are many, many Unicode character properties. For the full list
+see perluniprops. Most of them have synonyms with shorter names,
+also listed there. Some synonyms are a single character. For these,
+you can drop the braces. For instance, \f(CW\*(C`\epM\*(C'\fR is the same thing as
+\&\f(CW\*(C`\ep{Mark}\*(C'\fR, meaning things like accent marks.
+.PP
+The Unicode \f(CW\*(C`\ep{Script}\*(C'\fR and \f(CW\*(C`\ep{Script_Extensions}\*(C'\fR properties are
+used to categorize every Unicode character into the language script it
+is written in. For example,
+English, French, and a bunch of other European languages are written in
+the Latin script. But there is also the Greek script, the Thai script,
+the Katakana script, \fIetc\fR. (\f(CW\*(C`Script\*(C'\fR is an older, less advanced,
+form of \f(CW\*(C`Script_Extensions\*(C'\fR, retained only for backwards
+compatibility.) You can test whether a character is in a particular
+script with, for example \f(CW\*(C`\ep{Latin}\*(C'\fR, \f(CW\*(C`\ep{Greek}\*(C'\fR, or
+\&\f(CW\*(C`\ep{Katakana}\*(C'\fR. To test if it isn't in the Balinese script, you would
+use \f(CW\*(C`\eP{Balinese}\*(C'\fR. (These all use \f(CW\*(C`Script_Extensions\*(C'\fR under the
+hood, as that gives better results.)
+.PP
+What we have described so far is the single form of the \f(CW\*(C`\ep{...}\*(C'\fR character
+classes. There is also a compound form which you may run into. These
+look like \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW=\fR\f(CIvalue\fR\f(CW}\*(C'\fR or \f(CW\*(C`\ep{\fR\f(CIname\fR\f(CW:\fR\f(CIvalue\fR\f(CW}\*(C'\fR (the equals sign and colon
+can be used interchangeably). These are more general than the single form,
+and in fact most of the single forms are just Perl-defined shortcuts for common
+compound forms. For example, the script examples in the previous paragraph
+could be written equivalently as \f(CW\*(C`\ep{Script_Extensions=Latin}\*(C'\fR, \f(CW\*(C`\ep{Script_Extensions:Greek}\*(C'\fR,
+\&\f(CW\*(C`\ep{script_extensions=katakana}\*(C'\fR, and \f(CW\*(C`\eP{script_extensions=balinese}\*(C'\fR (case is irrelevant
+between the \f(CW\*(C`{}\*(C'\fR braces). You may
+never have to use the compound forms, but sometimes it is necessary, and their
+use can make your code easier to understand.
+.PP
+\&\f(CW\*(C`\eX\*(C'\fR is an abbreviation for a character class that comprises
+a Unicode \fIextended grapheme cluster\fR. This represents a "logical character":
+what appears to be a single character, but may be represented internally by more
+than one. As an example, using the Unicode full names, \fIe.g.\fR, "A\ +\ COMBINING\ RING" is a grapheme cluster with base character "A" and combining character
+"COMBINING\ RING, which translates in Danish to "A" with the circle atop it,
+as in the word Ã…ngstrom.
+.PP
+For the full and latest information about Unicode see the latest
+Unicode standard, or the Unicode Consortium's website <https://www.unicode.org>
+.PP
+As if all those classes weren't enough, Perl also defines POSIX-style
+character classes. These have the form \f(CW\*(C`[:\fR\f(CIname\fR\f(CW:]\*(C'\fR, with \fIname\fR the
+name of the POSIX class. The POSIX classes are \f(CW\*(C`alpha\*(C'\fR, \f(CW\*(C`alnum\*(C'\fR,
+\&\f(CW\*(C`ascii\*(C'\fR, \f(CW\*(C`cntrl\*(C'\fR, \f(CW\*(C`digit\*(C'\fR, \f(CW\*(C`graph\*(C'\fR, \f(CW\*(C`lower\*(C'\fR, \f(CW\*(C`print\*(C'\fR, \f(CW\*(C`punct\*(C'\fR,
+\&\f(CW\*(C`space\*(C'\fR, \f(CW\*(C`upper\*(C'\fR, and \f(CW\*(C`xdigit\*(C'\fR, and two extensions, \f(CW\*(C`word\*(C'\fR (a Perl
+extension to match \f(CW\*(C`\ew\*(C'\fR), and \f(CW\*(C`blank\*(C'\fR (a GNU extension). The \f(CW\*(C`/a\*(C'\fR
+modifier restricts these to matching just in the ASCII range; otherwise
+they can match the same as their corresponding Perl Unicode classes:
+\&\f(CW\*(C`[:upper:]\*(C'\fR is the same as \f(CW\*(C`\ep{IsUpper}\*(C'\fR, \fIetc\fR. (There are some
+exceptions and gotchas with this; see perlrecharclass for a full
+discussion.) The \f(CW\*(C`[:digit:]\*(C'\fR, \f(CW\*(C`[:word:]\*(C'\fR, and
+\&\f(CW\*(C`[:space:]\*(C'\fR correspond to the familiar \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\ew\*(C'\fR, and \f(CW\*(C`\es\*(C'\fR
+character classes. To negate a POSIX class, put a \f(CW\*(Aq^\*(Aq\fR in front of
+the name, so that, \fIe.g.\fR, \f(CW\*(C`[:^digit:]\*(C'\fR corresponds to \f(CW\*(C`\eD\*(C'\fR and, under
+Unicode, \f(CW\*(C`\eP{IsDigit}\*(C'\fR. The Unicode and POSIX character classes can
+be used just like \f(CW\*(C`\ed\*(C'\fR, with the exception that POSIX character
+classes can only be used inside of a character class:
+.PP
+.Vb 6
+\& /\es+[abc[:digit:]xyz]\es*/; # match a,b,c,x,y,z, or a digit
+\& /^=item\es[[:digit:]]/; # match \*(Aq=item\*(Aq,
+\& # followed by a space and a digit
+\& /\es+[abc\ep{IsDigit}xyz]\es+/; # match a,b,c,x,y,z, or a digit
+\& /^=item\es\ep{IsDigit}/; # match \*(Aq=item\*(Aq,
+\& # followed by a space and a digit
+.Ve
+.PP
+Whew! That is all the rest of the characters and character classes.
+.SS "Compiling and saving regular expressions"
+.IX Subsection "Compiling and saving regular expressions"
+In Part 1 we mentioned that Perl compiles a regexp into a compact
+sequence of opcodes. Thus, a compiled regexp is a data structure
+that can be stored once and used again and again. The regexp quote
+\&\f(CW\*(C`qr//\*(C'\fR does exactly that: \f(CW\*(C`qr/string/\*(C'\fR compiles the \f(CW\*(C`string\*(C'\fR as a
+regexp and transforms the result into a form that can be assigned to a
+variable:
+.PP
+.Vb 1
+\& $reg = qr/foo+bar?/; # reg contains a compiled regexp
+.Ve
+.PP
+Then \f(CW$reg\fR can be used as a regexp:
+.PP
+.Vb 3
+\& $x = "fooooba";
+\& $x =~ $reg; # matches, just like /foo+bar?/
+\& $x =~ /$reg/; # same thing, alternate form
+.Ve
+.PP
+\&\f(CW$reg\fR can also be interpolated into a larger regexp:
+.PP
+.Vb 1
+\& $x =~ /(abc)?$reg/; # still matches
+.Ve
+.PP
+As with the matching operator, the regexp quote can use different
+delimiters, \fIe.g.\fR, \f(CW\*(C`qr!!\*(C'\fR, \f(CW\*(C`qr{}\*(C'\fR or \f(CW\*(C`qr~~\*(C'\fR. Apostrophes
+as delimiters (\f(CW\*(C`qr\*(Aq\*(Aq\*(C'\fR) inhibit any interpolation.
+.PP
+Pre-compiled regexps are useful for creating dynamic matches that
+don't need to be recompiled each time they are encountered. Using
+pre-compiled regexps, we write a \f(CW\*(C`grep_step\*(C'\fR program which greps
+for a sequence of patterns, advancing to the next pattern as soon
+as one has been satisfied.
+.PP
+.Vb 4
+\& % cat > grep_step
+\& #!/usr/bin/perl
+\& # grep_step \- match <number> regexps, one after the other
+\& # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
+\&
+\& $number = shift;
+\& $regexp[$_] = shift foreach (0..$number\-1);
+\& @compiled = map qr/$_/, @regexp;
+\& while ($line = <>) {
+\& if ($line =~ /$compiled[0]/) {
+\& print $line;
+\& shift @compiled;
+\& last unless @compiled;
+\& }
+\& }
+\& ^D
+\&
+\& % grep_step 3 shift print last grep_step
+\& $number = shift;
+\& print $line;
+\& last unless @compiled;
+.Ve
+.PP
+Storing pre-compiled regexps in an array \f(CW@compiled\fR allows us to
+simply loop through the regexps without any recompilation, thus gaining
+flexibility without sacrificing speed.
+.SS "Composing regular expressions at runtime"
+.IX Subsection "Composing regular expressions at runtime"
+Backtracking is more efficient than repeated tries with different regular
+expressions. If there are several regular expressions and a match with
+any of them is acceptable, then it is possible to combine them into a set
+of alternatives. If the individual expressions are input data, this
+can be done by programming a join operation. We'll exploit this idea in
+an improved version of the \f(CW\*(C`simple_grep\*(C'\fR program: a program that matches
+multiple patterns:
+.PP
+.Vb 4
+\& % cat > multi_grep
+\& #!/usr/bin/perl
+\& # multi_grep \- match any of <number> regexps
+\& # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ...
+\&
+\& $number = shift;
+\& $regexp[$_] = shift foreach (0..$number\-1);
+\& $pattern = join \*(Aq|\*(Aq, @regexp;
+\&
+\& while ($line = <>) {
+\& print $line if $line =~ /$pattern/;
+\& }
+\& ^D
+\&
+\& % multi_grep 2 shift for multi_grep
+\& $number = shift;
+\& $regexp[$_] = shift foreach (0..$number\-1);
+.Ve
+.PP
+Sometimes it is advantageous to construct a pattern from the \fIinput\fR
+that is to be analyzed and use the permissible values on the left
+hand side of the matching operations. As an example for this somewhat
+paradoxical situation, let's assume that our input contains a command
+verb which should match one out of a set of available command verbs,
+with the additional twist that commands may be abbreviated as long as
+the given string is unique. The program below demonstrates the basic
+algorithm.
+.PP
+.Vb 10
+\& % cat > keymatch
+\& #!/usr/bin/perl
+\& $kwds = \*(Aqcopy compare list print\*(Aq;
+\& while( $cmd = <> ){
+\& $cmd =~ s/^\es+|\es+$//g; # trim leading and trailing spaces
+\& if( ( @matches = $kwds =~ /\eb$cmd\ew*/g ) == 1 ){
+\& print "command: \*(Aq@matches\*(Aq\en";
+\& } elsif( @matches == 0 ){
+\& print "no such command: \*(Aq$cmd\*(Aq\en";
+\& } else {
+\& print "not unique: \*(Aq$cmd\*(Aq (could be one of: @matches)\en";
+\& }
+\& }
+\& ^D
+\&
+\& % keymatch
+\& li
+\& command: \*(Aqlist\*(Aq
+\& co
+\& not unique: \*(Aqco\*(Aq (could be one of: copy compare)
+\& printer
+\& no such command: \*(Aqprinter\*(Aq
+.Ve
+.PP
+Rather than trying to match the input against the keywords, we match the
+combined set of keywords against the input. The pattern matching
+operation \f(CW\*(C`$kwds\ =~\ /\eb($cmd\ew*)/g\*(C'\fR does several things at the
+same time. It makes sure that the given command begins where a keyword
+begins (\f(CW\*(C`\eb\*(C'\fR). It tolerates abbreviations due to the added \f(CW\*(C`\ew*\*(C'\fR. It
+tells us the number of matches (\f(CW\*(C`scalar @matches\*(C'\fR) and all the keywords
+that were actually matched. You could hardly ask for more.
+.SS "Embedding comments and modifiers in a regular expression"
+.IX Subsection "Embedding comments and modifiers in a regular expression"
+Starting with this section, we will be discussing Perl's set of
+\&\fIextended patterns\fR. These are extensions to the traditional regular
+expression syntax that provide powerful new tools for pattern
+matching. We have already seen extensions in the form of the minimal
+matching constructs \f(CW\*(C`??\*(C'\fR, \f(CW\*(C`*?\*(C'\fR, \f(CW\*(C`+?\*(C'\fR, \f(CW\*(C`{n,m}?\*(C'\fR, \f(CW\*(C`{n,}?\*(C'\fR, and
+\&\f(CW\*(C`{,n}?\*(C'\fR. Most of the extensions below have the form \f(CW\*(C`(?char...)\*(C'\fR,
+where the \f(CW\*(C`char\*(C'\fR is a character that determines the type of extension.
+.PP
+The first extension is an embedded comment \f(CW\*(C`(?#text)\*(C'\fR. This embeds a
+comment into the regular expression without affecting its meaning. The
+comment should not have any closing parentheses in the text. An
+example is
+.PP
+.Vb 1
+\& /(?# Match an integer:)[+\-]?\ed+/;
+.Ve
+.PP
+This style of commenting has been largely superseded by the raw,
+freeform commenting that is allowed with the \f(CW\*(C`/x\*(C'\fR modifier.
+.PP
+Most modifiers, such as \f(CW\*(C`/i\*(C'\fR, \f(CW\*(C`/m\*(C'\fR, \f(CW\*(C`/s\*(C'\fR and \f(CW\*(C`/x\*(C'\fR (or any
+combination thereof) can also be embedded in
+a regexp using \f(CW\*(C`(?i)\*(C'\fR, \f(CW\*(C`(?m)\*(C'\fR, \f(CW\*(C`(?s)\*(C'\fR, and \f(CW\*(C`(?x)\*(C'\fR. For instance,
+.PP
+.Vb 7
+\& /(?i)yes/; # match \*(Aqyes\*(Aq case insensitively
+\& /yes/i; # same thing
+\& /(?x)( # freeform version of an integer regexp
+\& [+\-]? # match an optional sign
+\& \ed+ # match a sequence of digits
+\& )
+\& /x;
+.Ve
+.PP
+Embedded modifiers can have two important advantages over the usual
+modifiers. Embedded modifiers allow a custom set of modifiers for
+\&\fIeach\fR regexp pattern. This is great for matching an array of regexps
+that must have different modifiers:
+.PP
+.Vb 8
+\& $pattern[0] = \*(Aq(?i)doctor\*(Aq;
+\& $pattern[1] = \*(AqJohnson\*(Aq;
+\& ...
+\& while (<>) {
+\& foreach $patt (@pattern) {
+\& print if /$patt/;
+\& }
+\& }
+.Ve
+.PP
+The second advantage is that embedded modifiers (except \f(CW\*(C`/p\*(C'\fR, which
+modifies the entire regexp) only affect the regexp
+inside the group the embedded modifier is contained in. So grouping
+can be used to localize the modifier's effects:
+.PP
+.Vb 1
+\& /Answer: ((?i)yes)/; # matches \*(AqAnswer: yes\*(Aq, \*(AqAnswer: YES\*(Aq, etc.
+.Ve
+.PP
+Embedded modifiers can also turn off any modifiers already present
+by using, \fIe.g.\fR, \f(CW\*(C`(?\-i)\*(C'\fR. Modifiers can also be combined into
+a single expression, \fIe.g.\fR, \f(CW\*(C`(?s\-i)\*(C'\fR turns on single line mode and
+turns off case insensitivity.
+.PP
+Embedded modifiers may also be added to a non-capturing grouping.
+\&\f(CW\*(C`(?i\-m:regexp)\*(C'\fR is a non-capturing grouping that matches \f(CW\*(C`regexp\*(C'\fR
+case insensitively and turns off multi-line mode.
+.SS "Looking ahead and looking behind"
+.IX Subsection "Looking ahead and looking behind"
+This section concerns the lookahead and lookbehind assertions. First,
+a little background.
+.PP
+In Perl regular expressions, most regexp elements "eat up" a certain
+amount of string when they match. For instance, the regexp element
+\&\f(CW\*(C`[abc]\*(C'\fR eats up one character of the string when it matches, in the
+sense that Perl moves to the next character position in the string
+after the match. There are some elements, however, that don't eat up
+characters (advance the character position) if they match. The examples
+we have seen so far are the anchors. The anchor \f(CW\*(Aq^\*(Aq\fR matches the
+beginning of the line, but doesn't eat any characters. Similarly, the
+word boundary anchor \f(CW\*(C`\eb\*(C'\fR matches wherever a character matching \f(CW\*(C`\ew\*(C'\fR
+is next to a character that doesn't, but it doesn't eat up any
+characters itself. Anchors are examples of \fIzero-width assertions\fR:
+zero-width, because they consume
+no characters, and assertions, because they test some property of the
+string. In the context of our walk in the woods analogy to regexp
+matching, most regexp elements move us along a trail, but anchors have
+us stop a moment and check our surroundings. If the local environment
+checks out, we can proceed forward. But if the local environment
+doesn't satisfy us, we must backtrack.
+.PP
+Checking the environment entails either looking ahead on the trail,
+looking behind, or both. \f(CW\*(Aq^\*(Aq\fR looks behind, to see that there are no
+characters before. \f(CW\*(Aq$\*(Aq\fR looks ahead, to see that there are no
+characters after. \f(CW\*(C`\eb\*(C'\fR looks both ahead and behind, to see if the
+characters on either side differ in their "word-ness".
+.PP
+The lookahead and lookbehind assertions are generalizations of the
+anchor concept. Lookahead and lookbehind are zero-width assertions
+that let us specify which characters we want to test for. The
+lookahead assertion is denoted by \f(CW\*(C`(?=regexp)\*(C'\fR or (starting in 5.32,
+experimentally in 5.28) \f(CW\*(C`(*pla:regexp)\*(C'\fR or
+\&\f(CW\*(C`(*positive_lookahead:regexp)\*(C'\fR; and the lookbehind assertion is denoted
+by \f(CW\*(C`(?<=fixed\-regexp)\*(C'\fR or (starting in 5.32, experimentally in
+5.28) \f(CW\*(C`(*plb:fixed\-regexp)\*(C'\fR or \f(CW\*(C`(*positive_lookbehind:fixed\-regexp)\*(C'\fR.
+Some examples are
+.PP
+.Vb 8
+\& $x = "I catch the housecat \*(AqTom\-cat\*(Aq with catnip";
+\& $x =~ /cat(*pla:\es)/; # matches \*(Aqcat\*(Aq in \*(Aqhousecat\*(Aq
+\& @catwords = ($x =~ /(?<=\es)cat\ew+/g); # matches,
+\& # $catwords[0] = \*(Aqcatch\*(Aq
+\& # $catwords[1] = \*(Aqcatnip\*(Aq
+\& $x =~ /\ebcat\eb/; # matches \*(Aqcat\*(Aq in \*(AqTom\-cat\*(Aq
+\& $x =~ /(?<=\es)cat(?=\es)/; # doesn\*(Aqt match; no isolated \*(Aqcat\*(Aq in
+\& # middle of $x
+.Ve
+.PP
+Note that the parentheses in these are
+non-capturing, since these are zero-width assertions. Thus in the
+second regexp, the substrings captured are those of the whole regexp
+itself. Lookahead can match arbitrary regexps, but
+lookbehind prior to 5.30 \f(CW\*(C`(?<=fixed\-regexp)\*(C'\fR only works for regexps
+of fixed width, \fIi.e.\fR, a fixed number of characters long. Thus
+\&\f(CW\*(C`(?<=(ab|bc))\*(C'\fR is fine, but \f(CW\*(C`(?<=(ab)*)\*(C'\fR prior to 5.30 is not.
+.PP
+The negated versions of the lookahead and lookbehind assertions are
+denoted by \f(CW\*(C`(?!regexp)\*(C'\fR and \f(CW\*(C`(?<!fixed\-regexp)\*(C'\fR respectively.
+Or, starting in 5.32 (experimentally in 5.28), \f(CW\*(C`(*nla:regexp)\*(C'\fR,
+\&\f(CW\*(C`(*negative_lookahead:regexp)\*(C'\fR, \f(CW\*(C`(*nlb:regexp)\*(C'\fR, or
+\&\f(CW\*(C`(*negative_lookbehind:regexp)\*(C'\fR.
+They evaluate true if the regexps do \fInot\fR match:
+.PP
+.Vb 4
+\& $x = "foobar";
+\& $x =~ /foo(?!bar)/; # doesn\*(Aqt match, \*(Aqbar\*(Aq follows \*(Aqfoo\*(Aq
+\& $x =~ /foo(?!baz)/; # matches, \*(Aqbaz\*(Aq doesn\*(Aqt follow \*(Aqfoo\*(Aq
+\& $x =~ /(?<!\es)foo/; # matches, there is no \es before \*(Aqfoo\*(Aq
+.Ve
+.PP
+Here is an example where a string containing blank-separated words,
+numbers and single dashes is to be split into its components.
+Using \f(CW\*(C`/\es+/\*(C'\fR alone won't work, because spaces are not required between
+dashes, or a word or a dash. Additional places for a split are established
+by looking ahead and behind:
+.PP
+.Vb 5
+\& $str = "one two \- \-\-6\-8";
+\& @toks = split / \es+ # a run of spaces
+\& | (?<=\eS) (?=\-) # any non\-space followed by \*(Aq\-\*(Aq
+\& | (?<=\-) (?=\eS) # a \*(Aq\-\*(Aq followed by any non\-space
+\& /x, $str; # @toks = qw(one two \- \- \- 6 \- 8)
+.Ve
+.SS "Using independent subexpressions to prevent backtracking"
+.IX Subsection "Using independent subexpressions to prevent backtracking"
+\&\fIIndependent subexpressions\fR (or atomic subexpressions) are regular
+expressions, in the context of a larger regular expression, that
+function independently of the larger regular expression. That is, they
+consume as much or as little of the string as they wish without regard
+for the ability of the larger regexp to match. Independent
+subexpressions are represented by
+\&\f(CW\*(C`(?>regexp)\*(C'\fR or (starting in 5.32, experimentally in 5.28)
+\&\f(CW\*(C`(*atomic:regexp)\*(C'\fR. We can illustrate their behavior by first
+considering an ordinary regexp:
+.PP
+.Vb 2
+\& $x = "ab";
+\& $x =~ /a*ab/; # matches
+.Ve
+.PP
+This obviously matches, but in the process of matching, the
+subexpression \f(CW\*(C`a*\*(C'\fR first grabbed the \f(CW\*(Aqa\*(Aq\fR. Doing so, however,
+wouldn't allow the whole regexp to match, so after backtracking, \f(CW\*(C`a*\*(C'\fR
+eventually gave back the \f(CW\*(Aqa\*(Aq\fR and matched the empty string. Here, what
+\&\f(CW\*(C`a*\*(C'\fR matched was \fIdependent\fR on what the rest of the regexp matched.
+.PP
+Contrast that with an independent subexpression:
+.PP
+.Vb 1
+\& $x =~ /(?>a*)ab/; # doesn\*(Aqt match!
+.Ve
+.PP
+The independent subexpression \f(CW\*(C`(?>a*)\*(C'\fR doesn't care about the rest
+of the regexp, so it sees an \f(CW\*(Aqa\*(Aq\fR and grabs it. Then the rest of the
+regexp \f(CW\*(C`ab\*(C'\fR cannot match. Because \f(CW\*(C`(?>a*)\*(C'\fR is independent, there
+is no backtracking and the independent subexpression does not give
+up its \f(CW\*(Aqa\*(Aq\fR. Thus the match of the regexp as a whole fails. A similar
+behavior occurs with completely independent regexps:
+.PP
+.Vb 3
+\& $x = "ab";
+\& $x =~ /a*/g; # matches, eats an \*(Aqa\*(Aq
+\& $x =~ /\eGab/g; # doesn\*(Aqt match, no \*(Aqa\*(Aq available
+.Ve
+.PP
+Here \f(CW\*(C`/g\*(C'\fR and \f(CW\*(C`\eG\*(C'\fR create a "tag team" handoff of the string from
+one regexp to the other. Regexps with an independent subexpression are
+much like this, with a handoff of the string to the independent
+subexpression, and a handoff of the string back to the enclosing
+regexp.
+.PP
+The ability of an independent subexpression to prevent backtracking
+can be quite useful. Suppose we want to match a non-empty string
+enclosed in parentheses up to two levels deep. Then the following
+regexp matches:
+.PP
+.Vb 2
+\& $x = "abc(de(fg)h"; # unbalanced parentheses
+\& $x =~ /\e( ( [ ^ () ]+ | \e( [ ^ () ]* \e) )+ \e)/xx;
+.Ve
+.PP
+The regexp matches an open parenthesis, one or more copies of an
+alternation, and a close parenthesis. The alternation is two-way, with
+the first alternative \f(CW\*(C`[^()]+\*(C'\fR matching a substring with no
+parentheses and the second alternative \f(CW\*(C`\e([^()]*\e)\*(C'\fR matching a
+substring delimited by parentheses. The problem with this regexp is
+that it is pathological: it has nested indeterminate quantifiers
+of the form \f(CW\*(C`(a+|b)+\*(C'\fR. We discussed in Part 1 how nested quantifiers
+like this could take an exponentially long time to execute if there
+is no match possible. To prevent the exponential blowup, we need to
+prevent useless backtracking at some point. This can be done by
+enclosing the inner quantifier as an independent subexpression:
+.PP
+.Vb 1
+\& $x =~ /\e( ( (?> [ ^ () ]+ ) | \e([ ^ () ]* \e) )+ \e)/xx;
+.Ve
+.PP
+Here, \f(CW\*(C`(?>[^()]+)\*(C'\fR breaks the degeneracy of string partitioning
+by gobbling up as much of the string as possible and keeping it. Then
+match failures fail much more quickly.
+.SS "Conditional expressions"
+.IX Subsection "Conditional expressions"
+A \fIconditional expression\fR is a form of if-then-else statement
+that allows one to choose which patterns are to be matched, based on
+some condition. There are two types of conditional expression:
+\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-regexp\fR\f(CW)\*(C'\fR and
+\&\f(CW\*(C`(?(condition)\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR.
+\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)\fR\f(CIyes\-regexp\fR\f(CW)\*(C'\fR is
+like an \f(CW\*(Aqif\ ()\ {}\*(Aq\fR statement in Perl. If the \fIcondition\fR is true,
+the \fIyes-regexp\fR will be matched. If the \fIcondition\fR is false, the
+\&\fIyes-regexp\fR will be skipped and Perl will move onto the next regexp
+element. The second form is like an \f(CW\*(Aqif\ ()\ {}\ else\ {}\*(Aq\fR statement
+in Perl. If the \fIcondition\fR is true, the \fIyes-regexp\fR will be
+matched, otherwise the \fIno-regexp\fR will be matched.
+.PP
+The \fIcondition\fR can have several forms. The first form is simply an
+integer in parentheses \f(CW\*(C`(\fR\f(CIinteger\fR\f(CW)\*(C'\fR. It is true if the corresponding
+backreference \f(CW\*(C`\e\fR\f(CIinteger\fR\f(CW\*(C'\fR matched earlier in the regexp. The same
+thing can be done with a name associated with a capture group, written
+as \f(CW\*(C`(<\fR\f(CIname\fR\f(CW>)\*(C'\fR or \f(CW\*(C`(\*(Aq\fR\f(CIname\fR\f(CW\*(Aq)\*(C'\fR. The second form is a bare
+zero-width assertion \f(CW\*(C`(?...)\*(C'\fR, either a lookahead, a lookbehind, or a
+code assertion (discussed in the next section). The third set of forms
+provides tests that return true if the expression is executed within
+a recursion (\f(CW\*(C`(R)\*(C'\fR) or is being called from some capturing group,
+referenced either by number (\f(CW\*(C`(R1)\*(C'\fR, \f(CW\*(C`(R2)\*(C'\fR,...) or by name
+(\f(CW\*(C`(R&\fR\f(CIname\fR\f(CW)\*(C'\fR).
+.PP
+The integer or name form of the \f(CW\*(C`condition\*(C'\fR allows us to choose,
+with more flexibility, what to match based on what matched earlier in the
+regexp. This searches for words of the form \f(CW"$x$x"\fR or \f(CW"$x$y$y$x"\fR:
+.PP
+.Vb 9
+\& % simple_grep \*(Aq^(\ew+)(\ew+)?(?(2)\eg2\eg1|\eg1)$\*(Aq /usr/dict/words
+\& beriberi
+\& coco
+\& couscous
+\& deed
+\& ...
+\& toot
+\& toto
+\& tutu
+.Ve
+.PP
+The lookbehind \f(CW\*(C`condition\*(C'\fR allows, along with backreferences,
+an earlier part of the match to influence a later part of the
+match. For instance,
+.PP
+.Vb 1
+\& /[ATGC]+(?(?<=AA)G|C)$/;
+.Ve
+.PP
+matches a DNA sequence such that it either ends in \f(CW\*(C`AAG\*(C'\fR, or some
+other base pair combination and \f(CW\*(AqC\*(Aq\fR. Note that the form is
+\&\f(CW\*(C`(?(?<=AA)G|C)\*(C'\fR and not \f(CW\*(C`(?((?<=AA))G|C)\*(C'\fR; for the
+lookahead, lookbehind or code assertions, the parentheses around the
+conditional are not needed.
+.SS "Defining named patterns"
+.IX Subsection "Defining named patterns"
+Some regular expressions use identical subpatterns in several places.
+Starting with Perl 5.10, it is possible to define named subpatterns in
+a section of the pattern so that they can be called up by name
+anywhere in the pattern. This syntactic pattern for this definition
+group is \f(CW\*(C`(?(DEFINE)(?<\fR\f(CIname\fR\f(CW>\fR\f(CIpattern\fR\f(CW)...)\*(C'\fR. An insertion
+of a named pattern is written as \f(CW\*(C`(?&\fR\f(CIname\fR\f(CW)\*(C'\fR.
+.PP
+The example below illustrates this feature using the pattern for
+floating point numbers that was presented earlier on. The three
+subpatterns that are used more than once are the optional sign, the
+digit sequence for an integer and the decimal fraction. The \f(CW\*(C`DEFINE\*(C'\fR
+group at the end of the pattern contains their definition. Notice
+that the decimal fraction pattern is the first place where we can
+reuse the integer pattern.
+.PP
+.Vb 8
+\& /^ (?&osg)\e * ( (?&int)(?&dec)? | (?&dec) )
+\& (?: [eE](?&osg)(?&int) )?
+\& $
+\& (?(DEFINE)
+\& (?<osg>[\-+]?) # optional sign
+\& (?<int>\ed++) # integer
+\& (?<dec>\e.(?&int)) # decimal fraction
+\& )/x
+.Ve
+.SS "Recursive patterns"
+.IX Subsection "Recursive patterns"
+This feature (introduced in Perl 5.10) significantly extends the
+power of Perl's pattern matching. By referring to some other
+capture group anywhere in the pattern with the construct
+\&\f(CW\*(C`(?\fR\f(CIgroup\-ref\fR\f(CW)\*(C'\fR, the \fIpattern\fR within the referenced group is used
+as an independent subpattern in place of the group reference itself.
+Because the group reference may be contained \fIwithin\fR the group it
+refers to, it is now possible to apply pattern matching to tasks that
+hitherto required a recursive parser.
+.PP
+To illustrate this feature, we'll design a pattern that matches if
+a string contains a palindrome. (This is a word or a sentence that,
+while ignoring spaces, interpunctuation and case, reads the same backwards
+as forwards. We begin by observing that the empty string or a string
+containing just one word character is a palindrome. Otherwise it must
+have a word character up front and the same at its end, with another
+palindrome in between.
+.PP
+.Vb 1
+\& /(?: (\ew) (?...Here be a palindrome...) \eg{ \-1 } | \ew? )/x
+.Ve
+.PP
+Adding \f(CW\*(C`\eW*\*(C'\fR at either end to eliminate what is to be ignored, we already
+have the full pattern:
+.PP
+.Vb 4
+\& my $pp = qr/^(\eW* (?: (\ew) (?1) \eg{\-1} | \ew? ) \eW*)$/ix;
+\& for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
+\& print "\*(Aq$s\*(Aq is a palindrome\en" if $s =~ /$pp/;
+\& }
+.Ve
+.PP
+In \f(CW\*(C`(?...)\*(C'\fR both absolute and relative backreferences may be used.
+The entire pattern can be reinserted with \f(CW\*(C`(?R)\*(C'\fR or \f(CW\*(C`(?0)\*(C'\fR.
+If you prefer to name your groups, you can use \f(CW\*(C`(?&\fR\f(CIname\fR\f(CW)\*(C'\fR to
+recurse into that group.
+.SS "A bit of magic: executing Perl code in a regular expression"
+.IX Subsection "A bit of magic: executing Perl code in a regular expression"
+Normally, regexps are a part of Perl expressions.
+\&\fICode evaluation\fR expressions turn that around by allowing
+arbitrary Perl code to be a part of a regexp. A code evaluation
+expression is denoted \f(CW\*(C`(?{\fR\f(CIcode\fR\f(CW})\*(C'\fR, with \fIcode\fR a string of Perl
+statements.
+.PP
+Code expressions are zero-width assertions, and the value they return
+depends on their environment. There are two possibilities: either the
+code expression is used as a conditional in a conditional expression
+\&\f(CW\*(C`(?(\fR\f(CIcondition\fR\f(CW)...)\*(C'\fR, or it is not. If the code expression is a
+conditional, the code is evaluated and the result (\fIi.e.\fR, the result of
+the last statement) is used to determine truth or falsehood. If the
+code expression is not used as a conditional, the assertion always
+evaluates true and the result is put into the special variable
+\&\f(CW$^R\fR. The variable \f(CW$^R\fR can then be used in code expressions later
+in the regexp. Here are some silly examples:
+.PP
+.Vb 5
+\& $x = "abcdef";
+\& $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
+\& # prints \*(AqHi Mom!\*(Aq
+\& $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn\*(Aqt match,
+\& # no \*(AqHi Mom!\*(Aq
+.Ve
+.PP
+Pay careful attention to the next example:
+.PP
+.Vb 3
+\& $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn\*(Aqt match,
+\& # no \*(AqHi Mom!\*(Aq
+\& # but why not?
+.Ve
+.PP
+At first glance, you'd think that it shouldn't print, because obviously
+the \f(CW\*(C`ddd\*(C'\fR isn't going to match the target string. But look at this
+example:
+.PP
+.Vb 2
+\& $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn\*(Aqt match,
+\& # but _does_ print
+.Ve
+.PP
+Hmm. What happened here? If you've been following along, you know that
+the above pattern should be effectively (almost) the same as the last one;
+enclosing the \f(CW\*(Aqd\*(Aq\fR in a character class isn't going to change what it
+matches. So why does the first not print while the second one does?
+.PP
+The answer lies in the optimizations the regexp engine makes. In the first
+case, all the engine sees are plain old characters (aside from the
+\&\f(CW\*(C`?{}\*(C'\fR construct). It's smart enough to realize that the string \f(CW\*(Aqddd\*(Aq\fR
+doesn't occur in our target string before actually running the pattern
+through. But in the second case, we've tricked it into thinking that our
+pattern is more complicated. It takes a look, sees our
+character class, and decides that it will have to actually run the
+pattern to determine whether or not it matches, and in the process of
+running it hits the print statement before it discovers that we don't
+have a match.
+.PP
+To take a closer look at how the engine does optimizations, see the
+section "Pragmas and debugging" below.
+.PP
+More fun with \f(CW\*(C`?{}\*(C'\fR:
+.PP
+.Vb 6
+\& $x =~ /(?{print "Hi Mom!";})/; # matches,
+\& # prints \*(AqHi Mom!\*(Aq
+\& $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
+\& # prints \*(Aq1\*(Aq
+\& $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
+\& # prints \*(Aq1\*(Aq
+.Ve
+.PP
+The bit of magic mentioned in the section title occurs when the regexp
+backtracks in the process of searching for a match. If the regexp
+backtracks over a code expression and if the variables used within are
+localized using \f(CW\*(C`local\*(C'\fR, the changes in the variables produced by the
+code expression are undone! Thus, if we wanted to count how many times
+a character got matched inside a group, we could use, \fIe.g.\fR,
+.PP
+.Vb 11
+\& $x = "aaaa";
+\& $count = 0; # initialize \*(Aqa\*(Aq count
+\& $c = "bob"; # test if $c gets clobbered
+\& $x =~ /(?{local $c = 0;}) # initialize count
+\& ( a # match \*(Aqa\*(Aq
+\& (?{local $c = $c + 1;}) # increment count
+\& )* # do this any number of times,
+\& aa # but match \*(Aqaa\*(Aq at the end
+\& (?{$count = $c;}) # copy local $c var into $count
+\& /x;
+\& print "\*(Aqa\*(Aq count is $count, \e$c variable is \*(Aq$c\*(Aq\en";
+.Ve
+.PP
+This prints
+.PP
+.Vb 1
+\& \*(Aqa\*(Aq count is 2, $c variable is \*(Aqbob\*(Aq
+.Ve
+.PP
+If we replace the \f(CW\*(C`\ (?{local\ $c\ =\ $c\ +\ 1;})\*(C'\fR with
+\&\f(CW\*(C`\ (?{$c\ =\ $c\ +\ 1;})\*(C'\fR, the variable changes are \fInot\fR undone
+during backtracking, and we get
+.PP
+.Vb 1
+\& \*(Aqa\*(Aq count is 4, $c variable is \*(Aqbob\*(Aq
+.Ve
+.PP
+Note that only localized variable changes are undone. Other side
+effects of code expression execution are permanent. Thus
+.PP
+.Vb 2
+\& $x = "aaaa";
+\& $x =~ /(a(?{print "Yow\en";}))*aa/;
+.Ve
+.PP
+produces
+.PP
+.Vb 4
+\& Yow
+\& Yow
+\& Yow
+\& Yow
+.Ve
+.PP
+The result \f(CW$^R\fR is automatically localized, so that it will behave
+properly in the presence of backtracking.
+.PP
+This example uses a code expression in a conditional to match a
+definite article, either \f(CW\*(Aqthe\*(Aq\fR in English or \f(CW\*(Aqder|die|das\*(Aq\fR in
+German:
+.PP
+.Vb 11
+\& $lang = \*(AqDE\*(Aq; # use German
+\& ...
+\& $text = "das";
+\& print "matched\en"
+\& if $text =~ /(?(?{
+\& $lang eq \*(AqEN\*(Aq; # is the language English?
+\& })
+\& the | # if so, then match \*(Aqthe\*(Aq
+\& (der|die|das) # else, match \*(Aqder|die|das\*(Aq
+\& )
+\& /xi;
+.Ve
+.PP
+Note that the syntax here is \f(CW\*(C`(?(?{...})\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR, not
+\&\f(CW\*(C`(?((?{...}))\fR\f(CIyes\-regexp\fR\f(CW|\fR\f(CIno\-regexp\fR\f(CW)\*(C'\fR. In other words, in the case of a
+code expression, we don't need the extra parentheses around the
+conditional.
+.PP
+If you try to use code expressions where the code text is contained within
+an interpolated variable, rather than appearing literally in the pattern,
+Perl may surprise you:
+.PP
+.Vb 5
+\& $bar = 5;
+\& $pat = \*(Aq(?{ 1 })\*(Aq;
+\& /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
+\& /foo(?{ 1 })$bar/; # compiles ok, $bar interpolated
+\& /foo${pat}bar/; # compile error!
+\&
+\& $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
+\& /foo${pat}bar/; # compiles ok
+.Ve
+.PP
+If a regexp has a variable that interpolates a code expression, Perl
+treats the regexp as an error. If the code expression is precompiled into
+a variable, however, interpolating is ok. The question is, why is this an
+error?
+.PP
+The reason is that variable interpolation and code expressions
+together pose a security risk. The combination is dangerous because
+many programmers who write search engines often take user input and
+plug it directly into a regexp:
+.PP
+.Vb 3
+\& $regexp = <>; # read user\-supplied regexp
+\& $chomp $regexp; # get rid of possible newline
+\& $text =~ /$regexp/; # search $text for the $regexp
+.Ve
+.PP
+If the \f(CW$regexp\fR variable contains a code expression, the user could
+then execute arbitrary Perl code. For instance, some joker could
+search for \f(CW\*(C`system(\*(Aqrm\ \-rf\ *\*(Aq);\*(C'\fR to erase your files. In this
+sense, the combination of interpolation and code expressions \fItaints\fR
+your regexp. So by default, using both interpolation and code
+expressions in the same regexp is not allowed. If you're not
+concerned about malicious users, it is possible to bypass this
+security check by invoking \f(CW\*(C`use\ re\ \*(Aqeval\*(Aq\*(C'\fR:
+.PP
+.Vb 4
+\& use re \*(Aqeval\*(Aq; # throw caution out the door
+\& $bar = 5;
+\& $pat = \*(Aq(?{ 1 })\*(Aq;
+\& /foo${pat}bar/; # compiles ok
+.Ve
+.PP
+Another form of code expression is the \fIpattern code expression\fR.
+The pattern code expression is like a regular code expression, except
+that the result of the code evaluation is treated as a regular
+expression and matched immediately. A simple example is
+.PP
+.Vb 4
+\& $length = 5;
+\& $char = \*(Aqa\*(Aq;
+\& $x = \*(Aqaaaaabb\*(Aq;
+\& $x =~ /(??{$char x $length})/x; # matches, there are 5 of \*(Aqa\*(Aq
+.Ve
+.PP
+This final example contains both ordinary and pattern code
+expressions. It detects whether a binary string \f(CW1101010010001...\fR has a
+Fibonacci spacing 0,1,1,2,3,5,... of the \f(CW\*(Aq1\*(Aq\fR's:
+.PP
+.Vb 12
+\& $x = "1101010010001000001";
+\& $z0 = \*(Aq\*(Aq; $z1 = \*(Aq0\*(Aq; # initial conditions
+\& print "It is a Fibonacci sequence\en"
+\& if $x =~ /^1 # match an initial \*(Aq1\*(Aq
+\& (?:
+\& ((??{ $z0 })) # match some \*(Aq0\*(Aq
+\& 1 # and then a \*(Aq1\*(Aq
+\& (?{ $z0 = $z1; $z1 .= $^N; })
+\& )+ # repeat as needed
+\& $ # that is all there is
+\& /x;
+\& printf "Largest sequence matched was %d\en", length($z1)\-length($z0);
+.Ve
+.PP
+Remember that \f(CW$^N\fR is set to whatever was matched by the last
+completed capture group. This prints
+.PP
+.Vb 2
+\& It is a Fibonacci sequence
+\& Largest sequence matched was 5
+.Ve
+.PP
+Ha! Try that with your garden variety regexp package...
+.PP
+Note that the variables \f(CW$z0\fR and \f(CW$z1\fR are not substituted when the
+regexp is compiled, as happens for ordinary variables outside a code
+expression. Rather, the whole code block is parsed as perl code at the
+same time as perl is compiling the code containing the literal regexp
+pattern.
+.PP
+This regexp without the \f(CW\*(C`/x\*(C'\fR modifier is
+.PP
+.Vb 1
+\& /^1(?:((??{ $z0 }))1(?{ $z0 = $z1; $z1 .= $^N; }))+$/
+.Ve
+.PP
+which shows that spaces are still possible in the code parts. Nevertheless,
+when working with code and conditional expressions, the extended form of
+regexps is almost necessary in creating and debugging regexps.
+.SS "Backtracking control verbs"
+.IX Subsection "Backtracking control verbs"
+Perl 5.10 introduced a number of control verbs intended to provide
+detailed control over the backtracking process, by directly influencing
+the regexp engine and by providing monitoring techniques. See
+"Special Backtracking Control Verbs" in perlre for a detailed
+description.
+.PP
+Below is just one example, illustrating the control verb \f(CW\*(C`(*FAIL)\*(C'\fR,
+which may be abbreviated as \f(CW\*(C`(*F)\*(C'\fR. If this is inserted in a regexp
+it will cause it to fail, just as it would at some
+mismatch between the pattern and the string. Processing
+of the regexp continues as it would after any "normal"
+failure, so that, for instance, the next position in the string or another
+alternative will be tried. As failing to match doesn't preserve capture
+groups or produce results, it may be necessary to use this in
+combination with embedded code.
+.PP
+.Vb 4
+\& %count = ();
+\& "supercalifragilisticexpialidocious" =~
+\& /([aeiou])(?{ $count{$1}++; })(*FAIL)/i;
+\& printf "%3d \*(Aq%s\*(Aq\en", $count{$_}, $_ for (sort keys %count);
+.Ve
+.PP
+The pattern begins with a class matching a subset of letters. Whenever
+this matches, a statement like \f(CW\*(C`$count{\*(Aqa\*(Aq}++;\*(C'\fR is executed, incrementing
+the letter's counter. Then \f(CW\*(C`(*FAIL)\*(C'\fR does what it says, and
+the regexp engine proceeds according to the book: as long as the end of
+the string hasn't been reached, the position is advanced before looking
+for another vowel. Thus, match or no match makes no difference, and the
+regexp engine proceeds until the entire string has been inspected.
+(It's remarkable that an alternative solution using something like
+.PP
+.Vb 2
+\& $count{lc($_)}++ for split(\*(Aq\*(Aq, "supercalifragilisticexpialidocious");
+\& printf "%3d \*(Aq%s\*(Aq\en", $count2{$_}, $_ for ( qw{ a e i o u } );
+.Ve
+.PP
+is considerably slower.)
+.SS "Pragmas and debugging"
+.IX Subsection "Pragmas and debugging"
+Speaking of debugging, there are several pragmas available to control
+and debug regexps in Perl. We have already encountered one pragma in
+the previous section, \f(CW\*(C`use\ re\ \*(Aqeval\*(Aq;\*(C'\fR, that allows variable
+interpolation and code expressions to coexist in a regexp. The other
+pragmas are
+.PP
+.Vb 3
+\& use re \*(Aqtaint\*(Aq;
+\& $tainted = <>;
+\& @parts = ($tainted =~ /(\ew+)\es+(\ew+)/; # @parts is now tainted
+.Ve
+.PP
+The \f(CW\*(C`taint\*(C'\fR pragma causes any substrings from a match with a tainted
+variable to be tainted as well, if your perl supports tainting
+(see perlsec). This is not normally the case, as
+regexps are often used to extract the safe bits from a tainted
+variable. Use \f(CW\*(C`taint\*(C'\fR when you are not extracting safe bits, but are
+performing some other processing. Both \f(CW\*(C`taint\*(C'\fR and \f(CW\*(C`eval\*(C'\fR pragmas
+are lexically scoped, which means they are in effect only until
+the end of the block enclosing the pragmas.
+.PP
+.Vb 2
+\& use re \*(Aq/m\*(Aq; # or any other flags
+\& $multiline_string =~ /^foo/; # /m is implied
+.Ve
+.PP
+The \f(CW\*(C`re \*(Aq/flags\*(Aq\*(C'\fR pragma (introduced in Perl
+5.14) turns on the given regular expression flags
+until the end of the lexical scope. See
+"'/flags' mode" in re for more
+detail.
+.PP
+.Vb 2
+\& use re \*(Aqdebug\*(Aq;
+\& /^(.*)$/s; # output debugging info
+\&
+\& use re \*(Aqdebugcolor\*(Aq;
+\& /^(.*)$/s; # output debugging info in living color
+.Ve
+.PP
+The global \f(CW\*(C`debug\*(C'\fR and \f(CW\*(C`debugcolor\*(C'\fR pragmas allow one to get
+detailed debugging info about regexp compilation and
+execution. \f(CW\*(C`debugcolor\*(C'\fR is the same as debug, except the debugging
+information is displayed in color on terminals that can display
+termcap color sequences. Here is example output:
+.PP
+.Vb 10
+\& % perl \-e \*(Aquse re "debug"; "abc" =~ /a*b+c/;\*(Aq
+\& Compiling REx \*(Aqa*b+c\*(Aq
+\& size 9 first at 1
+\& 1: STAR(4)
+\& 2: EXACT <a>(0)
+\& 4: PLUS(7)
+\& 5: EXACT <b>(0)
+\& 7: EXACT <c>(9)
+\& 9: END(0)
+\& floating \*(Aqbc\*(Aq at 0..2147483647 (checking floating) minlen 2
+\& Guessing start of match, REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq...
+\& Found floating substr \*(Aqbc\*(Aq at offset 1...
+\& Guessed: match at offset 0
+\& Matching REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq
+\& Setting an EVAL scope, savestack=3
+\& 0 <> <abc> | 1: STAR
+\& EXACT <a> can match 1 times out of 32767...
+\& Setting an EVAL scope, savestack=3
+\& 1 <a> <bc> | 4: PLUS
+\& EXACT <b> can match 1 times out of 32767...
+\& Setting an EVAL scope, savestack=3
+\& 2 <ab> <c> | 7: EXACT <c>
+\& 3 <abc> <> | 9: END
+\& Match successful!
+\& Freeing REx: \*(Aqa*b+c\*(Aq
+.Ve
+.PP
+If you have gotten this far into the tutorial, you can probably guess
+what the different parts of the debugging output tell you. The first
+part
+.PP
+.Vb 8
+\& Compiling REx \*(Aqa*b+c\*(Aq
+\& size 9 first at 1
+\& 1: STAR(4)
+\& 2: EXACT <a>(0)
+\& 4: PLUS(7)
+\& 5: EXACT <b>(0)
+\& 7: EXACT <c>(9)
+\& 9: END(0)
+.Ve
+.PP
+describes the compilation stage. \f(CWSTAR(4)\fR means that there is a
+starred object, in this case \f(CW\*(Aqa\*(Aq\fR, and if it matches, goto line 4,
+\&\fIi.e.\fR, \f(CWPLUS(7)\fR. The middle lines describe some heuristics and
+optimizations performed before a match:
+.PP
+.Vb 4
+\& floating \*(Aqbc\*(Aq at 0..2147483647 (checking floating) minlen 2
+\& Guessing start of match, REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq...
+\& Found floating substr \*(Aqbc\*(Aq at offset 1...
+\& Guessed: match at offset 0
+.Ve
+.PP
+Then the match is executed and the remaining lines describe the
+process:
+.PP
+.Vb 12
+\& Matching REx \*(Aqa*b+c\*(Aq against \*(Aqabc\*(Aq
+\& Setting an EVAL scope, savestack=3
+\& 0 <> <abc> | 1: STAR
+\& EXACT <a> can match 1 times out of 32767...
+\& Setting an EVAL scope, savestack=3
+\& 1 <a> <bc> | 4: PLUS
+\& EXACT <b> can match 1 times out of 32767...
+\& Setting an EVAL scope, savestack=3
+\& 2 <ab> <c> | 7: EXACT <c>
+\& 3 <abc> <> | 9: END
+\& Match successful!
+\& Freeing REx: \*(Aqa*b+c\*(Aq
+.Ve
+.PP
+Each step is of the form \f(CW\*(C`n\ <x>\ <y>\*(C'\fR, with \f(CW\*(C`<x>\*(C'\fR the
+part of the string matched and \f(CW\*(C`<y>\*(C'\fR the part not yet
+matched. The \f(CW\*(C`|\ \ 1:\ \ STAR\*(C'\fR says that Perl is at line number 1
+in the compilation list above. See
+"Debugging Regular Expressions" in perldebguts for much more detail.
+.PP
+An alternative method of debugging regexps is to embed \f(CW\*(C`print\*(C'\fR
+statements within the regexp. This provides a blow-by-blow account of
+the backtracking in an alternation:
+.PP
+.Vb 12
+\& "that this" =~ m@(?{print "Start at position ", pos, "\en";})
+\& t(?{print "t1\en";})
+\& h(?{print "h1\en";})
+\& i(?{print "i1\en";})
+\& s(?{print "s1\en";})
+\& |
+\& t(?{print "t2\en";})
+\& h(?{print "h2\en";})
+\& a(?{print "a2\en";})
+\& t(?{print "t2\en";})
+\& (?{print "Done at position ", pos, "\en";})
+\& @x;
+.Ve
+.PP
+prints
+.PP
+.Vb 8
+\& Start at position 0
+\& t1
+\& h1
+\& t2
+\& h2
+\& a2
+\& t2
+\& Done at position 4
+.Ve
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+This is just a tutorial. For the full story on Perl regular
+expressions, see the perlre regular expressions reference page.
+.PP
+For more information on the matching \f(CW\*(C`m//\*(C'\fR and substitution \f(CW\*(C`s///\*(C'\fR
+operators, see "Regexp Quote-Like Operators" in perlop. For
+information on the \f(CW\*(C`split\*(C'\fR operation, see "split" in perlfunc.
+.PP
+For an excellent all-around resource on the care and feeding of
+regular expressions, see the book \fIMastering Regular Expressions\fR by
+Jeffrey Friedl (published by O'Reilly, ISBN 1556592\-257\-3).
+.SH "AUTHOR AND COPYRIGHT"
+.IX Header "AUTHOR AND COPYRIGHT"
+Copyright (c) 2000 Mark Kvale.
+All rights reserved.
+Now maintained by Perl porters.
+.PP
+This document may be distributed under the same terms as Perl itself.
+.SS Acknowledgments
+.IX Subsection "Acknowledgments"
+The inspiration for the stop codon DNA example came from the ZIP
+code example in chapter 7 of \fIMastering Regular Expressions\fR.
+.PP
+The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
+Haworth, Ronald J Kimball, and Joe Smith for all their helpful
+comments.