summaryrefslogtreecommitdiffstats
path: root/upstream/fedora-40/man1/perlreguts.1
diff options
context:
space:
mode:
Diffstat (limited to 'upstream/fedora-40/man1/perlreguts.1')
-rw-r--r--upstream/fedora-40/man1/perlreguts.11092
1 files changed, 1092 insertions, 0 deletions
diff --git a/upstream/fedora-40/man1/perlreguts.1 b/upstream/fedora-40/man1/perlreguts.1
new file mode 100644
index 00000000..94308646
--- /dev/null
+++ b/upstream/fedora-40/man1/perlreguts.1
@@ -0,0 +1,1092 @@
+.\" -*- mode: troff; coding: utf-8 -*-
+.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
+.\"
+.\" Standard preamble:
+.\" ========================================================================
+.de Sp \" Vertical space (when we can't use .PP)
+.if t .sp .5v
+.if n .sp
+..
+.de Vb \" Begin verbatim text
+.ft CW
+.nf
+.ne \\$1
+..
+.de Ve \" End verbatim text
+.ft R
+.fi
+..
+.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
+.ie n \{\
+. ds C` ""
+. ds C' ""
+'br\}
+.el\{\
+. ds C`
+. ds C'
+'br\}
+.\"
+.\" Escape single quotes in literal strings from groff's Unicode transform.
+.ie \n(.g .ds Aq \(aq
+.el .ds Aq '
+.\"
+.\" If the F register is >0, we'll generate index entries on stderr for
+.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
+.\" entries marked with X<> in POD. Of course, you'll have to process the
+.\" output yourself in some meaningful fashion.
+.\"
+.\" Avoid warning from groff about undefined register 'F'.
+.de IX
+..
+.nr rF 0
+.if \n(.g .if rF .nr rF 1
+.if (\n(rF:(\n(.g==0)) \{\
+. if \nF \{\
+. de IX
+. tm Index:\\$1\t\\n%\t"\\$2"
+..
+. if !\nF==2 \{\
+. nr % 0
+. nr F 2
+. \}
+. \}
+.\}
+.rr rF
+.\" ========================================================================
+.\"
+.IX Title "PERLREGUTS 1"
+.TH PERLREGUTS 1 2024-01-25 "perl v5.38.2" "Perl Programmers Reference Guide"
+.\" For nroff, turn off justification. Always turn off hyphenation; it makes
+.\" way too many mistakes in technical documents.
+.if n .ad l
+.nh
+.SH NAME
+perlreguts \- Description of the Perl regular expression engine.
+.SH DESCRIPTION
+.IX Header "DESCRIPTION"
+This document is an attempt to shine some light on the guts of the regex
+engine and how it works. The regex engine represents a significant chunk
+of the perl codebase, but is relatively poorly understood. This document
+is a meagre attempt at addressing this situation. It is derived from the
+author's experience, comments in the source code, other papers on the
+regex engine, feedback on the perl5\-porters mail list, and no doubt other
+places as well.
+.PP
+\&\fBNOTICE!\fR It should be clearly understood that the behavior and
+structures discussed in this represents the state of the engine as the
+author understood it at the time of writing. It is \fBNOT\fR an API
+definition, it is purely an internals guide for those who want to hack
+the regex engine, or understand how the regex engine works. Readers of
+this document are expected to understand perl's regex syntax and its
+usage in detail. If you want to learn about the basics of Perl's
+regular expressions, see perlre. And if you want to replace the
+regex engine with your own, see perlreapi.
+.SH OVERVIEW
+.IX Header "OVERVIEW"
+.SS "A quick note on terms"
+.IX Subsection "A quick note on terms"
+There is some debate as to whether to say "regexp" or "regex". In this
+document we will use the term "regex" unless there is a special reason
+not to, in which case we will explain why.
+.PP
+When speaking about regexes we need to distinguish between their source
+code form and their internal form. In this document we will use the term
+"pattern" when we speak of their textual, source code form, and the term
+"program" when we speak of their internal representation. These
+correspond to the terms \fIS\-regex\fR and \fIB\-regex\fR that Mark Jason
+Dominus employs in his paper on "Rx" ([1] in "REFERENCES").
+.SS "What is a regular expression engine?"
+.IX Subsection "What is a regular expression engine?"
+A regular expression engine is a program that takes a set of constraints
+specified in a mini-language, and then applies those constraints to a
+target string, and determines whether or not the string satisfies the
+constraints. See perlre for a full definition of the language.
+.PP
+In less grandiose terms, the first part of the job is to turn a pattern into
+something the computer can efficiently use to find the matching point in
+the string, and the second part is performing the search itself.
+.PP
+To do this we need to produce a program by parsing the text. We then
+need to execute the program to find the point in the string that
+matches. And we need to do the whole thing efficiently.
+.SS "Structure of a Regexp Program"
+.IX Subsection "Structure of a Regexp Program"
+\fIHigh Level\fR
+.IX Subsection "High Level"
+.PP
+Although it is a bit confusing and some people object to the terminology, it
+is worth taking a look at a comment that has
+been in \fIregexp.h\fR for years:
+.PP
+\&\fIThis is essentially a linear encoding of a nondeterministic
+finite-state machine (aka syntax charts or "railroad normal form" in
+parsing technology).\fR
+.PP
+The term "railroad normal form" is a bit esoteric, with "syntax
+diagram/charts", or "railroad diagram/charts" being more common terms.
+Nevertheless it provides a useful mental image of a regex program: each
+node can be thought of as a unit of track, with a single entry and in
+most cases a single exit point (there are pieces of track that fork, but
+statistically not many), and the whole forms a layout with a
+single entry and single exit point. The matching process can be thought
+of as a car that moves along the track, with the particular route through
+the system being determined by the character read at each possible
+connector point. A car can fall off the track at any point but it may
+only proceed as long as it matches the track.
+.PP
+Thus the pattern \f(CW\*(C`/foo(?:\ew+|\ed+|\es+)bar/\*(C'\fR can be thought of as the
+following chart:
+.PP
+.Vb 10
+\& [start]
+\& |
+\& <foo>
+\& |
+\& +\-\-\-\-\-+\-\-\-\-\-+
+\& | | |
+\& <\ew+> <\ed+> <\es+>
+\& | | |
+\& +\-\-\-\-\-+\-\-\-\-\-+
+\& |
+\& <bar>
+\& |
+\& [end]
+.Ve
+.PP
+The truth of the matter is that perl's regular expressions these days are
+much more complex than this kind of structure, but visualising it this way
+can help when trying to get your bearings, and it matches the
+current implementation pretty closely.
+.PP
+To be more precise, we will say that a regex program is an encoding
+of a graph. Each node in the graph corresponds to part of
+the original regex pattern, such as a literal string or a branch,
+and has a pointer to the nodes representing the next component
+to be matched. Since "node" and "opcode" already have other meanings in the
+perl source, we will call the nodes in a regex program "regops".
+.PP
+The program is represented by an array of \f(CW\*(C`regnode\*(C'\fR structures, one or
+more of which represent a single regop of the program. Struct
+\&\f(CW\*(C`regnode\*(C'\fR is the smallest struct needed, and has a field structure which is
+shared with all the other larger structures. (Outside this document, the term
+"regnode" is sometimes used to mean "regop", which could be confusing.)
+.PP
+The "next" pointers of all regops except \f(CW\*(C`BRANCH\*(C'\fR implement concatenation;
+a "next" pointer with a \f(CW\*(C`BRANCH\*(C'\fR on both ends of it is connecting two
+alternatives. [Here we have one of the subtle syntax dependencies: an
+individual \f(CW\*(C`BRANCH\*(C'\fR (as opposed to a collection of them) is never
+concatenated with anything because of operator precedence.]
+.PP
+The operand of some types of regop is a literal string; for others,
+it is a regop leading into a sub-program. In particular, the operand
+of a \f(CW\*(C`BRANCH\*(C'\fR node is the first regop of the branch.
+.PP
+\&\fBNOTE\fR: As the railroad metaphor suggests, this is \fBnot\fR a tree
+structure: the tail of the branch connects to the thing following the
+set of \f(CW\*(C`BRANCH\*(C'\fRes. It is a like a single line of railway track that
+splits as it goes into a station or railway yard and rejoins as it comes
+out the other side.
+.PP
+\fIRegops\fR
+.IX Subsection "Regops"
+.PP
+The base structure of a regop is defined in \fIregexp.h\fR as follows:
+.PP
+.Vb 5
+\& struct regnode {
+\& U8 flags; /* Various purposes, sometimes overridden */
+\& U8 type; /* Opcode value as specified by regnodes.h */
+\& U16 next_off; /* Offset in size regnode */
+\& };
+.Ve
+.PP
+Other larger \f(CW\*(C`regnode\*(C'\fR\-like structures are defined in \fIregcomp.h\fR. They
+are almost like subclasses in that they have the same fields as
+\&\f(CW\*(C`regnode\*(C'\fR, with possibly additional fields following in
+the structure, and in some cases the specific meaning (and name)
+of some of base fields are overridden. The following is a more
+complete description.
+.ie n .IP """regnode_1""" 4
+.el .IP \f(CWregnode_1\fR 4
+.IX Item "regnode_1"
+.PD 0
+.ie n .IP """regnode_2""" 4
+.el .IP \f(CWregnode_2\fR 4
+.IX Item "regnode_2"
+.PD
+\&\f(CW\*(C`regnode_1\*(C'\fR structures have the same header, followed by a single
+four-byte argument; \f(CW\*(C`regnode_2\*(C'\fR structures contain two two-byte
+arguments instead:
+.Sp
+.Vb 2
+\& regnode_1 U32 arg1;
+\& regnode_2 U16 arg1; U16 arg2;
+.Ve
+.ie n .IP """regnode_string""" 4
+.el .IP \f(CWregnode_string\fR 4
+.IX Item "regnode_string"
+\&\f(CW\*(C`regnode_string\*(C'\fR structures, used for literal strings, follow the header
+with a one-byte length and then the string data. Strings are padded on
+the tail end with zero bytes so that the total length of the node is a
+multiple of four bytes:
+.Sp
+.Vb 2
+\& regnode_string char string[1];
+\& U8 str_len; /* overrides flags */
+.Ve
+.ie n .IP """regnode_charclass""" 4
+.el .IP \f(CWregnode_charclass\fR 4
+.IX Item "regnode_charclass"
+Bracketed character classes are represented by \f(CW\*(C`regnode_charclass\*(C'\fR
+structures, which have a four-byte argument and then a 32\-byte (256\-bit)
+bitmap indicating which characters in the Latin1 range are included in
+the class.
+.Sp
+.Vb 2
+\& regnode_charclass U32 arg1;
+\& char bitmap[ANYOF_BITMAP_SIZE];
+.Ve
+.Sp
+Various flags whose names begin with \f(CW\*(C`ANYOF_\*(C'\fR are used for special
+situations. Above Latin1 matches and things not known until run-time
+are stored in "Perl's pprivate structure".
+.ie n .IP """regnode_charclass_posixl""" 4
+.el .IP \f(CWregnode_charclass_posixl\fR 4
+.IX Item "regnode_charclass_posixl"
+There is also a larger form of a char class structure used to represent
+POSIX char classes under \f(CW\*(C`/l\*(C'\fR matching,
+called \f(CW\*(C`regnode_charclass_posixl\*(C'\fR which has an
+additional 32\-bit bitmap indicating which POSIX char classes
+have been included.
+.Sp
+.Vb 3
+\& regnode_charclass_posixl U32 arg1;
+\& char bitmap[ANYOF_BITMAP_SIZE];
+\& U32 classflags;
+.Ve
+.PP
+\&\fIregnodes.h\fR defines an array called \f(CW\*(C`PL_regnode_arg_len[]\*(C'\fR which gives the size
+of each opcode in units of \f(CW\*(C`size regnode\*(C'\fR (4\-byte). A macro is used
+to calculate the size of an \f(CW\*(C`EXACT\*(C'\fR node based on its \f(CW\*(C`str_len\*(C'\fR field.
+.PP
+The regops are defined in \fIregnodes.h\fR which is generated from
+\&\fIregcomp.sym\fR by \fIregcomp.pl\fR. Currently the maximum possible number
+of distinct regops is restricted to 256, with about a quarter already
+used.
+.PP
+A set of macros makes accessing the fields
+easier and more consistent. These include \f(CWOP()\fR, which is used to determine
+the type of a \f(CW\*(C`regnode\*(C'\fR\-like structure; \f(CWNEXT_OFF()\fR, which is the offset to
+the next node (more on this later); \f(CWARG()\fR, \f(CWARG1()\fR, \f(CWARG2()\fR, \f(CWARG_SET()\fR,
+and equivalents for reading and setting the arguments; and \f(CWSTR_LEN()\fR,
+\&\f(CWSTRING()\fR and \f(CWOPERAND()\fR for manipulating strings and regop bearing
+types.
+.PP
+\fIWhat regnode is next?\fR
+.IX Subsection "What regnode is next?"
+.PP
+There are two distinct concepts of "next regnode" in the regex engine,
+and it is important to keep them distinct in your thinking as they
+overlap conceptually in many places, but where they don't overlap the
+difference is critical. For the majority of regnode types the two
+concepts are (nearly) identical in practice. The two types are
+\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR which is used heavily during compilation but only
+occasionally during execution and \f(CW\*(C`regnext\*(C'\fR which is used heavily
+during execution, and only occasionally during compilation.
+.IP """REGNODE_AFTER""" 4
+.IX Item """REGNODE_AFTER"""
+This is the "positionally next regnode" in the compiled regex program.
+For the smaller regnode types it is \f(CW\*(C`regnode_ptr+1\*(C'\fR under the hood, but
+as regnode sizes vary and can change over time we offer macros which
+hide the gory details.
+.Sp
+It is heavily used in the compiler phase but is only used by a few
+select regnode types in the execution phase. It is also heavily used in
+the code for dumping the regexp program for debugging.
+.Sp
+There are a selection of macros which can be used to compute this as
+efficiently as possible depending on the circumstances. The canonical
+macro is \f(CWREGNODE_AFTER()\fR, which is the most powerful and should handle
+any case we have, but is also potentially the slowest. There are two
+additional macros for the special case that you KNOW the current regnode
+size is constant, and you know its type or opcode. In which case you can
+use \f(CWREGNODE_AFTER_opcode()\fR or \f(CWREGNODE_AFTER_type()\fR.
+.Sp
+In older versions of the regex engine \f(CWREGNODE_AFTER()\fR was called
+\&\f(CW\*(C`NEXTOPER\*(C'\fR but this was found to be confusing and it was renamed. There
+is also a \f(CWREGNODE_BEFORE()\fR, but it is unsafe and should not be used
+in new code.
+.IP """regnext""" 4
+.IX Item """regnext"""
+This is the regnode which can be reached by jumping forward by the value
+of the \f(CWNEXT_OFF()\fR member of the regnode, or in a few cases for longer
+jumps by the \f(CW\*(C`arg1\*(C'\fR field of the \f(CW\*(C`regnode_1\*(C'\fR structure. The subroutine
+\&\f(CWregnext()\fR handles this transparently. In the majority of cases the
+\&\f(CW\*(C`regnext\*(C'\fR for a regnode is the regnode which should be executed after the
+current one has successfully matched, but in some cases this may not be
+true. In loop control and branch control regnode types the regnext may
+signify something special, for BRANCH nodes \f(CW\*(C`regnext\*(C'\fR is the
+next BRANCH that should be executed if the current one fails execution,
+and some loop control regnodes set the regnext to be the end of the loop
+so they can jump to their cleanup if the current iteration fails to match.
+.PP
+Most regnode types do not create a branch in the execution flow, and
+leaving aside optimizations the two concepts of "next" are the same.
+For instance the \f(CW\*(C`regnext\*(C'\fR and \f(CW\*(C`REGNODE_AFTER\*(C'\fR of a SBOL opcode are
+the same during compilation phase. The main place this is not true is
+\&\f(CW\*(C`BRANCH\*(C'\fR regnodes where the \f(CW\*(C`REGNODE_AFTER\*(C'\fR represents the start of
+the pattern in the branch and the \f(CW\*(C`regnext\*(C'\fR represents the linkage to
+the next BRANCH should this one fail to match, or 0 if it is the last
+branch. The looping logic for quantifiers also makes similar use of
+the distinction between the two types, with \f(CW\*(C`REGNODE_AFTER\*(C'\fR being the
+inside of the loop construct, and the \f(CW\*(C`regnext\*(C'\fR pointing at the end
+of the loop.
+.PP
+During compilation the engine may not know what the regnext is for a
+given node, so during compilation \f(CW\*(C`regnext\*(C'\fR is only used where it must
+be used and is known to be correct. At the very end of the compilation
+phase we walk the regex program and correct the regnext data as
+appropriate, and also perform various optimizations which may result in
+regnodes that were required during construction becoming redundant, or
+we may replace a large regnode with a much smaller one and filling in the
+gap with OPTIMIZED regnodes. Thus we might start with something like
+this:
+.PP
+.Vb 5
+\& BRANCH
+\& EXACT "foo"
+\& BRANCH
+\& EXACT "bar"
+\& EXACT "!"
+.Ve
+.PP
+and replace it with something like:
+.PP
+.Vb 5
+\& TRIE foo|bar
+\& OPTIMIZED
+\& OPTIMIZED
+\& OPTIMIZED
+\& EXACT "!"
+.Ve
+.PP
+the \f(CW\*(C`REGNODE_AFTER\*(C'\fR for the \f(CW\*(C`TRIE\*(C'\fR node would be an \f(CW\*(C`OPTIMIZED\*(C'\fR
+regnode, and in theory the \f(CW\*(C`regnext\*(C'\fR would be the same as the
+\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR. But it would be inefficient to execute the OPTIMIZED
+regnode as a noop three times, so the optimizer fixes the \f(CW\*(C`regnext\*(C'\fR so
+such nodes are skipped during execution phase.
+.PP
+During execution phases we use the \f(CWregnext()\fR almost exclusively, and
+only use \f(CW\*(C`REGNODE_AFTER\*(C'\fR in special cases where it has a well defined
+meaning for a given regnode type. For instance /x+/ results in
+.PP
+.Vb 3
+\& PLUS
+\& EXACT "x"
+\& END
+.Ve
+.PP
+the \f(CW\*(C`regnext\*(C'\fR of the \f(CW\*(C`PLUS\*(C'\fR regnode is the \f(CW\*(C`END\*(C'\fR regnode, and the
+\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR of the \f(CW\*(C`PLUS\*(C'\fR regnode is the \f(CW\*(C`EXACT\*(C'\fR regnode. The
+\&\f(CW\*(C`regnext\*(C'\fR and \f(CW\*(C`REGNODE_AFTER\*(C'\fR of the \f(CW\*(C`EXACT\*(C'\fR regnode is the
+\&\f(CW\*(C`END\*(C'\fR regnode.
+.SH "Process Overview"
+.IX Header "Process Overview"
+Broadly speaking, performing a match of a string against a pattern
+involves the following steps:
+.IP "A. Compilation" 5
+.IX Item "A. Compilation"
+.RS 5
+.PD 0
+.IP "1. Parsing" 5
+.IX Item "1. Parsing"
+.IP "2. Peep-hole optimisation and analysis" 5
+.IX Item "2. Peep-hole optimisation and analysis"
+.RE
+.RS 5
+.RE
+.IP "B. Execution" 5
+.IX Item "B. Execution"
+.RS 5
+.IP "3. Start position and no-match optimisations" 5
+.IX Item "3. Start position and no-match optimisations"
+.IP "4. Program execution" 5
+.IX Item "4. Program execution"
+.RE
+.RS 5
+.RE
+.PD
+.PP
+Where these steps occur in the actual execution of a perl program is
+determined by whether the pattern involves interpolating any string
+variables. If interpolation occurs, then compilation happens at run time. If it
+does not, then compilation is performed at compile time. (The \f(CW\*(C`/o\*(C'\fR modifier changes this,
+as does \f(CW\*(C`qr//\*(C'\fR to a certain extent.) The engine doesn't really care that
+much.
+.SS Compilation
+.IX Subsection "Compilation"
+This code resides primarily in \fIregcomp.c\fR, along with the header files
+\&\fIregcomp.h\fR, \fIregexp.h\fR and \fIregnodes.h\fR.
+.PP
+Compilation starts with \f(CWpregcomp()\fR, which is mostly an initialisation
+wrapper which farms work out to two other routines for the heavy lifting: the
+first is \f(CWreg()\fR, which is the start point for parsing; the second,
+\&\f(CWstudy_chunk()\fR, is responsible for optimisation.
+.PP
+Initialisation in \f(CWpregcomp()\fR mostly involves the creation and data-filling
+of a special structure, \f(CW\*(C`RExC_state_t\*(C'\fR (defined in \fIregcomp.c\fR).
+Almost all internally-used routines in \fIregcomp.h\fR take a pointer to one
+of these structures as their first argument, with the name \f(CW\*(C`pRExC_state\*(C'\fR.
+This structure is used to store the compilation state and contains many
+fields. Likewise there are many macros which operate on this
+variable: anything that looks like \f(CW\*(C`RExC_xxxx\*(C'\fR is a macro that operates on
+this pointer/structure.
+.PP
+\&\f(CWreg()\fR is the start of the parse process. It is responsible for
+parsing an arbitrary chunk of pattern up to either the end of the
+string, or the first closing parenthesis it encounters in the pattern.
+This means it can be used to parse the top-level regex, or any section
+inside of a grouping parenthesis. It also handles the "special parens"
+that perl's regexes have. For instance when parsing \f(CW\*(C`/x(?:foo)y/\*(C'\fR,
+\&\f(CWreg()\fR will at one point be called to parse from the "?" symbol up to
+and including the ")".
+.PP
+Additionally, \f(CWreg()\fR is responsible for parsing the one or more
+branches from the pattern, and for "finishing them off" by correctly
+setting their next pointers. In order to do the parsing, it repeatedly
+calls out to \f(CWregbranch()\fR, which is responsible for handling up to the
+first \f(CW\*(C`|\*(C'\fR symbol it sees.
+.PP
+\&\f(CWregbranch()\fR in turn calls \f(CWregpiece()\fR which
+handles "things" followed by a quantifier. In order to parse the
+"things", \f(CWregatom()\fR is called. This is the lowest level routine, which
+parses out constant strings, character classes, and the
+various special symbols like \f(CW\*(C`$\*(C'\fR. If \f(CWregatom()\fR encounters a "("
+character it in turn calls \f(CWreg()\fR.
+.PP
+There used to be two main passes involved in parsing, the first to
+calculate the size of the compiled program, and the second to actually
+compile it. But now there is only one main pass, with an initial crude
+guess based on the length of the input pattern, which is increased if
+necessary as parsing proceeds, and afterwards, trimmed to the actual
+amount used.
+.PP
+However, it may happen that parsing must be restarted at the beginning
+when various circumstances occur along the way. An example is if the
+program turns out to be so large that there are jumps in it that won't
+fit in the normal 16 bits available. There are two special regops that
+can hold bigger jump destinations, BRANCHJ and LONGBRANCH. The parse is
+restarted, and these are used instead of the normal shorter ones.
+Whenever restarting the parse is required, the function returns failure
+and sets a flag as to what needs to be done. This is passed up to the
+top level routine which takes the appropriate action and restarts from
+scratch. In the case of needing longer jumps, the \f(CW\*(C`RExC_use_BRANCHJ\*(C'\fR
+flag is set in the \f(CW\*(C`RExC_state_t\*(C'\fR structure, which the functions know
+to inspect before deciding how to do branches.
+.PP
+In most instances, the function that discovers the issue sets the causal
+flag and returns failure immediately. "Parsing complications"
+contains an explicit example of how this works. In other cases, such as
+a forward reference to a numbered parenthetical grouping, we need to
+finish the parse to know if that numbered grouping actually appears in
+the pattern. In those cases, the parse is just redone at the end, with
+the knowledge of how many groupings occur in it.
+.PP
+The routine \f(CWregtail()\fR is called by both \f(CWreg()\fR and \f(CWregbranch()\fR
+in order to "set the tail pointer" correctly. When executing and
+we get to the end of a branch, we need to go to the node following the
+grouping parens. When parsing, however, we don't know where the end will
+be until we get there, so when we do we must go back and update the
+offsets as appropriate. \f(CW\*(C`regtail\*(C'\fR is used to make this easier.
+.PP
+A subtlety of the parsing process means that a regex like \f(CW\*(C`/foo/\*(C'\fR is
+originally parsed into an alternation with a single branch. It is only
+afterwards that the optimiser converts single branch alternations into the
+simpler form.
+.PP
+\fIParse Call Graph and a Grammar\fR
+.IX Subsection "Parse Call Graph and a Grammar"
+.PP
+The call graph looks like this:
+.PP
+.Vb 10
+\& reg() # parse a top level regex, or inside of
+\& # parens
+\& regbranch() # parse a single branch of an alternation
+\& regpiece() # parse a pattern followed by a quantifier
+\& regatom() # parse a simple pattern
+\& regclass() # used to handle a class
+\& reg() # used to handle a parenthesised
+\& # subpattern
+\& ....
+\& ...
+\& regtail() # finish off the branch
+\& ...
+\& regtail() # finish off the branch sequence. Tie each
+\& # branch\*(Aqs tail to the tail of the
+\& # sequence
+\& # (NEW) In Debug mode this is
+\& # regtail_study().
+.Ve
+.PP
+A grammar form might be something like this:
+.PP
+.Vb 11
+\& atom : constant | class
+\& quant : \*(Aq*\*(Aq | \*(Aq+\*(Aq | \*(Aq?\*(Aq | \*(Aq{min,max}\*(Aq
+\& _branch: piece
+\& | piece _branch
+\& | nothing
+\& branch: _branch
+\& | _branch \*(Aq|\*(Aq branch
+\& group : \*(Aq(\*(Aq branch \*(Aq)\*(Aq
+\& _piece: atom | group
+\& piece : _piece
+\& | _piece quant
+.Ve
+.PP
+\fIParsing complications\fR
+.IX Subsection "Parsing complications"
+.PP
+The implication of the above description is that a pattern containing nested
+parentheses will result in a call graph which cycles through \f(CWreg()\fR,
+\&\f(CWregbranch()\fR, \f(CWregpiece()\fR, \f(CWregatom()\fR, \f(CWreg()\fR, \f(CWregbranch()\fR \fIetc\fR
+multiple times, until the deepest level of nesting is reached. All the above
+routines return a pointer to a \f(CW\*(C`regnode\*(C'\fR, which is usually the last regnode
+added to the program. However, one complication is that \fBreg()\fR returns NULL
+for parsing \f(CW\*(C`(?:)\*(C'\fR syntax for embedded modifiers, setting the flag
+\&\f(CW\*(C`TRYAGAIN\*(C'\fR. The \f(CW\*(C`TRYAGAIN\*(C'\fR propagates upwards until it is captured, in
+some cases by \f(CWregatom()\fR, but otherwise unconditionally by
+\&\f(CWregbranch()\fR. Hence it will never be returned by \f(CWregbranch()\fR to
+\&\f(CWreg()\fR. This flag permits patterns such as \f(CW\*(C`(?i)+\*(C'\fR to be detected as
+errors (\fIQuantifier follows nothing in regex; marked by <\-\- HERE in m/(?i)+
+<\-\- HERE /\fR).
+.PP
+Another complication is that the representation used for the program differs
+if it needs to store Unicode, but it's not always possible to know for sure
+whether it does until midway through parsing. The Unicode representation for
+the program is larger, and cannot be matched as efficiently. (See "Unicode
+and Localisation Support" below for more details as to why.) If the pattern
+contains literal Unicode, it's obvious that the program needs to store
+Unicode. Otherwise, the parser optimistically assumes that the more
+efficient representation can be used, and starts sizing on this basis.
+However, if it then encounters something in the pattern which must be stored
+as Unicode, such as an \f(CW\*(C`\ex{...}\*(C'\fR escape sequence representing a character
+literal, then this means that all previously calculated sizes need to be
+redone, using values appropriate for the Unicode representation. This
+is another instance where the parsing needs to be restarted, and it can
+and is done immediately. The function returns failure, and sets the
+flag \f(CW\*(C`RESTART_UTF8\*(C'\fR (encapsulated by using the macro \f(CW\*(C`REQUIRE_UTF8\*(C'\fR).
+This restart request is propagated up the call chain in a similar
+fashion, until it is "caught" in \f(CWPerl_re_op_compile()\fR, which marks
+the pattern as containing Unicode, and restarts the sizing pass. It is
+also possible for constructions within run-time code blocks to turn out
+to need Unicode representation., which is signalled by
+\&\f(CWS_compile_runtime_code()\fR returning false to \f(CWPerl_re_op_compile()\fR.
+.PP
+The restart was previously implemented using a \f(CW\*(C`longjmp\*(C'\fR in \f(CWregatom()\fR
+back to a \f(CW\*(C`setjmp\*(C'\fR in \f(CWPerl_re_op_compile()\fR, but this proved to be
+problematic as the latter is a large function containing many automatic
+variables, which interact badly with the emergent control flow of \f(CW\*(C`setjmp\*(C'\fR.
+.PP
+\fIDebug Output\fR
+.IX Subsection "Debug Output"
+.PP
+Starting in the 5.9.x development version of perl you can \f(CW\*(C`use re
+Debug => \*(AqPARSE\*(Aq\*(C'\fR to see some trace information about the parse
+process. We will start with some simple patterns and build up to more
+complex patterns.
+.PP
+So when we parse \f(CW\*(C`/foo/\*(C'\fR we see something like the following table. The
+left shows what is being parsed, and the number indicates where the next regop
+would go. The stuff on the right is the trace output of the graph. The
+names are chosen to be short to make it less dense on the screen. 'tsdy'
+is a special form of \f(CWregtail()\fR which does some extra analysis.
+.PP
+.Vb 6
+\& >foo< 1 reg
+\& brnc
+\& piec
+\& atom
+\& >< 4 tsdy~ EXACT <foo> (EXACT) (1)
+\& ~ attach to END (3) offset to 2
+.Ve
+.PP
+The resulting program then looks like:
+.PP
+.Vb 2
+\& 1: EXACT <foo>(3)
+\& 3: END(0)
+.Ve
+.PP
+As you can see, even though we parsed out a branch and a piece, it was ultimately
+only an atom. The final program shows us how things work. We have an \f(CW\*(C`EXACT\*(C'\fR regop,
+followed by an \f(CW\*(C`END\*(C'\fR regop. The number in parens indicates where the \f(CW\*(C`regnext\*(C'\fR of
+the node goes. The \f(CW\*(C`regnext\*(C'\fR of an \f(CW\*(C`END\*(C'\fR regop is unused, as \f(CW\*(C`END\*(C'\fR regops mean
+we have successfully matched. The number on the left indicates the position of
+the regop in the regnode array.
+.PP
+Now let's try a harder pattern. We will add a quantifier, so now we have the pattern
+\&\f(CW\*(C`/foo+/\*(C'\fR. We will see that \f(CWregbranch()\fR calls \f(CWregpiece()\fR twice.
+.PP
+.Vb 10
+\& >foo+< 1 reg
+\& brnc
+\& piec
+\& atom
+\& >o+< 3 piec
+\& atom
+\& >< 6 tail~ EXACT <fo> (1)
+\& 7 tsdy~ EXACT <fo> (EXACT) (1)
+\& ~ PLUS (END) (3)
+\& ~ attach to END (6) offset to 3
+.Ve
+.PP
+And we end up with the program:
+.PP
+.Vb 4
+\& 1: EXACT <fo>(3)
+\& 3: PLUS(6)
+\& 4: EXACT <o>(0)
+\& 6: END(0)
+.Ve
+.PP
+Now we have a special case. The \f(CW\*(C`EXACT\*(C'\fR regop has a \f(CW\*(C`regnext\*(C'\fR of 0. This is
+because if it matches it should try to match itself again. The \f(CW\*(C`PLUS\*(C'\fR regop
+handles the actual failure of the \f(CW\*(C`EXACT\*(C'\fR regop and acts appropriately (going
+to regnode 6 if the \f(CW\*(C`EXACT\*(C'\fR matched at least once, or failing if it didn't).
+.PP
+Now for something much more complex: \f(CW\*(C`/x(?:foo*|b[a][rR])(foo|bar)$/\*(C'\fR
+.PP
+.Vb 10
+\& >x(?:foo*|b... 1 reg
+\& brnc
+\& piec
+\& atom
+\& >(?:foo*|b[... 3 piec
+\& atom
+\& >?:foo*|b[a... reg
+\& >foo*|b[a][... brnc
+\& piec
+\& atom
+\& >o*|b[a][rR... 5 piec
+\& atom
+\& >|b[a][rR])... 8 tail~ EXACT <fo> (3)
+\& >b[a][rR])(... 9 brnc
+\& 10 piec
+\& atom
+\& >[a][rR])(f... 12 piec
+\& atom
+\& >a][rR])(fo... clas
+\& >[rR])(foo|... 14 tail~ EXACT <b> (10)
+\& piec
+\& atom
+\& >rR])(foo|b... clas
+\& >)(foo|bar)... 25 tail~ EXACT <a> (12)
+\& tail~ BRANCH (3)
+\& 26 tsdy~ BRANCH (END) (9)
+\& ~ attach to TAIL (25) offset to 16
+\& tsdy~ EXACT <fo> (EXACT) (4)
+\& ~ STAR (END) (6)
+\& ~ attach to TAIL (25) offset to 19
+\& tsdy~ EXACT <b> (EXACT) (10)
+\& ~ EXACT <a> (EXACT) (12)
+\& ~ ANYOF[Rr] (END) (14)
+\& ~ attach to TAIL (25) offset to 11
+\& >(foo|bar)$< tail~ EXACT <x> (1)
+\& piec
+\& atom
+\& >foo|bar)$< reg
+\& 28 brnc
+\& piec
+\& atom
+\& >|bar)$< 31 tail~ OPEN1 (26)
+\& >bar)$< brnc
+\& 32 piec
+\& atom
+\& >)$< 34 tail~ BRANCH (28)
+\& 36 tsdy~ BRANCH (END) (31)
+\& ~ attach to CLOSE1 (34) offset to 3
+\& tsdy~ EXACT <foo> (EXACT) (29)
+\& ~ attach to CLOSE1 (34) offset to 5
+\& tsdy~ EXACT <bar> (EXACT) (32)
+\& ~ attach to CLOSE1 (34) offset to 2
+\& >$< tail~ BRANCH (3)
+\& ~ BRANCH (9)
+\& ~ TAIL (25)
+\& piec
+\& atom
+\& >< 37 tail~ OPEN1 (26)
+\& ~ BRANCH (28)
+\& ~ BRANCH (31)
+\& ~ CLOSE1 (34)
+\& 38 tsdy~ EXACT <x> (EXACT) (1)
+\& ~ BRANCH (END) (3)
+\& ~ BRANCH (END) (9)
+\& ~ TAIL (END) (25)
+\& ~ OPEN1 (END) (26)
+\& ~ BRANCH (END) (28)
+\& ~ BRANCH (END) (31)
+\& ~ CLOSE1 (END) (34)
+\& ~ EOL (END) (36)
+\& ~ attach to END (37) offset to 1
+.Ve
+.PP
+Resulting in the program
+.PP
+.Vb 10
+\& 1: EXACT <x>(3)
+\& 3: BRANCH(9)
+\& 4: EXACT <fo>(6)
+\& 6: STAR(26)
+\& 7: EXACT <o>(0)
+\& 9: BRANCH(25)
+\& 10: EXACT <ba>(14)
+\& 12: OPTIMIZED (2 nodes)
+\& 14: ANYOF[Rr](26)
+\& 25: TAIL(26)
+\& 26: OPEN1(28)
+\& 28: TRIE\-EXACT(34)
+\& [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
+\& <foo>
+\& <bar>
+\& 30: OPTIMIZED (4 nodes)
+\& 34: CLOSE1(36)
+\& 36: EOL(37)
+\& 37: END(0)
+.Ve
+.PP
+Here we can see a much more complex program, with various optimisations in
+play. At regnode 10 we see an example where a character class with only
+one character in it was turned into an \f(CW\*(C`EXACT\*(C'\fR node. We can also see where
+an entire alternation was turned into a \f(CW\*(C`TRIE\-EXACT\*(C'\fR node. As a consequence,
+some of the regnodes have been marked as optimised away. We can see that
+the \f(CW\*(C`$\*(C'\fR symbol has been converted into an \f(CW\*(C`EOL\*(C'\fR regop, a special piece of
+code that looks for \f(CW\*(C`\en\*(C'\fR or the end of the string.
+.PP
+The next pointer for \f(CW\*(C`BRANCH\*(C'\fRes is interesting in that it points at where
+execution should go if the branch fails. When executing, if the engine
+tries to traverse from a branch to a \f(CW\*(C`regnext\*(C'\fR that isn't a branch then
+the engine will know that the entire set of branches has failed.
+.PP
+\fIPeep-hole Optimisation and Analysis\fR
+.IX Subsection "Peep-hole Optimisation and Analysis"
+.PP
+The regular expression engine can be a weighty tool to wield. On long
+strings and complex patterns it can end up having to do a lot of work
+to find a match, and even more to decide that no match is possible.
+Consider a situation like the following pattern.
+.PP
+.Vb 1
+\& \*(Aqababababababababababab\*(Aq =~ /(a|b)*z/
+.Ve
+.PP
+The \f(CW\*(C`(a|b)*\*(C'\fR part can match at every char in the string, and then fail
+every time because there is no \f(CW\*(C`z\*(C'\fR in the string. So obviously we can
+avoid using the regex engine unless there is a \f(CW\*(C`z\*(C'\fR in the string.
+Likewise in a pattern like:
+.PP
+.Vb 1
+\& /foo(\ew+)bar/
+.Ve
+.PP
+In this case we know that the string must contain a \f(CW\*(C`foo\*(C'\fR which must be
+followed by \f(CW\*(C`bar\*(C'\fR. We can use Fast Boyer-Moore matching as implemented
+in \f(CWfbm_instr()\fR to find the location of these strings. If they don't exist
+then we don't need to resort to the much more expensive regex engine.
+Even better, if they do exist then we can use their positions to
+reduce the search space that the regex engine needs to cover to determine
+if the entire pattern matches.
+.PP
+There are various aspects of the pattern that can be used to facilitate
+optimisations along these lines:
+.IP \(bu 5
+anchored fixed strings
+.IP \(bu 5
+floating fixed strings
+.IP \(bu 5
+minimum and maximum length requirements
+.IP \(bu 5
+start class
+.IP \(bu 5
+Beginning/End of line positions
+.PP
+Another form of optimisation that can occur is the post-parse "peep-hole"
+optimisation, where inefficient constructs are replaced by more efficient
+constructs. The \f(CW\*(C`TAIL\*(C'\fR regops which are used during parsing to mark the end
+of branches and the end of groups are examples of this. These regops are used
+as place-holders during construction and "always match" so they can be
+"optimised away" by making the things that point to the \f(CW\*(C`TAIL\*(C'\fR point to the
+thing that \f(CW\*(C`TAIL\*(C'\fR points to, thus "skipping" the node.
+.PP
+Another optimisation that can occur is that of "\f(CW\*(C`EXACT\*(C'\fR merging" which is
+where two consecutive \f(CW\*(C`EXACT\*(C'\fR nodes are merged into a single
+regop. An even more aggressive form of this is that a branch
+sequence of the form \f(CW\*(C`EXACT BRANCH ... EXACT\*(C'\fR can be converted into a
+\&\f(CW\*(C`TRIE\-EXACT\*(C'\fR regop.
+.PP
+All of this occurs in the routine \f(CWstudy_chunk()\fR which uses a special
+structure \f(CW\*(C`scan_data_t\*(C'\fR to store the analysis that it has performed, and
+does the "peep-hole" optimisations as it goes.
+.PP
+The code involved in \f(CWstudy_chunk()\fR is extremely cryptic. Be careful. :\-)
+.SS Execution
+.IX Subsection "Execution"
+Execution of a regex generally involves two phases, the first being
+finding the start point in the string where we should match from,
+and the second being running the regop interpreter.
+.PP
+If we can tell that there is no valid start point then we don't bother running
+the interpreter at all. Likewise, if we know from the analysis phase that we
+cannot detect a short-cut to the start position, we go straight to the
+interpreter.
+.PP
+The two entry points are \f(CWre_intuit_start()\fR and \f(CWpregexec()\fR. These routines
+have a somewhat incestuous relationship with overlap between their functions,
+and \f(CWpregexec()\fR may even call \f(CWre_intuit_start()\fR on its own. Nevertheless
+other parts of the perl source code may call into either, or both.
+.PP
+Execution of the interpreter itself used to be recursive, but thanks to the
+efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an
+internal stack is maintained on the heap and the routine is fully
+iterative. This can make it tricky as the code is quite conservative
+about what state it stores, with the result that two consecutive lines in the
+code can actually be running in totally different contexts due to the
+simulated recursion.
+.PP
+\fIStart position and no-match optimisations\fR
+.IX Subsection "Start position and no-match optimisations"
+.PP
+\&\f(CWre_intuit_start()\fR is responsible for handling start points and no-match
+optimisations as determined by the results of the analysis done by
+\&\f(CWstudy_chunk()\fR (and described in "Peep-hole Optimisation and Analysis").
+.PP
+The basic structure of this routine is to try to find the start\- and/or
+end-points of where the pattern could match, and to ensure that the string
+is long enough to match the pattern. It tries to use more efficient
+methods over less efficient methods and may involve considerable
+cross-checking of constraints to find the place in the string that matches.
+For instance it may try to determine that a given fixed string must be
+not only present but a certain number of chars before the end of the
+string, or whatever.
+.PP
+It calls several other routines, such as \f(CWfbm_instr()\fR which does
+Fast Boyer Moore matching and \f(CWfind_byclass()\fR which is responsible for
+finding the start using the first mandatory regop in the program.
+.PP
+When the optimisation criteria have been satisfied, \f(CWreg_try()\fR is called
+to perform the match.
+.PP
+\fIProgram execution\fR
+.IX Subsection "Program execution"
+.PP
+\&\f(CWpregexec()\fR is the main entry point for running a regex. It contains
+support for initialising the regex interpreter's state, running
+\&\f(CWre_intuit_start()\fR if needed, and running the interpreter on the string
+from various start positions as needed. When it is necessary to use
+the regex interpreter \f(CWpregexec()\fR calls \f(CWregtry()\fR.
+.PP
+\&\f(CWregtry()\fR is the entry point into the regex interpreter. It expects
+as arguments a pointer to a \f(CW\*(C`regmatch_info\*(C'\fR structure and a pointer to
+a string. It returns an integer 1 for success and a 0 for failure.
+It is basically a set-up wrapper around \f(CWregmatch()\fR.
+.PP
+\&\f(CW\*(C`regmatch\*(C'\fR is the main "recursive loop" of the interpreter. It is
+basically a giant switch statement that implements a state machine, where
+the possible states are the regops themselves, plus a number of additional
+intermediate and failure states. A few of the states are implemented as
+subroutines but the bulk are inline code.
+.SH MISCELLANEOUS
+.IX Header "MISCELLANEOUS"
+.SS "Unicode and Localisation Support"
+.IX Subsection "Unicode and Localisation Support"
+When dealing with strings containing characters that cannot be represented
+using an eight-bit character set, perl uses an internal representation
+that is a permissive version of Unicode's UTF\-8 encoding[2]. This uses single
+bytes to represent characters from the ASCII character set, and sequences
+of two or more bytes for all other characters. (See perlunitut
+for more information about the relationship between UTF\-8 and perl's
+encoding, utf8. The difference isn't important for this discussion.)
+.PP
+No matter how you look at it, Unicode support is going to be a pain in a
+regex engine. Tricks that might be fine when you have 256 possible
+characters often won't scale to handle the size of the UTF\-8 character
+set. Things you can take for granted with ASCII may not be true with
+Unicode. For instance, in ASCII, it is safe to assume that
+\&\f(CW\*(C`sizeof(char1) == sizeof(char2)\*(C'\fR, but in UTF\-8 it isn't. Unicode case folding is
+vastly more complex than the simple rules of ASCII, and even when not
+using Unicode but only localised single byte encodings, things can get
+tricky (for example, \fBLATIN SMALL LETTER SHARP S\fR (U+00DF, ß)
+should match 'SS' in localised case-insensitive matching).
+.PP
+Making things worse is that UTF\-8 support was a later addition to the
+regex engine (as it was to perl) and this necessarily made things a lot
+more complicated. Obviously it is easier to design a regex engine with
+Unicode support in mind from the beginning than it is to retrofit it to
+one that wasn't.
+.PP
+Nearly all regops that involve looking at the input string have
+two cases, one for UTF\-8, and one not. In fact, it's often more complex
+than that, as the pattern may be UTF\-8 as well.
+.PP
+Care must be taken when making changes to make sure that you handle
+UTF\-8 properly, both at compile time and at execution time, including
+when the string and pattern are mismatched.
+.SS "Base Structures"
+.IX Subsection "Base Structures"
+The \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is common to all
+regex engines. Two of its fields are intended for the private use
+of the regex engine that compiled the pattern. These are the
+\&\f(CW\*(C`intflags\*(C'\fR and pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer to
+an arbitrary structure whose use and management is the responsibility
+of the compiling engine. perl will never modify either of these
+values. In the case of the stock engine the structure pointed to by
+\&\f(CW\*(C`pprivate\*(C'\fR is called \f(CW\*(C`regexp_internal\*(C'\fR.
+.PP
+Its \f(CW\*(C`pprivate\*(C'\fR and \f(CW\*(C`intflags\*(C'\fR fields contain data
+specific to each engine.
+.PP
+There are two structures used to store a compiled regular expression.
+One, the \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is populated by
+the engine currently being used and some of its fields read by perl to
+implement things such as the stringification of \f(CW\*(C`qr//\*(C'\fR.
+.PP
+The other structure is pointed to by the \f(CW\*(C`regexp\*(C'\fR struct's
+\&\f(CW\*(C`pprivate\*(C'\fR and is in addition to \f(CW\*(C`intflags\*(C'\fR in the same struct
+considered to be the property of the regex engine which compiled the
+regular expression;
+.PP
+The regexp structure contains all the data that perl needs to be aware of
+to properly work with the regular expression. It includes data about
+optimisations that perl can use to determine if the regex engine should
+really be used, and various other control info that is needed to properly
+execute patterns in various contexts such as is the pattern anchored in
+some way, or what flags were used during the compile, or whether the
+program contains special constructs that perl needs to be aware of.
+.PP
+In addition it contains two fields that are intended for the private use
+of the regex engine that compiled the pattern. These are the \f(CW\*(C`intflags\*(C'\fR
+and pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer to an arbitrary
+structure whose use and management is the responsibility of the compiling
+engine. perl will never modify either of these values.
+.PP
+As mentioned earlier, in the case of the default engines, the \f(CW\*(C`pprivate\*(C'\fR
+will be a pointer to a regexp_internal structure which holds the compiled
+program and any additional data that is private to the regex engine
+implementation.
+.PP
+\fIPerl's \fR\f(CI\*(C`pprivate\*(C'\fR\fI structure\fR
+.IX Subsection "Perl's pprivate structure"
+.PP
+The following structure is used as the \f(CW\*(C`pprivate\*(C'\fR struct by perl's
+regex engine. Since it is specific to perl it is only of curiosity
+value to other engine implementations.
+.PP
+.Vb 8
+\& typedef struct regexp_internal {
+\& regnode *regstclass;
+\& struct reg_data *data;
+\& struct reg_code_blocks *code_blocks;
+\& U32 proglen;
+\& U32 name_list_idx;
+\& regnode program[1];
+\& } regexp_internal;
+.Ve
+.PP
+Description of the attributes is as follows:
+.ie n .IP """regstclass""" 5
+.el .IP \f(CWregstclass\fR 5
+.IX Item "regstclass"
+Special regop that is used by \f(CWre_intuit_start()\fR to check if a pattern
+can match at a certain position. For instance if the regex engine knows
+that the pattern must start with a 'Z' then it can scan the string until
+it finds one and then launch the regex engine from there. The routine
+that handles this is called \f(CWfind_by_class()\fR. Sometimes this field
+points at a regop embedded in the program, and sometimes it points at
+an independent synthetic regop that has been constructed by the optimiser.
+.ie n .IP """data""" 5
+.el .IP \f(CWdata\fR 5
+.IX Item "data"
+This field points at a \f(CW\*(C`reg_data\*(C'\fR structure, which is defined as follows
+.Sp
+.Vb 5
+\& struct reg_data {
+\& U32 count;
+\& U8 *what;
+\& void* data[1];
+\& };
+.Ve
+.Sp
+This structure is used for handling data structures that the regex engine
+needs to handle specially during a clone or free operation on the compiled
+product. Each element in the data array has a corresponding element in the
+what array. During compilation regops that need special structures stored
+will add an element to each array using the \fBadd_data()\fR routine and then store
+the index in the regop.
+.Sp
+In modern perls the 0th element of this structure is reserved and is NEVER
+used to store anything of use. This is to allow things that need to index
+into this array to represent "no value".
+.ie n .IP """code_blocks""" 5
+.el .IP \f(CWcode_blocks\fR 5
+.IX Item "code_blocks"
+This optional structure is used to manage \f(CW\*(C`(?{})\*(C'\fR constructs in the
+pattern. It is made up of the following structures.
+.Sp
+.Vb 7
+\& /* record the position of a (?{...}) within a pattern */
+\& struct reg_code_block {
+\& STRLEN start;
+\& STRLEN end;
+\& OP *block;
+\& REGEXP *src_regex;
+\& };
+\&
+\& /* array of reg_code_block\*(Aqs plus header info */
+\& struct reg_code_blocks {
+\& int refcnt; /* we may be pointed to from a regex
+\& and from the savestack */
+\& int count; /* how many code blocks */
+\& struct reg_code_block *cb; /* array of reg_code_block\*(Aqs */
+\& };
+.Ve
+.ie n .IP """proglen""" 5
+.el .IP \f(CWproglen\fR 5
+.IX Item "proglen"
+Stores the length of the compiled program in units of regops.
+.ie n .IP """name_list_idx""" 5
+.el .IP \f(CWname_list_idx\fR 5
+.IX Item "name_list_idx"
+This is the index into the data array where an AV is stored that contains
+the names of any named capture buffers in the pattern, should there be
+any. This is only used in the debugging version of the regex engine and
+when RXp_PAREN_NAMES(prog) is true. It will be 0 if there is no such data.
+.ie n .IP """program""" 5
+.el .IP \f(CWprogram\fR 5
+.IX Item "program"
+Compiled program. Inlined into the structure so the entire struct can be
+treated as a single blob.
+.SH "SEE ALSO"
+.IX Header "SEE ALSO"
+perlreapi
+.PP
+perlre
+.PP
+perlunitut
+.SH AUTHOR
+.IX Header "AUTHOR"
+by Yves Orton, 2006.
+.PP
+With excerpts from Perl, and contributions and suggestions from
+Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,
+Stephen McCamant, and David Landgren.
+.PP
+Now maintained by Perl 5 Porters.
+.SH LICENCE
+.IX Header "LICENCE"
+Same terms as Perl.
+.SH REFERENCES
+.IX Header "REFERENCES"
+[1] <https://perl.plover.com/Rx/paper/>
+.PP
+[2] <https://www.unicode.org/>