diff options
Diffstat (limited to 'upstream/fedora-40/man1/perlreguts.1')
-rw-r--r-- | upstream/fedora-40/man1/perlreguts.1 | 1092 |
1 files changed, 1092 insertions, 0 deletions
diff --git a/upstream/fedora-40/man1/perlreguts.1 b/upstream/fedora-40/man1/perlreguts.1 new file mode 100644 index 00000000..94308646 --- /dev/null +++ b/upstream/fedora-40/man1/perlreguts.1 @@ -0,0 +1,1092 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLREGUTS 1" +.TH PERLREGUTS 1 2024-01-25 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlreguts \- Description of the Perl regular expression engine. +.SH DESCRIPTION +.IX Header "DESCRIPTION" +This document is an attempt to shine some light on the guts of the regex +engine and how it works. The regex engine represents a significant chunk +of the perl codebase, but is relatively poorly understood. This document +is a meagre attempt at addressing this situation. It is derived from the +author's experience, comments in the source code, other papers on the +regex engine, feedback on the perl5\-porters mail list, and no doubt other +places as well. +.PP +\&\fBNOTICE!\fR It should be clearly understood that the behavior and +structures discussed in this represents the state of the engine as the +author understood it at the time of writing. It is \fBNOT\fR an API +definition, it is purely an internals guide for those who want to hack +the regex engine, or understand how the regex engine works. Readers of +this document are expected to understand perl's regex syntax and its +usage in detail. If you want to learn about the basics of Perl's +regular expressions, see perlre. And if you want to replace the +regex engine with your own, see perlreapi. +.SH OVERVIEW +.IX Header "OVERVIEW" +.SS "A quick note on terms" +.IX Subsection "A quick note on terms" +There is some debate as to whether to say "regexp" or "regex". In this +document we will use the term "regex" unless there is a special reason +not to, in which case we will explain why. +.PP +When speaking about regexes we need to distinguish between their source +code form and their internal form. In this document we will use the term +"pattern" when we speak of their textual, source code form, and the term +"program" when we speak of their internal representation. These +correspond to the terms \fIS\-regex\fR and \fIB\-regex\fR that Mark Jason +Dominus employs in his paper on "Rx" ([1] in "REFERENCES"). +.SS "What is a regular expression engine?" +.IX Subsection "What is a regular expression engine?" +A regular expression engine is a program that takes a set of constraints +specified in a mini-language, and then applies those constraints to a +target string, and determines whether or not the string satisfies the +constraints. See perlre for a full definition of the language. +.PP +In less grandiose terms, the first part of the job is to turn a pattern into +something the computer can efficiently use to find the matching point in +the string, and the second part is performing the search itself. +.PP +To do this we need to produce a program by parsing the text. We then +need to execute the program to find the point in the string that +matches. And we need to do the whole thing efficiently. +.SS "Structure of a Regexp Program" +.IX Subsection "Structure of a Regexp Program" +\fIHigh Level\fR +.IX Subsection "High Level" +.PP +Although it is a bit confusing and some people object to the terminology, it +is worth taking a look at a comment that has +been in \fIregexp.h\fR for years: +.PP +\&\fIThis is essentially a linear encoding of a nondeterministic +finite-state machine (aka syntax charts or "railroad normal form" in +parsing technology).\fR +.PP +The term "railroad normal form" is a bit esoteric, with "syntax +diagram/charts", or "railroad diagram/charts" being more common terms. +Nevertheless it provides a useful mental image of a regex program: each +node can be thought of as a unit of track, with a single entry and in +most cases a single exit point (there are pieces of track that fork, but +statistically not many), and the whole forms a layout with a +single entry and single exit point. The matching process can be thought +of as a car that moves along the track, with the particular route through +the system being determined by the character read at each possible +connector point. A car can fall off the track at any point but it may +only proceed as long as it matches the track. +.PP +Thus the pattern \f(CW\*(C`/foo(?:\ew+|\ed+|\es+)bar/\*(C'\fR can be thought of as the +following chart: +.PP +.Vb 10 +\& [start] +\& | +\& <foo> +\& | +\& +\-\-\-\-\-+\-\-\-\-\-+ +\& | | | +\& <\ew+> <\ed+> <\es+> +\& | | | +\& +\-\-\-\-\-+\-\-\-\-\-+ +\& | +\& <bar> +\& | +\& [end] +.Ve +.PP +The truth of the matter is that perl's regular expressions these days are +much more complex than this kind of structure, but visualising it this way +can help when trying to get your bearings, and it matches the +current implementation pretty closely. +.PP +To be more precise, we will say that a regex program is an encoding +of a graph. Each node in the graph corresponds to part of +the original regex pattern, such as a literal string or a branch, +and has a pointer to the nodes representing the next component +to be matched. Since "node" and "opcode" already have other meanings in the +perl source, we will call the nodes in a regex program "regops". +.PP +The program is represented by an array of \f(CW\*(C`regnode\*(C'\fR structures, one or +more of which represent a single regop of the program. Struct +\&\f(CW\*(C`regnode\*(C'\fR is the smallest struct needed, and has a field structure which is +shared with all the other larger structures. (Outside this document, the term +"regnode" is sometimes used to mean "regop", which could be confusing.) +.PP +The "next" pointers of all regops except \f(CW\*(C`BRANCH\*(C'\fR implement concatenation; +a "next" pointer with a \f(CW\*(C`BRANCH\*(C'\fR on both ends of it is connecting two +alternatives. [Here we have one of the subtle syntax dependencies: an +individual \f(CW\*(C`BRANCH\*(C'\fR (as opposed to a collection of them) is never +concatenated with anything because of operator precedence.] +.PP +The operand of some types of regop is a literal string; for others, +it is a regop leading into a sub-program. In particular, the operand +of a \f(CW\*(C`BRANCH\*(C'\fR node is the first regop of the branch. +.PP +\&\fBNOTE\fR: As the railroad metaphor suggests, this is \fBnot\fR a tree +structure: the tail of the branch connects to the thing following the +set of \f(CW\*(C`BRANCH\*(C'\fRes. It is a like a single line of railway track that +splits as it goes into a station or railway yard and rejoins as it comes +out the other side. +.PP +\fIRegops\fR +.IX Subsection "Regops" +.PP +The base structure of a regop is defined in \fIregexp.h\fR as follows: +.PP +.Vb 5 +\& struct regnode { +\& U8 flags; /* Various purposes, sometimes overridden */ +\& U8 type; /* Opcode value as specified by regnodes.h */ +\& U16 next_off; /* Offset in size regnode */ +\& }; +.Ve +.PP +Other larger \f(CW\*(C`regnode\*(C'\fR\-like structures are defined in \fIregcomp.h\fR. They +are almost like subclasses in that they have the same fields as +\&\f(CW\*(C`regnode\*(C'\fR, with possibly additional fields following in +the structure, and in some cases the specific meaning (and name) +of some of base fields are overridden. The following is a more +complete description. +.ie n .IP """regnode_1""" 4 +.el .IP \f(CWregnode_1\fR 4 +.IX Item "regnode_1" +.PD 0 +.ie n .IP """regnode_2""" 4 +.el .IP \f(CWregnode_2\fR 4 +.IX Item "regnode_2" +.PD +\&\f(CW\*(C`regnode_1\*(C'\fR structures have the same header, followed by a single +four-byte argument; \f(CW\*(C`regnode_2\*(C'\fR structures contain two two-byte +arguments instead: +.Sp +.Vb 2 +\& regnode_1 U32 arg1; +\& regnode_2 U16 arg1; U16 arg2; +.Ve +.ie n .IP """regnode_string""" 4 +.el .IP \f(CWregnode_string\fR 4 +.IX Item "regnode_string" +\&\f(CW\*(C`regnode_string\*(C'\fR structures, used for literal strings, follow the header +with a one-byte length and then the string data. Strings are padded on +the tail end with zero bytes so that the total length of the node is a +multiple of four bytes: +.Sp +.Vb 2 +\& regnode_string char string[1]; +\& U8 str_len; /* overrides flags */ +.Ve +.ie n .IP """regnode_charclass""" 4 +.el .IP \f(CWregnode_charclass\fR 4 +.IX Item "regnode_charclass" +Bracketed character classes are represented by \f(CW\*(C`regnode_charclass\*(C'\fR +structures, which have a four-byte argument and then a 32\-byte (256\-bit) +bitmap indicating which characters in the Latin1 range are included in +the class. +.Sp +.Vb 2 +\& regnode_charclass U32 arg1; +\& char bitmap[ANYOF_BITMAP_SIZE]; +.Ve +.Sp +Various flags whose names begin with \f(CW\*(C`ANYOF_\*(C'\fR are used for special +situations. Above Latin1 matches and things not known until run-time +are stored in "Perl's pprivate structure". +.ie n .IP """regnode_charclass_posixl""" 4 +.el .IP \f(CWregnode_charclass_posixl\fR 4 +.IX Item "regnode_charclass_posixl" +There is also a larger form of a char class structure used to represent +POSIX char classes under \f(CW\*(C`/l\*(C'\fR matching, +called \f(CW\*(C`regnode_charclass_posixl\*(C'\fR which has an +additional 32\-bit bitmap indicating which POSIX char classes +have been included. +.Sp +.Vb 3 +\& regnode_charclass_posixl U32 arg1; +\& char bitmap[ANYOF_BITMAP_SIZE]; +\& U32 classflags; +.Ve +.PP +\&\fIregnodes.h\fR defines an array called \f(CW\*(C`PL_regnode_arg_len[]\*(C'\fR which gives the size +of each opcode in units of \f(CW\*(C`size regnode\*(C'\fR (4\-byte). A macro is used +to calculate the size of an \f(CW\*(C`EXACT\*(C'\fR node based on its \f(CW\*(C`str_len\*(C'\fR field. +.PP +The regops are defined in \fIregnodes.h\fR which is generated from +\&\fIregcomp.sym\fR by \fIregcomp.pl\fR. Currently the maximum possible number +of distinct regops is restricted to 256, with about a quarter already +used. +.PP +A set of macros makes accessing the fields +easier and more consistent. These include \f(CWOP()\fR, which is used to determine +the type of a \f(CW\*(C`regnode\*(C'\fR\-like structure; \f(CWNEXT_OFF()\fR, which is the offset to +the next node (more on this later); \f(CWARG()\fR, \f(CWARG1()\fR, \f(CWARG2()\fR, \f(CWARG_SET()\fR, +and equivalents for reading and setting the arguments; and \f(CWSTR_LEN()\fR, +\&\f(CWSTRING()\fR and \f(CWOPERAND()\fR for manipulating strings and regop bearing +types. +.PP +\fIWhat regnode is next?\fR +.IX Subsection "What regnode is next?" +.PP +There are two distinct concepts of "next regnode" in the regex engine, +and it is important to keep them distinct in your thinking as they +overlap conceptually in many places, but where they don't overlap the +difference is critical. For the majority of regnode types the two +concepts are (nearly) identical in practice. The two types are +\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR which is used heavily during compilation but only +occasionally during execution and \f(CW\*(C`regnext\*(C'\fR which is used heavily +during execution, and only occasionally during compilation. +.IP """REGNODE_AFTER""" 4 +.IX Item """REGNODE_AFTER""" +This is the "positionally next regnode" in the compiled regex program. +For the smaller regnode types it is \f(CW\*(C`regnode_ptr+1\*(C'\fR under the hood, but +as regnode sizes vary and can change over time we offer macros which +hide the gory details. +.Sp +It is heavily used in the compiler phase but is only used by a few +select regnode types in the execution phase. It is also heavily used in +the code for dumping the regexp program for debugging. +.Sp +There are a selection of macros which can be used to compute this as +efficiently as possible depending on the circumstances. The canonical +macro is \f(CWREGNODE_AFTER()\fR, which is the most powerful and should handle +any case we have, but is also potentially the slowest. There are two +additional macros for the special case that you KNOW the current regnode +size is constant, and you know its type or opcode. In which case you can +use \f(CWREGNODE_AFTER_opcode()\fR or \f(CWREGNODE_AFTER_type()\fR. +.Sp +In older versions of the regex engine \f(CWREGNODE_AFTER()\fR was called +\&\f(CW\*(C`NEXTOPER\*(C'\fR but this was found to be confusing and it was renamed. There +is also a \f(CWREGNODE_BEFORE()\fR, but it is unsafe and should not be used +in new code. +.IP """regnext""" 4 +.IX Item """regnext""" +This is the regnode which can be reached by jumping forward by the value +of the \f(CWNEXT_OFF()\fR member of the regnode, or in a few cases for longer +jumps by the \f(CW\*(C`arg1\*(C'\fR field of the \f(CW\*(C`regnode_1\*(C'\fR structure. The subroutine +\&\f(CWregnext()\fR handles this transparently. In the majority of cases the +\&\f(CW\*(C`regnext\*(C'\fR for a regnode is the regnode which should be executed after the +current one has successfully matched, but in some cases this may not be +true. In loop control and branch control regnode types the regnext may +signify something special, for BRANCH nodes \f(CW\*(C`regnext\*(C'\fR is the +next BRANCH that should be executed if the current one fails execution, +and some loop control regnodes set the regnext to be the end of the loop +so they can jump to their cleanup if the current iteration fails to match. +.PP +Most regnode types do not create a branch in the execution flow, and +leaving aside optimizations the two concepts of "next" are the same. +For instance the \f(CW\*(C`regnext\*(C'\fR and \f(CW\*(C`REGNODE_AFTER\*(C'\fR of a SBOL opcode are +the same during compilation phase. The main place this is not true is +\&\f(CW\*(C`BRANCH\*(C'\fR regnodes where the \f(CW\*(C`REGNODE_AFTER\*(C'\fR represents the start of +the pattern in the branch and the \f(CW\*(C`regnext\*(C'\fR represents the linkage to +the next BRANCH should this one fail to match, or 0 if it is the last +branch. The looping logic for quantifiers also makes similar use of +the distinction between the two types, with \f(CW\*(C`REGNODE_AFTER\*(C'\fR being the +inside of the loop construct, and the \f(CW\*(C`regnext\*(C'\fR pointing at the end +of the loop. +.PP +During compilation the engine may not know what the regnext is for a +given node, so during compilation \f(CW\*(C`regnext\*(C'\fR is only used where it must +be used and is known to be correct. At the very end of the compilation +phase we walk the regex program and correct the regnext data as +appropriate, and also perform various optimizations which may result in +regnodes that were required during construction becoming redundant, or +we may replace a large regnode with a much smaller one and filling in the +gap with OPTIMIZED regnodes. Thus we might start with something like +this: +.PP +.Vb 5 +\& BRANCH +\& EXACT "foo" +\& BRANCH +\& EXACT "bar" +\& EXACT "!" +.Ve +.PP +and replace it with something like: +.PP +.Vb 5 +\& TRIE foo|bar +\& OPTIMIZED +\& OPTIMIZED +\& OPTIMIZED +\& EXACT "!" +.Ve +.PP +the \f(CW\*(C`REGNODE_AFTER\*(C'\fR for the \f(CW\*(C`TRIE\*(C'\fR node would be an \f(CW\*(C`OPTIMIZED\*(C'\fR +regnode, and in theory the \f(CW\*(C`regnext\*(C'\fR would be the same as the +\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR. But it would be inefficient to execute the OPTIMIZED +regnode as a noop three times, so the optimizer fixes the \f(CW\*(C`regnext\*(C'\fR so +such nodes are skipped during execution phase. +.PP +During execution phases we use the \f(CWregnext()\fR almost exclusively, and +only use \f(CW\*(C`REGNODE_AFTER\*(C'\fR in special cases where it has a well defined +meaning for a given regnode type. For instance /x+/ results in +.PP +.Vb 3 +\& PLUS +\& EXACT "x" +\& END +.Ve +.PP +the \f(CW\*(C`regnext\*(C'\fR of the \f(CW\*(C`PLUS\*(C'\fR regnode is the \f(CW\*(C`END\*(C'\fR regnode, and the +\&\f(CW\*(C`REGNODE_AFTER\*(C'\fR of the \f(CW\*(C`PLUS\*(C'\fR regnode is the \f(CW\*(C`EXACT\*(C'\fR regnode. The +\&\f(CW\*(C`regnext\*(C'\fR and \f(CW\*(C`REGNODE_AFTER\*(C'\fR of the \f(CW\*(C`EXACT\*(C'\fR regnode is the +\&\f(CW\*(C`END\*(C'\fR regnode. +.SH "Process Overview" +.IX Header "Process Overview" +Broadly speaking, performing a match of a string against a pattern +involves the following steps: +.IP "A. Compilation" 5 +.IX Item "A. Compilation" +.RS 5 +.PD 0 +.IP "1. Parsing" 5 +.IX Item "1. Parsing" +.IP "2. Peep-hole optimisation and analysis" 5 +.IX Item "2. Peep-hole optimisation and analysis" +.RE +.RS 5 +.RE +.IP "B. Execution" 5 +.IX Item "B. Execution" +.RS 5 +.IP "3. Start position and no-match optimisations" 5 +.IX Item "3. Start position and no-match optimisations" +.IP "4. Program execution" 5 +.IX Item "4. Program execution" +.RE +.RS 5 +.RE +.PD +.PP +Where these steps occur in the actual execution of a perl program is +determined by whether the pattern involves interpolating any string +variables. If interpolation occurs, then compilation happens at run time. If it +does not, then compilation is performed at compile time. (The \f(CW\*(C`/o\*(C'\fR modifier changes this, +as does \f(CW\*(C`qr//\*(C'\fR to a certain extent.) The engine doesn't really care that +much. +.SS Compilation +.IX Subsection "Compilation" +This code resides primarily in \fIregcomp.c\fR, along with the header files +\&\fIregcomp.h\fR, \fIregexp.h\fR and \fIregnodes.h\fR. +.PP +Compilation starts with \f(CWpregcomp()\fR, which is mostly an initialisation +wrapper which farms work out to two other routines for the heavy lifting: the +first is \f(CWreg()\fR, which is the start point for parsing; the second, +\&\f(CWstudy_chunk()\fR, is responsible for optimisation. +.PP +Initialisation in \f(CWpregcomp()\fR mostly involves the creation and data-filling +of a special structure, \f(CW\*(C`RExC_state_t\*(C'\fR (defined in \fIregcomp.c\fR). +Almost all internally-used routines in \fIregcomp.h\fR take a pointer to one +of these structures as their first argument, with the name \f(CW\*(C`pRExC_state\*(C'\fR. +This structure is used to store the compilation state and contains many +fields. Likewise there are many macros which operate on this +variable: anything that looks like \f(CW\*(C`RExC_xxxx\*(C'\fR is a macro that operates on +this pointer/structure. +.PP +\&\f(CWreg()\fR is the start of the parse process. It is responsible for +parsing an arbitrary chunk of pattern up to either the end of the +string, or the first closing parenthesis it encounters in the pattern. +This means it can be used to parse the top-level regex, or any section +inside of a grouping parenthesis. It also handles the "special parens" +that perl's regexes have. For instance when parsing \f(CW\*(C`/x(?:foo)y/\*(C'\fR, +\&\f(CWreg()\fR will at one point be called to parse from the "?" symbol up to +and including the ")". +.PP +Additionally, \f(CWreg()\fR is responsible for parsing the one or more +branches from the pattern, and for "finishing them off" by correctly +setting their next pointers. In order to do the parsing, it repeatedly +calls out to \f(CWregbranch()\fR, which is responsible for handling up to the +first \f(CW\*(C`|\*(C'\fR symbol it sees. +.PP +\&\f(CWregbranch()\fR in turn calls \f(CWregpiece()\fR which +handles "things" followed by a quantifier. In order to parse the +"things", \f(CWregatom()\fR is called. This is the lowest level routine, which +parses out constant strings, character classes, and the +various special symbols like \f(CW\*(C`$\*(C'\fR. If \f(CWregatom()\fR encounters a "(" +character it in turn calls \f(CWreg()\fR. +.PP +There used to be two main passes involved in parsing, the first to +calculate the size of the compiled program, and the second to actually +compile it. But now there is only one main pass, with an initial crude +guess based on the length of the input pattern, which is increased if +necessary as parsing proceeds, and afterwards, trimmed to the actual +amount used. +.PP +However, it may happen that parsing must be restarted at the beginning +when various circumstances occur along the way. An example is if the +program turns out to be so large that there are jumps in it that won't +fit in the normal 16 bits available. There are two special regops that +can hold bigger jump destinations, BRANCHJ and LONGBRANCH. The parse is +restarted, and these are used instead of the normal shorter ones. +Whenever restarting the parse is required, the function returns failure +and sets a flag as to what needs to be done. This is passed up to the +top level routine which takes the appropriate action and restarts from +scratch. In the case of needing longer jumps, the \f(CW\*(C`RExC_use_BRANCHJ\*(C'\fR +flag is set in the \f(CW\*(C`RExC_state_t\*(C'\fR structure, which the functions know +to inspect before deciding how to do branches. +.PP +In most instances, the function that discovers the issue sets the causal +flag and returns failure immediately. "Parsing complications" +contains an explicit example of how this works. In other cases, such as +a forward reference to a numbered parenthetical grouping, we need to +finish the parse to know if that numbered grouping actually appears in +the pattern. In those cases, the parse is just redone at the end, with +the knowledge of how many groupings occur in it. +.PP +The routine \f(CWregtail()\fR is called by both \f(CWreg()\fR and \f(CWregbranch()\fR +in order to "set the tail pointer" correctly. When executing and +we get to the end of a branch, we need to go to the node following the +grouping parens. When parsing, however, we don't know where the end will +be until we get there, so when we do we must go back and update the +offsets as appropriate. \f(CW\*(C`regtail\*(C'\fR is used to make this easier. +.PP +A subtlety of the parsing process means that a regex like \f(CW\*(C`/foo/\*(C'\fR is +originally parsed into an alternation with a single branch. It is only +afterwards that the optimiser converts single branch alternations into the +simpler form. +.PP +\fIParse Call Graph and a Grammar\fR +.IX Subsection "Parse Call Graph and a Grammar" +.PP +The call graph looks like this: +.PP +.Vb 10 +\& reg() # parse a top level regex, or inside of +\& # parens +\& regbranch() # parse a single branch of an alternation +\& regpiece() # parse a pattern followed by a quantifier +\& regatom() # parse a simple pattern +\& regclass() # used to handle a class +\& reg() # used to handle a parenthesised +\& # subpattern +\& .... +\& ... +\& regtail() # finish off the branch +\& ... +\& regtail() # finish off the branch sequence. Tie each +\& # branch\*(Aqs tail to the tail of the +\& # sequence +\& # (NEW) In Debug mode this is +\& # regtail_study(). +.Ve +.PP +A grammar form might be something like this: +.PP +.Vb 11 +\& atom : constant | class +\& quant : \*(Aq*\*(Aq | \*(Aq+\*(Aq | \*(Aq?\*(Aq | \*(Aq{min,max}\*(Aq +\& _branch: piece +\& | piece _branch +\& | nothing +\& branch: _branch +\& | _branch \*(Aq|\*(Aq branch +\& group : \*(Aq(\*(Aq branch \*(Aq)\*(Aq +\& _piece: atom | group +\& piece : _piece +\& | _piece quant +.Ve +.PP +\fIParsing complications\fR +.IX Subsection "Parsing complications" +.PP +The implication of the above description is that a pattern containing nested +parentheses will result in a call graph which cycles through \f(CWreg()\fR, +\&\f(CWregbranch()\fR, \f(CWregpiece()\fR, \f(CWregatom()\fR, \f(CWreg()\fR, \f(CWregbranch()\fR \fIetc\fR +multiple times, until the deepest level of nesting is reached. All the above +routines return a pointer to a \f(CW\*(C`regnode\*(C'\fR, which is usually the last regnode +added to the program. However, one complication is that \fBreg()\fR returns NULL +for parsing \f(CW\*(C`(?:)\*(C'\fR syntax for embedded modifiers, setting the flag +\&\f(CW\*(C`TRYAGAIN\*(C'\fR. The \f(CW\*(C`TRYAGAIN\*(C'\fR propagates upwards until it is captured, in +some cases by \f(CWregatom()\fR, but otherwise unconditionally by +\&\f(CWregbranch()\fR. Hence it will never be returned by \f(CWregbranch()\fR to +\&\f(CWreg()\fR. This flag permits patterns such as \f(CW\*(C`(?i)+\*(C'\fR to be detected as +errors (\fIQuantifier follows nothing in regex; marked by <\-\- HERE in m/(?i)+ +<\-\- HERE /\fR). +.PP +Another complication is that the representation used for the program differs +if it needs to store Unicode, but it's not always possible to know for sure +whether it does until midway through parsing. The Unicode representation for +the program is larger, and cannot be matched as efficiently. (See "Unicode +and Localisation Support" below for more details as to why.) If the pattern +contains literal Unicode, it's obvious that the program needs to store +Unicode. Otherwise, the parser optimistically assumes that the more +efficient representation can be used, and starts sizing on this basis. +However, if it then encounters something in the pattern which must be stored +as Unicode, such as an \f(CW\*(C`\ex{...}\*(C'\fR escape sequence representing a character +literal, then this means that all previously calculated sizes need to be +redone, using values appropriate for the Unicode representation. This +is another instance where the parsing needs to be restarted, and it can +and is done immediately. The function returns failure, and sets the +flag \f(CW\*(C`RESTART_UTF8\*(C'\fR (encapsulated by using the macro \f(CW\*(C`REQUIRE_UTF8\*(C'\fR). +This restart request is propagated up the call chain in a similar +fashion, until it is "caught" in \f(CWPerl_re_op_compile()\fR, which marks +the pattern as containing Unicode, and restarts the sizing pass. It is +also possible for constructions within run-time code blocks to turn out +to need Unicode representation., which is signalled by +\&\f(CWS_compile_runtime_code()\fR returning false to \f(CWPerl_re_op_compile()\fR. +.PP +The restart was previously implemented using a \f(CW\*(C`longjmp\*(C'\fR in \f(CWregatom()\fR +back to a \f(CW\*(C`setjmp\*(C'\fR in \f(CWPerl_re_op_compile()\fR, but this proved to be +problematic as the latter is a large function containing many automatic +variables, which interact badly with the emergent control flow of \f(CW\*(C`setjmp\*(C'\fR. +.PP +\fIDebug Output\fR +.IX Subsection "Debug Output" +.PP +Starting in the 5.9.x development version of perl you can \f(CW\*(C`use re +Debug => \*(AqPARSE\*(Aq\*(C'\fR to see some trace information about the parse +process. We will start with some simple patterns and build up to more +complex patterns. +.PP +So when we parse \f(CW\*(C`/foo/\*(C'\fR we see something like the following table. The +left shows what is being parsed, and the number indicates where the next regop +would go. The stuff on the right is the trace output of the graph. The +names are chosen to be short to make it less dense on the screen. 'tsdy' +is a special form of \f(CWregtail()\fR which does some extra analysis. +.PP +.Vb 6 +\& >foo< 1 reg +\& brnc +\& piec +\& atom +\& >< 4 tsdy~ EXACT <foo> (EXACT) (1) +\& ~ attach to END (3) offset to 2 +.Ve +.PP +The resulting program then looks like: +.PP +.Vb 2 +\& 1: EXACT <foo>(3) +\& 3: END(0) +.Ve +.PP +As you can see, even though we parsed out a branch and a piece, it was ultimately +only an atom. The final program shows us how things work. We have an \f(CW\*(C`EXACT\*(C'\fR regop, +followed by an \f(CW\*(C`END\*(C'\fR regop. The number in parens indicates where the \f(CW\*(C`regnext\*(C'\fR of +the node goes. The \f(CW\*(C`regnext\*(C'\fR of an \f(CW\*(C`END\*(C'\fR regop is unused, as \f(CW\*(C`END\*(C'\fR regops mean +we have successfully matched. The number on the left indicates the position of +the regop in the regnode array. +.PP +Now let's try a harder pattern. We will add a quantifier, so now we have the pattern +\&\f(CW\*(C`/foo+/\*(C'\fR. We will see that \f(CWregbranch()\fR calls \f(CWregpiece()\fR twice. +.PP +.Vb 10 +\& >foo+< 1 reg +\& brnc +\& piec +\& atom +\& >o+< 3 piec +\& atom +\& >< 6 tail~ EXACT <fo> (1) +\& 7 tsdy~ EXACT <fo> (EXACT) (1) +\& ~ PLUS (END) (3) +\& ~ attach to END (6) offset to 3 +.Ve +.PP +And we end up with the program: +.PP +.Vb 4 +\& 1: EXACT <fo>(3) +\& 3: PLUS(6) +\& 4: EXACT <o>(0) +\& 6: END(0) +.Ve +.PP +Now we have a special case. The \f(CW\*(C`EXACT\*(C'\fR regop has a \f(CW\*(C`regnext\*(C'\fR of 0. This is +because if it matches it should try to match itself again. The \f(CW\*(C`PLUS\*(C'\fR regop +handles the actual failure of the \f(CW\*(C`EXACT\*(C'\fR regop and acts appropriately (going +to regnode 6 if the \f(CW\*(C`EXACT\*(C'\fR matched at least once, or failing if it didn't). +.PP +Now for something much more complex: \f(CW\*(C`/x(?:foo*|b[a][rR])(foo|bar)$/\*(C'\fR +.PP +.Vb 10 +\& >x(?:foo*|b... 1 reg +\& brnc +\& piec +\& atom +\& >(?:foo*|b[... 3 piec +\& atom +\& >?:foo*|b[a... reg +\& >foo*|b[a][... brnc +\& piec +\& atom +\& >o*|b[a][rR... 5 piec +\& atom +\& >|b[a][rR])... 8 tail~ EXACT <fo> (3) +\& >b[a][rR])(... 9 brnc +\& 10 piec +\& atom +\& >[a][rR])(f... 12 piec +\& atom +\& >a][rR])(fo... clas +\& >[rR])(foo|... 14 tail~ EXACT <b> (10) +\& piec +\& atom +\& >rR])(foo|b... clas +\& >)(foo|bar)... 25 tail~ EXACT <a> (12) +\& tail~ BRANCH (3) +\& 26 tsdy~ BRANCH (END) (9) +\& ~ attach to TAIL (25) offset to 16 +\& tsdy~ EXACT <fo> (EXACT) (4) +\& ~ STAR (END) (6) +\& ~ attach to TAIL (25) offset to 19 +\& tsdy~ EXACT <b> (EXACT) (10) +\& ~ EXACT <a> (EXACT) (12) +\& ~ ANYOF[Rr] (END) (14) +\& ~ attach to TAIL (25) offset to 11 +\& >(foo|bar)$< tail~ EXACT <x> (1) +\& piec +\& atom +\& >foo|bar)$< reg +\& 28 brnc +\& piec +\& atom +\& >|bar)$< 31 tail~ OPEN1 (26) +\& >bar)$< brnc +\& 32 piec +\& atom +\& >)$< 34 tail~ BRANCH (28) +\& 36 tsdy~ BRANCH (END) (31) +\& ~ attach to CLOSE1 (34) offset to 3 +\& tsdy~ EXACT <foo> (EXACT) (29) +\& ~ attach to CLOSE1 (34) offset to 5 +\& tsdy~ EXACT <bar> (EXACT) (32) +\& ~ attach to CLOSE1 (34) offset to 2 +\& >$< tail~ BRANCH (3) +\& ~ BRANCH (9) +\& ~ TAIL (25) +\& piec +\& atom +\& >< 37 tail~ OPEN1 (26) +\& ~ BRANCH (28) +\& ~ BRANCH (31) +\& ~ CLOSE1 (34) +\& 38 tsdy~ EXACT <x> (EXACT) (1) +\& ~ BRANCH (END) (3) +\& ~ BRANCH (END) (9) +\& ~ TAIL (END) (25) +\& ~ OPEN1 (END) (26) +\& ~ BRANCH (END) (28) +\& ~ BRANCH (END) (31) +\& ~ CLOSE1 (END) (34) +\& ~ EOL (END) (36) +\& ~ attach to END (37) offset to 1 +.Ve +.PP +Resulting in the program +.PP +.Vb 10 +\& 1: EXACT <x>(3) +\& 3: BRANCH(9) +\& 4: EXACT <fo>(6) +\& 6: STAR(26) +\& 7: EXACT <o>(0) +\& 9: BRANCH(25) +\& 10: EXACT <ba>(14) +\& 12: OPTIMIZED (2 nodes) +\& 14: ANYOF[Rr](26) +\& 25: TAIL(26) +\& 26: OPEN1(28) +\& 28: TRIE\-EXACT(34) +\& [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf] +\& <foo> +\& <bar> +\& 30: OPTIMIZED (4 nodes) +\& 34: CLOSE1(36) +\& 36: EOL(37) +\& 37: END(0) +.Ve +.PP +Here we can see a much more complex program, with various optimisations in +play. At regnode 10 we see an example where a character class with only +one character in it was turned into an \f(CW\*(C`EXACT\*(C'\fR node. We can also see where +an entire alternation was turned into a \f(CW\*(C`TRIE\-EXACT\*(C'\fR node. As a consequence, +some of the regnodes have been marked as optimised away. We can see that +the \f(CW\*(C`$\*(C'\fR symbol has been converted into an \f(CW\*(C`EOL\*(C'\fR regop, a special piece of +code that looks for \f(CW\*(C`\en\*(C'\fR or the end of the string. +.PP +The next pointer for \f(CW\*(C`BRANCH\*(C'\fRes is interesting in that it points at where +execution should go if the branch fails. When executing, if the engine +tries to traverse from a branch to a \f(CW\*(C`regnext\*(C'\fR that isn't a branch then +the engine will know that the entire set of branches has failed. +.PP +\fIPeep-hole Optimisation and Analysis\fR +.IX Subsection "Peep-hole Optimisation and Analysis" +.PP +The regular expression engine can be a weighty tool to wield. On long +strings and complex patterns it can end up having to do a lot of work +to find a match, and even more to decide that no match is possible. +Consider a situation like the following pattern. +.PP +.Vb 1 +\& \*(Aqababababababababababab\*(Aq =~ /(a|b)*z/ +.Ve +.PP +The \f(CW\*(C`(a|b)*\*(C'\fR part can match at every char in the string, and then fail +every time because there is no \f(CW\*(C`z\*(C'\fR in the string. So obviously we can +avoid using the regex engine unless there is a \f(CW\*(C`z\*(C'\fR in the string. +Likewise in a pattern like: +.PP +.Vb 1 +\& /foo(\ew+)bar/ +.Ve +.PP +In this case we know that the string must contain a \f(CW\*(C`foo\*(C'\fR which must be +followed by \f(CW\*(C`bar\*(C'\fR. We can use Fast Boyer-Moore matching as implemented +in \f(CWfbm_instr()\fR to find the location of these strings. If they don't exist +then we don't need to resort to the much more expensive regex engine. +Even better, if they do exist then we can use their positions to +reduce the search space that the regex engine needs to cover to determine +if the entire pattern matches. +.PP +There are various aspects of the pattern that can be used to facilitate +optimisations along these lines: +.IP \(bu 5 +anchored fixed strings +.IP \(bu 5 +floating fixed strings +.IP \(bu 5 +minimum and maximum length requirements +.IP \(bu 5 +start class +.IP \(bu 5 +Beginning/End of line positions +.PP +Another form of optimisation that can occur is the post-parse "peep-hole" +optimisation, where inefficient constructs are replaced by more efficient +constructs. The \f(CW\*(C`TAIL\*(C'\fR regops which are used during parsing to mark the end +of branches and the end of groups are examples of this. These regops are used +as place-holders during construction and "always match" so they can be +"optimised away" by making the things that point to the \f(CW\*(C`TAIL\*(C'\fR point to the +thing that \f(CW\*(C`TAIL\*(C'\fR points to, thus "skipping" the node. +.PP +Another optimisation that can occur is that of "\f(CW\*(C`EXACT\*(C'\fR merging" which is +where two consecutive \f(CW\*(C`EXACT\*(C'\fR nodes are merged into a single +regop. An even more aggressive form of this is that a branch +sequence of the form \f(CW\*(C`EXACT BRANCH ... EXACT\*(C'\fR can be converted into a +\&\f(CW\*(C`TRIE\-EXACT\*(C'\fR regop. +.PP +All of this occurs in the routine \f(CWstudy_chunk()\fR which uses a special +structure \f(CW\*(C`scan_data_t\*(C'\fR to store the analysis that it has performed, and +does the "peep-hole" optimisations as it goes. +.PP +The code involved in \f(CWstudy_chunk()\fR is extremely cryptic. Be careful. :\-) +.SS Execution +.IX Subsection "Execution" +Execution of a regex generally involves two phases, the first being +finding the start point in the string where we should match from, +and the second being running the regop interpreter. +.PP +If we can tell that there is no valid start point then we don't bother running +the interpreter at all. Likewise, if we know from the analysis phase that we +cannot detect a short-cut to the start position, we go straight to the +interpreter. +.PP +The two entry points are \f(CWre_intuit_start()\fR and \f(CWpregexec()\fR. These routines +have a somewhat incestuous relationship with overlap between their functions, +and \f(CWpregexec()\fR may even call \f(CWre_intuit_start()\fR on its own. Nevertheless +other parts of the perl source code may call into either, or both. +.PP +Execution of the interpreter itself used to be recursive, but thanks to the +efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an +internal stack is maintained on the heap and the routine is fully +iterative. This can make it tricky as the code is quite conservative +about what state it stores, with the result that two consecutive lines in the +code can actually be running in totally different contexts due to the +simulated recursion. +.PP +\fIStart position and no-match optimisations\fR +.IX Subsection "Start position and no-match optimisations" +.PP +\&\f(CWre_intuit_start()\fR is responsible for handling start points and no-match +optimisations as determined by the results of the analysis done by +\&\f(CWstudy_chunk()\fR (and described in "Peep-hole Optimisation and Analysis"). +.PP +The basic structure of this routine is to try to find the start\- and/or +end-points of where the pattern could match, and to ensure that the string +is long enough to match the pattern. It tries to use more efficient +methods over less efficient methods and may involve considerable +cross-checking of constraints to find the place in the string that matches. +For instance it may try to determine that a given fixed string must be +not only present but a certain number of chars before the end of the +string, or whatever. +.PP +It calls several other routines, such as \f(CWfbm_instr()\fR which does +Fast Boyer Moore matching and \f(CWfind_byclass()\fR which is responsible for +finding the start using the first mandatory regop in the program. +.PP +When the optimisation criteria have been satisfied, \f(CWreg_try()\fR is called +to perform the match. +.PP +\fIProgram execution\fR +.IX Subsection "Program execution" +.PP +\&\f(CWpregexec()\fR is the main entry point for running a regex. It contains +support for initialising the regex interpreter's state, running +\&\f(CWre_intuit_start()\fR if needed, and running the interpreter on the string +from various start positions as needed. When it is necessary to use +the regex interpreter \f(CWpregexec()\fR calls \f(CWregtry()\fR. +.PP +\&\f(CWregtry()\fR is the entry point into the regex interpreter. It expects +as arguments a pointer to a \f(CW\*(C`regmatch_info\*(C'\fR structure and a pointer to +a string. It returns an integer 1 for success and a 0 for failure. +It is basically a set-up wrapper around \f(CWregmatch()\fR. +.PP +\&\f(CW\*(C`regmatch\*(C'\fR is the main "recursive loop" of the interpreter. It is +basically a giant switch statement that implements a state machine, where +the possible states are the regops themselves, plus a number of additional +intermediate and failure states. A few of the states are implemented as +subroutines but the bulk are inline code. +.SH MISCELLANEOUS +.IX Header "MISCELLANEOUS" +.SS "Unicode and Localisation Support" +.IX Subsection "Unicode and Localisation Support" +When dealing with strings containing characters that cannot be represented +using an eight-bit character set, perl uses an internal representation +that is a permissive version of Unicode's UTF\-8 encoding[2]. This uses single +bytes to represent characters from the ASCII character set, and sequences +of two or more bytes for all other characters. (See perlunitut +for more information about the relationship between UTF\-8 and perl's +encoding, utf8. The difference isn't important for this discussion.) +.PP +No matter how you look at it, Unicode support is going to be a pain in a +regex engine. Tricks that might be fine when you have 256 possible +characters often won't scale to handle the size of the UTF\-8 character +set. Things you can take for granted with ASCII may not be true with +Unicode. For instance, in ASCII, it is safe to assume that +\&\f(CW\*(C`sizeof(char1) == sizeof(char2)\*(C'\fR, but in UTF\-8 it isn't. Unicode case folding is +vastly more complex than the simple rules of ASCII, and even when not +using Unicode but only localised single byte encodings, things can get +tricky (for example, \fBLATIN SMALL LETTER SHARP S\fR (U+00DF, ß) +should match 'SS' in localised case-insensitive matching). +.PP +Making things worse is that UTF\-8 support was a later addition to the +regex engine (as it was to perl) and this necessarily made things a lot +more complicated. Obviously it is easier to design a regex engine with +Unicode support in mind from the beginning than it is to retrofit it to +one that wasn't. +.PP +Nearly all regops that involve looking at the input string have +two cases, one for UTF\-8, and one not. In fact, it's often more complex +than that, as the pattern may be UTF\-8 as well. +.PP +Care must be taken when making changes to make sure that you handle +UTF\-8 properly, both at compile time and at execution time, including +when the string and pattern are mismatched. +.SS "Base Structures" +.IX Subsection "Base Structures" +The \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is common to all +regex engines. Two of its fields are intended for the private use +of the regex engine that compiled the pattern. These are the +\&\f(CW\*(C`intflags\*(C'\fR and pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer to +an arbitrary structure whose use and management is the responsibility +of the compiling engine. perl will never modify either of these +values. In the case of the stock engine the structure pointed to by +\&\f(CW\*(C`pprivate\*(C'\fR is called \f(CW\*(C`regexp_internal\*(C'\fR. +.PP +Its \f(CW\*(C`pprivate\*(C'\fR and \f(CW\*(C`intflags\*(C'\fR fields contain data +specific to each engine. +.PP +There are two structures used to store a compiled regular expression. +One, the \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is populated by +the engine currently being used and some of its fields read by perl to +implement things such as the stringification of \f(CW\*(C`qr//\*(C'\fR. +.PP +The other structure is pointed to by the \f(CW\*(C`regexp\*(C'\fR struct's +\&\f(CW\*(C`pprivate\*(C'\fR and is in addition to \f(CW\*(C`intflags\*(C'\fR in the same struct +considered to be the property of the regex engine which compiled the +regular expression; +.PP +The regexp structure contains all the data that perl needs to be aware of +to properly work with the regular expression. It includes data about +optimisations that perl can use to determine if the regex engine should +really be used, and various other control info that is needed to properly +execute patterns in various contexts such as is the pattern anchored in +some way, or what flags were used during the compile, or whether the +program contains special constructs that perl needs to be aware of. +.PP +In addition it contains two fields that are intended for the private use +of the regex engine that compiled the pattern. These are the \f(CW\*(C`intflags\*(C'\fR +and pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer to an arbitrary +structure whose use and management is the responsibility of the compiling +engine. perl will never modify either of these values. +.PP +As mentioned earlier, in the case of the default engines, the \f(CW\*(C`pprivate\*(C'\fR +will be a pointer to a regexp_internal structure which holds the compiled +program and any additional data that is private to the regex engine +implementation. +.PP +\fIPerl's \fR\f(CI\*(C`pprivate\*(C'\fR\fI structure\fR +.IX Subsection "Perl's pprivate structure" +.PP +The following structure is used as the \f(CW\*(C`pprivate\*(C'\fR struct by perl's +regex engine. Since it is specific to perl it is only of curiosity +value to other engine implementations. +.PP +.Vb 8 +\& typedef struct regexp_internal { +\& regnode *regstclass; +\& struct reg_data *data; +\& struct reg_code_blocks *code_blocks; +\& U32 proglen; +\& U32 name_list_idx; +\& regnode program[1]; +\& } regexp_internal; +.Ve +.PP +Description of the attributes is as follows: +.ie n .IP """regstclass""" 5 +.el .IP \f(CWregstclass\fR 5 +.IX Item "regstclass" +Special regop that is used by \f(CWre_intuit_start()\fR to check if a pattern +can match at a certain position. For instance if the regex engine knows +that the pattern must start with a 'Z' then it can scan the string until +it finds one and then launch the regex engine from there. The routine +that handles this is called \f(CWfind_by_class()\fR. Sometimes this field +points at a regop embedded in the program, and sometimes it points at +an independent synthetic regop that has been constructed by the optimiser. +.ie n .IP """data""" 5 +.el .IP \f(CWdata\fR 5 +.IX Item "data" +This field points at a \f(CW\*(C`reg_data\*(C'\fR structure, which is defined as follows +.Sp +.Vb 5 +\& struct reg_data { +\& U32 count; +\& U8 *what; +\& void* data[1]; +\& }; +.Ve +.Sp +This structure is used for handling data structures that the regex engine +needs to handle specially during a clone or free operation on the compiled +product. Each element in the data array has a corresponding element in the +what array. During compilation regops that need special structures stored +will add an element to each array using the \fBadd_data()\fR routine and then store +the index in the regop. +.Sp +In modern perls the 0th element of this structure is reserved and is NEVER +used to store anything of use. This is to allow things that need to index +into this array to represent "no value". +.ie n .IP """code_blocks""" 5 +.el .IP \f(CWcode_blocks\fR 5 +.IX Item "code_blocks" +This optional structure is used to manage \f(CW\*(C`(?{})\*(C'\fR constructs in the +pattern. It is made up of the following structures. +.Sp +.Vb 7 +\& /* record the position of a (?{...}) within a pattern */ +\& struct reg_code_block { +\& STRLEN start; +\& STRLEN end; +\& OP *block; +\& REGEXP *src_regex; +\& }; +\& +\& /* array of reg_code_block\*(Aqs plus header info */ +\& struct reg_code_blocks { +\& int refcnt; /* we may be pointed to from a regex +\& and from the savestack */ +\& int count; /* how many code blocks */ +\& struct reg_code_block *cb; /* array of reg_code_block\*(Aqs */ +\& }; +.Ve +.ie n .IP """proglen""" 5 +.el .IP \f(CWproglen\fR 5 +.IX Item "proglen" +Stores the length of the compiled program in units of regops. +.ie n .IP """name_list_idx""" 5 +.el .IP \f(CWname_list_idx\fR 5 +.IX Item "name_list_idx" +This is the index into the data array where an AV is stored that contains +the names of any named capture buffers in the pattern, should there be +any. This is only used in the debugging version of the regex engine and +when RXp_PAREN_NAMES(prog) is true. It will be 0 if there is no such data. +.ie n .IP """program""" 5 +.el .IP \f(CWprogram\fR 5 +.IX Item "program" +Compiled program. Inlined into the structure so the entire struct can be +treated as a single blob. +.SH "SEE ALSO" +.IX Header "SEE ALSO" +perlreapi +.PP +perlre +.PP +perlunitut +.SH AUTHOR +.IX Header "AUTHOR" +by Yves Orton, 2006. +.PP +With excerpts from Perl, and contributions and suggestions from +Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, +Stephen McCamant, and David Landgren. +.PP +Now maintained by Perl 5 Porters. +.SH LICENCE +.IX Header "LICENCE" +Same terms as Perl. +.SH REFERENCES +.IX Header "REFERENCES" +[1] <https://perl.plover.com/Rx/paper/> +.PP +[2] <https://www.unicode.org/> |