summaryrefslogtreecommitdiffstats
path: root/src/backend/regex/README
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:15:05 +0000
commit46651ce6fe013220ed397add242004d764fc0153 (patch)
tree6e5299f990f88e60174a1d3ae6e48eedd2688b2b /src/backend/regex/README
parentInitial commit. (diff)
downloadpostgresql-14-upstream.tar.xz
postgresql-14-upstream.zip
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/backend/regex/README')
-rw-r--r--src/backend/regex/README440
1 files changed, 440 insertions, 0 deletions
diff --git a/src/backend/regex/README b/src/backend/regex/README
new file mode 100644
index 0000000..e4b0836
--- /dev/null
+++ b/src/backend/regex/README
@@ -0,0 +1,440 @@
+Implementation notes about Henry Spencer's regex library
+========================================================
+
+If Henry ever had any internals documentation, he didn't publish it.
+So this file is an attempt to reverse-engineer some docs.
+
+General source-file layout
+--------------------------
+
+There are six separately-compilable source files, five of which expose
+exactly one exported function apiece:
+ regcomp.c: pg_regcomp
+ regexec.c: pg_regexec
+ regerror.c: pg_regerror
+ regfree.c: pg_regfree
+ regprefix.c: pg_regprefix
+(The pg_ prefixes were added by the Postgres project to distinguish this
+library version from any similar one that might be present on a particular
+system. They'd need to be removed or replaced in any standalone version
+of the library.)
+
+The sixth file, regexport.c, exposes multiple functions that allow extraction
+of info about a compiled regex (see regexport.h).
+
+There are additional source files regc_*.c that are #include'd in regcomp,
+and similarly additional source files rege_*.c that are #include'd in
+regexec. This was done to avoid exposing internal symbols globally;
+all functions not meant to be part of the library API are static.
+
+(Actually the above is a lie in one respect: there are two more global
+symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are
+not meant to be part of the API, but they have to be global because both
+regcomp and regexec call them. It'd be better to get rid of
+pg_set_regex_collation, as well as the static variables it sets, in favor of
+keeping the needed locale state in the regex structs. We have not done this
+yet for lack of a design for how to add application-specific state to the
+structs.)
+
+What's where in src/backend/regex/:
+
+regcomp.c Top-level regex compilation code
+regc_color.c Color map management
+regc_cvec.c Character vector (cvec) management
+regc_lex.c Lexer
+regc_nfa.c NFA handling
+regc_locale.c Application-specific locale code from Tcl project
+regc_pg_locale.c Postgres-added application-specific locale code
+regexec.c Top-level regex execution code
+rege_dfa.c DFA creation and execution
+regerror.c pg_regerror: generate text for a regex error code
+regfree.c pg_regfree: API to free a no-longer-needed regex_t
+regexport.c Functions for extracting info from a regex_t
+regprefix.c Code for extracting a common prefix from a regex_t
+
+The locale-specific code is concerned primarily with case-folding and with
+expanding locale-specific character classes, such as [[:alnum:]]. It
+really needs refactoring if this is ever to become a standalone library.
+
+The header files for the library are in src/include/regex/:
+
+regcustom.h Customizes library for particular application
+regerrs.h Error message list
+regex.h Exported API
+regexport.h Exported API for regexport.c
+regguts.h Internals declarations
+
+
+DFAs, NFAs, and all that
+------------------------
+
+This library is a hybrid DFA/NFA regex implementation. (If you've never
+heard either of those terms, get thee to a first-year comp sci textbook.)
+It might not be clear at first glance what that really means and how it
+relates to what you'll see in the code. Here's what really happens:
+
+* Initial parsing of a regex generates an NFA representation, with number
+of states approximately proportional to the length of the regexp.
+
+* The NFA is then optimized into a "compact NFA" representation, which is
+basically the same idea but without fields that are not going to be needed
+at runtime. It is simplified too: the compact format only allows "plain"
+and "LACON" arc types. The cNFA representation is what is passed from
+regcomp to regexec.
+
+* Unlike traditional NFA-based regex engines, we do not execute directly
+from the NFA representation, as that would require backtracking and so be
+very slow in some cases. Rather, we execute a DFA, which ideally can
+process an input string in linear time (O(M) for M characters of input)
+without backtracking. Each state of the DFA corresponds to a set of
+states of the NFA, that is all the states that the NFA might have been in
+upon reaching the current point in the input string. Therefore, an NFA
+with N states might require as many as 2^N states in the corresponding
+DFA, which could easily require unreasonable amounts of memory. We deal
+with this by materializing states of the DFA lazily (only when needed) and
+keeping them in a limited-size cache. The possible need to build the same
+state of the DFA repeatedly makes this approach not truly O(M) time, but
+in the worst case as much as O(M*N). That's still far better than the
+worst case for a backtracking NFA engine.
+
+If that were the end of it, we'd just say this is a DFA engine, with the
+use of NFAs being merely an implementation detail. However, a DFA engine
+cannot handle some important regex features such as capturing parens and
+back-references. If the parser finds that a regex uses these features
+(collectively called "messy cases" in the code), then we have to use
+NFA-style backtracking search after all.
+
+When using the NFA mode, the representation constructed by the parser
+consists of a tree of sub-expressions ("subre"s). Leaf tree nodes are
+either plain regular expressions (which are executed as DFAs in the manner
+described above) or back-references (which try to match the input to some
+previous substring). Non-leaf nodes are capture nodes (which save the
+location of the substring currently matching their child node),
+concatenation, alternation, or iteration nodes. At execution time, the
+executor recursively scans the tree. At concatenation, alternation, or
+iteration nodes, it considers each possible alternative way of matching the
+input string, that is each place where the string could be split for a
+concatenation or iteration, or each child node for an alternation. It
+tries the next alternative if the match fails according to the child nodes.
+This is exactly the sort of backtracking search done by a traditional NFA
+regex engine. If there are many tree levels it can get very slow.
+
+But all is not lost: we can still be smarter than the average pure NFA
+engine. To do this, each subre node has an associated DFA, which
+represents what the node could possibly match insofar as a mathematically
+pure regex can describe that, which basically means "no backrefs".
+Before we perform any search of possible alternative sub-matches, we run
+the DFA to see if it thinks the proposed substring could possibly match.
+If not, we can reject the match immediately without iterating through many
+possibilities.
+
+As an example, consider the regex "(a[bc]+)\1". The compiled
+representation will have a top-level concatenation subre node. Its first
+child is a plain DFA node for "a[bc]+" (which is marked as being a capture
+node). The concatenation's second child is a backref node for \1.
+The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
+where the backref has been replaced by a copy of the DFA for its referent
+expression. When executed, the concatenation node will have to search for
+a possible division of the input string that allows its two child nodes to
+each match their part of the string (and although this specific case can
+only succeed when the division is at the middle, the code does not know
+that, nor would it be true in general). However, we can first run the DFA
+and quickly reject any input that doesn't start with an "a" and contain
+one more "a" plus some number of b's and c's. If the DFA doesn't match,
+there is no need to recurse to the two child nodes for each possible
+string division point. In many cases, this prefiltering makes the search
+run much faster than a pure NFA engine could do. It is this behavior that
+justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
+library.
+
+It's perhaps worth noting that separate capture subre nodes are a rarity:
+normally, we just mark a subre as capturing and that's it. However, it's
+legal to write a regex like "((x))" in which the same substring has to be
+captured by multiple sets of parentheses. Since a subre has room for only
+one "capno" field, a single subre can't handle that. We handle such cases
+by wrapping the base subre (which captures the innermost parens) in a
+no-op capture node, or even more than one for "(((x)))" etc. This is a
+little bit inefficient because we end up with multiple identical NFAs,
+but since the case is pointless and infrequent, it's not worth working
+harder.
+
+
+Colors and colormapping
+-----------------------
+
+In many common regex patterns, there are large numbers of characters that
+can be treated alike by the execution engine. A simple example is the
+pattern "[[:alpha:]][[:alnum:]]*" for an identifier. Basically the engine
+only needs to care whether an input symbol is a letter, a digit, or other.
+We could build the NFA or DFA with a separate arc for each possible letter
+and digit, but that's very wasteful of space and not so cheap to execute
+either, especially when dealing with Unicode which can have thousands of
+letters. Instead, the parser builds a "color map" that maps each possible
+input symbol to a "color", or equivalence class. The NFA or DFA
+representation then has arcs labeled with colors, not specific input
+symbols. At execution, the first thing the executor does with each input
+symbol is to look up its color in the color map, and then everything else
+works from the color only.
+
+To build the colormap, we start by assigning every possible input symbol
+the color WHITE, which means "other" (that is, at the end of parsing, the
+symbols that are still WHITE are those not explicitly referenced anywhere
+in the regex). When we see a simple literal character or a bracket
+expression in the regex, we want to assign that character, or all the
+characters represented by the bracket expression, a unique new color that
+can be used to label the NFA arc corresponding to the state transition for
+matching this character or bracket expression. The basic idea is:
+first, change the color assigned to a character to some new value;
+second, run through all the existing arcs in the partially-built NFA,
+and for each one referencing the character's old color, add a parallel
+arc referencing its new color (this keeps the reassignment from changing
+the semantics of what we already built); and third, add a new arc with
+the character's new color to the current pair of NFA states, denoting
+that seeing this character allows the state transition to be made.
+
+This is complicated a bit by not wanting to create more colors
+(equivalence classes) than absolutely necessary. In particular, if a
+bracket expression mentions two characters that had the same color before,
+they should still share the same color after we process the bracket, since
+there is still not a need to distinguish them. But we do need to
+distinguish them from other characters that previously had the same color
+yet are not listed in the bracket expression. To mechanize this, the code
+has a concept of "parent colors" and "subcolors", where a color's subcolor
+is the new color that we are giving to any characters of that color while
+parsing the current atom. (The word "parent" is a bit unfortunate here,
+because it suggests a long-lived relationship, but a subcolor link really
+only lasts for the duration of parsing a single atom.) In other words,
+a subcolor link means that we are in process of splitting the parent color
+into two colors (equivalence classes), depending on whether or not each
+member character should be included by the current regex atom.
+
+As an example, suppose we have the regex "a\d\wx". Initially all possible
+character codes are labeled WHITE (color 0). To parse the atom "a", we
+create a new color (1), update "a"'s color map entry to 1, and create an
+arc labeled 1 between the first two states of the NFA. Now we see \d,
+which is really a bracket expression containing the digits "0"-"9".
+First we process "0", which is currently WHITE, so we create a new color
+(2), update "0"'s color map entry to 2, and create an arc labeled 2
+between the second and third states of the NFA. We also mark color WHITE
+as having the subcolor 2, which means that future relabelings of WHITE
+characters should also select 2 as the new color. Thus, when we process
+"1", we won't create a new color but re-use 2. We update "1"'s color map
+entry to 2, and then find that we don't need a new arc because there is
+already one labeled 2 between the second and third states of the NFA.
+Similarly for the other 8 digits, so there will be only one arc labeled 2
+between NFA states 2 and 3 for all members of this bracket expression.
+At completion of processing of the bracket expression, we call okcolors()
+which breaks all the existing parent/subcolor links; there is no longer a
+marker saying that WHITE characters should be relabeled 2. (Note:
+actually, we did the same creation and clearing of a subcolor link for the
+primitive atom "a", but it didn't do anything very interesting.) Now we
+come to the "\w" bracket expression, which for simplicity assume expands
+to just "[a-z0-9]". We process "a", but observe that it is already the
+sole member of its color 1. This means there is no need to subdivide that
+equivalence class more finely, so we do not create any new color. We just
+make an arc labeled 1 between the third and fourth NFA states. Next we
+process "b", which is WHITE and far from the only WHITE character, so we
+create a new color (3), link that as WHITE's subcolor, relabel "b" as
+color 3, and make an arc labeled 3. As we process "c" through "z", each
+is relabeled from WHITE to 3, but no new arc is needed. Now we come to
+"0", which is not the only member of its color 2, so we suppose that a new
+color is needed and create color 4. We link 4 as subcolor of 2, relabel
+"0" as color 4 in the map, and add an arc for color 4. Next "1" through
+"9" are similarly relabeled as color 4, with no additional arcs needed.
+Having finished the bracket expression, we call okcolors(), which breaks
+the subcolor links. okcolors() further observes that we have removed
+every member of color 2 (the previous color of the digit characters).
+Therefore, it runs through the partial NFA built so far and relabels arcs
+labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is
+relabeled color 4. Then it frees up color 2, since we have no more use
+for that color. We now have an NFA in which transitions for digits are
+consistently labeled with color 4. Last, we come to the atom "x".
+"x" is currently labeled with color 3, and it's not the only member of
+that color, so we realize that we now need to distinguish "x" from other
+letters when we did not before. We create a new color, which might have
+been 5 but instead we recycle the unused color 2. "x" is relabeled 2 in
+the color map and 2 is linked as the subcolor of 3, and we add an arc for
+2 between states 4 and 5 of the NFA. Now we call okcolors(), which breaks
+the subcolor link between colors 3 and 2 and notices that both colors are
+nonempty. Therefore, it also runs through the existing NFA arcs and adds
+an additional arc labeled 2 wherever there is an arc labeled 3; this
+action ensures that characters of color 2 (i.e., "x") will still be
+considered as allowing any transitions they did before. We are now done
+parsing the regex, and we have these final color assignments:
+ color 1: "a"
+ color 2: "x"
+ color 3: other letters
+ color 4: digits
+and the NFA has these arcs:
+ states 1 -> 2 on color 1 (hence, "a" only)
+ states 2 -> 3 on color 4 (digits)
+ states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters)
+ states 4 -> 5 on color 2 ("x" only)
+which can be seen to be a correct representation of the regex.
+
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom. We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot. That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic. Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors. Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline. In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
+Given this summary, we can see we need the following operations for
+colors:
+
+* A fast way to look up the current color assignment for any character
+ code. (This is needed during both parsing and execution, while the
+ remaining operations are needed only during parsing.)
+* A way to alter the color assignment for any given character code.
+* We must track the number of characters currently assigned to each
+ color, so that we can detect empty and singleton colors.
+* We must track all existing NFA arcs of a given color, so that we
+ can relabel them at need, or add parallel arcs of a new color when
+ an existing color has to be subdivided.
+
+The last two of these are handled with the "struct colordesc" array and
+the "colorchain" links in NFA arc structs.
+
+Ideally, we'd do the first two operations using a simple linear array
+storing the current color assignment for each character code.
+Unfortunately, that's not terribly workable for large charsets such as
+Unicode. Our solution is to divide the color map into two parts. A simple
+linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
+chosen large enough to include all popular characters (so that the
+significantly-slower code paths about to be described are seldom invoked).
+Characters above that need be considered at compile time only if they
+appear explicitly in the regex pattern. We store each such mentioned
+character or character range as an entry in the "colormaprange" array in
+the colormap. (Overlapping ranges are split into unique subranges, so that
+each range in the finished list needs only a single color that describes
+all its characters.) When mapping a character above MAX_SIMPLE_CHR to a
+color at runtime, we search this list of ranges explicitly.
+
+That's still not quite enough, though, because of locale-dependent
+character classes such as [[:alpha:]]. In Unicode locales these classes
+may have thousands of entries that are above MAX_SIMPLE_CHR, and we
+certainly don't want to be searching large colormaprange arrays at runtime.
+Nor do we even want to spend the time to initialize cvec structures that
+exhaustively describe all of those characters. Our solution is to compute
+exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
+For characters above that, we apply the <ctype.h> or <wctype.h> lookup
+functions at runtime for each locale-dependent character class used in the
+regex pattern, constructing a bitmap that describes which classes the
+runtime character belongs to. The per-character-range data structure
+mentioned above actually holds, for each range, a separate color entry
+for each possible combination of character class properties. That is,
+the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
+whose rows correspond to high characters or character ranges that are
+explicitly mentioned in the regex pattern, and whose columns correspond
+to sets of the locale-dependent character classes that are used in the
+regex.
+
+As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
+(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
+a high color map with three rows. One row is for the single character
+U+1234 (represented as a single-element range), one is for the range
+U+1D100..U+1D1FF, and the other row represents all remaining high
+characters. The color map has two columns, one for characters that
+satisfy iswalnum() and one for those that don't.
+
+We build this color map in parallel with scanning the regex. Each time
+we detect a new explicit high character (or range) or a locale-dependent
+character class, we split existing entry(s) in the high color map so that
+characters we need to be able to distinguish will have distinct entries
+that can be given separate colors. Often, though, single entries in the
+high color map will represent very large sets of characters.
+
+If there are both explicit high characters/ranges and locale-dependent
+character classes, we may have entries in the high color map array that
+have non-WHITE colors but don't actually represent any real characters.
+(For example, in a row representing a singleton range, only one of the
+columns could possibly be a live entry; it's the one matching the actual
+locale properties for that single character.) We don't currently make
+any effort to reclaim such colors. In principle it could be done, but
+it's not clear that it's worth the trouble.
+
+
+Detailed semantics of an NFA
+----------------------------
+
+When trying to read dumped-out NFAs, it's helpful to know these facts:
+
+State 0 (additionally marked with "@" in dumpnfa's output) is always the
+goal state, and state 1 (additionally marked with ">") is the start state.
+(The code refers to these as the post state and pre state respectively.)
+
+The possible arc types are:
+
+ PLAIN arcs, which specify matching of any character of a given "color"
+ (see above). These are dumped as "[color_number]->to_state".
+ In addition there can be "rainbow" PLAIN arcs, which are dumped as
+ "[*]->to_state".
+
+ EMPTY arcs, which specify a no-op transition to another state. These
+ are dumped as "->to_state".
+
+ AHEAD constraints, which represent a "next character must be of this
+ color" constraint. AHEAD differs from a PLAIN arc in that the input
+ character is not consumed when crossing the arc. These are dumped as
+ ">color_number>->to_state", or possibly ">*>->to_state".
+
+ BEHIND constraints, which represent a "previous character must be of
+ this color" constraint, which likewise consumes no input. These are
+ dumped as "<color_number<->to_state", or possibly "<*<->to_state".
+
+ '^' arcs, which specify a beginning-of-input constraint. These are
+ dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
+ beginning-of-line constraints respectively.
+
+ '$' arcs, which specify an end-of-input constraint. These are dumped
+ as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
+ constraints respectively.
+
+ LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and
+ "(?<!re)" constraints, i.e. the input starting/ending at this point must
+ match (or not match) a given sub-RE, but the matching input is not
+ consumed. These are dumped as ":subtree_number:->to_state".
+
+If you see anything else (especially any question marks) in the display of
+an arc, it's dumpnfa() trying to tell you that there's something fishy
+about the arc; see the source code.
+
+The regex executor can only handle PLAIN and LACON transitions. The regex
+optimize() function is responsible for transforming the parser's output
+to get rid of all the other arc types. In particular, ^ and $ arcs that
+are not dropped as impossible will always end up adjacent to the pre or
+post state respectively, and then will be converted into PLAIN arcs that
+mention the special "colors" for BOS, BOL, EOS, or EOL.
+
+To decide whether a thus-transformed NFA matches a given substring of the
+input string, the executor essentially follows these rules:
+1. Start the NFA "looking at" the character *before* the given substring,
+or if the substring is at the start of the input, prepend an imaginary BOS
+character instead.
+2. Run the NFA until it has consumed the character *after* the given
+substring, or an imaginary following EOS character if the substring is at
+the end of the input.
+3. If the NFA is (or can be) in the goal state at this point, it matches.
+
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any. The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
+So one can mentally execute an untransformed NFA by taking ^ and $ as
+ordinary constraints that match at start and end of input; but plain
+arcs out of the start state should be taken as matches for the character
+before the target substring, and similarly, plain arcs leading to the
+post state are matches for the character after the target substring.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state. So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.