diff options
Diffstat (limited to 'third_party/rust/aho-corasick/DESIGN.md')
-rw-r--r-- | third_party/rust/aho-corasick/DESIGN.md | 483 |
1 files changed, 483 insertions, 0 deletions
diff --git a/third_party/rust/aho-corasick/DESIGN.md b/third_party/rust/aho-corasick/DESIGN.md new file mode 100644 index 0000000000..0e15ad0d87 --- /dev/null +++ b/third_party/rust/aho-corasick/DESIGN.md @@ -0,0 +1,483 @@ +This document describes the internal design of this crate, which is an object +lesson in what happens when you take a fairly simple old algorithm like +Aho-Corasick and make it fast and production ready. + +The target audience of this document is Rust programmers that have some +familiarity with string searching, however, one does not need to know the +Aho-Corasick algorithm in order to read this (it is explained below). One +should, however, know what a trie is. (If you don't, go read its Wikipedia +article.) + +The center-piece of this crate is an implementation of Aho-Corasick. On its +own, Aho-Corasick isn't that complicated. The complex pieces come from the +different variants of Aho-Corasick implemented in this crate. Specifically, +they are: + +* Aho-Corasick as an NFA, using dense transitions near the root with sparse + transitions elsewhere. +* Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct + and uses less memory.) + * A DFA with pre-multiplied state identifiers. This saves a multiplication + instruction in the core search loop. + * A DFA with equivalence classes of bytes as the alphabet, instead of the + traditional 256-byte alphabet. This shrinks the size of the DFA in memory, + but adds an extra lookup in the core search loop to map the input byte to + an equivalent class. +* The option to choose how state identifiers are represented, via one of + u8, u16, u32, u64 or usize. This permits creating compact automatons when + matching a small number of patterns. +* Supporting "standard" match semantics, along with its overlapping variant, + in addition to leftmost-first and leftmost-longest semantics. The "standard" + semantics are typically what you see in a textbook description of + Aho-Corasick. However, Aho-Corasick is also useful as an optimization in + regex engines, which often use leftmost-first or leftmost-longest semantics. + Thus, it is useful to implement those semantics here. The "standard" and + "leftmost" search algorithms are subtly different, and also require slightly + different construction algorithms. +* Support for ASCII case insensitive matching. +* Support for accelerating searches when the patterns all start with a small + number of fixed bytes. Or alternatively, when the patterns all contain a + small number of rare bytes. (Searching for these bytes uses SIMD vectorized + code courtesy of `memchr`.) +* Transparent support for alternative SIMD vectorized search routines for + smaller number of literals, such as the Teddy algorithm. We called these + "packed" search routines because they use SIMD. They can often be an order of + magnitude faster than just Aho-Corasick, but don't scale as well. +* Support for searching streams. This can reuse most of the underlying code, + but does require careful buffering support. +* Support for anchored searches, which permit efficient `is_prefix` checks for + a large number of patterns. + +When you combine all of this together along with trying to make everything as +fast as possible, what you end up with is enitrely too much code with too much +`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead, +we will explain it. + + +# Basics + +The fundamental problem this crate is trying to solve is to determine the +occurrences of possibly many patterns in a haystack. The naive way to solve +this is to look for a match for each pattern at each position in the haystack: + + for i in 0..haystack.len(): + for p in patterns.iter(): + if haystack[i..].starts_with(p.bytes()): + return Match(p.id(), i, i + p.bytes().len()) + +Those four lines are effectively all this crate does. The problem with those +four lines is that they are very slow, especially when you're searching for a +large number of patterns. + +While there are many different algorithms available to solve this, a popular +one is Aho-Corasick. It's a common solution because it's not too hard to +implement, scales quite well even when searching for thousands of patterns and +is generally pretty fast. Aho-Corasick does well here because, regardless of +the number of patterns you're searching for, it always visits each byte in the +haystack exactly once. This means, generally speaking, adding more patterns to +an Aho-Corasick automaton does not make it slower. (Strictly speaking, however, +this is not true, since a larger automaton will make less effective use of the +CPU's cache.) + +Aho-Corasick can be succinctly described as a trie with state transitions +between some of the nodes that efficiently instruct the search algorithm to +try matching alternative keys in the automaton. The trick is that these state +transitions are arranged such that each byte of input needs to be inspected +only once. These state transitions are typically called "failure transitions," +because they instruct the searcher (the thing traversing the automaton while +reading from the haystack) what to do when a byte in the haystack does not +correspond to a valid transition in the current state of the trie. + +More formally, a failure transition points to a state in the automaton that may +lead to a match whose prefix is a proper suffix of the path traversed through +the trie so far. (If no such proper suffix exists, then the failure transition +points back to the start state of the trie, effectively restarting the search.) +This is perhaps simpler to explain pictorally. For example, let's say we built +an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The +trie looks like this: + + a - S1 - b - S2 - c - S3 - d - S4* + / + S0 - c - S5 - e - S6 - f - S7* + +where states marked with a `*` are match states (meaning, the search algorithm +should stop and report a match to the caller). + +So given this trie, it should be somewhat straight-forward to see how it can +be used to determine whether any particular haystack *starts* with either +`abcd` or `cef`. It's easy to express this in code: + + fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool { + let mut state_id = trie.start(); + // If the empty pattern is in trie, then state_id is a match state. + if trie.is_match(state_id) { + return true; + } + for (i, &b) in haystack.iter().enumerate() { + state_id = match trie.next_state(state_id, b) { + Some(id) => id, + // If there was no transition for this state and byte, then we know + // the haystack does not start with one of the patterns in our trie. + None => return false, + }; + if trie.is_match(state_id) { + return true; + } + } + false + } + +And that's pretty much it. All we do is move through the trie starting with the +bytes at the beginning of the haystack. If we find ourselves in a position +where we can't move, or if we've looked through the entire haystack without +seeing a match state, then we know the haystack does not start with any of the +patterns in the trie. + +The meat of the Aho-Corasick algorithm is in how we add failure transitions to +our trie to keep searching efficient. Specifically, it permits us to not only +check whether a haystack *starts* with any one of a number of patterns, but +rather, whether the haystack contains any of a number of patterns *anywhere* in +the haystack. + +As mentioned before, failure transitions connect a proper suffix of the path +traversed through the trie before, with a path that leads to a match that has a +prefix corresponding to that proper suffix. So in our case, for patterns `abcd` +and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from +the diagram above) from `S3` upon seeing that the byte following `c` is not +`d`. Namely, the proper suffix in this example is `c`, which is a prefix of +`cef`. So the modified diagram looks like this: + + + a - S1 - b - S2 - c - S3 - d - S4* + / / + / ---------------- + / / + S0 - c - S5 - e - S6 - f - S7* + +One thing that isn't shown in this diagram is that *all* states have a failure +transition, but only `S3` has a *non-trivial* failure transition. That is, all +other states have a failure transition back to the start state. So if our +haystack was `abzabcd`, then the searcher would transition back to `S0` after +seeing `z`, which effectively restarts the search. (Because there is no pattern +in our trie that has a prefix of `bz` or `z`.) + +The code for traversing this *automaton* or *finite state machine* (it is no +longer just a trie) is not that much different from the `has_prefix` code +above: + + fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool { + let mut state_id = fsm.start(); + // If the empty pattern is in fsm, then state_id is a match state. + if fsm.is_match(state_id) { + return true; + } + for (i, &b) in haystack.iter().enumerate() { + // While the diagram above doesn't show this, we may wind up needing + // to follow multiple failure transitions before we land on a state + // in which we can advance. Therefore, when searching for the next + // state, we need to loop until we don't see a failure transition. + // + // This loop terminates because the start state has no empty + // transitions. Every transition from the start state either points to + // another state, or loops back to the start state. + loop { + match fsm.next_state(state_id, b) { + Some(id) => { + state_id = id; + break; + } + // Unlike our code above, if there was no transition for this + // state, then we don't quit. Instead, we look for this state's + // failure transition and follow that instead. + None => { + state_id = fsm.next_fail_state(state_id); + } + }; + } + if fsm.is_match(state_id) { + return true; + } + } + false + } + +Other than the complication around traversing failure transitions, this code +is still roughly "traverse the automaton with bytes from the haystack, and quit +when a match is seen." + +And that concludes our section on the basics. While we didn't go deep into +how the automaton is built (see `src/nfa.rs`, which has detailed comments about +that), the basic structure of Aho-Corasick should be reasonably clear. + + +# NFAs and DFAs + +There are generally two types of finite automata: non-deterministic finite +automata (NFA) and deterministic finite automata (DFA). The difference between +them is, principally, that an NFA can be in multiple states at once. This is +typically accomplished by things called _epsilon_ transitions, where one could +move to a new state without consuming any bytes from the input. (The other +mechanism by which NFAs can be in more than one state is where the same byte in +a particular state transitions to multiple distinct states.) In contrast, a DFA +can only ever be in one state at a time. A DFA has no epsilon transitions, and +for any given state, a byte transitions to at most one other state. + +By this formulation, the Aho-Corasick automaton described in the previous +section is an NFA. This is because failure transitions are, effectively, +epsilon transitions. That is, whenever the automaton is in state `S`, it is +actually in the set of states that are reachable by recursively following +failure transitions from `S`. (This means that, for example, the start state +is always active since the start state is reachable via failure transitions +from any state in the automaton.) + +NFAs have a lot of nice properties. They tend to be easier to construct, and +also tend to use less memory. However, their primary downside is that they are +typically slower to execute. For example, the code above showing how to search +with an Aho-Corasick automaton needs to potentially iterate through many +failure transitions for every byte of input. While this is a fairly small +amount of overhead, this can add up, especially if the automaton has a lot of +overlapping patterns with a lot of failure transitions. + +A DFA's search code, by contrast, looks like this: + + fn contains(dfa: &DFA, haystack: &[u8]) -> bool { + let mut state_id = dfa.start(); + // If the empty pattern is in dfa, then state_id is a match state. + if dfa.is_match(state_id) { + return true; + } + for (i, &b) in haystack.iter().enumerate() { + // An Aho-Corasick DFA *never* has a missing state that requires + // failure transitions to be followed. One byte of input advances the + // automaton by one state. Always. + state_id = dfa.next_state(state_id, b); + if dfa.is_match(state_id) { + return true; + } + } + false + } + +The search logic here is much simpler than for the NFA, and this tends to +translate into significant performance benefits as well, since there's a lot +less work being done for each byte in the haystack. How is this accomplished? +It's done by pre-following all failure transitions for all states for all bytes +in the alphabet, and then building a single state transition table. Building +this DFA can be much more costly than building the NFA, and use much more +memory, but the better performance can be worth it. + +Users of this crate can actually choose between using an NFA or a DFA. By +default, an NFA is used, because it typically strikes the best balance between +space usage and search performance. But the DFA option is available for cases +where a little extra memory and upfront time building the automaton is okay. +For example, the `AhoCorasick::auto_configure` and +`AhoCorasickBuilder::auto_configure` methods will enable the DFA setting if +there are a small number of patterns. + + +# More DFA tricks + +As described in the previous section, one of the downsides of using a DFA +is that it uses more memory and can take longer to build. One small way of +mitigating these concerns is to map the alphabet used by the automaton into +a smaller space. Typically, the alphabet of a DFA has 256 elements in it: +one element for each possible value that fits into a byte. However, in many +cases, one does not need the full alphabet. For example, if all patterns in an +Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct +bytes. As far as the automaton is concerned, the rest of the 204 bytes are +indistinguishable from one another: they will never disrciminate between a +match or a non-match. Therefore, in cases like that, the alphabet can be shrunk +to just 53 elements. One for each ASCII letter, and then another to serve as a +placeholder for every other unused byte. + +In practice, this library doesn't quite compute the optimal set of equivalence +classes, but it's close enough in most cases. The key idea is that this then +allows the transition table for the DFA to be potentially much smaller. The +downside of doing this, however, is that since the transition table is defined +in terms of this smaller alphabet space, every byte in the haystack must be +re-mapped to this smaller space. This requires an additional 256-byte table. +In practice, this can lead to a small search time hit, but it can be difficult +to measure. Moreover, it can sometimes lead to faster search times for bigger +automata, since it could be difference between more parts of the automaton +staying in the CPU cache or not. + +One other trick for DFAs employed by this crate is the notion of premultiplying +state identifiers. Specifically, the normal way to compute the next transition +in a DFA is via the following (assuming that the transition table is laid out +sequentially in memory, in row-major order, where the rows are states): + + next_state_id = dfa.transitions[current_state_id * 256 + current_byte] + +However, since the value `256` is a fixed constant, we can actually premultiply +the state identifiers in the table when we build the table initially. Then, the +next transition computation simply becomes: + + next_state_id = dfa.transitions[current_state_id + current_byte] + +This doesn't seem like much, but when this is being executed for every byte of +input that you're searching, saving that extra multiplication instruction can +add up. + +The same optimization works even when equivalence classes are enabled, as +described above. The only difference is that the premultiplication is by the +total number of equivalence classes instead of 256. + +There isn't much downside to premultiplying state identifiers, other than the +fact that you may need to choose a bigger integer representation than you would +otherwise. For example, if you don't premultiply state identifiers, then an +automaton that uses `u8` as a state identifier can hold up to 256 states. +However, if they are premultiplied, then it can only hold up to +`floor(256 / len(alphabet))` states. Thus premultiplication impacts how compact +your DFA can be. In practice, it's pretty rare to use `u8` as a state +identifier, so premultiplication is usually a good thing to do. + +Both equivalence classes and premultiplication are tuneable parameters via the +`AhoCorasickBuilder` type, and both are enabled by default. + + +# Match semantics + +One of the more interesting things about this implementation of Aho-Corasick +that (as far as this author knows) separates it from other implementations, is +that it natively supports leftmost-first and leftmost-longest match semantics. +Briefly, match semantics refer to the decision procedure by which searching +will disambiguate matches when there are multiple to choose from: + +* **standard** match semantics emits matches as soon as they are detected by + the automaton. This is typically equivalent to the textbook non-overlapping + formulation of Aho-Corasick. +* **leftmost-first** match semantics means that 1) the next match is the match + starting at the leftmost position and 2) among multiple matches starting at + the same leftmost position, the match corresponding to the pattern provided + first by the caller is reported. +* **leftmost-longest** is like leftmost-first, except when there are multiple + matches starting at the same leftmost position, the pattern corresponding to + the longest match is returned. + +(The crate API documentation discusses these differences, with examples, in +more depth on the `MatchKind` type.) + +The reason why supporting these match semantics is important is because it +gives the user more control over the match procedure. For example, +leftmost-first permits users to implement match priority by simply putting the +higher priority patterns first. Leftmost-longest, on the other hand, permits +finding the longest possible match, which might be useful when trying to find +words matching a dictionary. Additionally, regex engines often want to use +Aho-Corasick as an optimization when searching for an alternation of literals. +In order to preserve correct match semantics, regex engines typically can't use +the standard textbook definition directly, since regex engines will implement +either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics. + +Supporting leftmost semantics requires a couple key changes: + +* Constructing the Aho-Corasick automaton changes a bit in both how the trie is + constructed and how failure transitions are found. Namely, only a subset of + the failure transitions are added. Specifically, only the failure transitions + that either do not occur after a match or do occur after a match but preserve + that match are kept. (More details on this can be found in `src/nfa.rs`.) +* The search algorithm changes slightly. Since we are looking for the leftmost + match, we cannot quit as soon as a match is detected. Instead, after a match + is detected, we must keep searching until either the end of the input or + until a dead state is seen. (Dead states are not used for standard match + semantics. Dead states mean that searching should stop after a match has been + found.) + +Other implementations of Aho-Corasick do support leftmost match semantics, but +they do it with more overhead at search time, or even worse, with a queue of +matches and sophisticated hijinks to disambiguate the matches. While our +construction algorithm becomes a bit more complicated, the correct match +semantics fall out from the structure of the automaton itself. + + +# Overlapping matches + +One of the nice properties of an Aho-Corasick automaton is that it can report +all possible matches, even when they overlap with one another. In this mode, +the match semantics don't matter, since all possible matches are reported. +Overlapping searches work just like regular searches, except the state +identifier at which the previous search left off is carried over to the next +search, so that it can pick up where it left off. If there are additional +matches at that state, then they are reported before resuming the search. + +Enabling leftmost-first or leftmost-longest match semantics causes the +automaton to use a subset of all failure transitions, which means that +overlapping searches cannot be used. Therefore, if leftmost match semantics are +used, attempting to do an overlapping search will panic. Thus, to get +overlapping searches, the caller must use the default standard match semantics. +This behavior was chosen because there are only two alternatives, which were +deemed worse: + +* Compile two automatons internally, one for standard semantics and one for + the semantics requested by the caller (if not standard). +* Create a new type, distinct from the `AhoCorasick` type, which has different + capabilities based on the configuration options. + +The first is untenable because of the amount of memory used by the automaton. +The second increases the complexity of the API too much by adding too many +types that do similar things. It is conceptually much simpler to keep all +searching isolated to a single type. Callers may query whether the automaton +supports overlapping searches via the `AhoCorasick::supports_overlapping` +method. + + +# Stream searching + +Since Aho-Corasick is an automaton, it is possible to do partial searches on +partial parts of the haystack, and then resume that search on subsequent pieces +of the haystack. This is useful when the haystack you're trying to search is +not stored contiguously in memory, or if one does not want to read the entire +haystack into memory at once. + +Currently, only standard semantics are supported for stream searching. This is +some of the more complicated code in this crate, and is something I would very +much like to improve. In particular, it currently has the restriction that it +must buffer at least enough of the haystack in memory in order to fit the +longest possible match. The difficulty in getting stream searching right is +that the implementation choices (such as the buffer size) often impact what the +API looks like and what it's allowed to do. + + +# Prefilters + +In some cases, Aho-Corasick is not the fastest way to find matches containing +multiple patterns. Sometimes, the search can be accelerated using highly +optimized SIMD routines. For example, consider searching the following +patterns: + + Sherlock + Moriarty + Watson + +It is plausible that it would be much faster to quickly look for occurrences of +the leading bytes, `S`, `M` or `W`, before trying to start searching via the +automaton. Indeed, this is exactly what this crate will do. + +When there are more than three distinct starting bytes, then this crate will +look for three distinct bytes occurring at any position in the patterns, while +preferring bytes that are heuristically determined to be rare over others. For +example: + + Abuzz + Sanchez + Vasquez + Topaz + Waltz + +Here, we have more than 3 distinct starting bytes, but all of the patterns +contain `z`, which is typically a rare byte. In this case, the prefilter will +scan for `z`, back up a bit, and then execute the Aho-Corasick automaton. + +If all of that fails, then a packed multiple substring algorithm will be +attempted. Currently, the only algorithm available for this is Teddy, but more +may be added in the future. Teddy is unlike the above prefilters in that it +confirms its own matches, so when Teddy is active, it might not be necessary +for Aho-Corasick to run at all. (See `Automaton::leftmost_find_at_no_state_imp` +in `src/automaton.rs`.) However, the current Teddy implementation only works +in `x86_64` and when SSSE3 or AVX2 are available, and moreover, only works +_well_ when there are a small number of patterns (say, less than 100). Teddy +also requires the haystack to be of a certain length (more than 16-34 bytes). +When the haystack is shorter than that, Rabin-Karp is used instead. (See +`src/packed/rabinkarp.rs`.) + +There is a more thorough description of Teddy at +[`src/packed/teddy/README.md`](src/packed/teddy/README.md). |