summaryrefslogtreecommitdiffstats
path: root/vendor/aho-corasick/DESIGN.md
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/aho-corasick/DESIGN.md')
-rw-r--r--vendor/aho-corasick/DESIGN.md483
1 files changed, 483 insertions, 0 deletions
diff --git a/vendor/aho-corasick/DESIGN.md b/vendor/aho-corasick/DESIGN.md
new file mode 100644
index 000000000..367e203df
--- /dev/null
+++ b/vendor/aho-corasick/DESIGN.md
@@ -0,0 +1,483 @@
+This document describes the internal design of this crate, which is an object
+lesson in what happens when you take a fairly simple old algorithm like
+Aho-Corasick and make it fast and production ready.
+
+The target audience of this document is Rust programmers that have some
+familiarity with string searching, however, one does not need to know the
+Aho-Corasick algorithm in order to read this (it is explained below). One
+should, however, know what a trie is. (If you don't, go read its Wikipedia
+article.)
+
+The center-piece of this crate is an implementation of Aho-Corasick. On its
+own, Aho-Corasick isn't that complicated. The complex pieces come from the
+different variants of Aho-Corasick implemented in this crate. Specifically,
+they are:
+
+* Aho-Corasick as an NFA, using dense transitions near the root with sparse
+ transitions elsewhere.
+* Aho-Corasick as a DFA. (An NFA is slower to search, but cheaper to construct
+ and uses less memory.)
+ * A DFA with pre-multiplied state identifiers. This saves a multiplication
+ instruction in the core search loop.
+ * A DFA with equivalence classes of bytes as the alphabet, instead of the
+ traditional 256-byte alphabet. This shrinks the size of the DFA in memory,
+ but adds an extra lookup in the core search loop to map the input byte to
+ an equivalent class.
+* The option to choose how state identifiers are represented, via one of
+ u8, u16, u32, u64 or usize. This permits creating compact automatons when
+ matching a small number of patterns.
+* Supporting "standard" match semantics, along with its overlapping variant,
+ in addition to leftmost-first and leftmost-longest semantics. The "standard"
+ semantics are typically what you see in a textbook description of
+ Aho-Corasick. However, Aho-Corasick is also useful as an optimization in
+ regex engines, which often use leftmost-first or leftmost-longest semantics.
+ Thus, it is useful to implement those semantics here. The "standard" and
+ "leftmost" search algorithms are subtly different, and also require slightly
+ different construction algorithms.
+* Support for ASCII case insensitive matching.
+* Support for accelerating searches when the patterns all start with a small
+ number of fixed bytes. Or alternatively, when the patterns all contain a
+ small number of rare bytes. (Searching for these bytes uses SIMD vectorized
+ code courtesy of `memchr`.)
+* Transparent support for alternative SIMD vectorized search routines for
+ smaller number of literals, such as the Teddy algorithm. We called these
+ "packed" search routines because they use SIMD. They can often be an order of
+ magnitude faster than just Aho-Corasick, but don't scale as well.
+* Support for searching streams. This can reuse most of the underlying code,
+ but does require careful buffering support.
+* Support for anchored searches, which permit efficient `is_prefix` checks for
+ a large number of patterns.
+
+When you combine all of this together along with trying to make everything as
+fast as possible, what you end up with is enitrely too much code with too much
+`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead,
+we will explain it.
+
+
+# Basics
+
+The fundamental problem this crate is trying to solve is to determine the
+occurrences of possibly many patterns in a haystack. The naive way to solve
+this is to look for a match for each pattern at each position in the haystack:
+
+ for i in 0..haystack.len():
+ for p in patterns.iter():
+ if haystack[i..].starts_with(p.bytes()):
+ return Match(p.id(), i, i + p.bytes().len())
+
+Those four lines are effectively all this crate does. The problem with those
+four lines is that they are very slow, especially when you're searching for a
+large number of patterns.
+
+While there are many different algorithms available to solve this, a popular
+one is Aho-Corasick. It's a common solution because it's not too hard to
+implement, scales quite well even when searching for thousands of patterns and
+is generally pretty fast. Aho-Corasick does well here because, regardless of
+the number of patterns you're searching for, it always visits each byte in the
+haystack exactly once. This means, generally speaking, adding more patterns to
+an Aho-Corasick automaton does not make it slower. (Strictly speaking, however,
+this is not true, since a larger automaton will make less effective use of the
+CPU's cache.)
+
+Aho-Corasick can be succinctly described as a trie with state transitions
+between some of the nodes that efficiently instruct the search algorithm to
+try matching alternative keys in the automaton. The trick is that these state
+transitions are arranged such that each byte of input needs to be inspected
+only once. These state transitions are typically called "failure transitions,"
+because they instruct the searcher (the thing traversing the automaton while
+reading from the haystack) what to do when a byte in the haystack does not
+correspond to a valid transition in the current state of the trie.
+
+More formally, a failure transition points to a state in the automaton that may
+lead to a match whose prefix is a proper suffix of the path traversed through
+the trie so far. (If no such proper suffix exists, then the failure transition
+points back to the start state of the trie, effectively restarting the search.)
+This is perhaps simpler to explain pictorally. For example, let's say we built
+an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The
+trie looks like this:
+
+ a - S1 - b - S2 - c - S3 - d - S4*
+ /
+ S0 - c - S5 - e - S6 - f - S7*
+
+where states marked with a `*` are match states (meaning, the search algorithm
+should stop and report a match to the caller).
+
+So given this trie, it should be somewhat straight-forward to see how it can
+be used to determine whether any particular haystack *starts* with either
+`abcd` or `cef`. It's easy to express this in code:
+
+ fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
+ let mut state_id = trie.start();
+ // If the empty pattern is in trie, then state_id is a match state.
+ if trie.is_match(state_id) {
+ return true;
+ }
+ for (i, &b) in haystack.iter().enumerate() {
+ state_id = match trie.next_state(state_id, b) {
+ Some(id) => id,
+ // If there was no transition for this state and byte, then we know
+ // the haystack does not start with one of the patterns in our trie.
+ None => return false,
+ };
+ if trie.is_match(state_id) {
+ return true;
+ }
+ }
+ false
+ }
+
+And that's pretty much it. All we do is move through the trie starting with the
+bytes at the beginning of the haystack. If we find ourselves in a position
+where we can't move, or if we've looked through the entire haystack without
+seeing a match state, then we know the haystack does not start with any of the
+patterns in the trie.
+
+The meat of the Aho-Corasick algorithm is in how we add failure transitions to
+our trie to keep searching efficient. Specifically, it permits us to not only
+check whether a haystack *starts* with any one of a number of patterns, but
+rather, whether the haystack contains any of a number of patterns *anywhere* in
+the haystack.
+
+As mentioned before, failure transitions connect a proper suffix of the path
+traversed through the trie before, with a path that leads to a match that has a
+prefix corresponding to that proper suffix. So in our case, for patterns `abcd`
+and `cef`, with a haystack `abcef`, we want to transition to state `S5` (from
+the diagram above) from `S3` upon seeing that the byte following `c` is not
+`d`. Namely, the proper suffix in this example is `c`, which is a prefix of
+`cef`. So the modified diagram looks like this:
+
+
+ a - S1 - b - S2 - c - S3 - d - S4*
+ / /
+ / ----------------
+ / /
+ S0 - c - S5 - e - S6 - f - S7*
+
+One thing that isn't shown in this diagram is that *all* states have a failure
+transition, but only `S3` has a *non-trivial* failure transition. That is, all
+other states have a failure transition back to the start state. So if our
+haystack was `abzabcd`, then the searcher would transition back to `S0` after
+seeing `z`, which effectively restarts the search. (Because there is no pattern
+in our trie that has a prefix of `bz` or `z`.)
+
+The code for traversing this *automaton* or *finite state machine* (it is no
+longer just a trie) is not that much different from the `has_prefix` code
+above:
+
+ fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
+ let mut state_id = fsm.start();
+ // If the empty pattern is in fsm, then state_id is a match state.
+ if fsm.is_match(state_id) {
+ return true;
+ }
+ for (i, &b) in haystack.iter().enumerate() {
+ // While the diagram above doesn't show this, we may wind up needing
+ // to follow multiple failure transitions before we land on a state
+ // in which we can advance. Therefore, when searching for the next
+ // state, we need to loop until we don't see a failure transition.
+ //
+ // This loop terminates because the start state has no empty
+ // transitions. Every transition from the start state either points to
+ // another state, or loops back to the start state.
+ loop {
+ match fsm.next_state(state_id, b) {
+ Some(id) => {
+ state_id = id;
+ break;
+ }
+ // Unlike our code above, if there was no transition for this
+ // state, then we don't quit. Instead, we look for this state's
+ // failure transition and follow that instead.
+ None => {
+ state_id = fsm.next_fail_state(state_id);
+ }
+ };
+ }
+ if fsm.is_match(state_id) {
+ return true;
+ }
+ }
+ false
+ }
+
+Other than the complication around traversing failure transitions, this code
+is still roughly "traverse the automaton with bytes from the haystack, and quit
+when a match is seen."
+
+And that concludes our section on the basics. While we didn't go deep into
+how the automaton is built (see `src/nfa.rs`, which has detailed comments about
+that), the basic structure of Aho-Corasick should be reasonably clear.
+
+
+# NFAs and DFAs
+
+There are generally two types of finite automata: non-deterministic finite
+automata (NFA) and deterministic finite automata (DFA). The difference between
+them is, principally, that an NFA can be in multiple states at once. This is
+typically accomplished by things called _epsilon_ transitions, where one could
+move to a new state without consuming any bytes from the input. (The other
+mechanism by which NFAs can be in more than one state is where the same byte in
+a particular state transitions to multiple distinct states.) In contrast, a DFA
+can only ever be in one state at a time. A DFA has no epsilon transitions, and
+for any given state, a byte transitions to at most one other state.
+
+By this formulation, the Aho-Corasick automaton described in the previous
+section is an NFA. This is because failure transitions are, effectively,
+epsilon transitions. That is, whenever the automaton is in state `S`, it is
+actually in the set of states that are reachable by recursively following
+failure transitions from `S`. (This means that, for example, the start state
+is always active since the start state is reachable via failure transitions
+from any state in the automaton.)
+
+NFAs have a lot of nice properties. They tend to be easier to construct, and
+also tend to use less memory. However, their primary downside is that they are
+typically slower to execute. For example, the code above showing how to search
+with an Aho-Corasick automaton needs to potentially iterate through many
+failure transitions for every byte of input. While this is a fairly small
+amount of overhead, this can add up, especially if the automaton has a lot of
+overlapping patterns with a lot of failure transitions.
+
+A DFA's search code, by contrast, looks like this:
+
+ fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
+ let mut state_id = dfa.start();
+ // If the empty pattern is in dfa, then state_id is a match state.
+ if dfa.is_match(state_id) {
+ return true;
+ }
+ for (i, &b) in haystack.iter().enumerate() {
+ // An Aho-Corasick DFA *never* has a missing state that requires
+ // failure transitions to be followed. One byte of input advances the
+ // automaton by one state. Always.
+ state_id = trie.next_state(state_id, b);
+ if fsm.is_match(state_id) {
+ return true;
+ }
+ }
+ false
+ }
+
+The search logic here is much simpler than for the NFA, and this tends to
+translate into significant performance benefits as well, since there's a lot
+less work being done for each byte in the haystack. How is this accomplished?
+It's done by pre-following all failure transitions for all states for all bytes
+in the alphabet, and then building a single state transition table. Building
+this DFA can be much more costly than building the NFA, and use much more
+memory, but the better performance can be worth it.
+
+Users of this crate can actually choose between using an NFA or a DFA. By
+default, an NFA is used, because it typically strikes the best balance between
+space usage and search performance. But the DFA option is available for cases
+where a little extra memory and upfront time building the automaton is okay.
+For example, the `AhoCorasick::auto_configure` and
+`AhoCorasickBuilder::auto_configure` methods will enable the DFA setting if
+there are a small number of patterns.
+
+
+# More DFA tricks
+
+As described in the previous section, one of the downsides of using a DFA
+is that is uses more memory and can take longer to build. One small way of
+mitigating these concerns is to map the alphabet used by the automaton into
+a smaller space. Typically, the alphabet of a DFA has 256 elements in it:
+one element for each possible value that fits into a byte. However, in many
+cases, one does not need the full alphabet. For example, if all patterns in an
+Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct
+bytes. As far as the automaton is concerned, the rest of the 204 bytes are
+indistinguishable from one another: they will never disrciminate between a
+match or a non-match. Therefore, in cases like that, the alphabet can be shrunk
+to just 53 elements. One for each ASCII letter, and then another to serve as a
+placeholder for every other unused byte.
+
+In practice, this library doesn't quite compute the optimal set of equivalence
+classes, but it's close enough in most cases. The key idea is that this then
+allows the transition table for the DFA to be potentially much smaller. The
+downside of doing this, however, is that since the transition table is defined
+in terms of this smaller alphabet space, every byte in the haystack must be
+re-mapped to this smaller space. This requires an additional 256-byte table.
+In practice, this can lead to a small search time hit, but it can be difficult
+to measure. Moreover, it can sometimes lead to faster search times for bigger
+automata, since it could be difference between more parts of the automaton
+staying in the CPU cache or not.
+
+One other trick for DFAs employed by this crate is the notion of premultiplying
+state identifiers. Specifically, the normal way to compute the next transition
+in a DFA is via the following (assuming that the transition table is laid out
+sequentially in memory, in row-major order, where the rows are states):
+
+ next_state_id = dfa.transitions[current_state_id * 256 + current_byte]
+
+However, since the value `256` is a fixed constant, we can actually premultiply
+the state identifiers in the table when we build the table initially. Then, the
+next transition computation simply becomes:
+
+ next_state_id = dfa.transitions[current_state_id + current_byte]
+
+This doesn't seem like much, but when this is being executed for every byte of
+input that you're searching, saving that extra multiplication instruction can
+add up.
+
+The same optimization works even when equivalence classes are enabled, as
+described above. The only difference is that the premultiplication is by the
+total number of equivalence classes instead of 256.
+
+There isn't much downside to premultiplying state identifiers, other than the
+fact that you may need to choose a bigger integer representation than you would
+otherwise. For example, if you don't premultiply state identifiers, then an
+automaton that uses `u8` as a state identifier can hold up to 256 states.
+However, if they are premultiplied, then it can only hold up to
+`floor(256 / len(alphabet))` states. Thus premultiplication impacts how compact
+your DFA can be. In practice, it's pretty rare to use `u8` as a state
+identifier, so premultiplication is usually a good thing to do.
+
+Both equivalence classes and premultiplication are tuneable parameters via the
+`AhoCorasickBuilder` type, and both are enabled by default.
+
+
+# Match semantics
+
+One of the more interesting things about this implementation of Aho-Corasick
+that (as far as this author knows) separates it from other implementations, is
+that it natively supports leftmost-first and leftmost-longest match semantics.
+Briefly, match semantics refer to the decision procedure by which searching
+will disambiguate matches when there are multiple to choose from:
+
+* **standard** match semantics emits matches as soon as they are detected by
+ the automaton. This is typically equivalent to the textbook non-overlapping
+ formulation of Aho-Corasick.
+* **leftmost-first** match semantics means that 1) the next match is the match
+ starting at the leftmost position and 2) among multiple matches starting at
+ the same leftmost position, the match corresponding to the pattern provided
+ first by the caller is reported.
+* **leftmost-longest** is like leftmost-first, except when there are multiple
+ matches starting at the same leftmost position, the pattern corresponding to
+ the longest match is returned.
+
+(The crate API documentation discusses these differences, with examples, in
+more depth on the `MatchKind` type.)
+
+The reason why supporting these match semantics is important is because it
+gives the user more control over the match procedure. For example,
+leftmost-first permits users to implement match priority by simply putting the
+higher priority patterns first. Leftmost-longest, on the other hand, permits
+finding the longest possible match, which might be useful when trying to find
+words matching a dictionary. Additionally, regex engines often want to use
+Aho-Corasick as an optimization when searching for an alternation of literals.
+In order to preserve correct match semantics, regex engines typically can't use
+the standard textbook definition directly, since regex engines will implement
+either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.
+
+Supporting leftmost semantics requires a couple key changes:
+
+* Constructing the Aho-Corasick automaton changes a bit in both how the trie is
+ constructed and how failure transitions are found. Namely, only a subset of
+ the failure transitions are added. Specifically, only the failure transitions
+ that either do not occur after a match or do occur after a match but preserve
+ that match are kept. (More details on this can be found in `src/nfa.rs`.)
+* The search algorithm changes slightly. Since we are looking for the leftmost
+ match, we cannot quit as soon as a match is detected. Instead, after a match
+ is detected, we must keep searching until either the end of the input or
+ until a dead state is seen. (Dead states are not used for standard match
+ semantics. Dead states mean that searching should stop after a match has been
+ found.)
+
+Other implementations of Aho-Corasick do support leftmost match semantics, but
+they do it with more overhead at search time, or even worse, with a queue of
+matches and sophisticated hijinks to disambiguate the matches. While our
+construction algorithm becomes a bit more complicated, the correct match
+semantics fall out from the structure of the automaton itself.
+
+
+# Overlapping matches
+
+One of the nice properties of an Aho-Corasick automaton is that it can report
+all possible matches, even when they overlap with one another. In this mode,
+the match semantics don't matter, since all possible matches are reported.
+Overlapping searches work just like regular searches, except the state
+identifier at which the previous search left off is carried over to the next
+search, so that it can pick up where it left off. If there are additional
+matches at that state, then they are reported before resuming the search.
+
+Enabling leftmost-first or leftmost-longest match semantics causes the
+automaton to use a subset of all failure transitions, which means that
+overlapping searches cannot be used. Therefore, if leftmost match semantics are
+used, attempting to do an overlapping search will panic. Thus, to get
+overlapping searches, the caller must use the default standard match semantics.
+This behavior was chosen because there are only two alternatives, which were
+deemed worse:
+
+* Compile two automatons internally, one for standard semantics and one for
+ the semantics requested by the caller (if not standard).
+* Create a new type, distinct from the `AhoCorasick` type, which has different
+ capabilities based on the configuration options.
+
+The first is untenable because of the amount of memory used by the automaton.
+The second increases the complexity of the API too much by adding too many
+types that do similar things. It is conceptually much simpler to keep all
+searching isolated to a single type. Callers may query whether the automaton
+supports overlapping searches via the `AhoCorasick::supports_overlapping`
+method.
+
+
+# Stream searching
+
+Since Aho-Corasick is an automaton, it is possible to do partial searches on
+partial parts of the haystack, and then resume that search on subsequent pieces
+of the haystack. This is useful when the haystack you're trying to search is
+not stored contiguous in memory, or if one does not want to read the entire
+haystack into memory at once.
+
+Currently, only standard semantics are supported for stream searching. This is
+some of the more complicated code in this crate, and is something I would very
+much like to improve. In particular, it currently has the restriction that it
+must buffer at least enough of the haystack in memory in order to fit the
+longest possible match. The difficulty in getting stream searching right is
+that the implementation choices (such as the buffer size) often impact what the
+API looks like and what it's allowed to do.
+
+
+# Prefilters
+
+In some cases, Aho-Corasick is not the fastest way to find matches containing
+multiple patterns. Sometimes, the search can be accelerated using highly
+optimized SIMD routines. For example, consider searching the following
+patterns:
+
+ Sherlock
+ Moriarty
+ Watson
+
+It is plausible that it would be much faster to quickly look for occurrences of
+the leading bytes, `S`, `M` or `W`, before trying to start searching via the
+automaton. Indeed, this is exactly what this crate will do.
+
+When there are more than three distinct starting bytes, then this crate will
+look for three distinct bytes occurring at any position in the patterns, while
+preferring bytes that are heuristically determined to be rare over others. For
+example:
+
+ Abuzz
+ Sanchez
+ Vasquez
+ Topaz
+ Waltz
+
+Here, we have more than 3 distinct starting bytes, but all of the patterns
+contain `z`, which is typically a rare byte. In this case, the prefilter will
+scan for `z`, back up a bit, and then execute the Aho-Corasick automaton.
+
+If all of that fails, then a packed multiple substring algorithm will be
+attempted. Currently, the only algorithm available for this is Teddy, but more
+may be added in the future. Teddy is unlike the above prefilters in that it
+confirms its own matches, so when Teddy is active, it might not be necessary
+for Aho-Corasick to run at all. (See `Automaton::leftmost_find_at_no_state_imp`
+in `src/automaton.rs`.) However, the current Teddy implementation only works
+in `x86_64` and when SSSE3 or AVX2 are available, and moreover, only works
+_well_ when there are a small number of patterns (say, less than 100). Teddy
+also requires the haystack to be of a certain length (more than 16-34 bytes).
+When the haystack is shorter than that, Rabin-Karp is used instead. (See
+`src/packed/rabinkarp.rs`.)
+
+There is a more thorough description of Teddy at
+[`src/packed/teddy/README.md`](src/packed/teddy/README.md).