diff options
Diffstat (limited to 'vendor/regex-automata/PLANS.md')
-rw-r--r-- | vendor/regex-automata/PLANS.md | 165 |
1 files changed, 0 insertions, 165 deletions
diff --git a/vendor/regex-automata/PLANS.md b/vendor/regex-automata/PLANS.md deleted file mode 100644 index 2fa9392ef..000000000 --- a/vendor/regex-automata/PLANS.md +++ /dev/null @@ -1,165 +0,0 @@ -pattern_limit should not be defined inside nfa::thompson, but rather at the -top-level. - ------ - -Main problem right now is exemplified by the set60 and set70 failing tests. In -particular, when finding the starting position while matching multiple regexes -simultaneously, the reverse search is messed up. The reverse search doesn't -depend on which regex matched in the forward direction, which means it won't -always find the correcting starting location. Unfortunately, the only way to -fix this, as far as I can tell, is to add a group of start states for every -regex in the DFA. Then once we do the reverse search, we need to choose the -correct start state based on which regex matched in the forward direction. - -This is a nasty change. - -So it looks like this only applies when doing an overlapping search in reverse -to find the start of a match. That means we should make this configurable -but enable it by default for the reverse automata. It should be configurable -so that folks can construct a regex that doesn't have the ability to do -overlapping searches correctly. If an overlapping search is attempted with -a reverse automaton that lacks starting states for each pattern, then the -implementation should panic. - -BUT! It is also convenient to provide this option in general for folks that -want a DFA that can match any pattern while also being able to match a specific -pattern. - -Straw man: - -* Update dense::Config to have a `starts_for_each_pattern` option. It should - be disabled by default. -* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration - to have the aforementioned option enabled. -* It would be interesting to add new APIs to `Regex` that support matching - specific patterns, but I think this is a complication. If we did want to do - this, then we should just add it to the `_at` variants and leave the rest of - the API untouched. -* Add a `pattern_id: Option<PatternID>` parameter to each of the five - `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the - existing behavior. A `Some` value means that the starting state for that - specific pattern must be chosen, which in turn implies an anchored search. - (This means `starts_for_each_pattern` has utility for single-pattern DFAs - since it makes it possible to build a DFA that can do both unanchored and - anchored searches.) -* Thread this new parameter down into the various functions in `dfa::search` - all the way down into `init_fwd` and `init_rev`. These functions will then - pass it to `dfa.start_state_{forward,reverse}`. -* This is where things get gruesome since we now need to completely re-work how - start states are represented in dense and sparse DFAs _and_ it needs to be - configurable. It looks like the `Start` type from `dfa::automaton` can - basically remain unchanged, since it still represents one of the four - possible starting states that will need to be applied for every pattern. -* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only - header is the state ID count, which is always 4. We'll want to change this - to the stride and add a new header value that encodes the number of patterns. - When the number of patterns is zero, then existing behavior is preserved and - represents the case where `starts_for_each_pattern` is disabled (or in the - case of an empty DFA). When non-zero, a table of starting state IDs is - encoded with each row corresponding to the 4 starting states for each - pattern. Before this table (even if it's empty), the 4 starting states for - the entire DFA are encoded. -* For `dfa::sparse`, do the same as above. They are essentially the same right - now anyway, with the only difference being that sparse DFAs use `&[u8]` - instead of `&[S]` (because sparse DFAs don't have any alignment - requirements). -* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when - true, creates a start table with the header, the start states for the entire - DFA and a row of start states for each pattern. When false, no rows are - added. -* Expose whether there are starting states for each pattern via a predicate - on the DFA. -* Modify the determinizer's `add_starts` method to basically do what it does, - but also do it for each pattern when the DFA is configured for it. It should - continue to reuse states as appropriate or not generate new states if they - aren't needed. This will want to use the `NFA::start_pattern` method, which - provides the starting NFA state ID for the given pattern. -* Fix the dense->sparse conversion. At this point, this piece should be fairly - straight-forward since the sparse representation of starting states is - basically identical to the dense representation. - -At this point, I think the bug should resolve itself. - -^^^^ DONE! IT WORKS! - ------ - - -Add top-level SyntaxConfig (or some such) that has all of the regex-syntax -options forwarded, but with automata oriented docs. Then use this for all of -the engines instead of having to repeat every option for every builder. - ------ - -These produce different results. PCRE2 looks correct. Basically, we should be -using the context around the `at` position correctly, which we aren't doing -right now. Seems tricky to get right, particularly when confirming the match -with a reverse DFA. - -Maybe our 'at' functions need to take a full range... Sigh. This is indeed what -RE2 does. GAH. - -fn main() { - let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap(); - let s = "foo bar baz"; - println!("{:?}", re.find_at(s, 3).map(|m| m.as_str())); - - let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap(); - let s = "foo bar baz"; - println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap()); -} - -^^^^ This is fixed now, but we still need to find a way to add test coverage -for "context" searches. It'd be nice to do this automatically, but we'll -probably just added a new 'context = [start, end]' option. - ------ - - -* Create regex-test crate, based on glob-test. Try to anticipate the needs for - the full regex test suite. - * See if we can clean up tests. - * Provide a way to mark a test as expensive. - * Provide a way to test is_match_at and find_at. - * Test shortest_match_at too? Huge pain. Add tests for it. - * Port ALL tests from the regex crate. Will probably need a way to mark a - test as skipped. - * Document tests better. -* Find a way to remove byteorder dependency. -* Reorganize crate API: - * Have errors contain `Box<Error+Send+Sync>` instead of `String`. - * Make errors non-exhaustive. - * Audit `StateID` trait for safety. - * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA - have inefficient implementations of some methods. Maybe use multiple - traits? Answer: get rid of premultiply/classes knobs and just enable - them by default. Should remove a huge amount of code. - * Check whether `unsafe` is really needed to eliminate bounds checks. Use - micro-benchmarks and bigger CLI workloads using `regex-automata-debug`. - * Re-write module docs for `dfa` as they are no longer top-level. Keep most. - * Retain any pertinent top-level crate docs, but don't rewrite yet. - * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess - right now. - * Clean up and add 'always_match' and 'never_match' constructors for every - regex engine. - * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode - version of \b unfortunately. Carefully scrutinize how the regex crate's - lazy DFA does it and try to make it comprehensible. Done! Except for the - part about making it comprehensible. -* Rethink prefilters? -* Add `regex-automata-generate` CLI tool. This should just be a copy of - the `ucd-generate dfa` and `ucd-generate regex` commands. - -Then build new public `nfa` sub-module. - * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into - source for fast checking. That way, we don't need to ever do explicit UTF-8 - decoding anywhere. Yay. - -Then `lazy` sub-module. - -Then `onepass`. - -Then `jit`. - -... and beyond? CRAZY. But it can be done! Build STRONG base layers. |