diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-30 18:31:44 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-30 18:31:44 +0000 |
commit | c23a457e72abe608715ac76f076f47dc42af07a5 (patch) | |
tree | 2772049aaf84b5c9d0ed12ec8d86812f7a7904b6 /vendor/regex-automata-0.2.0/PLANS.md | |
parent | Releasing progress-linux version 1.73.0+dfsg1-1~progress7.99u1. (diff) | |
download | rustc-c23a457e72abe608715ac76f076f47dc42af07a5.tar.xz rustc-c23a457e72abe608715ac76f076f47dc42af07a5.zip |
Merging upstream version 1.74.1+dfsg1.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/regex-automata-0.2.0/PLANS.md')
-rw-r--r-- | vendor/regex-automata-0.2.0/PLANS.md | 165 |
1 files changed, 165 insertions, 0 deletions
diff --git a/vendor/regex-automata-0.2.0/PLANS.md b/vendor/regex-automata-0.2.0/PLANS.md new file mode 100644 index 000000000..2fa9392ef --- /dev/null +++ b/vendor/regex-automata-0.2.0/PLANS.md @@ -0,0 +1,165 @@ +pattern_limit should not be defined inside nfa::thompson, but rather at the +top-level. + +----- + +Main problem right now is exemplified by the set60 and set70 failing tests. In +particular, when finding the starting position while matching multiple regexes +simultaneously, the reverse search is messed up. The reverse search doesn't +depend on which regex matched in the forward direction, which means it won't +always find the correcting starting location. Unfortunately, the only way to +fix this, as far as I can tell, is to add a group of start states for every +regex in the DFA. Then once we do the reverse search, we need to choose the +correct start state based on which regex matched in the forward direction. + +This is a nasty change. + +So it looks like this only applies when doing an overlapping search in reverse +to find the start of a match. That means we should make this configurable +but enable it by default for the reverse automata. It should be configurable +so that folks can construct a regex that doesn't have the ability to do +overlapping searches correctly. If an overlapping search is attempted with +a reverse automaton that lacks starting states for each pattern, then the +implementation should panic. + +BUT! It is also convenient to provide this option in general for folks that +want a DFA that can match any pattern while also being able to match a specific +pattern. + +Straw man: + +* Update dense::Config to have a `starts_for_each_pattern` option. It should + be disabled by default. +* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration + to have the aforementioned option enabled. +* It would be interesting to add new APIs to `Regex` that support matching + specific patterns, but I think this is a complication. If we did want to do + this, then we should just add it to the `_at` variants and leave the rest of + the API untouched. +* Add a `pattern_id: Option<PatternID>` parameter to each of the five + `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the + existing behavior. A `Some` value means that the starting state for that + specific pattern must be chosen, which in turn implies an anchored search. + (This means `starts_for_each_pattern` has utility for single-pattern DFAs + since it makes it possible to build a DFA that can do both unanchored and + anchored searches.) +* Thread this new parameter down into the various functions in `dfa::search` + all the way down into `init_fwd` and `init_rev`. These functions will then + pass it to `dfa.start_state_{forward,reverse}`. +* This is where things get gruesome since we now need to completely re-work how + start states are represented in dense and sparse DFAs _and_ it needs to be + configurable. It looks like the `Start` type from `dfa::automaton` can + basically remain unchanged, since it still represents one of the four + possible starting states that will need to be applied for every pattern. +* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only + header is the state ID count, which is always 4. We'll want to change this + to the stride and add a new header value that encodes the number of patterns. + When the number of patterns is zero, then existing behavior is preserved and + represents the case where `starts_for_each_pattern` is disabled (or in the + case of an empty DFA). When non-zero, a table of starting state IDs is + encoded with each row corresponding to the 4 starting states for each + pattern. Before this table (even if it's empty), the 4 starting states for + the entire DFA are encoded. +* For `dfa::sparse`, do the same as above. They are essentially the same right + now anyway, with the only difference being that sparse DFAs use `&[u8]` + instead of `&[S]` (because sparse DFAs don't have any alignment + requirements). +* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when + true, creates a start table with the header, the start states for the entire + DFA and a row of start states for each pattern. When false, no rows are + added. +* Expose whether there are starting states for each pattern via a predicate + on the DFA. +* Modify the determinizer's `add_starts` method to basically do what it does, + but also do it for each pattern when the DFA is configured for it. It should + continue to reuse states as appropriate or not generate new states if they + aren't needed. This will want to use the `NFA::start_pattern` method, which + provides the starting NFA state ID for the given pattern. +* Fix the dense->sparse conversion. At this point, this piece should be fairly + straight-forward since the sparse representation of starting states is + basically identical to the dense representation. + +At this point, I think the bug should resolve itself. + +^^^^ DONE! IT WORKS! + +----- + + +Add top-level SyntaxConfig (or some such) that has all of the regex-syntax +options forwarded, but with automata oriented docs. Then use this for all of +the engines instead of having to repeat every option for every builder. + +----- + +These produce different results. PCRE2 looks correct. Basically, we should be +using the context around the `at` position correctly, which we aren't doing +right now. Seems tricky to get right, particularly when confirming the match +with a reverse DFA. + +Maybe our 'at' functions need to take a full range... Sigh. This is indeed what +RE2 does. GAH. + +fn main() { + let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap(); + let s = "foo bar baz"; + println!("{:?}", re.find_at(s, 3).map(|m| m.as_str())); + + let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap(); + let s = "foo bar baz"; + println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap()); +} + +^^^^ This is fixed now, but we still need to find a way to add test coverage +for "context" searches. It'd be nice to do this automatically, but we'll +probably just added a new 'context = [start, end]' option. + +----- + + +* Create regex-test crate, based on glob-test. Try to anticipate the needs for + the full regex test suite. + * See if we can clean up tests. + * Provide a way to mark a test as expensive. + * Provide a way to test is_match_at and find_at. + * Test shortest_match_at too? Huge pain. Add tests for it. + * Port ALL tests from the regex crate. Will probably need a way to mark a + test as skipped. + * Document tests better. +* Find a way to remove byteorder dependency. +* Reorganize crate API: + * Have errors contain `Box<Error+Send+Sync>` instead of `String`. + * Make errors non-exhaustive. + * Audit `StateID` trait for safety. + * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA + have inefficient implementations of some methods. Maybe use multiple + traits? Answer: get rid of premultiply/classes knobs and just enable + them by default. Should remove a huge amount of code. + * Check whether `unsafe` is really needed to eliminate bounds checks. Use + micro-benchmarks and bigger CLI workloads using `regex-automata-debug`. + * Re-write module docs for `dfa` as they are no longer top-level. Keep most. + * Retain any pertinent top-level crate docs, but don't rewrite yet. + * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess + right now. + * Clean up and add 'always_match' and 'never_match' constructors for every + regex engine. + * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode + version of \b unfortunately. Carefully scrutinize how the regex crate's + lazy DFA does it and try to make it comprehensible. Done! Except for the + part about making it comprehensible. +* Rethink prefilters? +* Add `regex-automata-generate` CLI tool. This should just be a copy of + the `ucd-generate dfa` and `ucd-generate regex` commands. + +Then build new public `nfa` sub-module. + * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into + source for fast checking. That way, we don't need to ever do explicit UTF-8 + decoding anywhere. Yay. + +Then `lazy` sub-module. + +Then `onepass`. + +Then `jit`. + +... and beyond? CRAZY. But it can be done! Build STRONG base layers. |