summaryrefslogtreecommitdiffstats
path: root/vendor/regex-automata/PLANS.md
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/regex-automata/PLANS.md')
-rw-r--r--vendor/regex-automata/PLANS.md165
1 files changed, 0 insertions, 165 deletions
diff --git a/vendor/regex-automata/PLANS.md b/vendor/regex-automata/PLANS.md
deleted file mode 100644
index 2fa9392ef..000000000
--- a/vendor/regex-automata/PLANS.md
+++ /dev/null
@@ -1,165 +0,0 @@
-pattern_limit should not be defined inside nfa::thompson, but rather at the
-top-level.
-
------
-
-Main problem right now is exemplified by the set60 and set70 failing tests. In
-particular, when finding the starting position while matching multiple regexes
-simultaneously, the reverse search is messed up. The reverse search doesn't
-depend on which regex matched in the forward direction, which means it won't
-always find the correcting starting location. Unfortunately, the only way to
-fix this, as far as I can tell, is to add a group of start states for every
-regex in the DFA. Then once we do the reverse search, we need to choose the
-correct start state based on which regex matched in the forward direction.
-
-This is a nasty change.
-
-So it looks like this only applies when doing an overlapping search in reverse
-to find the start of a match. That means we should make this configurable
-but enable it by default for the reverse automata. It should be configurable
-so that folks can construct a regex that doesn't have the ability to do
-overlapping searches correctly. If an overlapping search is attempted with
-a reverse automaton that lacks starting states for each pattern, then the
-implementation should panic.
-
-BUT! It is also convenient to provide this option in general for folks that
-want a DFA that can match any pattern while also being able to match a specific
-pattern.
-
-Straw man:
-
-* Update dense::Config to have a `starts_for_each_pattern` option. It should
- be disabled by default.
-* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
- to have the aforementioned option enabled.
-* It would be interesting to add new APIs to `Regex` that support matching
- specific patterns, but I think this is a complication. If we did want to do
- this, then we should just add it to the `_at` variants and leave the rest of
- the API untouched.
-* Add a `pattern_id: Option<PatternID>` parameter to each of the five
- `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
- existing behavior. A `Some` value means that the starting state for that
- specific pattern must be chosen, which in turn implies an anchored search.
- (This means `starts_for_each_pattern` has utility for single-pattern DFAs
- since it makes it possible to build a DFA that can do both unanchored and
- anchored searches.)
-* Thread this new parameter down into the various functions in `dfa::search`
- all the way down into `init_fwd` and `init_rev`. These functions will then
- pass it to `dfa.start_state_{forward,reverse}`.
-* This is where things get gruesome since we now need to completely re-work how
- start states are represented in dense and sparse DFAs _and_ it needs to be
- configurable. It looks like the `Start` type from `dfa::automaton` can
- basically remain unchanged, since it still represents one of the four
- possible starting states that will need to be applied for every pattern.
-* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
- header is the state ID count, which is always 4. We'll want to change this
- to the stride and add a new header value that encodes the number of patterns.
- When the number of patterns is zero, then existing behavior is preserved and
- represents the case where `starts_for_each_pattern` is disabled (or in the
- case of an empty DFA). When non-zero, a table of starting state IDs is
- encoded with each row corresponding to the 4 starting states for each
- pattern. Before this table (even if it's empty), the 4 starting states for
- the entire DFA are encoded.
-* For `dfa::sparse`, do the same as above. They are essentially the same right
- now anyway, with the only difference being that sparse DFAs use `&[u8]`
- instead of `&[S]` (because sparse DFAs don't have any alignment
- requirements).
-* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
- true, creates a start table with the header, the start states for the entire
- DFA and a row of start states for each pattern. When false, no rows are
- added.
-* Expose whether there are starting states for each pattern via a predicate
- on the DFA.
-* Modify the determinizer's `add_starts` method to basically do what it does,
- but also do it for each pattern when the DFA is configured for it. It should
- continue to reuse states as appropriate or not generate new states if they
- aren't needed. This will want to use the `NFA::start_pattern` method, which
- provides the starting NFA state ID for the given pattern.
-* Fix the dense->sparse conversion. At this point, this piece should be fairly
- straight-forward since the sparse representation of starting states is
- basically identical to the dense representation.
-
-At this point, I think the bug should resolve itself.
-
-^^^^ DONE! IT WORKS!
-
------
-
-
-Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
-options forwarded, but with automata oriented docs. Then use this for all of
-the engines instead of having to repeat every option for every builder.
-
------
-
-These produce different results. PCRE2 looks correct. Basically, we should be
-using the context around the `at` position correctly, which we aren't doing
-right now. Seems tricky to get right, particularly when confirming the match
-with a reverse DFA.
-
-Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
-RE2 does. GAH.
-
-fn main() {
- let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
- let s = "foo bar baz";
- println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));
-
- let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
- let s = "foo bar baz";
- println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
-}
-
-^^^^ This is fixed now, but we still need to find a way to add test coverage
-for "context" searches. It'd be nice to do this automatically, but we'll
-probably just added a new 'context = [start, end]' option.
-
------
-
-
-* Create regex-test crate, based on glob-test. Try to anticipate the needs for
- the full regex test suite.
- * See if we can clean up tests.
- * Provide a way to mark a test as expensive.
- * Provide a way to test is_match_at and find_at.
- * Test shortest_match_at too? Huge pain. Add tests for it.
- * Port ALL tests from the regex crate. Will probably need a way to mark a
- test as skipped.
- * Document tests better.
-* Find a way to remove byteorder dependency.
-* Reorganize crate API:
- * Have errors contain `Box<Error+Send+Sync>` instead of `String`.
- * Make errors non-exhaustive.
- * Audit `StateID` trait for safety.
- * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
- have inefficient implementations of some methods. Maybe use multiple
- traits? Answer: get rid of premultiply/classes knobs and just enable
- them by default. Should remove a huge amount of code.
- * Check whether `unsafe` is really needed to eliminate bounds checks. Use
- micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
- * Re-write module docs for `dfa` as they are no longer top-level. Keep most.
- * Retain any pertinent top-level crate docs, but don't rewrite yet.
- * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
- right now.
- * Clean up and add 'always_match' and 'never_match' constructors for every
- regex engine.
- * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
- version of \b unfortunately. Carefully scrutinize how the regex crate's
- lazy DFA does it and try to make it comprehensible. Done! Except for the
- part about making it comprehensible.
-* Rethink prefilters?
-* Add `regex-automata-generate` CLI tool. This should just be a copy of
- the `ucd-generate dfa` and `ucd-generate regex` commands.
-
-Then build new public `nfa` sub-module.
- * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
- source for fast checking. That way, we don't need to ever do explicit UTF-8
- decoding anywhere. Yay.
-
-Then `lazy` sub-module.
-
-Then `onepass`.
-
-Then `jit`.
-
-... and beyond? CRAZY. But it can be done! Build STRONG base layers.