summaryrefslogtreecommitdiffstats
path: root/vendor/regex-automata/PLANS.md
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/regex-automata/PLANS.md')
-rw-r--r--vendor/regex-automata/PLANS.md165
1 files changed, 165 insertions, 0 deletions
diff --git a/vendor/regex-automata/PLANS.md b/vendor/regex-automata/PLANS.md
new file mode 100644
index 000000000..2fa9392ef
--- /dev/null
+++ b/vendor/regex-automata/PLANS.md
@@ -0,0 +1,165 @@
+pattern_limit should not be defined inside nfa::thompson, but rather at the
+top-level.
+
+-----
+
+Main problem right now is exemplified by the set60 and set70 failing tests. In
+particular, when finding the starting position while matching multiple regexes
+simultaneously, the reverse search is messed up. The reverse search doesn't
+depend on which regex matched in the forward direction, which means it won't
+always find the correcting starting location. Unfortunately, the only way to
+fix this, as far as I can tell, is to add a group of start states for every
+regex in the DFA. Then once we do the reverse search, we need to choose the
+correct start state based on which regex matched in the forward direction.
+
+This is a nasty change.
+
+So it looks like this only applies when doing an overlapping search in reverse
+to find the start of a match. That means we should make this configurable
+but enable it by default for the reverse automata. It should be configurable
+so that folks can construct a regex that doesn't have the ability to do
+overlapping searches correctly. If an overlapping search is attempted with
+a reverse automaton that lacks starting states for each pattern, then the
+implementation should panic.
+
+BUT! It is also convenient to provide this option in general for folks that
+want a DFA that can match any pattern while also being able to match a specific
+pattern.
+
+Straw man:
+
+* Update dense::Config to have a `starts_for_each_pattern` option. It should
+ be disabled by default.
+* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
+ to have the aforementioned option enabled.
+* It would be interesting to add new APIs to `Regex` that support matching
+ specific patterns, but I think this is a complication. If we did want to do
+ this, then we should just add it to the `_at` variants and leave the rest of
+ the API untouched.
+* Add a `pattern_id: Option<PatternID>` parameter to each of the five
+ `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
+ existing behavior. A `Some` value means that the starting state for that
+ specific pattern must be chosen, which in turn implies an anchored search.
+ (This means `starts_for_each_pattern` has utility for single-pattern DFAs
+ since it makes it possible to build a DFA that can do both unanchored and
+ anchored searches.)
+* Thread this new parameter down into the various functions in `dfa::search`
+ all the way down into `init_fwd` and `init_rev`. These functions will then
+ pass it to `dfa.start_state_{forward,reverse}`.
+* This is where things get gruesome since we now need to completely re-work how
+ start states are represented in dense and sparse DFAs _and_ it needs to be
+ configurable. It looks like the `Start` type from `dfa::automaton` can
+ basically remain unchanged, since it still represents one of the four
+ possible starting states that will need to be applied for every pattern.
+* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
+ header is the state ID count, which is always 4. We'll want to change this
+ to the stride and add a new header value that encodes the number of patterns.
+ When the number of patterns is zero, then existing behavior is preserved and
+ represents the case where `starts_for_each_pattern` is disabled (or in the
+ case of an empty DFA). When non-zero, a table of starting state IDs is
+ encoded with each row corresponding to the 4 starting states for each
+ pattern. Before this table (even if it's empty), the 4 starting states for
+ the entire DFA are encoded.
+* For `dfa::sparse`, do the same as above. They are essentially the same right
+ now anyway, with the only difference being that sparse DFAs use `&[u8]`
+ instead of `&[S]` (because sparse DFAs don't have any alignment
+ requirements).
+* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
+ true, creates a start table with the header, the start states for the entire
+ DFA and a row of start states for each pattern. When false, no rows are
+ added.
+* Expose whether there are starting states for each pattern via a predicate
+ on the DFA.
+* Modify the determinizer's `add_starts` method to basically do what it does,
+ but also do it for each pattern when the DFA is configured for it. It should
+ continue to reuse states as appropriate or not generate new states if they
+ aren't needed. This will want to use the `NFA::start_pattern` method, which
+ provides the starting NFA state ID for the given pattern.
+* Fix the dense->sparse conversion. At this point, this piece should be fairly
+ straight-forward since the sparse representation of starting states is
+ basically identical to the dense representation.
+
+At this point, I think the bug should resolve itself.
+
+^^^^ DONE! IT WORKS!
+
+-----
+
+
+Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
+options forwarded, but with automata oriented docs. Then use this for all of
+the engines instead of having to repeat every option for every builder.
+
+-----
+
+These produce different results. PCRE2 looks correct. Basically, we should be
+using the context around the `at` position correctly, which we aren't doing
+right now. Seems tricky to get right, particularly when confirming the match
+with a reverse DFA.
+
+Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
+RE2 does. GAH.
+
+fn main() {
+ let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
+ let s = "foo bar baz";
+ println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));
+
+ let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
+ let s = "foo bar baz";
+ println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
+}
+
+^^^^ This is fixed now, but we still need to find a way to add test coverage
+for "context" searches. It'd be nice to do this automatically, but we'll
+probably just added a new 'context = [start, end]' option.
+
+-----
+
+
+* Create regex-test crate, based on glob-test. Try to anticipate the needs for
+ the full regex test suite.
+ * See if we can clean up tests.
+ * Provide a way to mark a test as expensive.
+ * Provide a way to test is_match_at and find_at.
+ * Test shortest_match_at too? Huge pain. Add tests for it.
+ * Port ALL tests from the regex crate. Will probably need a way to mark a
+ test as skipped.
+ * Document tests better.
+* Find a way to remove byteorder dependency.
+* Reorganize crate API:
+ * Have errors contain `Box<Error+Send+Sync>` instead of `String`.
+ * Make errors non-exhaustive.
+ * Audit `StateID` trait for safety.
+ * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
+ have inefficient implementations of some methods. Maybe use multiple
+ traits? Answer: get rid of premultiply/classes knobs and just enable
+ them by default. Should remove a huge amount of code.
+ * Check whether `unsafe` is really needed to eliminate bounds checks. Use
+ micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
+ * Re-write module docs for `dfa` as they are no longer top-level. Keep most.
+ * Retain any pertinent top-level crate docs, but don't rewrite yet.
+ * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
+ right now.
+ * Clean up and add 'always_match' and 'never_match' constructors for every
+ regex engine.
+ * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
+ version of \b unfortunately. Carefully scrutinize how the regex crate's
+ lazy DFA does it and try to make it comprehensible. Done! Except for the
+ part about making it comprehensible.
+* Rethink prefilters?
+* Add `regex-automata-generate` CLI tool. This should just be a copy of
+ the `ucd-generate dfa` and `ucd-generate regex` commands.
+
+Then build new public `nfa` sub-module.
+ * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
+ source for fast checking. That way, we don't need to ever do explicit UTF-8
+ decoding anywhere. Yay.
+
+Then `lazy` sub-module.
+
+Then `onepass`.
+
+Then `jit`.
+
+... and beyond? CRAZY. But it can be done! Build STRONG base layers.