Merging upstream version 1.74.1+dfsg1.

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-30 18:31:44 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-30 18:31:44 +0000
commit: c23a457e72abe608715ac76f076f47dc42af07a5 (patch)
tree: 2772049aaf84b5c9d0ed12ec8d86812f7a7904b6 /vendor/regex-automata-0.2.0/PLANS.md
parent: Releasing progress-linux version 1.73.0+dfsg1-1~progress7.99u1. (diff)
download: rustc-c23a457e72abe608715ac76f076f47dc42af07a5.tar.xz
rustc-c23a457e72abe608715ac76f076f47dc42af07a5.zip
1 files changed, 165 insertions, 0 deletions
diff --git a/vendor/regex-automata-0.2.0/PLANS.md b/vendor/regex-automata-0.2.0/PLANS.md
new file mode 100644
index 000000000..2fa9392ef
--- /dev/null
+++ b/vendor/regex-automata-0.2.0/PLANS.md
@@ -0,0 +1,165 @@
+pattern_limit should not be defined inside nfa::thompson, but rather at the
+top-level.
+
+-----
+
+Main problem right now is exemplified by the set60 and set70 failing tests. In
+particular, when finding the starting position while matching multiple regexes
+simultaneously, the reverse search is messed up. The reverse search doesn't
+depend on which regex matched in the forward direction, which means it won't
+always find the correcting starting location. Unfortunately, the only way to
+fix this, as far as I can tell, is to add a group of start states for every
+regex in the DFA. Then once we do the reverse search, we need to choose the
+correct start state based on which regex matched in the forward direction.
+
+This is a nasty change.
+
+So it looks like this only applies when doing an overlapping search in reverse
+to find the start of a match. That means we should make this configurable
+but enable it by default for the reverse automata. It should be configurable
+so that folks can construct a regex that doesn't have the ability to do
+overlapping searches correctly. If an overlapping search is attempted with
+a reverse automaton that lacks starting states for each pattern, then the
+implementation should panic.
+
+BUT! It is also convenient to provide this option in general for folks that
+want a DFA that can match any pattern while also being able to match a specific
+pattern.
+
+Straw man:
+
+* Update dense::Config to have a `starts_for_each_pattern` option. It should
+  be disabled by default.
+* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
+  to have the aforementioned option enabled.
+* It would be interesting to add new APIs to `Regex` that support matching
+  specific patterns, but I think this is a complication. If we did want to do
+  this, then we should just add it to the `_at` variants and leave the rest of
+  the API untouched.
+* Add a `pattern_id: Option<PatternID>` parameter to each of the five
+  `*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
+  existing behavior. A `Some` value means that the starting state for that
+  specific pattern must be chosen, which in turn implies an anchored search.
+  (This means `starts_for_each_pattern` has utility for single-pattern DFAs
+  since it makes it possible to build a DFA that can do both unanchored and
+  anchored searches.)
+* Thread this new parameter down into the various functions in `dfa::search`
+  all the way down into `init_fwd` and `init_rev`. These functions will then
+  pass it to `dfa.start_state_{forward,reverse}`.
+* This is where things get gruesome since we now need to completely re-work how
+  start states are represented in dense and sparse DFAs _and_ it needs to be
+  configurable. It looks like the `Start` type from `dfa::automaton` can
+  basically remain unchanged, since it still represents one of the four
+  possible starting states that will need to be applied for every pattern.
+* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
+  header is the state ID count, which is always 4. We'll want to change this
+  to the stride and add a new header value that encodes the number of patterns.
+  When the number of patterns is zero, then existing behavior is preserved and
+  represents the case where `starts_for_each_pattern` is disabled (or in the
+  case of an empty DFA). When non-zero, a table of starting state IDs is
+  encoded with each row corresponding to the 4 starting states for each
+  pattern. Before this table (even if it's empty), the 4 starting states for
+  the entire DFA are encoded.
+* For `dfa::sparse`, do the same as above. They are essentially the same right
+  now anyway, with the only difference being that sparse DFAs use `&[u8]`
+  instead of `&[S]` (because sparse DFAs don't have any alignment
+  requirements).
+* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
+  true, creates a start table with the header, the start states for the entire
+  DFA and a row of start states for each pattern. When false, no rows are
+  added.
+* Expose whether there are starting states for each pattern via a predicate
+  on the DFA.
+* Modify the determinizer's `add_starts` method to basically do what it does,
+  but also do it for each pattern when the DFA is configured for it. It should
+  continue to reuse states as appropriate or not generate new states if they
+  aren't needed. This will want to use the `NFA::start_pattern` method, which
+  provides the starting NFA state ID for the given pattern.
+* Fix the dense->sparse conversion. At this point, this piece should be fairly
+  straight-forward since the sparse representation of starting states is
+  basically identical to the dense representation.
+
+At this point, I think the bug should resolve itself.
+
+^^^^ DONE! IT WORKS!
+
+-----
+
+
+Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
+options forwarded, but with automata oriented docs. Then use this for all of
+the engines instead of having to repeat every option for every builder.
+
+-----
+
+These produce different results. PCRE2 looks correct. Basically, we should be
+using the context around the `at` position correctly, which we aren't doing
+right now. Seems tricky to get right, particularly when confirming the match
+with a reverse DFA.
+
+Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
+RE2 does. GAH.
+
+fn main() {
+    let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
+    let s = "foo bar baz";
+    println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));
+
+    let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
+    let s = "foo bar baz";
+    println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
+}
+
+^^^^ This is fixed now, but we still need to find a way to add test coverage
+for "context" searches. It'd be nice to do this automatically, but we'll
+probably just added a new 'context = [start, end]' option.
+
+-----
+
+
+* Create regex-test crate, based on glob-test. Try to anticipate the needs for
+  the full regex test suite.
+  * See if we can clean up tests.
+    * Provide a way to mark a test as expensive.
+    * Provide a way to test is_match_at and find_at.
+    * Test shortest_match_at too? Huge pain. Add tests for it.
+    * Port ALL tests from the regex crate. Will probably need a way to mark a
+      test as skipped.
+    * Document tests better.
+* Find a way to remove byteorder dependency.
+* Reorganize crate API:
+  * Have errors contain `Box<Error+Send+Sync>` instead of `String`.
+  * Make errors non-exhaustive.
+  * Audit `StateID` trait for safety.
+  * Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
+    have inefficient implementations of some methods. Maybe use multiple
+    traits? Answer: get rid of premultiply/classes knobs and just enable
+    them by default. Should remove a huge amount of code.
+  * Check whether `unsafe` is really needed to eliminate bounds checks. Use
+    micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
+  * Re-write module docs for `dfa` as they are no longer top-level. Keep most.
+  * Retain any pertinent top-level crate docs, but don't rewrite yet.
+  * Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
+    right now.
+  * Clean up and add 'always_match' and 'never_match' constructors for every
+    regex engine.
+  * See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
+    version of \b unfortunately. Carefully scrutinize how the regex crate's
+    lazy DFA does it and try to make it comprehensible. Done! Except for the
+    part about making it comprehensible.
+* Rethink prefilters?
+* Add `regex-automata-generate` CLI tool. This should just be a copy of
+  the `ucd-generate dfa` and `ucd-generate regex` commands.
+
+Then build new public `nfa` sub-module.
+  * For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
+    source for fast checking. That way, we don't need to ever do explicit UTF-8
+    decoding anywhere. Yay.
+
+Then `lazy` sub-module.
+
+Then `onepass`.
+
+Then `jit`.
+
+... and beyond? CRAZY. But it can be done! Build STRONG base layers.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-30 18:31:44 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-30 18:31:44 +0000
commit	c23a457e72abe608715ac76f076f47dc42af07a5 (patch)
tree	2772049aaf84b5c9d0ed12ec8d86812f7a7904b6 /vendor/regex-automata-0.2.0/PLANS.md
parent	Releasing progress-linux version 1.73.0+dfsg1-1~progress7.99u1. (diff)
download	rustc-c23a457e72abe608715ac76f076f47dc42af07a5.tar.xz rustc-c23a457e72abe608715ac76f076f47dc42af07a5.zip