summaryrefslogtreecommitdiffstats
path: root/vendor/regex-automata/README.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-30 18:31:44 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-30 18:31:44 +0000
commitc23a457e72abe608715ac76f076f47dc42af07a5 (patch)
tree2772049aaf84b5c9d0ed12ec8d86812f7a7904b6 /vendor/regex-automata/README.md
parentReleasing progress-linux version 1.73.0+dfsg1-1~progress7.99u1. (diff)
downloadrustc-c23a457e72abe608715ac76f076f47dc42af07a5.tar.xz
rustc-c23a457e72abe608715ac76f076f47dc42af07a5.zip
Merging upstream version 1.74.1+dfsg1.
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/regex-automata/README.md')
-rw-r--r--vendor/regex-automata/README.md307
1 files changed, 101 insertions, 206 deletions
diff --git a/vendor/regex-automata/README.md b/vendor/regex-automata/README.md
index 23e0bffe0..c12b07012 100644
--- a/vendor/regex-automata/README.md
+++ b/vendor/regex-automata/README.md
@@ -1,15 +1,13 @@
regex-automata
==============
-A low level regular expression library that uses deterministic finite automata.
-It supports a rich syntax with Unicode support, has extensive options for
-configuring the best space vs time trade off for your use case and provides
-support for cheap deserialization of automata for use in `no_std` environments.
+This crate exposes a variety of regex engines used by the `regex` crate.
+It provides a vast, sprawling and "expert" level API to each regex engine.
+The regex engines provided by this crate focus heavily on finite automata
+implementations and specifically guarantee worst case `O(m * n)` time
+complexity for all searches. (Where `m ~ len(regex)` and `n ~ len(haystack)`.)
-[![Build status](https://github.com/BurntSushi/regex-automata/workflows/ci/badge.svg)](https://github.com/BurntSushi/regex-automata/actions)
+[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions)
[![Crates.io](https://img.shields.io/crates/v/regex-automata.svg)](https://crates.io/crates/regex-automata)
-![Minimum Supported Rust Version 1.41](https://img.shields.io/badge/rustc-1.41-green)
-
-Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/).
### Documentation
@@ -17,206 +15,103 @@ Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/).
https://docs.rs/regex-automata
-### Usage
-
-Add this to your `Cargo.toml`:
-
-```toml
-[dependencies]
-regex-automata = "0.1"
-```
-
-**WARNING**: The `master` branch currently contains code for the `0.2` release,
-but this README still targets the `0.1` release. Namely, it is recommended to
-stick with the `0.1` release. The `0.2` release was made prematurely in order
-to unblock some folks.
+### Example
-
-### Example: basic regex searching
-
-This example shows how to compile a regex using the default configuration
-and then use it to find matches in a byte string:
+This example shows how to search for matches of multiple regexes, where each
+regex uses the same capture group names to parse different key-value formats.
```rust
-use regex_automata::Regex;
-
-let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
-let text = b"2018-12-24 2016-10-08";
-let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
-assert_eq!(matches, vec![(0, 10), (11, 21)]);
+use regex_automata::{meta::Regex, PatternID};
+
+let re = Regex::new_many(&[
+ r#"(?m)^(?<key>[[:word:]]+)=(?<val>[[:word:]]+)$"#,
+ r#"(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$"#,
+ r#"(?m)^(?<key>[[:word:]]+)='(?<val>[^']+)'$"#,
+ r#"(?m)^(?<key>[[:word:]]+):\s*(?<val>[[:word:]]+)$"#,
+]).unwrap();
+let hay = r#"
+best_album="Blow Your Face Out"
+best_quote='"then as it was, then again it will be"'
+best_year=1973
+best_simpsons_episode: HOMR
+"#;
+let mut kvs = vec![];
+for caps in re.captures_iter(hay) {
+ // N.B. One could use capture indices '1' and '2' here
+ // as well. Capture indices are local to each pattern.
+ // (Just like names are.)
+ let key = &hay[caps.get_group_by_name("key").unwrap()];
+ let val = &hay[caps.get_group_by_name("val").unwrap()];
+ kvs.push((key, val));
+}
+assert_eq!(kvs, vec![
+ ("best_album", "Blow Your Face Out"),
+ ("best_quote", "\"then as it was, then again it will be\""),
+ ("best_year", "1973"),
+ ("best_simpsons_episode", "HOMR"),
+]);
```
-For more examples and information about the various knobs that can be turned,
-please see the [docs](https://docs.rs/regex-automata/0.1).
-
-
-### Support for `no_std`
-
-This crate comes with a `std` feature that is enabled by default. When the
-`std` feature is enabled, the API of this crate will include the facilities
-necessary for compiling, serializing, deserializing and searching with regular
-expressions. When the `std` feature is disabled, the API of this crate will
-shrink such that it only includes the facilities necessary for deserializing
-and searching with regular expressions.
-
-The intended workflow for `no_std` environments is thus as follows:
-
-* Write a program with the `std` feature that compiles and serializes a
- regular expression. Serialization should only happen after first converting
- the DFAs to use a fixed size state identifier instead of the default `usize`.
- You may also need to serialize both little and big endian versions of each
- DFA. (So that's 4 DFAs in total for each regex.)
-* In your `no_std` environment, follow the examples above for deserializing
- your previously serialized DFAs into regexes. You can then search with them
- as you would any regex.
-
-Deserialization can happen anywhere. For example, with bytes embedded into a
-binary or with a file memory mapped at runtime.
-
-Note that the
-[`ucd-generate`](https://github.com/BurntSushi/ucd-generate)
-tool will do the first step for you with its `dfa` or `regex` sub-commands.
-
-
-### Cargo features
-
-* `std` - **Enabled** by default. This enables the ability to compile finite
- automata. This requires the `regex-syntax` dependency. Without this feature
- enabled, finite automata can only be used for searching (using the approach
- described above).
-* `transducer` - **Disabled** by default. This provides implementations of the
- `Automaton` trait found in the `fst` crate. This permits using finite
- automata generated by this crate to search finite state transducers. This
- requires the `fst` dependency.
-
-
-### Differences with the regex crate
-
-The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a
-general purpose regular expression engine. It aims to automatically balance low
-compile times, fast search times and low memory usage, while also providing
-a convenient API for users. In contrast, this crate provides a lower level
-regular expression interface that is a bit less convenient while providing more
-explicit control over memory usage and search times.
-
-Here are some specific negative differences:
-
-* **Compilation can take an exponential amount of time and space** in the size
- of the regex pattern. While most patterns do not exhibit worst case
- exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will
- build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should
- not be compiled with this library. (In the future, the API may expose an
- option to return an error if the DFA gets too big.)
-* This crate does not support sub-match extraction, which can be achieved with
- the regex crate's "captures" API. This may be added in the future, but is
- unlikely.
-* While the regex crate doesn't necessarily sport fast compilation times, the
- regexes in this crate are almost universally slow to compile, especially when
- they contain large Unicode character classes. For example, on my system,
- compiling `\w{3}` with byte classes enabled takes just over 1 second and
- almost 5MB of memory! (Compiling a sparse regex takes about the same time
- but only uses about 500KB of memory.) Conversly, compiling the same regex
- without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and
- less than 5KB of memory. For this reason, you should only use Unicode
- character classes if you absolutely need them!
-* This crate does not support regex sets.
-* This crate does not support zero-width assertions such as `^`, `$`, `\b` or
- `\B`.
-* As a lower level crate, this library does not do literal optimizations. In
- exchange, you get predictable performance regardless of input. The
- philosophy here is that literal optimizations should be applied at a higher
- level, although there is no easy support for this in the ecosystem yet.
-* There is no `&str` API like in the regex crate. In this crate, all APIs
- operate on `&[u8]`. By default, match indices are guaranteed to fall on
- UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled.
-
-With some of the downsides out of the way, here are some positive differences:
-
-* Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply
- deserialized. Deserialization always takes constant time since searching can
- be performed directly on the raw serialized bytes of a DFA.
-* This crate was specifically designed so that the searching phase of a DFA has
- minimal runtime requirements, and can therefore be used in `no_std`
- environments. While `no_std` environments cannot compile regexes, they can
- deserialize pre-compiled regexes.
-* Since this crate builds DFAs ahead of time, it will generally out-perform
- the `regex` crate on equivalent tasks. The performance difference is likely
- not large. However, because of a complex set of optimizations in the regex
- crate (like literal optimizations), an accurate performance comparison may be
- difficult to do.
-* Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search
- performance a small amount, but uses much less storage space. Potentially
- even less than what the regex crate uses.
-* This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`,
- which enables one to do less work in some cases. For example, if you only
- need the end of a match and not the start of a match, then you can use a DFA
- directly without building a `Regex`, which always requires a second DFA to
- find the start of a match.
-* Aside from choosing between dense and sparse DFAs, there are several options
- for configuring the space usage vs search time trade off. These include
- things like choosing a smaller state identifier representation, to
- premultiplying state identifiers and splitting a DFA's alphabet into
- equivalence classes. Finally, DFA minimization is also provided, but can
- increase compilation times dramatically.
-
-
-### Future work
-
-* Look into being smarter about generating NFA states for large Unicode
- character classes. These can create a lot of additional work for both the
- determinizer and the minimizer, and I suspect this is the key thing we'll
- want to improve if we want to make DFA compile times faster. I *believe*
- it's possible to potentially build minimal or nearly minimal NFAs for the
- special case of Unicode character classes by leveraging Daciuk's algorithms
- for building minimal automata in linear time for sets of strings. See
- https://blog.burntsushi.net/transducers/#construction for more details. The
- key adaptation I think we need to make is to modify the algorithm to operate
- on byte ranges instead of enumerating every codepoint in the set. Otherwise,
- it might not be worth doing.
-* Add support for regex sets. It should be possible to do this by "simply"
- introducing more match states. I think we can also report the positions at
- each match, similar to how Aho-Corasick works. I think the long pole in the
- tent here is probably the API design work and arranging it so that we don't
- introduce extra overhead into the non-regex-set case without duplicating a
- lot of code. It seems doable.
-* Stretch goal: support capturing groups by implementing "tagged" DFA
- (transducers). Laurikari's paper is the usual reference here, but Trofimovich
- has a much more thorough treatment here:
- https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf
- I've only read the paper once. I suspect it will require at least a few more
- read throughs before I understand it.
- See also: https://re2c.org
-* Possibly less ambitious goal: can we select a portion of Trofimovich's work
- to make small fixed length look-around work? It would be really nice to
- support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $.
-* Experiment with code generating Rust code. There is an early experiment in
- src/codegen.rs that is thoroughly bit-rotted. At the time, I was
- experimenting with whether or not codegen would significant decrease the size
- of a DFA, since if you squint hard enough, it's kind of like a sparse
- representation. However, it didn't shrink as much as I thought it would, so
- I gave up. The other problem is that Rust doesn't support gotos, so I don't
- even know whether the "match on each state" in a loop thing will be fast
- enough. Either way, it's probably a good option to have. For one thing, it
- would be endian independent where as the serialization format of the DFAs in
- this crate are endian dependent (so you need two versions of every DFA, but
- you only need to compile one of them for any given arch).
-* Experiment with unrolling the match loops and fill out the benchmarks.
-* Add some kind of streaming API. I believe users of the library can already
- implement something for this outside of the crate, but it would be good to
- provide an official API. The key thing here is figuring out the API. I
- suspect we might want to support several variants.
-* Make a decision on whether or not there is room for literal optimizations
- in this crate. My original intent was to not let this crate sink down into
- that very very very deep rabbit hole. But instead, we might want to provide
- some way for literal optimizations to hook into the match routines. The right
- path forward here is to probably build something outside of the crate and
- then see about integrating it. After all, users can implement their own
- match routines just as efficiently as what the crate provides.
-* A key downside of DFAs is that they can take up a lot of memory and can be
- quite costly to build. Their worst case compilation time is O(2^n), where
- n is the number of NFA states. A paper by Yang and Prasanna (2011) actually
- seems to provide a way to character state blow up such that it is detectable.
- If we could know whether a regex will exhibit state explosion or not, then
- we could make an intelligent decision about whether to ahead-of-time compile
- a DFA.
- See: https://www.researchgate.net/profile/Xu-Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000/Characterization-of-a-global-germplasm-collection-and-its-potential-utilization-for-analysis-of-complex-quantitative-traits-in-maize.pdf
+
+### Safety
+
+**I welcome audits of `unsafe` code.**
+
+This crate tries to be extremely conservative in its use of `unsafe`, but does
+use it in a few spots. In general, I am very open to removing uses of `unsafe`
+if it doesn't result in measurable performance regressions and doesn't result
+in significantly more complex code.
+
+Below is an outline of how `unsafe` is used in this crate.
+
+* `util::pool::Pool` makes use of `unsafe` to implement a fast path for
+accessing an element of the pool. The fast path applies to the first thread
+that uses the pool. In effect, the fast path is fast because it avoid a mutex
+lock. `unsafe` is also used in the no-std version of `Pool` to implement a spin
+lock for synchronization.
+* `util::lazy::Lazy` uses `unsafe` to implement a variant of
+`once_cell::sync::Lazy` that works in no-std environments. A no-std no-alloc
+implementation is also provided that requires use of `unsafe`.
+* The `dfa` module makes extensive use of `unsafe` to support zero-copy
+deserialization of DFAs. The high level problem is that you need to get from
+`&[u8]` to the internal representation of a DFA without doing any copies.
+This is required for support in no-std no-alloc environments. It also makes
+deserialization extremely cheap.
+* The `dfa` and `hybrid` modules use `unsafe` to explicitly elide bounds checks
+in the core search loops. This makes the codegen tighter and typically leads to
+consistent 5-10% performance improvements on some workloads.
+
+In general, the above reflect the only uses of `unsafe` throughout the entire
+`regex` crate. At present, there are no plans to meaningfully expand the use
+of `unsafe`. With that said, one thing folks have been asking for is cheap
+deserialization of a `regex::Regex`. My sense is that this feature will require
+a lot more `unsafe` in places to support zero-copy deserialization. It is
+unclear at this point whether this will be pursued.
+
+
+### Motivation
+
+I started out building this crate because I wanted to re-work the `regex`
+crate internals to make it more amenable to optimizations. It turns out that
+there are a lot of different ways to build regex engines and even more ways to
+compose them. Moreover, heuristic literal optimizations are often tricky to
+get correct, but the fruit they bear is attractive. All of these things were
+difficult to expand upon without risking the introduction of more bugs. So I
+decided to tear things down and start fresh.
+
+In the course of doing so, I ended up designing strong boundaries between each
+component so that each component could be reasoned and tested independently.
+This also made it somewhat natural to expose the components as a library unto
+itself. Namely, folks have been asking for more capabilities in the regex
+crate for a long time, but these capabilities usually come with additional API
+complexity that I didn't want to introduce in the `regex` crate proper. But
+exposing them in an "expert" level crate like `regex-automata` seemed quite
+fine.
+
+In the end, I do still somewhat consider this crate an experiment. It is
+unclear whether the strong boundaries between components will be an impediment
+to ongoing development or not. De-coupling tends to lead to slower development
+in my experience, and when you mix in the added cost of not introducing
+breaking changes all of the time, things can get quite complicated. But, I
+don't think anyone has ever release the internals of a regex engine as a
+library before. So it will be interesting to see how it plays out!