diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-17 12:02:58 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-17 12:02:58 +0000 |
commit | 698f8c2f01ea549d77d7dc3338a12e04c11057b9 (patch) | |
tree | 173a775858bd501c378080a10dca74132f05bc50 /vendor/regex-automata/README.md | |
parent | Initial commit. (diff) | |
download | rustc-698f8c2f01ea549d77d7dc3338a12e04c11057b9.tar.xz rustc-698f8c2f01ea549d77d7dc3338a12e04c11057b9.zip |
Adding upstream version 1.64.0+dfsg1.upstream/1.64.0+dfsg1
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/regex-automata/README.md')
-rw-r--r-- | vendor/regex-automata/README.md | 223 |
1 files changed, 223 insertions, 0 deletions
diff --git a/vendor/regex-automata/README.md b/vendor/regex-automata/README.md new file mode 100644 index 000000000..8eaf03f04 --- /dev/null +++ b/vendor/regex-automata/README.md @@ -0,0 +1,223 @@ +regex-automata +============== +A low level regular expression library that uses deterministic finite automata. +It supports a rich syntax with Unicode support, has extensive options for +configuring the best space vs time trade off for your use case and provides +support for cheap deserialization of automata for use in `no_std` environments. + +[![Build status](https://github.com/BurntSushi/regex-automata/workflows/ci/badge.svg)](https://github.com/BurntSushi/regex-automata/actions) +[![on crates.io](https://meritbadge.herokuapp.com/regex-automata)](https://crates.io/crates/regex-automata) +![Minimum Supported Rust Version 1.41](https://img.shields.io/badge/rustc-1.41-green) + +Dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/). + + +### Documentation + +https://docs.rs/regex-automata + + +### Usage + +Add this to your `Cargo.toml`: + +```toml +[dependencies] +regex-automata = "0.1" +``` + +and this to your crate root (if you're using Rust 2015): + +```rust +extern crate regex_automata; +``` + + +### Example: basic regex searching + +This example shows how to compile a regex using the default configuration +and then use it to find matches in a byte string: + +```rust +use regex_automata::Regex; + +let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); +let text = b"2018-12-24 2016-10-08"; +let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); +assert_eq!(matches, vec![(0, 10), (11, 21)]); +``` + +For more examples and information about the various knobs that can be turned, +please see the [docs](https://docs.rs/regex-automata). + + +### Support for `no_std` + +This crate comes with a `std` feature that is enabled by default. When the +`std` feature is enabled, the API of this crate will include the facilities +necessary for compiling, serializing, deserializing and searching with regular +expressions. When the `std` feature is disabled, the API of this crate will +shrink such that it only includes the facilities necessary for deserializing +and searching with regular expressions. + +The intended workflow for `no_std` environments is thus as follows: + +* Write a program with the `std` feature that compiles and serializes a + regular expression. Serialization should only happen after first converting + the DFAs to use a fixed size state identifier instead of the default `usize`. + You may also need to serialize both little and big endian versions of each + DFA. (So that's 4 DFAs in total for each regex.) +* In your `no_std` environment, follow the examples above for deserializing + your previously serialized DFAs into regexes. You can then search with them + as you would any regex. + +Deserialization can happen anywhere. For example, with bytes embedded into a +binary or with a file memory mapped at runtime. + +Note that the +[`ucd-generate`](https://github.com/BurntSushi/ucd-generate) +tool will do the first step for you with its `dfa` or `regex` sub-commands. + + +### Cargo features + +* `std` - **Enabled** by default. This enables the ability to compile finite + automata. This requires the `regex-syntax` dependency. Without this feature + enabled, finite automata can only be used for searching (using the approach + described above). +* `transducer` - **Disabled** by default. This provides implementations of the + `Automaton` trait found in the `fst` crate. This permits using finite + automata generated by this crate to search finite state transducers. This + requires the `fst` dependency. + + +### Differences with the regex crate + +The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a +general purpose regular expression engine. It aims to automatically balance low +compile times, fast search times and low memory usage, while also providing +a convenient API for users. In contrast, this crate provides a lower level +regular expression interface that is a bit less convenient while providing more +explicit control over memory usage and search times. + +Here are some specific negative differences: + +* **Compilation can take an exponential amount of time and space** in the size + of the regex pattern. While most patterns do not exhibit worst case + exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will + build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should + not be compiled with this library. (In the future, the API may expose an + option to return an error if the DFA gets too big.) +* This crate does not support sub-match extraction, which can be achieved with + the regex crate's "captures" API. This may be added in the future, but is + unlikely. +* While the regex crate doesn't necessarily sport fast compilation times, the + regexes in this crate are almost universally slow to compile, especially when + they contain large Unicode character classes. For example, on my system, + compiling `\w{3}` with byte classes enabled takes just over 1 second and + almost 5MB of memory! (Compiling a sparse regex takes about the same time + but only uses about 500KB of memory.) Conversly, compiling the same regex + without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and + less than 5KB of memory. For this reason, you should only use Unicode + character classes if you absolutely need them! +* This crate does not support regex sets. +* This crate does not support zero-width assertions such as `^`, `$`, `\b` or + `\B`. +* As a lower level crate, this library does not do literal optimizations. In + exchange, you get predictable performance regardless of input. The + philosophy here is that literal optimizations should be applied at a higher + level, although there is no easy support for this in the ecosystem yet. +* There is no `&str` API like in the regex crate. In this crate, all APIs + operate on `&[u8]`. By default, match indices are guaranteed to fall on + UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled. + +With some of the downsides out of the way, here are some positive differences: + +* Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply + deserialized. Deserialization always takes constant time since searching can + be performed directly on the raw serialized bytes of a DFA. +* This crate was specifically designed so that the searching phase of a DFA has + minimal runtime requirements, and can therefore be used in `no_std` + environments. While `no_std` environments cannot compile regexes, they can + deserialize pre-compiled regexes. +* Since this crate builds DFAs ahead of time, it will generally out-perform + the `regex` crate on equivalent tasks. The performance difference is likely + not large. However, because of a complex set of optimizations in the regex + crate (like literal optimizations), an accurate performance comparison may be + difficult to do. +* Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search + performance a small amount, but uses much less storage space. Potentially + even less than what the regex crate uses. +* This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`, + which enables one to do less work in some cases. For example, if you only + need the end of a match and not the start of a match, then you can use a DFA + directly without building a `Regex`, which always requires a second DFA to + find the start of a match. +* Aside from choosing between dense and sparse DFAs, there are several options + for configuring the space usage vs search time trade off. These include + things like choosing a smaller state identifier representation, to + premultiplying state identifiers and splitting a DFA's alphabet into + equivalence classes. Finally, DFA minimization is also provided, but can + increase compilation times dramatically. + + +### Future work + +* Look into being smarter about generating NFA states for large Unicode + character classes. These can create a lot of additional work for both the + determinizer and the minimizer, and I suspect this is the key thing we'll + want to improve if we want to make DFA compile times faster. I *believe* + it's possible to potentially build minimal or nearly minimal NFAs for the + special case of Unicode character classes by leveraging Daciuk's algorithms + for building minimal automata in linear time for sets of strings. See + https://blog.burntsushi.net/transducers/#construction for more details. The + key adaptation I think we need to make is to modify the algorithm to operate + on byte ranges instead of enumerating every codepoint in the set. Otherwise, + it might not be worth doing. +* Add support for regex sets. It should be possible to do this by "simply" + introducing more match states. I think we can also report the positions at + each match, similar to how Aho-Corasick works. I think the long pole in the + tent here is probably the API design work and arranging it so that we don't + introduce extra overhead into the non-regex-set case without duplicating a + lot of code. It seems doable. +* Stretch goal: support capturing groups by implementing "tagged" DFA + (transducers). Laurikari's paper is the usual reference here, but Trofimovich + has a much more thorough treatment here: + https://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf + I've only read the paper once. I suspect it will require at least a few more + read throughs before I understand it. + See also: https://re2c.org +* Possibly less ambitious goal: can we select a portion of Trofimovich's work + to make small fixed length look-around work? It would be really nice to + support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $. +* Experiment with code generating Rust code. There is an early experiment in + src/codegen.rs that is thoroughly bit-rotted. At the time, I was + experimenting with whether or not codegen would significant decrease the size + of a DFA, since if you squint hard enough, it's kind of like a sparse + representation. However, it didn't shrink as much as I thought it would, so + I gave up. The other problem is that Rust doesn't support gotos, so I don't + even know whether the "match on each state" in a loop thing will be fast + enough. Either way, it's probably a good option to have. For one thing, it + would be endian independent where as the serialization format of the DFAs in + this crate are endian dependent (so you need two versions of every DFA, but + you only need to compile one of them for any given arch). +* Experiment with unrolling the match loops and fill out the benchmarks. +* Add some kind of streaming API. I believe users of the library can already + implement something for this outside of the crate, but it would be good to + provide an official API. The key thing here is figuring out the API. I + suspect we might want to support several variants. +* Make a decision on whether or not there is room for literal optimizations + in this crate. My original intent was to not let this crate sink down into + that very very very deep rabbit hole. But instead, we might want to provide + some way for literal optimizations to hook into the match routines. The right + path forward here is to probably build something outside of the crate and + then see about integrating it. After all, users can implement their own + match routines just as efficiently as what the crate provides. +* A key downside of DFAs is that they can take up a lot of memory and can be + quite costly to build. Their worst case compilation time is O(2^n), where + n is the number of NFA states. A paper by Yang and Prasanna (2011) actually + seems to provide a way to character state blow up such that it is detectable. + If we could know whether a regex will exhibit state explosion or not, then + we could make an intelligent decision about whether to ahead-of-time compile + a DFA. + See: https://www.researchgate.net/profile/Xu-Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000/Characterization-of-a-global-germplasm-collection-and-its-potential-utilization-for-analysis-of-complex-quantitative-traits-in-maize.pdf |