diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:47:55 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:47:55 +0000 |
commit | 2aadc03ef15cb5ca5cc2af8a7c08e070742f0ac4 (patch) | |
tree | 033cc839730fda84ff08db877037977be94e5e3a /vendor/regex-automata/README.md | |
parent | Initial commit. (diff) | |
download | cargo-2aadc03ef15cb5ca5cc2af8a7c08e070742f0ac4.tar.xz cargo-2aadc03ef15cb5ca5cc2af8a7c08e070742f0ac4.zip |
Adding upstream version 0.70.1+ds1.upstream/0.70.1+ds1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'vendor/regex-automata/README.md')
-rw-r--r-- | vendor/regex-automata/README.md | 117 |
1 files changed, 117 insertions, 0 deletions
diff --git a/vendor/regex-automata/README.md b/vendor/regex-automata/README.md new file mode 100644 index 0000000..c12b070 --- /dev/null +++ b/vendor/regex-automata/README.md @@ -0,0 +1,117 @@ +regex-automata +============== +This crate exposes a variety of regex engines used by the `regex` crate. +It provides a vast, sprawling and "expert" level API to each regex engine. +The regex engines provided by this crate focus heavily on finite automata +implementations and specifically guarantee worst case `O(m * n)` time +complexity for all searches. (Where `m ~ len(regex)` and `n ~ len(haystack)`.) + +[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions) +[![Crates.io](https://img.shields.io/crates/v/regex-automata.svg)](https://crates.io/crates/regex-automata) + + +### Documentation + +https://docs.rs/regex-automata + + +### Example + +This example shows how to search for matches of multiple regexes, where each +regex uses the same capture group names to parse different key-value formats. + +```rust +use regex_automata::{meta::Regex, PatternID}; + +let re = Regex::new_many(&[ + r#"(?m)^(?<key>[[:word:]]+)=(?<val>[[:word:]]+)$"#, + r#"(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$"#, + r#"(?m)^(?<key>[[:word:]]+)='(?<val>[^']+)'$"#, + r#"(?m)^(?<key>[[:word:]]+):\s*(?<val>[[:word:]]+)$"#, +]).unwrap(); +let hay = r#" +best_album="Blow Your Face Out" +best_quote='"then as it was, then again it will be"' +best_year=1973 +best_simpsons_episode: HOMR +"#; +let mut kvs = vec![]; +for caps in re.captures_iter(hay) { + // N.B. One could use capture indices '1' and '2' here + // as well. Capture indices are local to each pattern. + // (Just like names are.) + let key = &hay[caps.get_group_by_name("key").unwrap()]; + let val = &hay[caps.get_group_by_name("val").unwrap()]; + kvs.push((key, val)); +} +assert_eq!(kvs, vec![ + ("best_album", "Blow Your Face Out"), + ("best_quote", "\"then as it was, then again it will be\""), + ("best_year", "1973"), + ("best_simpsons_episode", "HOMR"), +]); +``` + + +### Safety + +**I welcome audits of `unsafe` code.** + +This crate tries to be extremely conservative in its use of `unsafe`, but does +use it in a few spots. In general, I am very open to removing uses of `unsafe` +if it doesn't result in measurable performance regressions and doesn't result +in significantly more complex code. + +Below is an outline of how `unsafe` is used in this crate. + +* `util::pool::Pool` makes use of `unsafe` to implement a fast path for +accessing an element of the pool. The fast path applies to the first thread +that uses the pool. In effect, the fast path is fast because it avoid a mutex +lock. `unsafe` is also used in the no-std version of `Pool` to implement a spin +lock for synchronization. +* `util::lazy::Lazy` uses `unsafe` to implement a variant of +`once_cell::sync::Lazy` that works in no-std environments. A no-std no-alloc +implementation is also provided that requires use of `unsafe`. +* The `dfa` module makes extensive use of `unsafe` to support zero-copy +deserialization of DFAs. The high level problem is that you need to get from +`&[u8]` to the internal representation of a DFA without doing any copies. +This is required for support in no-std no-alloc environments. It also makes +deserialization extremely cheap. +* The `dfa` and `hybrid` modules use `unsafe` to explicitly elide bounds checks +in the core search loops. This makes the codegen tighter and typically leads to +consistent 5-10% performance improvements on some workloads. + +In general, the above reflect the only uses of `unsafe` throughout the entire +`regex` crate. At present, there are no plans to meaningfully expand the use +of `unsafe`. With that said, one thing folks have been asking for is cheap +deserialization of a `regex::Regex`. My sense is that this feature will require +a lot more `unsafe` in places to support zero-copy deserialization. It is +unclear at this point whether this will be pursued. + + +### Motivation + +I started out building this crate because I wanted to re-work the `regex` +crate internals to make it more amenable to optimizations. It turns out that +there are a lot of different ways to build regex engines and even more ways to +compose them. Moreover, heuristic literal optimizations are often tricky to +get correct, but the fruit they bear is attractive. All of these things were +difficult to expand upon without risking the introduction of more bugs. So I +decided to tear things down and start fresh. + +In the course of doing so, I ended up designing strong boundaries between each +component so that each component could be reasoned and tested independently. +This also made it somewhat natural to expose the components as a library unto +itself. Namely, folks have been asking for more capabilities in the regex +crate for a long time, but these capabilities usually come with additional API +complexity that I didn't want to introduce in the `regex` crate proper. But +exposing them in an "expert" level crate like `regex-automata` seemed quite +fine. + +In the end, I do still somewhat consider this crate an experiment. It is +unclear whether the strong boundaries between components will be an impediment +to ongoing development or not. De-coupling tends to lead to slower development +in my experience, and when you mix in the added cost of not introducing +breaking changes all of the time, things can get quite complicated. But, I +don't think anyone has ever release the internals of a regex engine as a +library before. So it will be interesting to see how it plays out! |