117 lines
5.2 KiB
Markdown
117 lines
5.2 KiB
Markdown
regex-automata
|
|
==============
|
|
This crate exposes a variety of regex engines used by the `regex` crate.
|
|
It provides a vast, sprawling and "expert" level API to each regex engine.
|
|
The regex engines provided by this crate focus heavily on finite automata
|
|
implementations and specifically guarantee worst case `O(m * n)` time
|
|
complexity for all searches. (Where `m ~ len(regex)` and `n ~ len(haystack)`.)
|
|
|
|
[](https://github.com/rust-lang/regex/actions)
|
|
[](https://crates.io/crates/regex-automata)
|
|
|
|
|
|
### Documentation
|
|
|
|
https://docs.rs/regex-automata
|
|
|
|
|
|
### Example
|
|
|
|
This example shows how to search for matches of multiple regexes, where each
|
|
regex uses the same capture group names to parse different key-value formats.
|
|
|
|
```rust
|
|
use regex_automata::{meta::Regex, PatternID};
|
|
|
|
let re = Regex::new_many(&[
|
|
r#"(?m)^(?<key>[[:word:]]+)=(?<val>[[:word:]]+)$"#,
|
|
r#"(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$"#,
|
|
r#"(?m)^(?<key>[[:word:]]+)='(?<val>[^']+)'$"#,
|
|
r#"(?m)^(?<key>[[:word:]]+):\s*(?<val>[[:word:]]+)$"#,
|
|
]).unwrap();
|
|
let hay = r#"
|
|
best_album="Blow Your Face Out"
|
|
best_quote='"then as it was, then again it will be"'
|
|
best_year=1973
|
|
best_simpsons_episode: HOMR
|
|
"#;
|
|
let mut kvs = vec![];
|
|
for caps in re.captures_iter(hay) {
|
|
// N.B. One could use capture indices '1' and '2' here
|
|
// as well. Capture indices are local to each pattern.
|
|
// (Just like names are.)
|
|
let key = &hay[caps.get_group_by_name("key").unwrap()];
|
|
let val = &hay[caps.get_group_by_name("val").unwrap()];
|
|
kvs.push((key, val));
|
|
}
|
|
assert_eq!(kvs, vec![
|
|
("best_album", "Blow Your Face Out"),
|
|
("best_quote", "\"then as it was, then again it will be\""),
|
|
("best_year", "1973"),
|
|
("best_simpsons_episode", "HOMR"),
|
|
]);
|
|
```
|
|
|
|
|
|
### Safety
|
|
|
|
**I welcome audits of `unsafe` code.**
|
|
|
|
This crate tries to be extremely conservative in its use of `unsafe`, but does
|
|
use it in a few spots. In general, I am very open to removing uses of `unsafe`
|
|
if it doesn't result in measurable performance regressions and doesn't result
|
|
in significantly more complex code.
|
|
|
|
Below is an outline of how `unsafe` is used in this crate.
|
|
|
|
* `util::pool::Pool` makes use of `unsafe` to implement a fast path for
|
|
accessing an element of the pool. The fast path applies to the first thread
|
|
that uses the pool. In effect, the fast path is fast because it avoid a mutex
|
|
lock. `unsafe` is also used in the no-std version of `Pool` to implement a spin
|
|
lock for synchronization.
|
|
* `util::lazy::Lazy` uses `unsafe` to implement a variant of
|
|
`once_cell::sync::Lazy` that works in no-std environments. A no-std no-alloc
|
|
implementation is also provided that requires use of `unsafe`.
|
|
* The `dfa` module makes extensive use of `unsafe` to support zero-copy
|
|
deserialization of DFAs. The high level problem is that you need to get from
|
|
`&[u8]` to the internal representation of a DFA without doing any copies.
|
|
This is required for support in no-std no-alloc environments. It also makes
|
|
deserialization extremely cheap.
|
|
* The `dfa` and `hybrid` modules use `unsafe` to explicitly elide bounds checks
|
|
in the core search loops. This makes the codegen tighter and typically leads to
|
|
consistent 5-10% performance improvements on some workloads.
|
|
|
|
In general, the above reflect the only uses of `unsafe` throughout the entire
|
|
`regex` crate. At present, there are no plans to meaningfully expand the use
|
|
of `unsafe`. With that said, one thing folks have been asking for is cheap
|
|
deserialization of a `regex::Regex`. My sense is that this feature will require
|
|
a lot more `unsafe` in places to support zero-copy deserialization. It is
|
|
unclear at this point whether this will be pursued.
|
|
|
|
|
|
### Motivation
|
|
|
|
I started out building this crate because I wanted to re-work the `regex`
|
|
crate internals to make it more amenable to optimizations. It turns out that
|
|
there are a lot of different ways to build regex engines and even more ways to
|
|
compose them. Moreover, heuristic literal optimizations are often tricky to
|
|
get correct, but the fruit they bear is attractive. All of these things were
|
|
difficult to expand upon without risking the introduction of more bugs. So I
|
|
decided to tear things down and start fresh.
|
|
|
|
In the course of doing so, I ended up designing strong boundaries between each
|
|
component so that each component could be reasoned and tested independently.
|
|
This also made it somewhat natural to expose the components as a library unto
|
|
itself. Namely, folks have been asking for more capabilities in the regex
|
|
crate for a long time, but these capabilities usually come with additional API
|
|
complexity that I didn't want to introduce in the `regex` crate proper. But
|
|
exposing them in an "expert" level crate like `regex-automata` seemed quite
|
|
fine.
|
|
|
|
In the end, I do still somewhat consider this crate an experiment. It is
|
|
unclear whether the strong boundaries between components will be an impediment
|
|
to ongoing development or not. De-coupling tends to lead to slower development
|
|
in my experience, and when you mix in the added cost of not introducing
|
|
breaking changes all of the time, things can get quite complicated. But, I
|
|
don't think anyone has ever release the internals of a regex engine as a
|
|
library before. So it will be interesting to see how it plays out!
|