diff options
Diffstat (limited to 'third_party/rust/regex/README.md')
-rw-r--r-- | third_party/rust/regex/README.md | 246 |
1 files changed, 246 insertions, 0 deletions
diff --git a/third_party/rust/regex/README.md b/third_party/rust/regex/README.md new file mode 100644 index 0000000000..a9d6fcd373 --- /dev/null +++ b/third_party/rust/regex/README.md @@ -0,0 +1,246 @@ +regex +===== +This crate provides routines for searching strings for matches of a [regular +expression] (aka "regex"). The regex syntax supported by this crate is similar +to other regex engines, but it lacks several features that are not known how to +implement efficiently. This includes, but is not limited to, look-around and +backreferences. In exchange, all regex searches in this crate have worst case +`O(m * n)` time complexity, where `m` is proportional to the size of the regex +and `n` is proportional to the size of the string being searched. + +[regular expression]: https://en.wikipedia.org/wiki/Regular_expression + +[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions) +[![Crates.io](https://img.shields.io/crates/v/regex.svg)](https://crates.io/crates/regex) + +### Documentation + +[Module documentation with examples](https://docs.rs/regex). +The module documentation also includes a comprehensive description of the +syntax supported. + +Documentation with examples for the various matching functions and iterators +can be found on the +[`Regex` type](https://docs.rs/regex/*/regex/struct.Regex.html). + +### Usage + +To bring this crate into your repository, either add `regex` to your +`Cargo.toml`, or run `cargo add regex`. + +Here's a simple example that matches a date in YYYY-MM-DD format and prints the +year, month and day: + +```rust +use regex::Regex; + +fn main() { + let re = Regex::new(r"(?x) +(?P<year>\d{4}) # the year +- +(?P<month>\d{2}) # the month +- +(?P<day>\d{2}) # the day +").unwrap(); + + let caps = re.captures("2010-03-14").unwrap(); + assert_eq!("2010", &caps["year"]); + assert_eq!("03", &caps["month"]); + assert_eq!("14", &caps["day"]); +} +``` + +If you have lots of dates in text that you'd like to iterate over, then it's +easy to adapt the above example with an iterator: + +```rust +use regex::Regex; + +fn main() { + let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap(); + let hay = "On 2010-03-14, foo happened. On 2014-10-14, bar happened."; + + let mut dates = vec![]; + for (_, [year, month, day]) in re.captures_iter(hay).map(|c| c.extract()) { + dates.push((year, month, day)); + } + assert_eq!(dates, vec![ + ("2010", "03", "14"), + ("2014", "10", "14"), + ]); +} +``` + +### Usage: Avoid compiling the same regex in a loop + +It is an anti-pattern to compile the same regular expression in a loop since +compilation is typically expensive. (It takes anywhere from a few microseconds +to a few **milliseconds** depending on the size of the regex.) Not only is +compilation itself expensive, but this also prevents optimizations that reuse +allocations internally to the matching engines. + +In Rust, it can sometimes be a pain to pass regular expressions around if +they're used from inside a helper function. Instead, we recommend using the +[`once_cell`](https://crates.io/crates/once_cell) crate to ensure that +regular expressions are compiled exactly once. For example: + +```rust +use { + once_cell::sync::Lazy, + regex::Regex, +}; + +fn some_helper_function(haystack: &str) -> bool { + static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"...").unwrap()); + RE.is_match(haystack) +} + +fn main() { + assert!(some_helper_function("abc")); + assert!(!some_helper_function("ac")); +} +``` + +Specifically, in this example, the regex will be compiled when it is used for +the first time. On subsequent uses, it will reuse the previous compilation. + +### Usage: match regular expressions on `&[u8]` + +The main API of this crate (`regex::Regex`) requires the caller to pass a +`&str` for searching. In Rust, an `&str` is required to be valid UTF-8, which +means the main API can't be used for searching arbitrary bytes. + +To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API is +identical to the main API, except that it takes an `&[u8]` to search on instead +of an `&str`. The `&[u8]` APIs also permit disabling Unicode mode in the regex +even when the pattern would match invalid UTF-8. For example, `(?-u:.)` is +not allowed in `regex::Regex` but is allowed in `regex::bytes::Regex` since +`(?-u:.)` matches any byte except for `\n`. Conversely, `.` will match the +UTF-8 encoding of any Unicode scalar value except for `\n`. + +This example shows how to find all null-terminated strings in a slice of bytes: + +```rust +use regex::bytes::Regex; + +let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap(); +let text = b"foo\xFFbar\x00baz\x00"; + +// Extract all of the strings without the null terminator from each match. +// The unwrap is OK here since a match requires the `cstr` capture to match. +let cstrs: Vec<&[u8]> = + re.captures_iter(text) + .map(|c| c.name("cstr").unwrap().as_bytes()) + .collect(); +assert_eq!(vec![&b"foo\xFFbar"[..], &b"baz"[..]], cstrs); +``` + +Notice here that the `[^\x00]+` will match any *byte* except for `NUL`, +including bytes like `\xFF` which are not valid UTF-8. When using the main API, +`[^\x00]+` would instead match any valid UTF-8 sequence except for `NUL`. + +### Usage: match multiple regular expressions simultaneously + +This demonstrates how to use a `RegexSet` to match multiple (possibly +overlapping) regular expressions in a single scan of the search text: + +```rust +use regex::RegexSet; + +let set = RegexSet::new(&[ + r"\w+", + r"\d+", + r"\pL+", + r"foo", + r"bar", + r"barfoo", + r"foobar", +]).unwrap(); + +// Iterate over and collect all of the matches. +let matches: Vec<_> = set.matches("foobar").into_iter().collect(); +assert_eq!(matches, vec![0, 2, 3, 4, 6]); + +// You can also test whether a particular regex matched: +let matches = set.matches("foobar"); +assert!(!matches.matched(5)); +assert!(matches.matched(6)); +``` + + +### Usage: regex internals as a library + +The [`regex-automata` directory](./regex-automata/) contains a crate that +exposes all of the internal matching engines used by the `regex` crate. The +idea is that the `regex` crate exposes a simple API for 99% of use cases, but +`regex-automata` exposes oodles of customizable behaviors. + +[Documentation for `regex-automata`.](https://docs.rs/regex-automata) + + +### Usage: a regular expression parser + +This repository contains a crate that provides a well tested regular expression +parser, abstract syntax and a high-level intermediate representation for +convenient analysis. It provides no facilities for compilation or execution. +This may be useful if you're implementing your own regex engine or otherwise +need to do analysis on the syntax of a regular expression. It is otherwise not +recommended for general use. + +[Documentation for `regex-syntax`.](https://docs.rs/regex-syntax) + + +### Crate features + +This crate comes with several features that permit tweaking the trade off +between binary size, compilation time and runtime performance. Users of this +crate can selectively disable Unicode tables, or choose from a variety of +optimizations performed by this crate to disable. + +When all of these features are disabled, runtime match performance may be much +worse, but if you're matching on short strings, or if high performance isn't +necessary, then such a configuration is perfectly serviceable. To disable +all such features, use the following `Cargo.toml` dependency configuration: + +```toml +[dependencies.regex] +version = "1.3" +default-features = false +# Unless you have a specific reason not to, it's good sense to enable standard +# library support. It enables several optimizations and avoids spin locks. It +# also shouldn't meaningfully impact compile times or binary size. +features = ["std"] +``` + +This will reduce the dependency tree of `regex` down to two crates: +`regex-syntax` and `regex-automata`. + +The full set of features one can disable are +[in the "Crate features" section of the documentation](https://docs.rs/regex/1.*/#crate-features). + + +### Minimum Rust version policy + +This crate's minimum supported `rustc` version is `1.60.0`. + +The policy is that the minimum Rust version required to use this crate can be +increased in minor version updates. For example, if regex 1.0 requires Rust +1.20.0, then regex 1.0.z for all values of `z` will also require Rust 1.20.0 or +newer. However, regex 1.y for `y > 0` may require a newer minimum version of +Rust. + + +### License + +This project is licensed under either of + + * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or + https://www.apache.org/licenses/LICENSE-2.0) + * MIT license ([LICENSE-MIT](LICENSE-MIT) or + https://opensource.org/licenses/MIT) + +at your option. + +The data in `regex-syntax/src/unicode_tables/` is licensed under the Unicode +License Agreement +([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)). |