/*!
A DFA-backed `Regex`.
This module provides [`Regex`], which is defined generically over the
[`Automaton`] trait. A `Regex` implements convenience routines you might have
come to expect, such as finding the start/end of a match and iterating over
all non-overlapping matches. This `Regex` type is limited in its capabilities
to what a DFA can provide. Therefore, APIs involving capturing groups, for
example, are not provided.
Internally, a `Regex` is composed of two DFAs. One is a "forward" DFA that
finds the end offset of a match, where as the other is a "reverse" DFA that
find the start offset of a match.
See the [parent module](crate::dfa) for examples.
*/
#[cfg(feature = "alloc")]
use alloc::vec::Vec;
#[cfg(feature = "dfa-build")]
use crate::dfa::dense::BuildError;
use crate::{
dfa::{automaton::Automaton, dense},
util::{iter, search::Input},
Anchored, Match, MatchError,
};
#[cfg(feature = "alloc")]
use crate::{
dfa::{sparse, StartKind},
util::search::MatchKind,
};
// When the alloc feature is enabled, the regex type sets its A type parameter
// to default to an owned dense DFA. But without alloc, we set no default. This
// makes things a lot more convenient in the common case, since writing out the
// DFA types is pretty annoying.
//
// Since we have two different definitions but only want to write one doc
// string, we use a macro to capture the doc and other attributes once and then
// repeat them for each definition.
macro_rules! define_regex_type {
($(#[$doc:meta])*) => {
#[cfg(feature = "alloc")]
$(#[$doc])*
pub struct Regex {
forward: A,
reverse: A,
}
#[cfg(not(feature = "alloc"))]
$(#[$doc])*
pub struct Regex {
forward: A,
reverse: A,
}
};
}
define_regex_type!(
/// A regular expression that uses deterministic finite automata for fast
/// searching.
///
/// A regular expression is comprised of two DFAs, a "forward" DFA and a
/// "reverse" DFA. The forward DFA is responsible for detecting the end of
/// a match while the reverse DFA is responsible for detecting the start
/// of a match. Thus, in order to find the bounds of any given match, a
/// forward search must first be run followed by a reverse search. A match
/// found by the forward DFA guarantees that the reverse DFA will also find
/// a match.
///
/// The type of the DFA used by a `Regex` corresponds to the `A` type
/// parameter, which must satisfy the [`Automaton`] trait. Typically,
/// `A` is either a [`dense::DFA`](crate::dfa::dense::DFA) or a
/// [`sparse::DFA`](crate::dfa::sparse::DFA), where dense DFAs use more
/// memory but search faster, while sparse DFAs use less memory but search
/// more slowly.
///
/// # Crate features
///
/// Note that despite what the documentation auto-generates, the _only_
/// crate feature needed to use this type is `dfa-search`. You do _not_
/// need to enable the `alloc` feature.
///
/// By default, a regex's automaton type parameter is set to
/// `dense::DFA>` when the `alloc` feature is enabled. For most
/// in-memory work loads, this is the most convenient type that gives the
/// best search performance. When the `alloc` feature is disabled, no
/// default type is used.
///
/// # When should I use this?
///
/// Generally speaking, if you can afford the overhead of building a full
/// DFA for your regex, and you don't need things like capturing groups,
/// then this is a good choice if you're looking to optimize for matching
/// speed. Note however that its speed may be worse than a general purpose
/// regex engine if you don't provide a [`dense::Config::prefilter`] to the
/// underlying DFA.
///
/// # Sparse DFAs
///
/// Since a `Regex` is generic over the [`Automaton`] trait, it can be
/// used with any kind of DFA. While this crate constructs dense DFAs by
/// default, it is easy enough to build corresponding sparse DFAs, and then
/// build a regex from them:
///
/// ```
/// use regex_automata::dfa::regex::Regex;
///
/// // First, build a regex that uses dense DFAs.
/// let dense_re = Regex::new("foo[0-9]+")?;
///
/// // Second, build sparse DFAs from the forward and reverse dense DFAs.
/// let fwd = dense_re.forward().to_sparse()?;
/// let rev = dense_re.reverse().to_sparse()?;
///
/// // Third, build a new regex from the constituent sparse DFAs.
/// let sparse_re = Regex::builder().build_from_dfas(fwd, rev);
///
/// // A regex that uses sparse DFAs can be used just like with dense DFAs.
/// assert_eq!(true, sparse_re.is_match(b"foo123"));
///
/// # Ok::<(), Box>(())
/// ```
///
/// Alternatively, one can use a [`Builder`] to construct a sparse DFA
/// more succinctly. (Note though that dense DFAs are still constructed
/// first internally, and then converted to sparse DFAs, as in the example
/// above.)
///
/// ```
/// use regex_automata::dfa::regex::Regex;
///
/// let sparse_re = Regex::builder().build_sparse(r"foo[0-9]+")?;
/// // A regex that uses sparse DFAs can be used just like with dense DFAs.
/// assert!(sparse_re.is_match(b"foo123"));
///
/// # Ok::<(), Box>(())
/// ```
///
/// # Fallibility
///
/// Most of the search routines defined on this type will _panic_ when the
/// underlying search fails. This might be because the DFA gave up because
/// it saw a quit byte, whether configured explicitly or via heuristic
/// Unicode word boundary support, although neither are enabled by default.
/// Or it might fail because an invalid `Input` configuration is given,
/// for example, with an unsupported [`Anchored`] mode.
///
/// If you need to handle these error cases instead of allowing them to
/// trigger a panic, then the lower level [`Regex::try_search`] provides
/// a fallible API that never panics.
///
/// # Example
///
/// This example shows how to cause a search to terminate if it sees a
/// `\n` byte, and handle the error returned. This could be useful if, for
/// example, you wanted to prevent a user supplied pattern from matching
/// across a line boundary.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{dfa::{self, regex::Regex}, Input, MatchError};
///
/// let re = Regex::builder()
/// .dense(dfa::dense::Config::new().quit(b'\n', true))
/// .build(r"foo\p{any}+bar")?;
///
/// let input = Input::new("foo\nbar");
/// // Normally this would produce a match, since \p{any} contains '\n'.
/// // But since we instructed the automaton to enter a quit state if a
/// // '\n' is observed, this produces a match error instead.
/// let expected = MatchError::quit(b'\n', 3);
/// let got = re.try_search(&input).unwrap_err();
/// assert_eq!(expected, got);
///
/// # Ok::<(), Box>(())
/// ```
#[derive(Clone, Debug)]
);
#[cfg(all(feature = "syntax", feature = "dfa-build"))]
impl Regex {
/// Parse the given regular expression using the default configuration and
/// return the corresponding regex.
///
/// If you want a non-default configuration, then use the [`Builder`] to
/// set your own configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// let re = Regex::new("foo[0-9]+bar")?;
/// assert_eq!(
/// Some(Match::must(0, 3..14)),
/// re.find(b"zzzfoo12345barzzz"),
/// );
/// # Ok::<(), Box>(())
/// ```
pub fn new(pattern: &str) -> Result {
Builder::new().build(pattern)
}
/// Like `new`, but parses multiple patterns into a single "regex set."
/// This similarly uses the default regex configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// let re = Regex::new_many(&["[a-z]+", "[0-9]+"])?;
///
/// let mut it = re.find_iter(b"abc 1 foo 4567 0 quux");
/// assert_eq!(Some(Match::must(0, 0..3)), it.next());
/// assert_eq!(Some(Match::must(1, 4..5)), it.next());
/// assert_eq!(Some(Match::must(0, 6..9)), it.next());
/// assert_eq!(Some(Match::must(1, 10..14)), it.next());
/// assert_eq!(Some(Match::must(1, 15..16)), it.next());
/// assert_eq!(Some(Match::must(0, 17..21)), it.next());
/// assert_eq!(None, it.next());
/// # Ok::<(), Box>(())
/// ```
pub fn new_many>(
patterns: &[P],
) -> Result {
Builder::new().build_many(patterns)
}
}
#[cfg(all(feature = "syntax", feature = "dfa-build"))]
impl Regex>> {
/// Parse the given regular expression using the default configuration,
/// except using sparse DFAs, and return the corresponding regex.
///
/// If you want a non-default configuration, then use the [`Builder`] to
/// set your own configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// let re = Regex::new_sparse("foo[0-9]+bar")?;
/// assert_eq!(
/// Some(Match::must(0, 3..14)),
/// re.find(b"zzzfoo12345barzzz"),
/// );
/// # Ok::<(), Box>(())
/// ```
pub fn new_sparse(
pattern: &str,
) -> Result>>, BuildError> {
Builder::new().build_sparse(pattern)
}
/// Like `new`, but parses multiple patterns into a single "regex set"
/// using sparse DFAs. This otherwise similarly uses the default regex
/// configuration.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// let re = Regex::new_many_sparse(&["[a-z]+", "[0-9]+"])?;
///
/// let mut it = re.find_iter(b"abc 1 foo 4567 0 quux");
/// assert_eq!(Some(Match::must(0, 0..3)), it.next());
/// assert_eq!(Some(Match::must(1, 4..5)), it.next());
/// assert_eq!(Some(Match::must(0, 6..9)), it.next());
/// assert_eq!(Some(Match::must(1, 10..14)), it.next());
/// assert_eq!(Some(Match::must(1, 15..16)), it.next());
/// assert_eq!(Some(Match::must(0, 17..21)), it.next());
/// assert_eq!(None, it.next());
/// # Ok::<(), Box>(())
/// ```
pub fn new_many_sparse>(
patterns: &[P],
) -> Result>>, BuildError> {
Builder::new().build_many_sparse(patterns)
}
}
/// Convenience routines for regex construction.
impl Regex> {
/// Return a builder for configuring the construction of a `Regex`.
///
/// This is a convenience routine to avoid needing to import the
/// [`Builder`] type in common cases.
///
/// # Example
///
/// This example shows how to use the builder to disable UTF-8 mode
/// everywhere.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{
/// dfa::regex::Regex, nfa::thompson, util::syntax, Match,
/// };
///
/// let re = Regex::builder()
/// .syntax(syntax::Config::new().utf8(false))
/// .thompson(thompson::Config::new().utf8(false))
/// .build(r"foo(?-u:[^b])ar.*")?;
/// let haystack = b"\xFEfoo\xFFarzz\xE2\x98\xFF\n";
/// let expected = Some(Match::must(0, 1..9));
/// let got = re.find(haystack);
/// assert_eq!(expected, got);
///
/// # Ok::<(), Box>(())
/// ```
pub fn builder() -> Builder {
Builder::new()
}
}
/// Standard search routines for finding and iterating over matches.
impl Regex {
/// Returns true if and only if this regex matches the given haystack.
///
/// This routine may short circuit if it knows that scanning future input
/// will never lead to a different result. In particular, if the underlying
/// DFA enters a match state or a dead state, then this routine will return
/// `true` or `false`, respectively, without inspecting any future input.
///
/// # Panics
///
/// This routine panics if the search could not complete. This can occur
/// in a number of circumstances:
///
/// * The configuration of the DFA may permit it to "quit" the search.
/// For example, setting quit bytes or enabling heuristic support for
/// Unicode word boundaries. The default configuration does not enable any
/// option that could result in the DFA quitting.
/// * When the provided `Input` configuration is not supported. For
/// example, by providing an unsupported anchor mode.
///
/// When a search panics, callers cannot know whether a match exists or
/// not.
///
/// Use [`Regex::try_search`] if you want to handle these error conditions.
///
/// # Example
///
/// ```
/// use regex_automata::dfa::regex::Regex;
///
/// let re = Regex::new("foo[0-9]+bar")?;
/// assert_eq!(true, re.is_match("foo12345bar"));
/// assert_eq!(false, re.is_match("foobar"));
/// # Ok::<(), Box>(())
/// ```
#[inline]
pub fn is_match<'h, I: Into>>(&self, input: I) -> bool {
// Not only can we do an "earliest" search, but we can avoid doing a
// reverse scan too.
let input = input.into().earliest(true);
self.forward().try_search_fwd(&input).map(|x| x.is_some()).unwrap()
}
/// Returns the start and end offset of the leftmost match. If no match
/// exists, then `None` is returned.
///
/// # Panics
///
/// This routine panics if the search could not complete. This can occur
/// in a number of circumstances:
///
/// * The configuration of the DFA may permit it to "quit" the search.
/// For example, setting quit bytes or enabling heuristic support for
/// Unicode word boundaries. The default configuration does not enable any
/// option that could result in the DFA quitting.
/// * When the provided `Input` configuration is not supported. For
/// example, by providing an unsupported anchor mode.
///
/// When a search panics, callers cannot know whether a match exists or
/// not.
///
/// Use [`Regex::try_search`] if you want to handle these error conditions.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// // Greediness is applied appropriately.
/// let re = Regex::new("foo[0-9]+")?;
/// assert_eq!(Some(Match::must(0, 3..11)), re.find("zzzfoo12345zzz"));
///
/// // Even though a match is found after reading the first byte (`a`),
/// // the default leftmost-first match semantics demand that we find the
/// // earliest match that prefers earlier parts of the pattern over latter
/// // parts.
/// let re = Regex::new("abc|a")?;
/// assert_eq!(Some(Match::must(0, 0..3)), re.find("abc"));
/// # Ok::<(), Box>(())
/// ```
#[inline]
pub fn find<'h, I: Into>>(&self, input: I) -> Option {
self.try_search(&input.into()).unwrap()
}
/// Returns an iterator over all non-overlapping leftmost matches in the
/// given bytes. If no match exists, then the iterator yields no elements.
///
/// This corresponds to the "standard" regex search iterator.
///
/// # Panics
///
/// If the search returns an error during iteration, then iteration
/// panics. See [`Regex::find`] for the panic conditions.
///
/// Use [`Regex::try_search`] with
/// [`util::iter::Searcher`](crate::util::iter::Searcher) if you want to
/// handle these error conditions.
///
/// # Example
///
/// ```
/// use regex_automata::{Match, dfa::regex::Regex};
///
/// let re = Regex::new("foo[0-9]+")?;
/// let text = "foo1 foo12 foo123";
/// let matches: Vec = re.find_iter(text).collect();
/// assert_eq!(matches, vec![
/// Match::must(0, 0..4),
/// Match::must(0, 5..10),
/// Match::must(0, 11..17),
/// ]);
/// # Ok::<(), Box>(())
/// ```
#[inline]
pub fn find_iter<'r, 'h, I: Into>>(
&'r self,
input: I,
) -> FindMatches<'r, 'h, A> {
let it = iter::Searcher::new(input.into());
FindMatches { re: self, it }
}
}
/// Lower level fallible search routines that permit controlling where the
/// search starts and ends in a particular sequence.
impl Regex {
/// Returns the start and end offset of the leftmost match. If no match
/// exists, then `None` is returned.
///
/// This is like [`Regex::find`] but with two differences:
///
/// 1. It is not generic over `Into` and instead accepts a
/// `&Input`. This permits reusing the same `Input` for multiple searches
/// without needing to create a new one. This _may_ help with latency.
/// 2. It returns an error if the search could not complete where as
/// [`Regex::find`] will panic.
///
/// # Errors
///
/// This routine errors if the search could not complete. This can occur
/// in the following circumstances:
///
/// * The configuration of the DFA may permit it to "quit" the search.
/// For example, setting quit bytes or enabling heuristic support for
/// Unicode word boundaries. The default configuration does not enable any
/// option that could result in the DFA quitting.
/// * When the provided `Input` configuration is not supported. For
/// example, by providing an unsupported anchor mode.
///
/// When a search returns an error, callers cannot know whether a match
/// exists or not.
#[inline]
pub fn try_search(
&self,
input: &Input<'_>,
) -> Result