/*! Types and routines that support the search APIs of most regex engines. This sub-module isn't exposed directly, but rather, its contents are exported at the crate root due to the universality of most of the types and routines in this module. */ use core::ops::{Range, RangeBounds}; use crate::util::{escape::DebugByte, primitives::PatternID, utf8}; /// The parameters for a regex search including the haystack to search. /// /// It turns out that regex searches have a few parameters, and in most cases, /// those parameters have defaults that work in the vast majority of cases. /// This `Input` type exists to make that common case seamnless while also /// providing an avenue for changing the parameters of a search. In particular, /// this type enables doing so without a combinatorial explosion of different /// methods and/or superfluous parameters in the common cases. /// /// An `Input` permits configuring the following things: /// /// * Search only a substring of a haystack, while taking the broader context /// into account for resolving look-around assertions. /// * Indicating whether to search for all patterns in a regex, or to /// only search for one pattern in particular. /// * Whether to perform an anchored on unanchored search. /// * Whether to report a match as early as possible. /// /// All of these parameters, except for the haystack, have sensible default /// values. This means that the minimal search configuration is simply a call /// to [`Input::new`] with your haystack. Setting any other parameter is /// optional. /// /// Moreover, for any `H` that implements `AsRef<[u8]>`, there exists a /// `From for Input` implementation. This is useful because many of the /// search APIs in this crate accept an `Into`. This means you can /// provide string or byte strings to these routines directly, and they'll /// automatically get converted into an `Input` for you. /// /// The lifetime parameter `'h` refers to the lifetime of the haystack. /// /// # Organization /// /// The API of `Input` is split into a few different parts: /// /// * A builder-like API that transforms a `Input` by value. Examples: /// [`Input::span`] and [`Input::anchored`]. /// * A setter API that permits mutating parameters in place. Examples: /// [`Input::set_span`] and [`Input::set_anchored`]. /// * A getter API that permits retrieving any of the search parameters. /// Examples: [`Input::get_span`] and [`Input::get_anchored`]. /// * A few convenience getter routines that don't conform to the above naming /// pattern due to how common they are. Examples: [`Input::haystack`], /// [`Input::start`] and [`Input::end`]. /// * Miscellaneous predicates and other helper routines that are useful /// in some contexts. Examples: [`Input::is_char_boundary`]. /// /// A `Input` exposes so much because it is meant to be used by both callers of /// regex engines _and_ implementors of regex engines. A constraining factor is /// that regex engines should accept a `&Input` as its lowest level API, which /// means that implementors should only use the "getter" APIs of a `Input`. /// /// # Valid bounds and search termination /// /// An `Input` permits setting the bounds of a search via either /// [`Input::span`] or [`Input::range`]. The bounds set must be valid, or /// else a panic will occur. Bounds are valid if and only if: /// /// * The bounds represent a valid range into the input's haystack. /// * **or** the end bound is a valid ending bound for the haystack *and* /// the start bound is exactly one greater than the start bound. /// /// In the latter case, [`Input::is_done`] will return true and indicates any /// search receiving such an input should immediately return with no match. /// /// Note that while `Input` is used for reverse searches in this crate, the /// `Input::is_done` predicate assumes a forward search. Because unsigned /// offsets are used internally, there is no way to tell from only the offsets /// whether a reverse search is done or not. /// /// # Regex engine support /// /// Any regex engine accepting an `Input` must support at least the following /// things: /// /// * Searching a `&[u8]` for matches. /// * Searching a substring of `&[u8]` for a match, such that any match /// reported must appear entirely within that substring. /// * For a forwards search, a match should never be reported when /// [`Input::is_done`] returns true. (For reverse searches, termination should /// be handled outside of `Input`.) /// /// Supporting other aspects of an `Input` are optional, but regex engines /// should handle aspects they don't support gracefully. How this is done is /// generally up to the regex engine. This crate generally treats unsupported /// anchored modes as an error to report for example, but for simplicity, in /// the meta regex engine, trying to search with an invalid pattern ID just /// results in no match being reported. #[derive(Clone)] pub struct Input<'h> { haystack: &'h [u8], span: Span, anchored: Anchored, earliest: bool, } impl<'h> Input<'h> { /// Create a new search configuration for the given haystack. #[inline] pub fn new>(haystack: &'h H) -> Input<'h> { Input { haystack: haystack.as_ref(), span: Span { start: 0, end: haystack.as_ref().len() }, anchored: Anchored::No, earliest: false, } } /// Set the span for this search. /// /// This routine does not panic if the span given is not a valid range for /// this search's haystack. If this search is run with an invalid range, /// then the most likely outcome is that the actual search execution will /// panic. /// /// This routine is generic over how a span is provided. While /// a [`Span`] may be given directly, one may also provide a /// `std::ops::Range`. To provide anything supported by range /// syntax, use the [`Input::range`] method. /// /// The default span is the entire haystack. /// /// Note that [`Input::range`] overrides this method and vice versa. /// /// # Panics /// /// This panics if the given span does not correspond to valid bounds in /// the haystack or the termination of a search. /// /// # Example /// /// This example shows how the span of the search can impact whether a /// match is reported or not. This is particularly relevant for look-around /// operators, which might take things outside of the span into account /// when determining whether they match. /// /// ``` /// # if cfg!(miri) { return Ok(()); } // miri takes too long /// use regex_automata::{ /// nfa::thompson::pikevm::PikeVM, /// Match, Input, /// }; /// /// // Look for 'at', but as a distinct word. /// let re = PikeVM::new(r"\bat\b")?; /// let mut cache = re.create_cache(); /// let mut caps = re.create_captures(); /// /// // Our haystack contains 'at', but not as a distinct word. /// let haystack = "batter"; /// /// // A standard search finds nothing, as expected. /// let input = Input::new(haystack); /// re.search(&mut cache, &input, &mut caps); /// assert_eq!(None, caps.get_match()); /// /// // But if we wanted to search starting at position '1', we might /// // slice the haystack. If we do this, it's impossible for the \b /// // anchors to take the surrounding context into account! And thus, /// // a match is produced. /// let input = Input::new(&haystack[1..3]); /// re.search(&mut cache, &input, &mut caps); /// assert_eq!(Some(Match::must(0, 0..2)), caps.get_match()); /// /// // But if we specify the span of the search instead of slicing the /// // haystack, then the regex engine can "see" outside of the span /// // and resolve the anchors correctly. /// let input = Input::new(haystack).span(1..3); /// re.search(&mut cache, &input, &mut caps); /// assert_eq!(None, caps.get_match()); /// /// # Ok::<(), Box>(()) /// ``` /// /// This may seem a little ham-fisted, but this scenario tends to come up /// if some other regex engine found the match span and now you need to /// re-process that span to look for capturing groups. (e.g., Run a faster /// DFA first, find a match, then run the PikeVM on just the match span to /// resolve capturing groups.) In order to implement that sort of logic /// correctly, you need to set the span on the search instead of slicing /// the haystack directly. /// /// The other advantage of using this routine to specify the bounds of the /// search is that the match offsets are still reported in terms of the /// original haystack. For example, the second search in the example above /// reported a match at position `0`, even though `at` starts at offset /// `1` because we sliced the haystack. #[inline] pub fn span>(mut self, span: S) -> Input<'h> { self.set_span(span); self } /// Like `Input::span`, but accepts any range instead. /// /// This routine does not panic if the range given is not a valid range for /// this search's haystack. If this search is run with an invalid range, /// then the most likely outcome is that the actual search execution will /// panic. /// /// The default range is the entire haystack. /// /// Note that [`Input::span`] overrides this method and vice versa. /// /// # Panics /// /// This routine will panic if the given range could not be converted /// to a valid [`Range`]. For example, this would panic when given /// `0..=usize::MAX` since it cannot be represented using a half-open /// interval in terms of `usize`. /// /// This also panics if the given range does not correspond to valid bounds /// in the haystack or the termination of a search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// /// let input = Input::new("foobar").range(2..=4); /// assert_eq!(2..5, input.get_range()); /// ``` #[inline] pub fn range>(mut self, range: R) -> Input<'h> { self.set_range(range); self } /// Sets the anchor mode of a search. /// /// When a search is anchored (so that's [`Anchored::Yes`] or /// [`Anchored::Pattern`]), a match must begin at the start of a search. /// When a search is not anchored (that's [`Anchored::No`]), regex engines /// will behave as if the pattern started with a `(?s-u:.)*?`. This prefix /// permits a match to appear anywhere. /// /// By default, the anchored mode is [`Anchored::No`]. /// /// **WARNING:** this is subtly different than using a `^` at the start of /// your regex. A `^` forces a regex to match exclusively at the start of /// a haystack, regardless of where you begin your search. In contrast, /// anchoring a search will allow your regex to match anywhere in your /// haystack, but the match must start at the beginning of a search. /// /// For example, consider the haystack `aba` and the following searches: /// /// 1. The regex `^a` is compiled with `Anchored::No` and searches `aba` /// starting at position `2`. Since `^` requires the match to start at /// the beginning of the haystack and `2 > 0`, no match is found. /// 2. The regex `a` is compiled with `Anchored::Yes` and searches `aba` /// starting at position `2`. This reports a match at `[2, 3]` since /// the match starts where the search started. Since there is no `^`, /// there is no requirement for the match to start at the beginning of /// the haystack. /// 3. The regex `a` is compiled with `Anchored::Yes` and searches `aba` /// starting at position `1`. Since `b` corresponds to position `1` and /// since the search is anchored, it finds no match. While the regex /// matches at other positions, configuring the search to be anchored /// requires that it only report a match that begins at the same offset /// as the beginning of the search. /// 4. The regex `a` is compiled with `Anchored::No` and searches `aba` /// starting at position `1`. Since the search is not anchored and /// the regex does not start with `^`, the search executes as if there /// is a `(?s:.)*?` prefix that permits it to match anywhere. Thus, it /// reports a match at `[2, 3]`. /// /// Note that the [`Anchored::Pattern`] mode is like `Anchored::Yes`, /// except it only reports matches for a particular pattern. /// /// # Example /// /// This demonstrates the differences between an anchored search and /// a pattern that begins with `^` (as described in the above warning /// message). /// /// ``` /// use regex_automata::{ /// nfa::thompson::pikevm::PikeVM, /// Anchored, Match, Input, /// }; /// /// let haystack = "aba"; /// /// let re = PikeVM::new(r"^a")?; /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures()); /// let input = Input::new(haystack).span(2..3).anchored(Anchored::No); /// re.search(&mut cache, &input, &mut caps); /// // No match is found because 2 is not the beginning of the haystack, /// // which is what ^ requires. /// assert_eq!(None, caps.get_match()); /// /// let re = PikeVM::new(r"a")?; /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures()); /// let input = Input::new(haystack).span(2..3).anchored(Anchored::Yes); /// re.search(&mut cache, &input, &mut caps); /// // An anchored search can still match anywhere in the haystack, it just /// // must begin at the start of the search which is '2' in this case. /// assert_eq!(Some(Match::must(0, 2..3)), caps.get_match()); /// /// let re = PikeVM::new(r"a")?; /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures()); /// let input = Input::new(haystack).span(1..3).anchored(Anchored::Yes); /// re.search(&mut cache, &input, &mut caps); /// // No match is found since we start searching at offset 1 which /// // corresponds to 'b'. Since there is no '(?s:.)*?' prefix, no match /// // is found. /// assert_eq!(None, caps.get_match()); /// /// let re = PikeVM::new(r"a")?; /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures()); /// let input = Input::new(haystack).span(1..3).anchored(Anchored::No); /// re.search(&mut cache, &input, &mut caps); /// // Since anchored=no, an implicit '(?s:.)*?' prefix was added to the /// // pattern. Even though the search starts at 'b', the 'match anything' /// // prefix allows the search to match 'a'. /// let expected = Some(Match::must(0, 2..3)); /// assert_eq!(expected, caps.get_match()); /// /// # Ok::<(), Box>(()) /// ``` #[inline] pub fn anchored(mut self, mode: Anchored) -> Input<'h> { self.set_anchored(mode); self } /// Whether to execute an "earliest" search or not. /// /// When running a non-overlapping search, an "earliest" search will return /// the match location as early as possible. For example, given a pattern /// of `foo[0-9]+` and a haystack of `foo12345`, a normal leftmost search /// will return `foo12345` as a match. But an "earliest" search for regex /// engines that support "earliest" semantics will return `foo1` as a /// match, since as soon as the first digit following `foo` is seen, it is /// known to have found a match. /// /// Note that "earliest" semantics generally depend on the regex engine. /// Different regex engines may determine there is a match at different /// points. So there is no guarantee that "earliest" matches will always /// return the same offsets for all regex engines. The "earliest" notion /// is really about when the particular regex engine determines there is /// a match rather than a consistent semantic unto itself. This is often /// useful for implementing "did a match occur or not" predicates, but /// sometimes the offset is useful as well. /// /// This is disabled by default. /// /// # Example /// /// This example shows the difference between "earliest" searching and /// normal searching. /// /// ``` /// use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input}; /// /// let re = PikeVM::new(r"foo[0-9]+")?; /// let mut cache = re.create_cache(); /// let mut caps = re.create_captures(); /// /// // A normal search implements greediness like you expect. /// let input = Input::new("foo12345"); /// re.search(&mut cache, &input, &mut caps); /// assert_eq!(Some(Match::must(0, 0..8)), caps.get_match()); /// /// // When 'earliest' is enabled and the regex engine supports /// // it, the search will bail once it knows a match has been /// // found. /// let input = Input::new("foo12345").earliest(true); /// re.search(&mut cache, &input, &mut caps); /// assert_eq!(Some(Match::must(0, 0..4)), caps.get_match()); /// # Ok::<(), Box>(()) /// ``` #[inline] pub fn earliest(mut self, yes: bool) -> Input<'h> { self.set_earliest(yes); self } /// Set the span for this search configuration. /// /// This is like the [`Input::span`] method, except this mutates the /// span in place. /// /// This routine is generic over how a span is provided. While /// a [`Span`] may be given directly, one may also provide a /// `std::ops::Range`. /// /// # Panics /// /// This panics if the given span does not correspond to valid bounds in /// the haystack or the termination of a search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// input.set_span(2..4); /// assert_eq!(2..4, input.get_range()); /// ``` #[inline] pub fn set_span>(&mut self, span: S) { let span = span.into(); assert!( span.end <= self.haystack.len() && span.start <= span.end.wrapping_add(1), "invalid span {:?} for haystack of length {}", span, self.haystack.len(), ); self.span = span; } /// Set the span for this search configuration given any range. /// /// This is like the [`Input::range`] method, except this mutates the /// span in place. /// /// This routine does not panic if the range given is not a valid range for /// this search's haystack. If this search is run with an invalid range, /// then the most likely outcome is that the actual search execution will /// panic. /// /// # Panics /// /// This routine will panic if the given range could not be converted /// to a valid [`Range`]. For example, this would panic when given /// `0..=usize::MAX` since it cannot be represented using a half-open /// interval in terms of `usize`. /// /// This also panics if the given span does not correspond to valid bounds /// in the haystack or the termination of a search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// input.set_range(2..=4); /// assert_eq!(2..5, input.get_range()); /// ``` #[inline] pub fn set_range>(&mut self, range: R) { use core::ops::Bound; // It's a little weird to convert ranges into spans, and then spans // back into ranges when we actually slice the haystack. Because // of that process, we always represent everything as a half-open // internal. Therefore, handling things like m..=n is a little awkward. let start = match range.start_bound() { Bound::Included(&i) => i, // Can this case ever happen? Range syntax doesn't support it... Bound::Excluded(&i) => i.checked_add(1).unwrap(), Bound::Unbounded => 0, }; let end = match range.end_bound() { Bound::Included(&i) => i.checked_add(1).unwrap(), Bound::Excluded(&i) => i, Bound::Unbounded => self.haystack().len(), }; self.set_span(Span { start, end }); } /// Set the starting offset for the span for this search configuration. /// /// This is a convenience routine for only mutating the start of a span /// without having to set the entire span. /// /// # Panics /// /// This panics if the span resulting from the new start position does not /// correspond to valid bounds in the haystack or the termination of a /// search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// input.set_start(5); /// assert_eq!(5..6, input.get_range()); /// ``` #[inline] pub fn set_start(&mut self, start: usize) { self.set_span(Span { start, ..self.get_span() }); } /// Set the ending offset for the span for this search configuration. /// /// This is a convenience routine for only mutating the end of a span /// without having to set the entire span. /// /// # Panics /// /// This panics if the span resulting from the new end position does not /// correspond to valid bounds in the haystack or the termination of a /// search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// input.set_end(5); /// assert_eq!(0..5, input.get_range()); /// ``` #[inline] pub fn set_end(&mut self, end: usize) { self.set_span(Span { end, ..self.get_span() }); } /// Set the anchor mode of a search. /// /// This is like [`Input::anchored`], except it mutates the search /// configuration in place. /// /// # Example /// /// ``` /// use regex_automata::{Anchored, Input, PatternID}; /// /// let mut input = Input::new("foobar"); /// assert_eq!(Anchored::No, input.get_anchored()); /// /// let pid = PatternID::must(5); /// input.set_anchored(Anchored::Pattern(pid)); /// assert_eq!(Anchored::Pattern(pid), input.get_anchored()); /// ``` #[inline] pub fn set_anchored(&mut self, mode: Anchored) { self.anchored = mode; } /// Set whether the search should execute in "earliest" mode or not. /// /// This is like [`Input::earliest`], except it mutates the search /// configuration in place. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert!(!input.get_earliest()); /// input.set_earliest(true); /// assert!(input.get_earliest()); /// ``` #[inline] pub fn set_earliest(&mut self, yes: bool) { self.earliest = yes; } /// Return a borrow of the underlying haystack as a slice of bytes. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert_eq!(b"foobar", input.haystack()); /// ``` #[inline] pub fn haystack(&self) -> &[u8] { self.haystack } /// Return the start position of this search. /// /// This is a convenience routine for `search.get_span().start()`. /// /// When [`Input::is_done`] is `false`, this is guaranteed to return /// an offset that is less than or equal to [`Input::end`]. Otherwise, /// the offset is one greater than [`Input::end`]. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert_eq!(0, input.start()); /// /// let input = Input::new("foobar").span(2..4); /// assert_eq!(2, input.start()); /// ``` #[inline] pub fn start(&self) -> usize { self.get_span().start } /// Return the end position of this search. /// /// This is a convenience routine for `search.get_span().end()`. /// /// This is guaranteed to return an offset that is a valid exclusive end /// bound for this input's haystack. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert_eq!(6, input.end()); /// /// let input = Input::new("foobar").span(2..4); /// assert_eq!(4, input.end()); /// ``` #[inline] pub fn end(&self) -> usize { self.get_span().end } /// Return the span for this search configuration. /// /// If one was not explicitly set, then the span corresponds to the entire /// range of the haystack. /// /// When [`Input::is_done`] is `false`, the span returned is guaranteed /// to correspond to valid bounds for this input's haystack. /// /// # Example /// /// ``` /// use regex_automata::{Input, Span}; /// /// let input = Input::new("foobar"); /// assert_eq!(Span { start: 0, end: 6 }, input.get_span()); /// ``` #[inline] pub fn get_span(&self) -> Span { self.span } /// Return the span as a range for this search configuration. /// /// If one was not explicitly set, then the span corresponds to the entire /// range of the haystack. /// /// When [`Input::is_done`] is `false`, the range returned is guaranteed /// to correspond to valid bounds for this input's haystack. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert_eq!(0..6, input.get_range()); /// ``` #[inline] pub fn get_range(&self) -> Range { self.get_span().range() } /// Return the anchored mode for this search configuration. /// /// If no anchored mode was set, then it defaults to [`Anchored::No`]. /// /// # Example /// /// ``` /// use regex_automata::{Anchored, Input, PatternID}; /// /// let mut input = Input::new("foobar"); /// assert_eq!(Anchored::No, input.get_anchored()); /// /// let pid = PatternID::must(5); /// input.set_anchored(Anchored::Pattern(pid)); /// assert_eq!(Anchored::Pattern(pid), input.get_anchored()); /// ``` #[inline] pub fn get_anchored(&self) -> Anchored { self.anchored } /// Return whether this search should execute in "earliest" mode. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("foobar"); /// assert!(!input.get_earliest()); /// ``` #[inline] pub fn get_earliest(&self) -> bool { self.earliest } /// Return true if and only if this search can never return any other /// matches. /// /// This occurs when the start position of this search is greater than the /// end position of the search. /// /// # Example /// /// ``` /// use regex_automata::Input; /// /// let mut input = Input::new("foobar"); /// assert!(!input.is_done()); /// input.set_start(6); /// assert!(!input.is_done()); /// input.set_start(7); /// assert!(input.is_done()); /// ``` #[inline] pub fn is_done(&self) -> bool { self.get_span().start > self.get_span().end } /// Returns true if and only if the given offset in this search's haystack /// falls on a valid UTF-8 encoded codepoint boundary. /// /// If the haystack is not valid UTF-8, then the behavior of this routine /// is unspecified. /// /// # Example /// /// This shows where codepoint boundaries do and don't exist in valid /// UTF-8. /// /// ``` /// use regex_automata::Input; /// /// let input = Input::new("☃"); /// assert!(input.is_char_boundary(0)); /// assert!(!input.is_char_boundary(1)); /// assert!(!input.is_char_boundary(2)); /// assert!(input.is_char_boundary(3)); /// assert!(!input.is_char_boundary(4)); /// ``` #[inline] pub fn is_char_boundary(&self, offset: usize) -> bool { utf8::is_boundary(self.haystack(), offset) } } impl<'h> core::fmt::Debug for Input<'h> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { use crate::util::escape::DebugHaystack; f.debug_struct("Input") .field("haystack", &DebugHaystack(self.haystack())) .field("span", &self.span) .field("anchored", &self.anchored) .field("earliest", &self.earliest) .finish() } } impl<'h, H: ?Sized + AsRef<[u8]>> From<&'h H> for Input<'h> { fn from(haystack: &'h H) -> Input<'h> { Input::new(haystack) } } /// A representation of a span reported by a regex engine. /// /// A span corresponds to the starting and ending _byte offsets_ of a /// contiguous region of bytes. The starting offset is inclusive while the /// ending offset is exclusive. That is, a span is a half-open interval. /// /// A span is used to report the offsets of a match, but it is also used to /// convey which region of a haystack should be searched via routines like /// [`Input::span`]. /// /// This is basically equivalent to a `std::ops::Range`, except this /// type implements `Copy` which makes it more ergonomic to use in the context /// of this crate. Like a range, this implements `Index` for `[u8]` and `str`, /// and `IndexMut` for `[u8]`. For convenience, this also impls `From`, /// which means things like `Span::from(5..10)` work. #[derive(Clone, Copy, Eq, Hash, PartialEq)] pub struct Span { /// The start offset of the span, inclusive. pub start: usize, /// The end offset of the span, exclusive. pub end: usize, } impl Span { /// Returns this span as a range. #[inline] pub fn range(&self) -> Range { Range::from(*self) } /// Returns true when this span is empty. That is, when `start >= end`. #[inline] pub fn is_empty(&self) -> bool { self.start >= self.end } /// Returns the length of this span. /// /// This returns `0` in precisely the cases that `is_empty` returns `true`. #[inline] pub fn len(&self) -> usize { self.end.saturating_sub(self.start) } /// Returns true when the given offset is contained within this span. /// /// Note that an empty span contains no offsets and will always return /// false. #[inline] pub fn contains(&self, offset: usize) -> bool { !self.is_empty() && self.start <= offset && offset <= self.end } /// Returns a new span with `offset` added to this span's `start` and `end` /// values. #[inline] pub fn offset(&self, offset: usize) -> Span { Span { start: self.start + offset, end: self.end + offset } } } impl core::fmt::Debug for Span { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { write!(f, "{}..{}", self.start, self.end) } } impl core::ops::Index for [u8] { type Output = [u8]; #[inline] fn index(&self, index: Span) -> &[u8] { &self[index.range()] } } impl core::ops::IndexMut for [u8] { #[inline] fn index_mut(&mut self, index: Span) -> &mut [u8] { &mut self[index.range()] } } impl core::ops::Index for str { type Output = str; #[inline] fn index(&self, index: Span) -> &str { &self[index.range()] } } impl From> for Span { #[inline] fn from(range: Range) -> Span { Span { start: range.start, end: range.end } } } impl From for Range { #[inline] fn from(span: Span) -> Range { Range { start: span.start, end: span.end } } } impl PartialEq> for Span { #[inline] fn eq(&self, range: &Range) -> bool { self.start == range.start && self.end == range.end } } impl PartialEq for Range { #[inline] fn eq(&self, span: &Span) -> bool { self.start == span.start && self.end == span.end } } /// A representation of "half" of a match reported by a DFA. /// /// This is called a "half" match because it only includes the end location (or /// start location for a reverse search) of a match. This corresponds to the /// information that a single DFA scan can report. Getting the other half of /// the match requires a second scan with a reversed DFA. /// /// A half match also includes the pattern that matched. The pattern is /// identified by an ID, which corresponds to its position (starting from `0`) /// relative to other patterns used to construct the corresponding DFA. If only /// a single pattern is provided to the DFA, then all matches are guaranteed to /// have a pattern ID of `0`. #[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)] pub struct HalfMatch { /// The pattern ID. pattern: PatternID, /// The offset of the match. /// /// For forward searches, the offset is exclusive. For reverse searches, /// the offset is inclusive. offset: usize, } impl HalfMatch { /// Create a new half match from a pattern ID and a byte offset. #[inline] pub fn new(pattern: PatternID, offset: usize) -> HalfMatch { HalfMatch { pattern, offset } } /// Create a new half match from a pattern ID and a byte offset. /// /// This is like [`HalfMatch::new`], but accepts a `usize` instead of a /// [`PatternID`]. This panics if the given `usize` is not representable /// as a `PatternID`. #[inline] pub fn must(pattern: usize, offset: usize) -> HalfMatch { HalfMatch::new(PatternID::new(pattern).unwrap(), offset) } /// Returns the ID of the pattern that matched. /// /// The ID of a pattern is derived from the position in which it was /// originally inserted into the corresponding DFA. The first pattern has /// identifier `0`, and each subsequent pattern is `1`, `2` and so on. #[inline] pub fn pattern(&self) -> PatternID { self.pattern } /// The position of the match. /// /// If this match was produced by a forward search, then the offset is /// exclusive. If this match was produced by a reverse search, then the /// offset is inclusive. #[inline] pub fn offset(&self) -> usize { self.offset } } /// A representation of a match reported by a regex engine. /// /// A match has two essential pieces of information: the [`PatternID`] that /// matches, and the [`Span`] of the match in a haystack. /// /// The pattern is identified by an ID, which corresponds to its position /// (starting from `0`) relative to other patterns used to construct the /// corresponding regex engine. If only a single pattern is provided, then all /// matches are guaranteed to have a pattern ID of `0`. /// /// Every match reported by a regex engine guarantees that its span has its /// start offset as less than or equal to its end offset. #[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)] pub struct Match { /// The pattern ID. pattern: PatternID, /// The underlying match span. span: Span, } impl Match { /// Create a new match from a pattern ID and a span. /// /// This constructor is generic over how a span is provided. While /// a [`Span`] may be given directly, one may also provide a /// `std::ops::Range`. /// /// # Panics /// /// This panics if `end < start`. /// /// # Example /// /// This shows how to create a match for the first pattern in a regex /// object using convenient range syntax. /// /// ``` /// use regex_automata::{Match, PatternID}; /// /// let m = Match::new(PatternID::ZERO, 5..10); /// assert_eq!(0, m.pattern().as_usize()); /// assert_eq!(5, m.start()); /// assert_eq!(10, m.end()); /// ``` #[inline] pub fn new>(pattern: PatternID, span: S) -> Match { let span: Span = span.into(); assert!(span.start <= span.end, "invalid match span"); Match { pattern, span } } /// Create a new match from a pattern ID and a byte offset span. /// /// This constructor is generic over how a span is provided. While /// a [`Span`] may be given directly, one may also provide a /// `std::ops::Range`. /// /// This is like [`Match::new`], but accepts a `usize` instead of a /// [`PatternID`]. This panics if the given `usize` is not representable /// as a `PatternID`. /// /// # Panics /// /// This panics if `end < start` or if `pattern > PatternID::MAX`. /// /// # Example /// /// This shows how to create a match for the third pattern in a regex /// object using convenient range syntax. /// /// ``` /// use regex_automata::Match; /// /// let m = Match::must(3, 5..10); /// assert_eq!(3, m.pattern().as_usize()); /// assert_eq!(5, m.start()); /// assert_eq!(10, m.end()); /// ``` #[inline] pub fn must>(pattern: usize, span: S) -> Match { Match::new(PatternID::must(pattern), span) } /// Returns the ID of the pattern that matched. /// /// The ID of a pattern is derived from the position in which it was /// originally inserted into the corresponding regex engine. The first /// pattern has identifier `0`, and each subsequent pattern is `1`, `2` and /// so on. #[inline] pub fn pattern(&self) -> PatternID { self.pattern } /// The starting position of the match. /// /// This is a convenience routine for `Match::span().start`. #[inline] pub fn start(&self) -> usize { self.span().start } /// The ending position of the match. /// /// This is a convenience routine for `Match::span().end`. #[inline] pub fn end(&self) -> usize { self.span().end } /// Returns the match span as a range. /// /// This is a convenience routine for `Match::span().range()`. #[inline] pub fn range(&self) -> core::ops::Range { self.span().range() } /// Returns the span for this match. #[inline] pub fn span(&self) -> Span { self.span } /// Returns true when the span in this match is empty. /// /// An empty match can only be returned when the regex itself can match /// the empty string. #[inline] pub fn is_empty(&self) -> bool { self.span().is_empty() } /// Returns the length of this match. /// /// This returns `0` in precisely the cases that `is_empty` returns `true`. #[inline] pub fn len(&self) -> usize { self.span().len() } } /// A set of `PatternID`s. /// /// A set of pattern identifiers is useful for recording which patterns have /// matched a particular haystack. A pattern set _only_ includes pattern /// identifiers. It does not include offset information. /// /// # Example /// /// This shows basic usage of a set. /// /// ``` /// use regex_automata::{PatternID, PatternSet}; /// /// let pid1 = PatternID::must(5); /// let pid2 = PatternID::must(8); /// // Create a new empty set. /// let mut set = PatternSet::new(10); /// // Insert pattern IDs. /// set.insert(pid1); /// set.insert(pid2); /// // Test membership. /// assert!(set.contains(pid1)); /// assert!(set.contains(pid2)); /// // Get all members. /// assert_eq!( /// vec![5, 8], /// set.iter().map(|p| p.as_usize()).collect::>(), /// ); /// // Clear the set. /// set.clear(); /// // Test that it is indeed empty. /// assert!(set.is_empty()); /// ``` #[cfg(feature = "alloc")] #[derive(Clone, Debug, Eq, PartialEq)] pub struct PatternSet { /// The number of patterns set to 'true' in this set. len: usize, /// A map from PatternID to boolean of whether a pattern matches or not. /// /// This should probably be a bitset, but it's probably unlikely to matter /// much in practice. /// /// The main downside of this representation (and similarly for a bitset) /// is that iteration scales with the capacity of the set instead of /// the length of the set. This doesn't seem likely to be a problem in /// practice. /// /// Another alternative is to just use a 'SparseSet' for this. It does use /// more memory (quite a bit more), but that seems fine I think compared /// to the memory being used by the regex engine. The real hiccup with /// it is that it yields pattern IDs in the order they were inserted. /// Which is actually kind of nice, but at the time of writing, pattern /// IDs are yielded in ascending order in the regex crate RegexSet API. /// If we did change to 'SparseSet', we could provide an additional /// 'iter_match_order' iterator, but keep the ascending order one for /// compatibility. which: alloc::boxed::Box<[bool]>, } #[cfg(feature = "alloc")] impl PatternSet { /// Create a new set of pattern identifiers with the given capacity. /// /// The given capacity typically corresponds to (at least) the number of /// patterns in a compiled regex object. /// /// # Panics /// /// This panics if the given capacity exceeds [`PatternID::LIMIT`]. This is /// impossible if you use the `pattern_len()` method as defined on any of /// the regex engines in this crate. Namely, a regex will fail to build by /// returning an error if the number of patterns given to it exceeds the /// limit. Therefore, the number of patterns in a valid regex is always /// a correct capacity to provide here. pub fn new(capacity: usize) -> PatternSet { assert!( capacity <= PatternID::LIMIT, "pattern set capacity exceeds limit of {}", PatternID::LIMIT, ); PatternSet { len: 0, which: alloc::vec![false; capacity].into_boxed_slice(), } } /// Clear this set such that it contains no pattern IDs. pub fn clear(&mut self) { self.len = 0; for matched in self.which.iter_mut() { *matched = false; } } /// Return true if and only if the given pattern identifier is in this set. pub fn contains(&self, pid: PatternID) -> bool { pid.as_usize() < self.capacity() && self.which[pid] } /// Insert the given pattern identifier into this set and return `true` if /// the given pattern ID was not previously in this set. /// /// If the pattern identifier is already in this set, then this is a no-op. /// /// Use [`PatternSet::try_insert`] for a fallible version of this routine. /// /// # Panics /// /// This panics if this pattern set has insufficient capacity to /// store the given pattern ID. pub fn insert(&mut self, pid: PatternID) -> bool { self.try_insert(pid) .expect("PatternSet should have sufficient capacity") } /// Insert the given pattern identifier into this set and return `true` if /// the given pattern ID was not previously in this set. /// /// If the pattern identifier is already in this set, then this is a no-op. /// /// # Errors /// /// This returns an error if this pattern set has insufficient capacity to /// store the given pattern ID. pub fn try_insert( &mut self, pid: PatternID, ) -> Result { if pid.as_usize() >= self.capacity() { return Err(PatternSetInsertError { attempted: pid, capacity: self.capacity(), }); } if self.which[pid] { return Ok(false); } self.len += 1; self.which[pid] = true; Ok(true) } /* // This is currently commented out because it is unused and it is unclear // whether it's useful or not. What's the harm in having it? When, if // we ever wanted to change our representation to a 'SparseSet', then // supporting this method would be a bit tricky. So in order to keep some // API evolution flexibility, we leave it out for now. /// Remove the given pattern identifier from this set. /// /// If the pattern identifier was not previously in this set, then this /// does not change the set and returns `false`. /// /// # Panics /// /// This panics if `pid` exceeds the capacity of this set. pub fn remove(&mut self, pid: PatternID) -> bool { if !self.which[pid] { return false; } self.len -= 1; self.which[pid] = false; true } */ /// Return true if and only if this set has no pattern identifiers in it. pub fn is_empty(&self) -> bool { self.len() == 0 } /// Return true if and only if this set has the maximum number of pattern /// identifiers in the set. This occurs precisely when `PatternSet::len() /// == PatternSet::capacity()`. /// /// This particular property is useful to test because it may allow one to /// stop a search earlier than you might otherwise. Namely, if a search is /// only reporting which patterns match a haystack and if you know all of /// the patterns match at a given point, then there's no new information /// that can be learned by continuing the search. (Because a pattern set /// does not keep track of offset information.) pub fn is_full(&self) -> bool { self.len() == self.capacity() } /// Returns the total number of pattern identifiers in this set. pub fn len(&self) -> usize { self.len } /// Returns the total number of pattern identifiers that may be stored /// in this set. /// /// This is guaranteed to be less than or equal to [`PatternID::LIMIT`]. /// /// Typically, the capacity of a pattern set matches the number of patterns /// in a regex object with which you are searching. pub fn capacity(&self) -> usize { self.which.len() } /// Returns an iterator over all pattern identifiers in this set. /// /// The iterator yields pattern identifiers in ascending order, starting /// at zero. pub fn iter(&self) -> PatternSetIter<'_> { PatternSetIter { it: self.which.iter().enumerate() } } } /// An error that occurs when a `PatternID` failed to insert into a /// `PatternSet`. /// /// An insert fails when the given `PatternID` exceeds the configured capacity /// of the `PatternSet`. /// /// This error is created by the [`PatternSet::try_insert`] routine. #[cfg(feature = "alloc")] #[derive(Clone, Debug)] pub struct PatternSetInsertError { attempted: PatternID, capacity: usize, } #[cfg(feature = "std")] impl std::error::Error for PatternSetInsertError {} #[cfg(feature = "alloc")] impl core::fmt::Display for PatternSetInsertError { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { write!( f, "failed to insert pattern ID {} into pattern set \ with insufficiet capacity of {}", self.attempted.as_usize(), self.capacity, ) } } /// An iterator over all pattern identifiers in a [`PatternSet`]. /// /// The lifetime parameter `'a` refers to the lifetime of the pattern set being /// iterated over. /// /// This iterator is created by the [`PatternSet::iter`] method. #[cfg(feature = "alloc")] #[derive(Clone, Debug)] pub struct PatternSetIter<'a> { it: core::iter::Enumerate>, } #[cfg(feature = "alloc")] impl<'a> Iterator for PatternSetIter<'a> { type Item = PatternID; fn next(&mut self) -> Option { while let Some((index, &yes)) = self.it.next() { if yes { // Only valid 'PatternID' values can be inserted into the set // and construction of the set panics if the capacity would // permit storing invalid pattern IDs. Thus, 'yes' is only true // precisely when 'index' corresponds to a valid 'PatternID'. return Some(PatternID::new_unchecked(index)); } } None } fn size_hint(&self) -> (usize, Option) { self.it.size_hint() } } #[cfg(feature = "alloc")] impl<'a> DoubleEndedIterator for PatternSetIter<'a> { fn next_back(&mut self) -> Option { while let Some((index, &yes)) = self.it.next_back() { if yes { // Only valid 'PatternID' values can be inserted into the set // and construction of the set panics if the capacity would // permit storing invalid pattern IDs. Thus, 'yes' is only true // precisely when 'index' corresponds to a valid 'PatternID'. return Some(PatternID::new_unchecked(index)); } } None } } /// The type of anchored search to perform. /// /// This is *almost* a boolean option. That is, you can either do an unanchored /// search for any pattern in a regex, or you can do an anchored search for any /// pattern in a regex. /// /// A third option exists that, assuming the regex engine supports it, permits /// you to do an anchored search for a specific pattern. /// /// Note that there is no way to run an unanchored search for a specific /// pattern. If you need that, you'll need to build separate regexes for each /// pattern. /// /// # Errors /// /// If a regex engine does not support the anchored mode selected, then the /// regex engine will return an error. While any non-trivial regex engine /// should support at least one of the available anchored modes, there is no /// singular mode that is guaranteed to be universally supported. Some regex /// engines might only support unanchored searches (DFAs compiled without /// anchored starting states) and some regex engines might only support /// anchored searches (like the one-pass DFA). /// /// The specific error returned is a [`MatchError`] with a /// [`MatchErrorKind::UnsupportedAnchored`] kind. The kind includes the /// `Anchored` value given that is unsupported. /// /// Note that regex engines should report "no match" if, for example, an /// `Anchored::Pattern` is provided with an invalid pattern ID _but_ where /// anchored searches for a specific pattern are supported. This is smooths out /// behavior such that it's possible to guarantee that an error never occurs /// based on how the regex engine is configured. All regex engines in this /// crate report "no match" when searching for an invalid pattern ID, but where /// searching for a valid pattern ID is otherwise supported. /// /// # Example /// /// This example shows how to use the various `Anchored` modes to run a /// search. We use the [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM) /// because it supports all modes unconditionally. Some regex engines, like /// the [`onepass::DFA`](crate::dfa::onepass::DFA) cannot support unanchored /// searches. /// /// ``` /// # if cfg!(miri) { return Ok(()); } // miri takes too long /// use regex_automata::{ /// nfa::thompson::pikevm::PikeVM, /// Anchored, Input, Match, PatternID, /// }; /// /// let re = PikeVM::new_many(&[ /// r"Mrs. \w+", /// r"Miss \w+", /// r"Mr. \w+", /// r"Ms. \w+", /// ])?; /// let mut cache = re.create_cache(); /// let hay = "Hello Mr. Springsteen!"; /// /// // The default is to do an unanchored search. /// assert_eq!(Some(Match::must(2, 6..21)), re.find(&mut cache, hay)); /// // Explicitly ask for an unanchored search. Same as above. /// let input = Input::new(hay).anchored(Anchored::No); /// assert_eq!(Some(Match::must(2, 6..21)), re.find(&mut cache, hay)); /// /// // Now try an anchored search. Since the match doesn't start at the /// // beginning of the haystack, no match is found! /// let input = Input::new(hay).anchored(Anchored::Yes); /// assert_eq!(None, re.find(&mut cache, input)); /// /// // We can try an anchored search again, but move the location of where /// // we start the search. Note that the offsets reported are still in /// // terms of the overall haystack and not relative to where we started /// // the search. /// let input = Input::new(hay).anchored(Anchored::Yes).range(6..); /// assert_eq!(Some(Match::must(2, 6..21)), re.find(&mut cache, input)); /// /// // Now try an anchored search for a specific pattern. We specifically /// // choose a pattern that we know doesn't match to prove that the search /// // only looks for the pattern we provide. /// let input = Input::new(hay) /// .anchored(Anchored::Pattern(PatternID::must(1))) /// .range(6..); /// assert_eq!(None, re.find(&mut cache, input)); /// /// // But if we switch it to the pattern that we know matches, then we find /// // the match. /// let input = Input::new(hay) /// .anchored(Anchored::Pattern(PatternID::must(2))) /// .range(6..); /// assert_eq!(Some(Match::must(2, 6..21)), re.find(&mut cache, input)); /// /// # Ok::<(), Box>(()) /// ``` #[derive(Clone, Copy, Debug, Eq, PartialEq)] pub enum Anchored { /// Run an unanchored search. This means a match may occur anywhere at or /// after the start position of the search. /// /// This search can return a match for any pattern in the regex. No, /// Run an anchored search. This means that a match must begin at the /// start position of the search. /// /// This search can return a match for any pattern in the regex. Yes, /// Run an anchored search for a specific pattern. This means that a match /// must be for the given pattern and must begin at the start position of /// the search. Pattern(PatternID), } impl Anchored { /// Returns true if and only if this anchor mode corresponds to any kind of /// anchored search. /// /// # Example /// /// This examples shows that both `Anchored::Yes` and `Anchored::Pattern` /// are considered anchored searches. /// /// ``` /// use regex_automata::{Anchored, PatternID}; /// /// assert!(!Anchored::No.is_anchored()); /// assert!(Anchored::Yes.is_anchored()); /// assert!(Anchored::Pattern(PatternID::ZERO).is_anchored()); /// ``` #[inline] pub fn is_anchored(&self) -> bool { matches!(*self, Anchored::Yes | Anchored::Pattern(_)) } /// Returns the pattern ID associated with this configuration if it is an /// anchored search for a specific pattern. Otherwise `None` is returned. /// /// # Example /// /// ``` /// use regex_automata::{Anchored, PatternID}; /// /// assert_eq!(None, Anchored::No.pattern()); /// assert_eq!(None, Anchored::Yes.pattern()); /// /// let pid = PatternID::must(5); /// assert_eq!(Some(pid), Anchored::Pattern(pid).pattern()); /// ``` #[inline] pub fn pattern(&self) -> Option { match *self { Anchored::Pattern(pid) => Some(pid), _ => None, } } } /// The kind of match semantics to use for a regex pattern. /// /// The default match kind is `LeftmostFirst`, and this corresponds to the /// match semantics used by most backtracking engines, such as Perl. /// /// # Leftmost first or "preference order" match semantics /// /// Leftmost-first semantics determine which match to report when there are /// multiple paths through a regex that match at the same position. The tie is /// essentially broken by how a backtracker would behave. For example, consider /// running the regex `foofoofoo|foofoo|foo` on the haystack `foofoo`. In this /// case, both the `foofoo` and `foo` branches match at position `0`. So should /// the end of the match be `3` or `6`? /// /// A backtracker will conceptually work by trying `foofoofoo` and failing. /// Then it will try `foofoo`, find the match and stop there. Thus, the /// leftmost-first match position is `6`. This is called "leftmost-first" or /// "preference order" because the order of the branches as written in the /// regex pattern is what determines how to break the tie. /// /// (Note that leftmost-longest match semantics, which break ties by always /// taking the longest matching string, are not currently supported by this /// crate. These match semantics tend to be found in POSIX regex engines.) /// /// This example shows how leftmost-first semantics work, and how it even /// applies to multi-pattern regexes: /// /// ``` /// use regex_automata::{ /// nfa::thompson::pikevm::PikeVM, /// Match, /// }; /// /// let re = PikeVM::new_many(&[ /// r"foofoofoo", /// r"foofoo", /// r"foo", /// ])?; /// let mut cache = re.create_cache(); /// let got: Vec = re.find_iter(&mut cache, "foofoo").collect(); /// let expected = vec![Match::must(1, 0..6)]; /// assert_eq!(expected, got); /// /// # Ok::<(), Box>(()) /// ``` /// /// # All matches /// /// The `All` match semantics report any and all matches, and generally will /// attempt to match as much as possible. It doesn't respect any sort of match /// priority at all, so things like non-greedy matching don't work in this /// mode. /// /// The fact that non-greedy matching doesn't work generally makes most forms /// of unanchored non-overlapping searches have unintuitive behavior. Namely, /// unanchored searches behave as if there is a `(?s-u:.)*?` prefix at the /// beginning of the pattern, which is specifically non-greedy. Since it will /// be treated as greedy in `All` match semantics, this generally means that /// it will first attempt to consume all of the haystack and is likely to wind /// up skipping matches. /// /// Generally speaking, `All` should only be used in two circumstances: /// /// * When running an anchored search and there is a desire to match as much as /// possible. For example, when building a reverse regex matcher to find the /// start of a match after finding the end. In this case, the reverse search /// is anchored to the end of the match found by the forward search. /// * When running overlapping searches. Since `All` encodes all possible /// matches, this is generally what you want for an overlapping search. If you /// try to use leftmost-first in an overlapping search, it is likely to produce /// counter-intuitive results since leftmost-first specifically excludes some /// matches from its underlying finite state machine. /// /// This example demonstrates the counter-intuitive behavior of `All` semantics /// when using a standard leftmost unanchored search: /// /// ``` /// use regex_automata::{ /// nfa::thompson::pikevm::PikeVM, /// Match, MatchKind, /// }; /// /// let re = PikeVM::builder() /// .configure(PikeVM::config().match_kind(MatchKind::All)) /// .build("foo")?; /// let hay = "first foo second foo wat"; /// let mut cache = re.create_cache(); /// let got: Vec = re.find_iter(&mut cache, hay).collect(); /// // Notice that it completely skips the first 'foo'! /// let expected = vec![Match::must(0, 17..20)]; /// assert_eq!(expected, got); /// /// # Ok::<(), Box>(()) /// ``` /// /// This second example shows how `All` semantics are useful for an overlapping /// search. Note that we use lower level lazy DFA APIs here since the NFA /// engines only currently support a very limited form of overlapping search. /// /// ``` /// use regex_automata::{ /// hybrid::dfa::{DFA, OverlappingState}, /// HalfMatch, Input, MatchKind, /// }; /// /// let re = DFA::builder() /// // If we didn't set 'All' semantics here, then the regex would only /// // match 'foo' at offset 3 and nothing else. Why? Because the state /// // machine implements preference order and knows that the 'foofoo' and /// // 'foofoofoo' branches can never match since 'foo' will always match /// // when they match and take priority. /// .configure(DFA::config().match_kind(MatchKind::All)) /// .build(r"foo|foofoo|foofoofoo")?; /// let mut cache = re.create_cache(); /// let mut state = OverlappingState::start(); /// let input = Input::new("foofoofoo"); /// let mut got = vec![]; /// loop { /// re.try_search_overlapping_fwd(&mut cache, &input, &mut state)?; /// let m = match state.get_match() { /// None => break, /// Some(m) => m, /// }; /// got.push(m); /// } /// let expected = vec![ /// HalfMatch::must(0, 3), /// HalfMatch::must(0, 6), /// HalfMatch::must(0, 9), /// ]; /// assert_eq!(expected, got); /// /// # Ok::<(), Box>(()) /// ``` #[non_exhaustive] #[derive(Clone, Copy, Debug, Eq, PartialEq)] pub enum MatchKind { /// Report all possible matches. All, /// Report only the leftmost matches. When multiple leftmost matches exist, /// report the match corresponding to the part of the regex that appears /// first in the syntax. LeftmostFirst, // There is prior art in RE2 that shows that we should be able to add // LeftmostLongest too. The tricky part of it is supporting ungreedy // repetitions. Instead of treating all NFA states as having equivalent // priority (as in 'All') or treating all NFA states as having distinct // priority based on order (as in 'LeftmostFirst'), we instead group NFA // states into sets, and treat members of each set as having equivalent // priority, but having greater priority than all following members // of different sets. // // However, it's not clear whether it's really worth adding this. After // all, leftmost-longest can be emulated when using literals by using // leftmost-first and sorting the literals by length in descending order. // However, this won't work for arbitrary regexes. e.g., `\w|\w\w` will // always match `a` in `ab` when using leftmost-first, but leftmost-longest // would match `ab`. } impl MatchKind { #[cfg(feature = "alloc")] pub(crate) fn continue_past_first_match(&self) -> bool { *self == MatchKind::All } } impl Default for MatchKind { fn default() -> MatchKind { MatchKind::LeftmostFirst } } /// An error indicating that a search stopped before reporting whether a /// match exists or not. /// /// To be very clear, this error type implies that one cannot assume that no /// matches occur, since the search stopped before completing. That is, if /// you're looking for information about where a search determined that no /// match can occur, then this error type does *not* give you that. (Indeed, at /// the time of writing, if you need such a thing, you have to write your own /// search routine.) /// /// Normally, when one searches for something, the response is either an /// affirmative "it was found at this location" or a negative "not found at /// all." However, in some cases, a regex engine can be configured to stop its /// search before concluding whether a match exists or not. When this happens, /// it may be important for the caller to know why the regex engine gave up and /// where in the input it gave up at. This error type exposes the 'why' and the /// 'where.' /// /// For example, the DFAs provided by this library generally cannot correctly /// implement Unicode word boundaries. Instead, they provide an option to /// eagerly support them on ASCII text (since Unicode word boundaries are /// equivalent to ASCII word boundaries when searching ASCII text), but will /// "give up" if a non-ASCII byte is seen. In such cases, one is usually /// required to either report the failure to the caller (unergonomic) or /// otherwise fall back to some other regex engine (ergonomic, but potentially /// costly). /// /// More generally, some regex engines offer the ability for callers to specify /// certain bytes that will trigger the regex engine to automatically quit if /// they are seen. /// /// Still yet, there may be other reasons for a failed match. For example, /// the hybrid DFA provided by this crate can be configured to give up if it /// believes that it is not efficient. This in turn permits callers to choose a /// different regex engine. /// /// (Note that DFAs are configured by default to never quit or give up in this /// fashion. For example, by default, a DFA will fail to build if the regex /// pattern contains a Unicode word boundary. One needs to opt into the "quit" /// behavior via options, like /// [`hybrid::dfa::Config::unicode_word_boundary`](crate::hybrid::dfa::Config::unicode_word_boundary).) /// /// There are a couple other ways a search /// can fail. For example, when using the /// [`BoundedBacktracker`](crate::nfa::thompson::backtrack::BoundedBacktracker) /// with a haystack that is too long, or trying to run an unanchored search /// with a [one-pass DFA](crate::dfa::onepass). #[derive(Clone, Debug, Eq, PartialEq)] pub struct MatchError( #[cfg(feature = "alloc")] alloc::boxed::Box, #[cfg(not(feature = "alloc"))] MatchErrorKind, ); impl MatchError { /// Create a new error value with the given kind. /// /// This is a more verbose version of the kind-specific constructors, /// e.g., `MatchError::quit`. pub fn new(kind: MatchErrorKind) -> MatchError { #[cfg(feature = "alloc")] { MatchError(alloc::boxed::Box::new(kind)) } #[cfg(not(feature = "alloc"))] { MatchError(kind) } } /// Returns a reference to the underlying error kind. pub fn kind(&self) -> &MatchErrorKind { &self.0 } /// Create a new "quit" error. The given `byte` corresponds to the value /// that tripped a search's quit condition, and `offset` corresponds to the /// location in the haystack at which the search quit. /// /// This is the same as calling `MatchError::new` with a /// [`MatchErrorKind::Quit`] kind. pub fn quit(byte: u8, offset: usize) -> MatchError { MatchError::new(MatchErrorKind::Quit { byte, offset }) } /// Create a new "gave up" error. The given `offset` corresponds to the /// location in the haystack at which the search gave up. /// /// This is the same as calling `MatchError::new` with a /// [`MatchErrorKind::GaveUp`] kind. pub fn gave_up(offset: usize) -> MatchError { MatchError::new(MatchErrorKind::GaveUp { offset }) } /// Create a new "haystack too long" error. The given `len` corresponds to /// the length of the haystack that was problematic. /// /// This is the same as calling `MatchError::new` with a /// [`MatchErrorKind::HaystackTooLong`] kind. pub fn haystack_too_long(len: usize) -> MatchError { MatchError::new(MatchErrorKind::HaystackTooLong { len }) } /// Create a new "unsupported anchored" error. This occurs when the caller /// requests a search with an anchor mode that is not supported by the /// regex engine. /// /// This is the same as calling `MatchError::new` with a /// [`MatchErrorKind::UnsupportedAnchored`] kind. pub fn unsupported_anchored(mode: Anchored) -> MatchError { MatchError::new(MatchErrorKind::UnsupportedAnchored { mode }) } } /// The underlying kind of a [`MatchError`]. /// /// This is a **non-exhaustive** enum. That means new variants may be added in /// a semver-compatible release. #[non_exhaustive] #[derive(Clone, Debug, Eq, PartialEq)] pub enum MatchErrorKind { /// The search saw a "quit" byte at which it was instructed to stop /// searching. Quit { /// The "quit" byte that was observed that caused the search to stop. byte: u8, /// The offset at which the quit byte was observed. offset: usize, }, /// The search, based on heuristics, determined that it would be better /// to stop, typically to provide the caller an opportunity to use an /// alternative regex engine. /// /// Currently, the only way for this to occur is via the lazy DFA and /// only when it is configured to do so (it will not return this error by /// default). GaveUp { /// The offset at which the search stopped. This corresponds to the /// position immediately following the last byte scanned. offset: usize, }, /// This error occurs if the haystack given to the regex engine was too /// long to be searched. This occurs, for example, with regex engines /// like the bounded backtracker that have a configurable fixed amount of /// capacity that is tied to the length of the haystack. Anything beyond /// that configured limit will result in an error at search time. HaystackTooLong { /// The length of the haystack that exceeded the limit. len: usize, }, /// An error indicating that a particular type of anchored search was /// requested, but that the regex engine does not support it. /// /// Note that this error should not be returned by a regex engine simply /// because the pattern ID is invalid (i.e., equal to or exceeds the number /// of patterns in the regex). In that case, the regex engine should report /// a non-match. UnsupportedAnchored { /// The anchored mode given that is unsupported. mode: Anchored, }, } #[cfg(feature = "std")] impl std::error::Error for MatchError {} impl core::fmt::Display for MatchError { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { match *self.kind() { MatchErrorKind::Quit { byte, offset } => write!( f, "quit search after observing byte {:?} at offset {}", DebugByte(byte), offset, ), MatchErrorKind::GaveUp { offset } => { write!(f, "gave up searching at offset {}", offset) } MatchErrorKind::HaystackTooLong { len } => { write!(f, "haystack of length {} is too long", len) } MatchErrorKind::UnsupportedAnchored { mode: Anchored::Yes } => { write!(f, "anchored searches are not supported or enabled") } MatchErrorKind::UnsupportedAnchored { mode: Anchored::No } => { write!(f, "unanchored searches are not supported or enabled") } MatchErrorKind::UnsupportedAnchored { mode: Anchored::Pattern(pid), } => { write!( f, "anchored searches for a specific pattern ({}) are \ not supported or enabled", pid.as_usize(), ) } } } } #[cfg(test)] mod tests { use super::*; // We test that our 'MatchError' type is the size we expect. This isn't an // API guarantee, but if the size increases, we really want to make sure we // decide to do that intentionally. So this should be a speed bump. And in // general, we should not increase the size without a very good reason. // // Why? Because low level search APIs return Result<.., MatchError>. When // MatchError gets bigger, so to does the Result type. // // Now, when 'alloc' is enabled, we do box the error, which de-emphasizes // the importance of keeping a small error type. But without 'alloc', we // still want things to be small. #[test] fn match_error_size() { let expected_size = if cfg!(feature = "alloc") { core::mem::size_of::() } else { 2 * core::mem::size_of::() }; assert_eq!(expected_size, core::mem::size_of::()); } // Same as above, but for the underlying match error kind. #[cfg(target_pointer_width = "64")] #[test] fn match_error_kind_size() { let expected_size = 2 * core::mem::size_of::(); assert_eq!(expected_size, core::mem::size_of::()); } #[cfg(target_pointer_width = "32")] #[test] fn match_error_kind_size() { let expected_size = 3 * core::mem::size_of::(); assert_eq!(expected_size, core::mem::size_of::()); } }