diff options
Diffstat (limited to 'man7/regex.7')
-rw-r--r-- | man7/regex.7 | 293 |
1 files changed, 293 insertions, 0 deletions
diff --git a/man7/regex.7 b/man7/regex.7 new file mode 100644 index 0000000..7a5b2d8 --- /dev/null +++ b/man7/regex.7 @@ -0,0 +1,293 @@ +'\" t +.\" From Henry Spencer's regex package (as found in the apache +.\" distribution). The package carries the following copyright: +.\" +.\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved. +.\" %%%LICENSE_START(MISC) +.\" This software is not subject to any license of the American Telephone +.\" and Telegraph Company or of the Regents of the University of California. +.\" +.\" Permission is granted to anyone to use this software for any purpose +.\" on any computer system, and to alter it and redistribute it, subject +.\" to the following restrictions: +.\" +.\" 1. The author is not responsible for the consequences of use of this +.\" software, no matter how awful, even if they arise from flaws in it. +.\" +.\" 2. The origin of this software must not be misrepresented, either by +.\" explicit claim or by omission. Since few users ever read sources, +.\" credits must appear in the documentation. +.\" +.\" 3. Altered versions must be plainly marked as such, and must not be +.\" misrepresented as being the original software. Since few users +.\" ever read sources, credits must appear in the documentation. +.\" +.\" 4. This notice may not be removed or altered. +.\" %%%LICENSE_END +.\" +.\" In order to comply with `credits must appear in the documentation' +.\" I added an AUTHOR paragraph below - aeb. +.\" +.\" In the default nroff environment there is no dagger \(dg. +.\" +.\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which +.\" appear not to be in the glibc implementation of regcomp +.\" +.ie t .ds dg \(dg +.el .ds dg (!) +.TH regex 7 2023-03-08 "Linux man-pages 6.05.01" +.SH NAME +regex \- POSIX.2 regular expressions +.SH DESCRIPTION +Regular expressions ("RE"s), +as defined in POSIX.2, come in two forms: +modern REs (roughly those of +.IR egrep ; +POSIX.2 calls these "extended" REs) +and obsolete REs (roughly those of +.BR ed (1); +POSIX.2 "basic" REs). +Obsolete REs mostly exist for backward compatibility in some old programs; +they will be discussed at the end. +POSIX.2 leaves some aspects of RE syntax and semantics open; +"\*(dg" marks decisions on these aspects that +may not be fully portable to other POSIX.2 implementations. +.PP +A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR, +separated by \[aq]|\[aq]. +It matches anything that matches one of the branches. +.PP +A branch is one\*(dg or more \fIpieces\fR, concatenated. +It matches a match for the first, followed by a match for the second, +and so on. +.PP +A piece is an \fIatom\fR possibly followed +by a single\*(dg \[aq]*\[aq], \[aq]+\[aq], \[aq]?\[aq], or \fIbound\fR. +An atom followed by \[aq]*\[aq] +matches a sequence of 0 or more matches of the atom. +An atom followed by \[aq]+\[aq] +matches a sequence of 1 or more matches of the atom. +An atom followed by \[aq]?\[aq] +matches a sequence of 0 or 1 matches of the atom. +.PP +A \fIbound\fR is \[aq]{\[aq] followed by an unsigned decimal integer, +possibly followed by \[aq],\[aq] +possibly followed by another unsigned decimal integer, +always followed by \[aq]}\[aq]. +The integers must lie between 0 and +.B RE_DUP_MAX +(255\*(dg) inclusive, +and if there are two of them, the first may not exceed the second. +An atom followed by a bound containing one integer \fIi\fR +and no comma matches +a sequence of exactly \fIi\fR matches of the atom. +An atom followed by a bound +containing one integer \fIi\fR and a comma matches +a sequence of \fIi\fR or more matches of the atom. +An atom followed by a bound +containing two integers \fIi\fR and \fIj\fR matches +a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. +.PP +An atom is a regular expression enclosed in "\fI()\fP" +(matching a match for the regular expression), +an empty set of "\fI()\fP" (matching the null string)\*(dg, +a \fIbracket expression\fR (see below), +\[aq].\[aq] (matching any single character), +\[aq]\[ha]\[aq] (matching the null string at the beginning of a line), +\[aq]$\[aq] (matching the null string at the end of a line), +a \[aq]\e\[aq] followed by one of the characters "\fI\[ha].[$()|*+?{\e\fP" +(matching that character taken as an ordinary character), +a \[aq]\e\[aq] followed by any other character\*(dg +(matching that character taken as an ordinary character, +as if the \[aq]\e\[aq] had not been present\*(dg), +or a single character with no other significance (matching that character). +A \[aq]{\[aq] followed by a character other than a digit +is an ordinary character, +not the beginning of a bound\*(dg. +It is illegal to end an RE with \[aq]\e\[aq]. +.PP +A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP". +It normally matches any single character from the list (but see below). +If the list begins with \[aq]\[ha]\[aq], +it matches any single character +(but see below) \fInot\fR from the rest of the list. +If two characters in the list are separated by \[aq]\-\[aq], this is shorthand +for the full \fIrange\fR of characters between those two (inclusive) in the +collating sequence, +for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit. +It is illegal\*(dg for two ranges to share an +endpoint, for example, "\fIa\-c\-e\fP". +Ranges are very collating-sequence-dependent, +and portable programs should avoid relying on them. +.PP +To include a literal \[aq]]\[aq] in the list, make it the first character +(following a possible \[aq]\[ha]\[aq]). +To include a literal \[aq]\-\[aq], make it the first or last character, +or the second endpoint of a range. +To use a literal \[aq]\-\[aq] as the first endpoint of a range, +enclose it in "\fI[.\fP" and "\fI.]\fP" +to make it a collating element (see below). +With the exception of these and some combinations using \[aq][\[aq] (see next +paragraphs), all other special characters, including \[aq]\e\[aq], lose their +special significance within a bracket expression. +.PP +Within a bracket expression, a collating element (a character, +a multicharacter sequence that collates as if it were a single character, +or a collating-sequence name for either) +enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the +sequence of characters of that collating element. +The sequence is a single element of the bracket expression's list. +A bracket expression containing a multicharacter collating element +can thus match more than one character, +for example, if the collating sequence includes a "ch" collating element, +then the RE "\fI[[.ch.]]*c\fP" matches the first five characters +of "chchcc". +.PP +Within a bracket expression, a collating element enclosed in "\fI[=\fP" and +"\fI=]\fP" is an equivalence class, standing for the sequences of characters +of all collating elements equivalent to that one, including itself. +(If there are no other equivalent collating elements, +the treatment is as if the enclosing delimiters +were "\fI[.\fP" and "\fI.]\fP".) +For example, if o and \(^o are the members of an equivalence class, +then "\fI[[=o=]]\fP", "\fI[[=\(^o=]]\fP", +and "\fI[o\(^o]\fP" are all synonymous. +An equivalence class may not\*(dg be an endpoint +of a range. +.PP +Within a bracket expression, the name of a \fIcharacter class\fR enclosed +in "\fI[:\fP" and "\fI:]\fP" stands for the list +of all characters belonging to that +class. +Standard character class names are: +.PP +.RS +.TS +l l l. +alnum digit punct +alpha graph space +blank lower upper +cntrl print xdigit +.TE +.RE +.PP +These stand for the character classes defined in +.BR wctype (3). +A locale may provide others. +A character class may not be used as an endpoint of a range. +.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 +.\" The following does not seem to apply in the glibc implementation +.\" .PP +.\" There are two special cases\*(dg of bracket expressions: +.\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match +.\" the null string at the beginning and end of a word respectively. +.\" A word is defined as a sequence of +.\" word characters +.\" which is neither preceded nor followed by +.\" word characters. +.\" A word character is an +.\" .I alnum +.\" character (as defined by +.\" .BR wctype (3)) +.\" or an underscore. +.\" This is an extension, +.\" compatible with but not specified by POSIX.2, +.\" and should be used with +.\" caution in software intended to be portable to other systems. +.PP +In the event that an RE could match more than one substring of a given +string, +the RE matches the one starting earliest in the string. +If the RE could match more than one substring starting at that point, +it matches the longest. +Subexpressions also match the longest possible substrings, subject to +the constraint that the whole match be as long as possible, +with subexpressions starting earlier in the RE taking priority over +ones starting later. +Note that higher-level subexpressions thus take priority over +their lower-level component subexpressions. +.PP +Match lengths are measured in characters, not collating elements. +A null string is considered longer than no match at all. +For example, +"\fIbb*\fP" matches the three middle characters of "abbbc", +"\fI(wee|week)(knights|nights)\fP" +matches all ten characters of "weeknights", +when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression +matches all three characters, and +when "\fI(a*)*\fP" is matched against "bc" +both the whole RE and the parenthesized +subexpression match the null string. +.PP +If case-independent matching is specified, +the effect is much as if all case distinctions had vanished from the +alphabet. +When an alphabetic that exists in multiple cases appears as an +ordinary character outside a bracket expression, it is effectively +transformed into a bracket expression containing both cases, +for example, \[aq]x\[aq] becomes "\fI[xX]\fP". +When it appears inside a bracket expression, all case counterparts +of it are added to the bracket expression, so that, for example, "\fI[x]\fP" +becomes "\fI[xX]\fP" and "\fI[\[ha]x]\fP" becomes "\fI[\[ha]xX]\fP". +.PP +No particular limit is imposed on the length of REs\*(dg. +Programs intended to be portable should not employ REs longer +than 256 bytes, +as an implementation can refuse to accept such REs and remain +POSIX-compliant. +.PP +Obsolete ("basic") regular expressions differ in several respects. +\[aq]|\[aq], \[aq]+\[aq], and \[aq]?\[aq] are +ordinary characters and there is no equivalent +for their functionality. +The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP", +with \[aq]{\[aq] and \[aq]}\[aq] by themselves ordinary characters. +The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP", +with \[aq](\[aq] and \[aq])\[aq] by themselves ordinary characters. +\[aq]\[ha]\[aq] is an ordinary character except at the beginning of the +RE or\*(dg the beginning of a parenthesized subexpression, +\[aq]$\[aq] is an ordinary character except at the end of the +RE or\*(dg the end of a parenthesized subexpression, +and \[aq]*\[aq] is an ordinary character if it appears at the beginning of the +RE or the beginning of a parenthesized subexpression +(after a possible leading \[aq]\[ha]\[aq]). +.PP +Finally, there is one new type of atom, a \fIback reference\fR: +\[aq]\e\[aq] followed by a nonzero decimal digit \fId\fR +matches the same sequence of characters +matched by the \fId\fRth parenthesized subexpression +(numbering subexpressions by the positions of their opening parentheses, +left to right), +so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc". +.SH BUGS +Having two kinds of REs is a botch. +.PP +The current POSIX.2 spec says that \[aq])\[aq] is an ordinary character in +the absence of an unmatched \[aq](\[aq]; +this was an unintentional result of a wording error, +and change is likely. +Avoid relying on it. +.PP +Back references are a dreadful botch, +posing major problems for efficient implementations. +They are also somewhat vaguely defined +(does +"\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?). +Avoid using them. +.PP +POSIX.2's specification of case-independent matching is vague. +The "one case implies all cases" definition given above +is current consensus among implementors as to the right interpretation. +.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 +.\" The following does not seem to apply in the glibc implementation +.\" .PP +.\" The syntax for word boundaries is incredibly ugly. +.SH AUTHOR +.\" Sigh... The page license means we must have the author's name +.\" in the formatted output. +This page was taken from Henry Spencer's regex package. +.SH SEE ALSO +.BR grep (1), +.BR regex (3) +.PP +POSIX.2, section 2.8 (Regular Expression Notation). |