diff options
Diffstat (limited to 'man7/regex.7')
-rw-r--r-- | man7/regex.7 | 52 |
1 files changed, 26 insertions, 26 deletions
diff --git a/man7/regex.7 b/man7/regex.7 index 7a5b2d8..ad63573 100644 --- a/man7/regex.7 +++ b/man7/regex.7 @@ -35,14 +35,14 @@ .\" .ie t .ds dg \(dg .el .ds dg (!) -.TH regex 7 2023-03-08 "Linux man-pages 6.05.01" +.TH regex 7 2023-11-01 "Linux man-pages 6.7" .SH NAME regex \- POSIX.2 regular expressions .SH DESCRIPTION Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly those of -.IR egrep ; +.BR egrep (1); POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of .BR ed (1); @@ -52,15 +52,15 @@ they will be discussed at the end. POSIX.2 leaves some aspects of RE syntax and semantics open; "\*(dg" marks decisions on these aspects that may not be fully portable to other POSIX.2 implementations. -.PP +.P A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR, separated by \[aq]|\[aq]. It matches anything that matches one of the branches. -.PP +.P A branch is one\*(dg or more \fIpieces\fR, concatenated. It matches a match for the first, followed by a match for the second, and so on. -.PP +.P A piece is an \fIatom\fR possibly followed by a single\*(dg \[aq]*\[aq], \[aq]+\[aq], \[aq]?\[aq], or \fIbound\fR. An atom followed by \[aq]*\[aq] @@ -69,7 +69,7 @@ An atom followed by \[aq]+\[aq] matches a sequence of 1 or more matches of the atom. An atom followed by \[aq]?\[aq] matches a sequence of 0 or 1 matches of the atom. -.PP +.P A \fIbound\fR is \[aq]{\[aq] followed by an unsigned decimal integer, possibly followed by \[aq],\[aq] possibly followed by another unsigned decimal integer, @@ -87,7 +87,7 @@ a sequence of \fIi\fR or more matches of the atom. An atom followed by a bound containing two integers \fIi\fR and \fIj\fR matches a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. -.PP +.P An atom is a regular expression enclosed in "\fI()\fP" (matching a match for the regular expression), an empty set of "\fI()\fP" (matching the null string)\*(dg, @@ -105,7 +105,7 @@ A \[aq]{\[aq] followed by a character other than a digit is an ordinary character, not the beginning of a bound\*(dg. It is illegal to end an RE with \[aq]\e\[aq]. -.PP +.P A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP". It normally matches any single character from the list (but see below). If the list begins with \[aq]\[ha]\[aq], @@ -119,7 +119,7 @@ It is illegal\*(dg for two ranges to share an endpoint, for example, "\fIa\-c\-e\fP". Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them. -.PP +.P To include a literal \[aq]]\[aq] in the list, make it the first character (following a possible \[aq]\[ha]\[aq]). To include a literal \[aq]\-\[aq], make it the first or last character, @@ -130,7 +130,7 @@ to make it a collating element (see below). With the exception of these and some combinations using \[aq][\[aq] (see next paragraphs), all other special characters, including \[aq]\e\[aq], lose their special significance within a bracket expression. -.PP +.P Within a bracket expression, a collating element (a character, a multicharacter sequence that collates as if it were a single character, or a collating-sequence name for either) @@ -142,7 +142,7 @@ can thus match more than one character, for example, if the collating sequence includes a "ch" collating element, then the RE "\fI[[.ch.]]*c\fP" matches the first five characters of "chchcc". -.PP +.P Within a bracket expression, a collating element enclosed in "\fI[=\fP" and "\fI=]\fP" is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. @@ -154,13 +154,13 @@ then "\fI[[=o=]]\fP", "\fI[[=\(^o=]]\fP", and "\fI[o\(^o]\fP" are all synonymous. An equivalence class may not\*(dg be an endpoint of a range. -.PP +.P Within a bracket expression, the name of a \fIcharacter class\fR enclosed in "\fI[:\fP" and "\fI:]\fP" stands for the list of all characters belonging to that class. Standard character class names are: -.PP +.P .RS .TS l l l. @@ -170,14 +170,14 @@ blank lower upper cntrl print xdigit .TE .RE -.PP +.P These stand for the character classes defined in .BR wctype (3). A locale may provide others. A character class may not be used as an endpoint of a range. .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 .\" The following does not seem to apply in the glibc implementation -.\" .PP +.\" .P .\" There are two special cases\*(dg of bracket expressions: .\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match .\" the null string at the beginning and end of a word respectively. @@ -194,7 +194,7 @@ A character class may not be used as an endpoint of a range. .\" compatible with but not specified by POSIX.2, .\" and should be used with .\" caution in software intended to be portable to other systems. -.PP +.P In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. @@ -206,7 +206,7 @@ with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions. -.PP +.P Match lengths are measured in characters, not collating elements. A null string is considered longer than no match at all. For example, @@ -218,7 +218,7 @@ matches all three characters, and when "\fI(a*)*\fP" is matched against "bc" both the whole RE and the parenthesized subexpression match the null string. -.PP +.P If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. @@ -229,13 +229,13 @@ for example, \[aq]x\[aq] becomes "\fI[xX]\fP". When it appears inside a bracket expression, all case counterparts of it are added to the bracket expression, so that, for example, "\fI[x]\fP" becomes "\fI[xX]\fP" and "\fI[\[ha]x]\fP" becomes "\fI[\[ha]xX]\fP". -.PP +.P No particular limit is imposed on the length of REs\*(dg. Programs intended to be portable should not employ REs longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant. -.PP +.P Obsolete ("basic") regular expressions differ in several respects. \[aq]|\[aq], \[aq]+\[aq], and \[aq]?\[aq] are ordinary characters and there is no equivalent @@ -251,7 +251,7 @@ RE or\*(dg the end of a parenthesized subexpression, and \[aq]*\[aq] is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading \[aq]\[ha]\[aq]). -.PP +.P Finally, there is one new type of atom, a \fIback reference\fR: \[aq]\e\[aq] followed by a nonzero decimal digit \fId\fR matches the same sequence of characters @@ -261,26 +261,26 @@ left to right), so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc". .SH BUGS Having two kinds of REs is a botch. -.PP +.P The current POSIX.2 spec says that \[aq])\[aq] is an ordinary character in the absence of an unmatched \[aq](\[aq]; this was an unintentional result of a wording error, and change is likely. Avoid relying on it. -.PP +.P Back references are a dreadful botch, posing major problems for efficient implementations. They are also somewhat vaguely defined (does "\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?). Avoid using them. -.PP +.P POSIX.2's specification of case-independent matching is vague. The "one case implies all cases" definition given above is current consensus among implementors as to the right interpretation. .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 .\" The following does not seem to apply in the glibc implementation -.\" .PP +.\" .P .\" The syntax for word boundaries is incredibly ugly. .SH AUTHOR .\" Sigh... The page license means we must have the author's name @@ -289,5 +289,5 @@ This page was taken from Henry Spencer's regex package. .SH SEE ALSO .BR grep (1), .BR regex (3) -.PP +.P POSIX.2, section 2.8 (Regular Expression Notation). |