diff options
Diffstat (limited to 'upstream/mageia-cauldron/man1p/lex.1p')
-rw-r--r-- | upstream/mageia-cauldron/man1p/lex.1p | 1376 |
1 files changed, 1376 insertions, 0 deletions
diff --git a/upstream/mageia-cauldron/man1p/lex.1p b/upstream/mageia-cauldron/man1p/lex.1p new file mode 100644 index 00000000..f698edf0 --- /dev/null +++ b/upstream/mageia-cauldron/man1p/lex.1p @@ -0,0 +1,1376 @@ +'\" et +.TH LEX "1P" 2017 "IEEE/The Open Group" "POSIX Programmer's Manual" +.\" +.SH PROLOG +This manual page is part of the POSIX Programmer's Manual. +The Linux implementation of this interface may differ (consult +the corresponding Linux manual page for details of Linux behavior), +or the interface may not be implemented on Linux. +.\" +.SH NAME +lex +\(em generate programs for lexical tasks (\fBDEVELOPMENT\fP) +.SH SYNOPSIS +.LP +.nf +lex \fB[\fR-t\fB] [\fR-n|-v\fB] [\fIfile\fR...\fB]\fR +.fi +.SH DESCRIPTION +The +.IR lex +utility shall generate C programs to be used in lexical processing of +character input, and that can be used as an interface to +.IR yacc . +The C programs shall be generated from +.IR lex +source code and conform to the ISO\ C standard, without depending on any undefined, +unspecified, or implementation-defined behavior, except in cases where +the code is copied directly from the supplied source, or in cases that +are documented by the implementation. Usually, the +.IR lex +utility shall write the program it generates to the file +.BR lex.yy.c ; +the state of this file is unspecified if +.IR lex +exits with a non-zero exit status. See the EXTENDED DESCRIPTION +section for a complete description of the +.IR lex +input language. +.SH OPTIONS +The +.IR lex +utility shall conform to the Base Definitions volume of POSIX.1\(hy2017, +.IR "Section 12.2" ", " "Utility Syntax Guidelines", +except for Guideline 9. +.P +The following options shall be supported: +.IP "\fB\-n\fP" 10 +Suppress the summary of statistics usually written with the +.BR \-v +option. If no table sizes are specified in the +.IR lex +source code and the +.BR \-v +option is not specified, then +.BR \-n +is implied. +.IP "\fB\-t\fP" 10 +Write the resulting program to standard output instead of +.BR lex.yy.c . +.IP "\fB\-v\fP" 10 +Write a summary of +.IR lex +statistics to the standard output. (See the discussion of +.IR lex +table sizes in +.IR "Definitions in lex".) +If the +.BR \-t +option is specified and +.BR \-n +is not specified, this report shall be written to standard error. If +table sizes are specified in the +.IR lex +source code, and if the +.BR \-n +option is not specified, the +.BR \-v +option may be enabled. +.SH OPERANDS +The following operand shall be supported: +.IP "\fIfile\fR" 10 +A pathname of an input file. If more than one such +.IR file +is specified, all files shall be concatenated to produce a single +.IR lex +program. If no +.IR file +operands are specified, or if a +.IR file +operand is +.BR '\-' , +the standard input shall be used. +.SH STDIN +The standard input shall be used if no +.IR file +operands are specified, or if a +.IR file +operand is +.BR '\-' . +See INPUT FILES. +.SH "INPUT FILES" +The input files shall be text files containing +.IR lex +source code, as described in the EXTENDED DESCRIPTION section. +.SH "ENVIRONMENT VARIABLES" +The following environment variables shall affect the execution of +.IR lex : +.IP "\fILANG\fP" 10 +Provide a default value for the internationalization variables that are +unset or null. (See the Base Definitions volume of POSIX.1\(hy2017, +.IR "Section 8.2" ", " "Internationalization Variables" +for the precedence of internationalization variables used to determine +the values of locale categories.) +.IP "\fILC_ALL\fP" 10 +If set to a non-empty string value, override the values of all the +other internationalization variables. +.IP "\fILC_COLLATE\fP" 10 +.br +Determine the locale for the behavior of ranges, equivalence classes, +and multi-character collating elements within regular expressions. If +this variable is not set to the POSIX locale, the results are +unspecified. +.IP "\fILC_CTYPE\fP" 10 +Determine the locale for the interpretation of sequences of bytes of +text data as characters (for example, single-byte as opposed to +multi-byte characters in arguments and input files), and the behavior +of character classes within regular expressions. If this variable is +not set to the POSIX locale, the results are unspecified. +.IP "\fILC_MESSAGES\fP" 10 +.br +Determine the locale that should be used to affect the format and +contents of diagnostic messages written to standard error. +.IP "\fINLSPATH\fP" 10 +Determine the location of message catalogs for the processing of +.IR LC_MESSAGES . +.SH "ASYNCHRONOUS EVENTS" +Default. +.SH STDOUT +If the +.BR \-t +option is specified, the text file of C source code output of +.IR lex +shall be written to standard output. +.P +If the +.BR \-t +option is not specified: +.IP " *" 4 +Implementation-defined informational, error, and warning messages +concerning the contents of +.IR lex +source code input shall be written to either the standard output or +standard error. +.IP " *" 4 +If the +.BR \-v +option is specified and the +.BR \-n +option is not specified, +.IR lex +statistics shall also be written to either the standard output or +standard error, in an implementation-defined format. These +statistics may also be generated if table sizes are specified with a +.BR '%' +operator in the +.IR Definitions +section, as long as the +.BR \-n +option is not specified. +.SH STDERR +If the +.BR \-t +option is specified, implementation-defined informational, error, and +warning messages concerning the contents of +.IR lex +source code input shall be written to the standard error. +.P +If the +.BR \-t +option is not specified: +.IP " 1." 4 +Implementation-defined informational, error, and warning messages +concerning the contents of +.IR lex +source code input shall be written to either the standard output or +standard error. +.IP " 2." 4 +If the +.BR \-v +option is specified and the +.BR \-n +option is not specified, +.IR lex +statistics shall also be written to either the standard output or +standard error, in an implementation-defined format. These +statistics may also be generated if table sizes are specified with a +.BR '%' +operator in the +.IR Definitions +section, as long as the +.BR \-n +option is not specified. +.SH "OUTPUT FILES" +A text file containing C source code shall be written to +.BR lex.yy.c , +or to the standard output if the +.BR \-t +option is present. +.SH "EXTENDED DESCRIPTION" +Each input file shall contain +.IR lex +source code, which is a table of regular expressions with corresponding +actions in the form of C program fragments. +.P +When +.BR lex.yy.c +is compiled and linked with the +.IR lex +library (using the +.BR "\-l\ l" +operand with +.IR c99 ), +the resulting program shall read character input from the standard +input and shall partition it into strings that match the given +expressions. +.br +.P +When an expression is matched, these actions shall occur: +.IP " *" 4 +The input string that was matched shall be left in +.IR yytext +as a null-terminated string; +.IR yytext +shall either be an external character array or a pointer to a +character string. As explained in +.IR "Definitions in lex", +the type can be explicitly selected using the +.BR %array +or +.BR %pointer +declarations, but the default is implementation-defined. +.IP " *" 4 +The external +.BR int +.IR yyleng +shall be set to the length of the matching string. +.IP " *" 4 +The expression's corresponding program fragment, or action, shall be +executed. +.P +During pattern matching, +.IR lex +shall search the set of patterns for the single longest possible +match. Among rules that match the same number of characters, the rule +given first shall be chosen. +.P +The general format of +.IR lex +source shall be: +.sp +.RS +.IR Definitions +.BR %% +.IR Rules +.BR %% +.IR User Subroutines +.RE +.P +The first +.BR \(dq%%\(dq +is required to mark the beginning of the rules (regular expressions and +actions); the second +.BR \(dq%%\(dq +is required only if user subroutines follow. +.P +Any line in the +.IR Definitions +section beginning with a +<blank> +shall be assumed to be a C program fragment and shall be copied to the +external definition area of the +.BR lex.yy.c +file. Similarly, anything in the +.IR Definitions +section included between delimiter lines containing only +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +shall also be copied unchanged to the external definition area of the +.BR lex.yy.c +file. +.P +Any such input (beginning with a +<blank> +or within +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +delimiter lines) appearing at the beginning of the +.IR Rules +section before any rules are specified shall be written to +.BR lex.yy.c +after the declarations of variables for the +\fIyylex\fR() +function and before the first line of code in +\fIyylex\fR(). +Thus, user variables local to +\fIyylex\fR() +can be declared here, as well as application code to execute upon entry +to +\fIyylex\fR(). +.P +The action taken by +.IR lex +when encountering any input beginning with a +<blank> +or within +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +delimiter lines appearing in the +.IR Rules +section but coming after one or more rules is undefined. The presence +of such input may result in an erroneous definition of the +\fIyylex\fR() +function. +.P +C-language code in the input shall not contain C-language trigraphs. +The C-language code within +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +delimiter lines shall not contain any lines consisting only of +.BR \(dq%}\(dq , +or only of +.BR \(dq%%\(dq . +.SS "Definitions in lex" +.P +.IR Definitions +appear before the first +.BR \(dq%%\(dq +delimiter. Any line in this section not contained between +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +lines and not beginning with a +<blank> +shall be assumed to define a +.IR lex +substitution string. The format of these lines shall be: +.sp +.RS 4 +.nf + +\fIname substitute\fR +.fi +.P +.RE +.P +If a +.IR name +does not meet the requirements for identifiers in the ISO\ C standard, the result +is undefined. The string +.IR substitute +shall replace the string {\c +.IR name } +when it is used in a rule. The +.IR name +string shall be recognized in this context only when the braces are +provided and when it does not appear within a bracket expression or +within double-quotes. +.P +In the +.IR Definitions +section, any line beginning with a +<percent-sign> +(\c +.BR '%' ) +character and followed by an alphanumeric word beginning with either +.BR 's' +or +.BR 'S' +shall define a set of start conditions. Any line beginning with a +.BR '%' +followed by a word beginning with either +.BR 'x' +or +.BR 'X' +shall define a set of exclusive start conditions. When the generated +scanner is in a +.BR %s +state, patterns with no state specified shall be also active; in a +.BR %x +state, such patterns shall not be active. The rest of the line, after +the first word, shall be considered to be one or more +<blank>-separated +names of start conditions. Start condition names shall be constructed +in the same way as definition names. Start conditions can be used to +restrict the matching of regular expressions to one or more states as +described in +.IR "Regular Expressions in lex". +.P +Implementations shall accept either of the following two +mutually-exclusive declarations in the +.IR Definitions +section: +.IP "\fB%array\fR" 10 +Declare the type of +.IR yytext +to be a null-terminated character array. +.IP "\fB%pointer\fR" 10 +Declare the type of +.IR yytext +to be a pointer to a null-terminated character string. +.P +The default type of +.IR yytext +is implementation-defined. If an application refers to +.IR yytext +outside of the scanner source file (that is, via an +.BR extern ), +the application shall include the appropriate +.BR %array +or +.BR %pointer +declaration in the scanner source file. +.P +Implementations shall accept declarations in the +.IR Definitions +section for setting certain internal table sizes. The declarations are +shown in the following table. +.sp +.ce 1 +\fBTable: Table Size Declarations in \fIlex\fP\fR +.TS +center tab(!) box; +cB | cB | cB +l | l | n. +Declaration!Description!Minimum Value +_ +%\fBp \fIn\fR!Number of positions!2\|500 +%\fBn \fIn\fR!Number of states!500 +%\fBa \fIn\fR!Number of transitions!2\|000 +%\fBe \fIn\fR!Number of parse tree nodes!1\|000 +%\fBk \fIn\fR!Number of packed character classes!1\|000 +%\fBo \fIn\fR!Size of the output array!3\|000 +.TE +.P +In the table, +.IR n +represents a positive decimal integer, preceded by one or more +<blank> +characters. The exact meaning of these table size numbers is +implementation-defined. The implementation shall document how these +numbers affect the +.IR lex +utility and how they are related to any output that may be generated by +the implementation should limitations be encountered during the +execution of +.IR lex . +It shall be possible to determine from this output which of the table +size values needs to be modified to permit +.IR lex +to successfully generate tables for the input language. The values in +the column Minimum Value represent the lowest values conforming +implementations shall provide. +.SS "Rules in lex" +.P +The rules in +.IR lex +source files are a table in which the left column contains regular +expressions and the right column contains actions (C program fragments) +to be executed when the expressions are recognized. +.sp +.RS 4 +.nf + +\fIERE action +ERE action\fP +\&... +.fi +.P +.RE +.P +The extended regular expression (ERE) portion of a row shall be +separated from +.IR action +by one or more +<blank> +characters. A regular expression containing +<blank> +characters shall be recognized under one of the following conditions: +.IP " *" 4 +The entire expression appears within double-quotes. +.IP " *" 4 +The +<blank> +characters appear within double-quotes or square brackets. +.IP " *" 4 +Each +<blank> +is preceded by a +<backslash> +character. +.SS "User Subroutines in lex" +.P +Anything in the user subroutines section shall be copied to +.BR lex.yy.c +following +\fIyylex\fR(). +.SS "Regular Expressions in lex" +.P +The +.IR lex +utility shall support the set of extended regular expressions (see the Base Definitions volume of POSIX.1\(hy2017, +.IR "Section 9.4" ", " "Extended Regular Expressions"), +with the following additions and exceptions to the syntax: +.IP "\fR\&\(dq...\&\(dq\fR" 10 +Any string enclosed in double-quotes shall represent the characters +within the double-quotes as themselves, except that +<backslash>-escapes +(which appear in the following table) shall be recognized. Any +<backslash>-escape +sequence shall be terminated by the closing quote. For example, +.BR \(dq\e01\(dq \c +.BR \(dq1\(dq +represents a single string: the octal value 1 followed by the +character +.BR '1' . +.IP "<\fIstate\fR>\fIr\fR,\ <\fIstate1,state2,\fR.\|.\|.>\fIr\fR" 10 +.br +The regular expression +.IR r +shall be matched only when the program is in one of the start +conditions indicated by +.IR state , +.IR state1 , +and so on; see +.IR "Actions in lex". +(As an exception to the typographical conventions of the rest of this volume of POSIX.1\(hy2017, +in this case <\fIstate\fP> does not represent a metavariable, but the +literal angle-bracket characters surrounding a symbol.) The start +condition shall be recognized as such only at the beginning of a +regular expression. +.IP "\fIr\fP/\fIx\fP" 10 +The regular expression +.IR r +shall be matched only if it is followed by an occurrence of regular +expression +.IR x +(\c +.IR x +is the instance of trailing context, further defined below). The token +returned in +.IR yytext +shall only match +.IR r . +If the trailing portion of +.IR r +matches the beginning of +.IR x , +the result is unspecified. The +.IR r +expression cannot include further trailing context or the +.BR '$' +(match-end-of-line) operator; +.IR x +cannot include the +.BR '\(ha' +(match-beginning-of-line) operator, nor trailing context, nor the +.BR '$' +operator. That is, only one occurrence of trailing context is allowed +in a +.IR lex +regular expression, and the +.BR '\(ha' +operator only can be used at the beginning of such an expression. +.IP "{\fIname\fR}" 10 +When +.IR name +is one of the substitution symbols from the +.IR Definitions +section, the string, including the enclosing braces, shall be replaced +by the +.IR substitute +value. The +.IR substitute +value shall be treated in the extended regular expression as if it were +enclosed in parentheses. No substitution shall occur if {\c +.IR name } +occurs within a bracket expression or within double-quotes. +.P +Within an ERE, a +<backslash> +character shall be considered to begin an escape sequence as specified +in the table in the Base Definitions volume of POSIX.1\(hy2017, +.IR "Chapter 5" ", " "File Format Notation" +(\c +.BR '\e\e' , +.BR '\ea' , +.BR '\eb' , +.BR '\ef' , +.BR '\en' , +.BR '\er' , +.BR '\et' , +.BR '\ev' ). +In addition, the escape sequences in the following table shall be +recognized. +.P +A literal +<newline> +cannot occur within an ERE; the escape sequence +.BR '\en' +can be used to represent a +<newline>. +A +<newline> +shall not be matched by a period operator. +.br +.sp +.ce 1 +\fBTable: Escape Sequences in \fIlex\fP\fR +.ad l +.TS +center tab(@) box; +cB | cB | cB +cB | cB | cB +lf5 | lw(2.4i) | lw(2.4i). +Escape +Sequence@Description@Meaning +_ +\e\fIdigits\fP@T{ +A +<backslash> +character followed by the longest sequence of one, two, or three +octal-digit characters (01234567). If all of the digits are 0 (that is, +representation of the NUL character), the behavior is undefined. +T}@T{ +The character whose encoding is represented by the one, two, or +three-digit octal integer. Multi-byte characters require +multiple, concatenated escape sequences of this type, including the +leading +<backslash> +for each byte. +T} +_ +\ex\fIdigits\fP@T{ +A +<backslash> +character followed by the longest sequence of hexadecimal-digit +characters (01234567abcdefABCDEF). If all of the digits are 0 (that is, +representation of the NUL character), the behavior is undefined. +T}@T{ +The character whose encoding is represented by the hexadecimal +integer. +T} +_ +\ec@T{ +A +<backslash> +character followed by any character not described in this +table or in the table in the Base Definitions volume of POSIX.1\(hy2017, +.IR "Chapter 5" ", " "File Format Notation" +(\c +.BR '\e\e' , +.BR '\ea' , +.BR '\eb' , +.BR '\ef' , +.BR '\en' , +.BR '\er' , +.BR '\et' , +.BR '\ev' ). +T}@T{ +The character +.BR 'c' , +unchanged. +T} +.TE +.ad b +.TP 10 +.BR Note: +If a +.BR '\ex' +sequence needs to be immediately followed by a hexadecimal digit +character, a sequence such as +.BR \(dq\ex1\(dq \c +.BR \(dq1\(dq +can be used, which represents a character containing the value 1, +followed by the character +.BR '1' . +.P +.P +The order of precedence given to extended regular expressions for +.IR lex +differs from that specified in the Base Definitions volume of POSIX.1\(hy2017, +.IR "Section 9.4" ", " "Extended Regular Expressions". +The order of precedence for +.IR lex +shall be as shown in the following table, from high to low. +.TP 10 +.BR Note: +The escaped characters entry is not meant to imply that these are +operators, but they are included in the table to show their +relationships to the true operators. The start condition, trailing +context, and anchoring notations have been omitted from the table +because of the placement restrictions described in this section; they +can only appear at the beginning or ending of an ERE. +.P +.br +.sp +.ce 1 +\fBTable: ERE Precedence in \fIlex\fP\fR +.TS +center tab(@) box; +cB | cB +lf2 | lf5. +Extended Regular Expression@Precedence +_ +collation-related bracket symbols@[= =] [: :] [. .] +escaped characters@\e<\fIspecial character\fP> +bracket expression@[ ] +quoting@"..." +grouping@( ) +definition@{\fIname\fP} +single-character RE duplication@* + ? +concatenation +interval expression@{m,n} +alternation@| +.TE +.P +The ERE anchoring operators +.BR '\(ha' +and +.BR '$' +do not appear in the table. With +.IR lex +regular expressions, these operators are restricted in their use: the +.BR '\(ha' +operator can only be used at the beginning of an entire regular +expression, and the +.BR '$' +operator only at the end. The operators apply to the entire regular +expression. Thus, for example, the pattern +.BR \(dq(\(haabc)|(def$)\(dq +is undefined; it can instead be written as two separate rules, one with +the regular expression +.BR \(dq\(haabc\(dq +and one with +.BR \(dqdef$\(dq , +which share a common action via the special +.BR '|' +action (see below). If the pattern were written +.BR \(dq\(haabc|def$\(dq , +it would match either +.BR \(dqabc\(dq +or +.BR \(dqdef\(dq +on a line by itself. +.P +Unlike the general ERE rules, embedded anchoring is not allowed by most +historical +.IR lex +implementations. An example of embedded anchoring would be for +patterns such as +.BR \(dq(\(ha|\ )foo(\ |$)\(dq +to match +.BR \(dqfoo\(dq +when it exists as a complete word. This functionality can be obtained +using existing +.IR lex +features: +.sp +.RS 4 +.nf + +\(hafoo/[ \en] | +" foo"/[ \en] /* Found foo as a separate word. */ +.fi +.P +.RE +.P +Note also that +.BR '$' +is a form of trailing context (it is equivalent to +.BR \(dq/\en\(dq ) +and as such cannot be used with regular expressions containing another +instance of the operator (see the preceding discussion of trailing +context). +.P +The additional regular expressions trailing-context operator +.BR '/' +can be used as an ordinary character if presented within double-quotes, +.BR \(dq/\(dq ; +preceded by a +<backslash>, +.BR \(dq\e/\(dq ; +or within a bracket expression, +.BR \(dq[/]\(dq . +The start-condition +.BR '<' +and +.BR '>' +operators shall be special only in a start condition at the beginning +of a regular expression; elsewhere in the regular expression they shall +be treated as ordinary characters. +.SS "Actions in lex" +.P +The action to be taken when an ERE is matched can be a C program +fragment or the special actions described below; the program fragment +can contain one or more C statements, and can also include special +actions. The empty C statement +.BR ';' +shall be a valid action; any string in the +.BR lex.yy.c +input that matches the pattern portion of such a rule is effectively +ignored or skipped. However, the absence of an action shall not be +valid, and the action +.IR lex +takes in such a condition is undefined. +.P +The specification for an action, including C statements and special +actions, can extend across several lines if enclosed in braces: +.sp +.RS 4 +.nf + +\fIERE\fP <\fIone or more blanks\fR> { \fIprogram statement + program statement\fP } +.fi +.P +.RE +.P +The program statements shall not contain unbalanced curly brace +preprocessing tokens. +.P +The default action when a string in the input to a +.BR lex.yy.c +program is not matched by any expression shall be to copy the string to +the output. Because the default behavior of a program generated by +.IR lex +is to read the input and copy it to the output, a minimal +.IR lex +source program that has just +.BR \(dq%%\(dq +shall generate a C program that simply copies the input to the output +unchanged. +.P +Four special actions shall be available: +.sp +.RS 4 +.nf + +| ECHO; REJECT; BEGIN +.fi +.P +.RE +.IP "\fR|\fR" 10 +The action +.BR '|' +means that the action for the next rule is the action for this rule. +Unlike the other three actions, +.BR '|' +cannot be enclosed in braces or be +<semicolon>-terminated; +the application shall ensure that it is specified alone, with no other +actions. +.IP "\fBECHO;\fR" 10 +Write the contents of the string +.IR yytext +on the output. +.IP "\fBREJECT;\fR" 10 +Usually only a single expression is matched by a given string in the +input. +.BR REJECT +means ``continue to the next expression that matches the current +input'', and shall cause whatever rule was the second choice after the +current rule to be executed for the same input. Thus, multiple rules +can be matched and executed for one input string or overlapping input +strings. For example, given the regular expressions +.BR \(dqxyz\(dq +and +.BR \(dqxy\(dq +and the input +.BR \(dqxyz\(dq , +usually only the regular expression +.BR \(dqxyz\(dq +would match. The next attempted match would start after +.BR z. +If the last action in the +.BR \(dqxyz\(dq +rule is +.BR REJECT , +both this rule and the +.BR \(dqxy\(dq +rule would be executed. The +.BR REJECT +action may be implemented in such a fashion that flow of control does +not continue after it, as if it were equivalent to a +.BR goto +to another part of +\fIyylex\fR(). +The use of +.BR REJECT +may result in somewhat larger and slower scanners. +.IP "\fBBEGIN\fR" 10 +The action: +.RS 10 +.sp +.RS 4 +.nf + +BEGIN \fInewstate\fP; +.fi +.P +.RE +.P +switches the state (start condition) to +.IR newstate . +If the string +.IR newstate +has not been declared previously as a start condition in the +.IR Definitions +section, the results are unspecified. The initial state is indicated +by the digit +.BR '0' +or the token +.BR INITIAL . +.RE +.P +The functions or macros described below are accessible to user code +included in the +.IR lex +input. It is unspecified whether they appear in the C code output of +.IR lex , +or are accessible only through the +.BR "\-l\ l" +operand to +.IR c99 +(the +.IR lex +library). +.IP "\fBint\ \fIyylex\fR(\fBvoid\fR)" 6 +.br +Performs lexical analysis on the input; this is the primary function +generated by the +.IR lex +utility. The function shall return zero when the end of input is +reached; otherwise, it shall return non-zero values (tokens) determined +by the actions that are selected. +.IP "\fBint\ \fIyymore\fR(\fBvoid\fR)" 6 +.br +When called, indicates that when the next input string is recognized, +it is to be appended to the current value of +.IR yytext +rather than replacing it; the value in +.IR yyleng +shall be adjusted accordingly. +.IP "\fBint\ \fIyyless\fR(\fBint\ \fIn\fR)" 6 +.br +Retains +.IR n +initial characters in +.IR yytext , +NUL-terminated, and treats the remaining characters as if they had not +been read; the value in +.IR yyleng +shall be adjusted accordingly. +.IP "\fBint\ \fIinput\fR(\fBvoid\fR)" 6 +.br +Returns the next character from the input, or zero on end-of-file. It +shall obtain input from the stream pointer +.IR yyin , +although possibly via an intermediate buffer. Thus, once scanning has +begun, the effect of altering the value of +.IR yyin +is undefined. The character read shall be removed from the input +stream of the scanner without any processing by the scanner. +.IP "\fBint\ \fIunput\fR(\fBint\ \fIc\fR)" 6 +.br +Returns the character +.BR 'c' +to the input; +.IR yytext +and +.IR yyleng +are undefined until the next expression is matched. The result of +using +\fIunput\fR() +for more characters than have been input is unspecified. +.P +The following functions shall appear only in the +.IR lex +library accessible through the +.BR "\-l\ l" +operand; they can therefore be redefined by a conforming application: +.IP "\fBint\ \fIyywrap\fR(\fBvoid\fR)" 6 +.br +Called by +\fIyylex\fR() +at end-of-file; the default +\fIyywrap\fR() +shall always return 1. If the application requires +\fIyylex\fR() +to continue processing with another source of input, then the +application can include a function +\fIyywrap\fR(), +which associates another file with the external variable +.BR "FILE *" +.IR yyin +and shall return a value of zero. +.IP "\fBint\ \fImain\fR(\fBint\ \fIargc\fR, \fBchar *\fIargv\fR[\|])" 6 +.br +Calls +\fIyylex\fR() +to perform lexical analysis, then exits. The user code can contain +\fImain\fR() +to perform application-specific operations, calling +\fIyylex\fR() +as applicable. +.P +Except for +\fIinput\fR(), +\fIunput\fR(), +and +\fImain\fR(), +all external and static names generated by +.IR lex +shall begin with the prefix +.BR yy +or +.BR YY . +.SH "EXIT STATUS" +The following exit values shall be returned: +.IP "\00" 6 +Successful completion. +.IP >0 6 +An error occurred. +.SH "CONSEQUENCES OF ERRORS" +Default. +.LP +.IR "The following sections are informative." +.SH "APPLICATION USAGE" +Conforming applications are warned that in the +.IR Rules +section, an ERE without an action is not acceptable, but need not be +detected as erroneous by +.IR lex . +This may result in compilation or runtime errors. +.P +The purpose of +\fIinput\fR() +is to take characters off the input stream and discard them as far as +the lexical analysis is concerned. A common use is to discard the body +of a comment once the beginning of a comment is recognized. +.P +The +.IR lex +utility is not fully internationalized in its treatment of regular +expressions in the +.IR lex +source code or generated lexical analyzer. It would seem desirable to +have the lexical analyzer interpret the regular expressions given in +the +.IR lex +source according to the environment specified when the lexical analyzer +is executed, but this is not possible with the current +.IR lex +technology. Furthermore, the very nature of the lexical analyzers +produced by +.IR lex +must be closely tied to the lexical requirements of the input language +being described, which is frequently locale-specific anyway. (For +example, writing an analyzer that is used for French text is not +automatically useful for processing other languages.) +.SH EXAMPLES +The following is an example of a +.IR lex +program that implements a rudimentary scanner for a Pascal-like +syntax: +.sp +.RS 4 +.nf + +%{ +/* Need this for the call to atof() below. */ +#include <math.h> +/* Need this for printf(), fopen(), and stdin below. */ +#include <stdio.h> +%} +.P +DIGIT [0-9] +ID [a-z][a-z0-9]* +.P +%% +.P +{DIGIT}+ { + printf("An integer: %s (%d)\en", yytext, + atoi(yytext)); + } +.P +{DIGIT}+"."{DIGIT}* { + printf("A float: %s (%g)\en", yytext, + atof(yytext)); + } +.P +if|then|begin|end|procedure|function { + printf("A keyword: %s\en", yytext); + } +.P +{ID} printf("An identifier: %s\en", yytext); +.P +"+"|"-"|"*"|"/" printf("An operator: %s\en", yytext); +.P +"{"[\(ha}\en]*"}" /* Eat up one-line comments. */ +.P +[ \et\en]+ /* Eat up white space. */ +.P +\&. printf("Unrecognized character: %s\en", yytext); +.P +%% +.P +int main(int argc, char *argv[]) +{ + ++argv, --argc; /* Skip over program name. */ + if (argc > 0) + yyin = fopen(argv[0], "r"); + else + yyin = stdin; +.P + yylex(); +} +.fi +.P +.RE +.SH RATIONALE +Even though the +.BR \-c +option and references to the C language are retained in this +description, +.IR lex +may be generalized to other languages, as was done at one time for EFL, +the Extended FORTRAN Language. Since the +.IR lex +input specification is essentially language-independent, versions of +this utility could be written to produce Ada, Modula-2, or Pascal code, +and there are known historical implementations that do so. +.P +The current description of +.IR lex +bypasses the issue of dealing with internationalized EREs in the +.IR lex +source code or generated lexical analyzer. If it follows the model used +by +.IR awk +(the source code is assumed to be presented in the POSIX locale, but +input and output are in the locale specified by the environment +variables), then the tables in the lexical analyzer produced by +.IR lex +would interpret EREs specified in the +.IR lex +source in terms of the environment variables specified when +.IR lex +was executed. The desired effect would be to have the lexical analyzer +interpret the EREs given in the +.IR lex +source according to the environment specified when the lexical analyzer +is executed, but this is not possible with the current +.IR lex +technology. +.P +The description of octal and hexadecimal-digit escape sequences agrees +with the ISO\ C standard use of escape sequences. +.P +Earlier versions of this standard allowed for implementations with +bytes other than eight bits, but this has been modified in this +version. +.P +There is no detailed output format specification. The observed behavior +of +.IR lex +under four different historical implementations was that none of these +implementations consistently reported the line numbers for error and +warning messages. Furthermore, there was a desire that +.IR lex +be allowed to output additional diagnostic messages. Leaving message +formats unspecified avoids these formatting questions and problems with +internationalization. +.P +Although the +.BR %x +specifier for +.IR exclusive +start conditions is not historical practice, it is believed to be a +minor change to historical implementations and greatly enhances the +usability of +.IR lex +programs since it permits an application to obtain the expected +functionality with fewer statements. +.P +The +.BR %array +and +.BR %pointer +declarations were added as a compromise between historical systems. +The System V-based +.IR lex +copies the matched text to a +.IR yytext +array. The +.IR flex +program, supported in BSD and GNU systems, uses a pointer. In the +latter case, significant performance improvements are available for +some scanners. Most historical programs should require no change in +porting from one system to another because the string being referenced +is null-terminated in both cases. (The method used by +.IR flex +in its case is to null-terminate the token in place by remembering the +character that used to come right after the token and replacing it +before continuing on to the next scan.) Multi-file programs with +external references to +.IR yytext +outside the scanner source file should continue to operate on their +historical systems, but would require one of the new declarations to be +considered strictly portable. +.P +The description of EREs avoids unnecessary duplication of ERE details +because their meanings within a +.IR lex +ERE are the same as that for the ERE in this volume of POSIX.1\(hy2017. +.P +The reason for the undefined condition associated with text beginning +with a +<blank> +or within +.BR \(dq%{\(dq +and +.BR \(dq%}\(dq +delimiter lines appearing in the +.IR Rules +section is historical practice. Both the BSD and System V +.IR lex +copy the indented (or enclosed) input in the +.IR Rules +section (except at the beginning) to unreachable areas of the +\fIyylex\fR() +function (the code is written directly after a +.IR break +statement). In some cases, the System V +.IR lex +generates an error message or a syntax error, depending on the form of +indented input. +.P +The intention in breaking the list of functions into those that may +appear in +.BR lex.yy.c +\fIversus\fR those that only appear in +.BR libl.a +is that only those functions in +.BR libl.a +can be reliably redefined by a conforming application. +.P +The descriptions of standard output and standard error are somewhat +complicated because historical +.IR lex +implementations chose to issue diagnostic messages to standard output +(unless +.BR \-t +was given). POSIX.1\(hy2008 allows this behavior, but leaves an opening +for the more expected behavior of using standard error for diagnostics. +Also, the System V behavior of writing the statistics when any table +sizes are given is allowed, while BSD-derived systems can avoid it. The +programmer can always precisely obtain the desired results by using +either the +.BR \-t +or +.BR \-n +options. +.P +The OPERANDS section does not mention the use of +.BR \- +as a synonym for standard input; not all historical implementations +support such usage for any of the +.IR file +operands. +.P +A description of the +.IR "translation table" +was deleted from early proposals because of its relatively low usage in +historical applications. +.P +The change to the definition of the +\fIinput\fR() +function that allows buffering of input presents the opportunity for +major performance gains in some applications. +.P +The following examples clarify the differences between +.IR lex +regular expressions and regular expressions appearing elsewhere in +\&this volume of POSIX.1\(hy2017. For regular expressions of the form +.BR \(dqr/x\(dq , +the string matching +.IR r +is always returned; confusion may arise when the beginning of +.IR x +matches the trailing portion of +.IR r . +For example, given the regular expression +.BR \(dqa*b/cc\(dq +and the input +.BR \(dqaaabcc\(dq , +.IR yytext +would contain the string +.BR \(dqaaab\(dq +on this match. But given the regular expression +.BR \(dqx*/xy\(dq +and the input +.BR \(dqxxxy\(dq , +the token +.BR xxx , +not +.BR xx , +is returned by some implementations because +.BR xxx +matches +.BR \(dqx*\(dq . +.P +In the rule +.BR \(dqab*/bc\(dq , +the +.BR \(dqb*\(dq +at the end of +.IR r +extends +.IR r 's +match into the beginning of the trailing context, so the result is +unspecified. If this rule were +.BR \(dqab/bc\(dq , +however, the rule matches the text +.BR \(dqab\(dq +when it is followed by the text +.BR \(dqbc\(dq . +In this latter case, the matching of +.IR r +cannot extend into the beginning of +.IR x , +so the result is specified. +.SH "FUTURE DIRECTIONS" +None. +.SH "SEE ALSO" +.IR "\fIc99\fR\^", +.IR "\fIed\fR\^", +.IR "\fIyacc\fR\^" +.P +The Base Definitions volume of POSIX.1\(hy2017, +.IR "Chapter 5" ", " "File Format Notation", +.IR "Chapter 8" ", " "Environment Variables", +.IR "Chapter 9" ", " "Regular Expressions", +.IR "Section 12.2" ", " "Utility Syntax Guidelines" +.\" +.SH COPYRIGHT +Portions of this text are reprinted and reproduced in electronic form +from IEEE Std 1003.1-2017, Standard for Information Technology +-- Portable Operating System Interface (POSIX), The Open Group Base +Specifications Issue 7, 2018 Edition, +Copyright (C) 2018 by the Institute of +Electrical and Electronics Engineers, Inc and The Open Group. +In the event of any discrepancy between this version and the original IEEE and +The Open Group Standard, the original IEEE and The Open Group Standard +is the referee document. The original Standard can be obtained online at +http://www.opengroup.org/unix/online.html . +.PP +Any typographical or formatting errors that appear +in this page are most likely +to have been introduced during the conversion of the source files to +man page format. To report such errors, see +https://www.kernel.org/doc/man-pages/reporting_bugs.html . |