diff options
Diffstat (limited to 'third_party/rust/jsparagus/js-quirks.md')
-rw-r--r-- | third_party/rust/jsparagus/js-quirks.md | 1036 |
1 files changed, 1036 insertions, 0 deletions
diff --git a/third_party/rust/jsparagus/js-quirks.md b/third_party/rust/jsparagus/js-quirks.md new file mode 100644 index 0000000000..20c621c92a --- /dev/null +++ b/third_party/rust/jsparagus/js-quirks.md @@ -0,0 +1,1036 @@ +## JS syntactic quirks + +> *To make a labyrinth, it takes* +> *Some good intentions, some mistakes.* +> —A. E. Stallings, “Daedal” + +JavaScript is rather hard to parse. Here is an in-depth accounting of +its syntactic quirks, with an eye toward actually implementing a parser +from scratch. + +With apologies to the generous people who work on the standard. Thanks +for doing that—better you than me. + +Thanks to [@bakkot](https://github.com/bakkot) and +[@mathiasbynens](https://github.com/mathiasbynens) for pointing out +several additional quirks. + +Problems are rated in terms of difficulty, from `(*)` = easy to `(***)` += hard. We’ll start with the easiest problems. + + +### Dangling else (*) + +If you know what this is, you may be excused. + +Statements like `if (EXPR) STMT if (EXPR) STMT else STMT` +are straight-up ambiguous in the JS formal grammar. +The ambiguity is resolved with +[a line of specification text](https://tc39.es/ecma262/#sec-if-statement): + +> Each `else` for which the choice of associated `if` is ambiguous shall +> be associated with the nearest possible `if` that would otherwise have +> no corresponding `else`. + +I love this sentence. Something about it cracks me up, I dunno... + +In a recursive descent parser, just doing the dumbest possible thing +correctly implements this rule. + +A parser generator has to decide what to do. In Yacc, you can use +operator precedence for this. + +Yacc aside: This should seem a little outrageous at first, as `else` is +hardly an operator. It helps if you understand what Yacc is doing. In LR +parsers, this kind of ambiguity in the grammar manifests as a +shift-reduce conflict. In this case, when we’ve already parsed `if ( +Expression ) Statement if ( Expression ) Statement` +and are looking at `else`, it’s unclear to Yacc +whether to reduce the if-statement or shift `else`. Yacc does not offer +a feature that lets us just say "always shift `else` here"; but there +*is* a Yacc feature that lets us resolve shift-reduce conflicts in a +rather odd, indirect way: operator precedence. We can resolve this +conflict by making `else` higher-precedence than the preceding symbol +`)`. + +Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]" +at the end of the IfStatement production that doesn’t have an `else`. + + +### Other ambiguities and informal parts of the spec (*) + +Not all of the spec is as formal as it seems at first. Most of the stuff +in this section is easy to deal with, but #4 is special. + +1. The lexical grammar is ambiguous: when looking at the characters `<<=`, + there is the question of whether to parse that as one token `<<=`, two + tokens (`< <=` or `<< =`), or three (`< < =`). + + Of course every programming language has this, and the fix is one + sentence of prose in the spec: + + > The source text is scanned from left to right, repeatedly taking the + > longest possible sequence of code points as the next input element. + + This is easy enough for hand-coded lexers, and for systems that are + designed to use separate lexical and syntactic grammars. (Other + parser generators may need help to avoid parsing `functionf(){}` as + a function.) + +2. The above line of prose does not apply *within* input elements, in + components of the lexical grammar. In those cases, the same basic + idea ("maximum munch") is specified using lookahead restrictions at + the end of productions: + + > *LineTerminatorSequence* :: + > <LF> + > <CR>[lookahead ≠ <LF>] + > <LS> + > <PS> + > <CR><LF> + + The lookahead restriction prevents a CR LF sequence from being + parsed as two adjacent *LineTerminatorSequence*s. + + This technique is used in several places, particularly in + [*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence). + +3. Annex B.1.4 extends the syntax for regular expressions, making the + grammar ambiguous. Again, a line of prose explains how to cope: + + > These changes introduce ambiguities that are broken by the + > ordering of grammar productions and by contextual + > information. When parsing using the following grammar, each + > alternative is considered only if previous production alternatives + > do not match. + +4. Annex B.1.2 extends the syntax of string literals to allow legacy + octal escape sequences, like `\033`. It says: + + > The syntax and semantics of 11.8.4 is extended as follows except + > that this extension is not allowed for strict mode code: + + ...followed by a new definition of *EscapeSequence*. + + So there are two sets of productions for *EscapeSequence*, and an + implementation is required to implement both and dynamically switch + between them. + + This means that `function f() { "\033"; "use strict"; }` is a + SyntaxError, even though the octal escape is scanned before we know + we're in strict mode. + +For another ambiguity, see "Slashes" below. + + +### Unicode quirks + +JavaScript source is Unicode and usually follows Unicode rules for thing +like identifiers and whitespace, but it has a few special cases: `$`, +`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are +legal in identifiers (the latter two only after the first character), and +`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is +treated as whitespace. + +It also allows any code point, including surrogate halves, even though the +Unicode standard says that unpaired surrogate halves should be treated as +encoding errors. + + +### Legacy octal literals and escape sequences (*) + +This is more funny than difficult. + +In a browser, in non-strict code, every sequence of decimal digits (not +followed by an identifier character) is a *NumericLiteral* token. + +If it starts with `0`, with more digits after, then it's a legacy Annex +B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal +number. Otherwise, hilariously, it's octal. + +``` +js> [067, 068, 069, 070] +[55, 68, 69, 56] +``` + +There are also legacy octal escape sequences in strings, and these have +their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8 +is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0` +followed by `8` or `9` is a legacy octal escape sequence representing +the null character. (Not to be confused with `\0` in strict code, not +followed by a digit, which still represents the null character, but +doesn't count as octal.) + +None of this is hard to implement, but figuring out what the spec says +is hard. + + +### Strict mode (*) + +*(entangled with: lazy compilation)* + +A script or function can start with this: + +```js +"use strict"; +``` + +This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript). +Additionally, all classes and modules are strict mode code. + +Strict mode has both parse-time and run-time effects. Parse-time effects +include: + +* Strict mode affects the lexical grammar: octal integer literals are + SyntaxErrors, octal character escapes are SyntaxErrors, and a + handful of words like `private` and `interface` are reserved (and + thus usually SyntaxErrors) in strict mode. + + Like the situation with slashes, this means it is not possible to + implement a complete lexer for JS without also parsing—at least + enough to detect class boundaries, "use strict" directives in + functions, and function boundaries. + +* It’s a SyntaxError to have bindings named `eval` or `arguments` in + strict mode code, or to assign to `eval` or `arguments`. + +* It’s a SyntaxError to have two argument bindings with the same name + in a strict function. + + Interestingly, you don’t always know if you’re in strict mode or not + when parsing arguments. + + ```js + function foo(a, a) { + "use strict"; + } + ``` + + When the implementation reaches the Use Strict Directive, it must + either know that `foo` has two arguments both named `a`, or switch + to strict mode, go back, and reparse the function from the + beginning. + + Fortunately an Early Error rule prohibits mixing `"use strict"` with + more complex parameter lists, like `function foo(x = eval('')) {`. + +* The expression syntax “`delete` *Identifier*” and the abominable + *WithStatement* are banned in strict mode. + + +### Conditional keywords (**) + +In some programming languages, you could write a lexer that has rules +like + +* When you see `if`, return `Token::If`. + +* When you see something like `apple` or `arrow` or `target`, + return `Token::Identifier`. + +Not so in JavaScript. The input `if` matches both the terminal `if` and +the nonterminal *IdentifierName*, both of which appear in the high-level +grammar. The same goes for `target`. + +This poses a deceptively difficult problem for table-driven parsers. +Such parsers run on a stream of token-ids, but the question of which +token-id to use for a word like `if` or `target` is ambiguous. The +current parser state can't fully resolve the ambiguity: there are cases +like `class C { get` where the token `get` might match either as a +keyword (the start of a getter) or as an *IdentifierName* (a method or +property named `get`) in different grammatical productions. + +All keywords are conditional, but some are more conditional than others. +The rules are inconsistent to a tragicomic extent. Keywords like `if` +that date back to JavaScript 1.0 are always keywords except when used as +property names or method names. They can't be variable names. Two +conditional keywords (`await` and `yield`) are in the *Keyword* list; +the rest are not. New syntax that happened to be introduced around the +same time as strict mode was awarded keyword status in strict mode. The +rules are scattered through the spec. All this interacts with `\u0065` +Unicode escape sequences somehow. It’s just unbelievably confusing. + +(After writing this section, I +[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694) +to make it a little less confusing.) + +* Thirty-six words are always reserved: + + > `break` `case` `catch` `class` `const` `continue` `debugger` + > `default` `delete` `do` `else` `enum` `export` `extends` `false` + > `finally` `for` `function` `if` `import` `in` `instanceof` `new` + > `null` `return` `super` `switch` `this` `throw` `true` `try` + > `typeof` `var` `void` `while` `with` + + These tokens can't be used as names of variables or arguments. + They're always considered special *except* when used as property + names, method names, or import/export names in modules. + + ```js + // property names + let obj = {if: 3, function: 4}; + assert(obj.if == 3); + + // method names + class C { + if() {} + function() {} + } + + // imports and exports + import {if as my_if} from "modulename"; + export {my_if as if}; + ``` + +* Two more words, `yield` and `await`, are in the *Keyword* list but + do not always act like keywords in practice. + + * `yield` is a *Keyword*; but it can be used as an identifier, + except in generators and strict mode code. + + This means that `yield - 1` is valid both inside and outside + generators, with different meanings. Outside a generator, it’s + subtraction. Inside, it yields the value **-1**. + + That reminds me of the Groucho Marx line: Outside of a dog, a + book is a man’s best friend. Inside of a dog it’s too dark to + read. + + * `await` is like that, but in async functions. Also it’s not a + valid identifier in modules. + + Conditional keywords are entangled with slashes: `yield /a/g` is two + tokens in a generator but five tokens elsewhere. + +* In strict mode code, `implements`, `interface`, `package`, + `private`, `protected`, and `public` are reserved (via Early Errors + rules). + + This is reflected in the message and location information for + certain syntax errors: + + ``` + SyntaxError: implements is a reserved identifier: + class implements {} + ......^ + + SyntaxError: implements is a reserved identifier: + function implements() { "use strict"; } + ....................................^ + ``` + +* `let` is not a *Keyword* or *ReservedWord*. Usually it can be an + identifier. It is special at the beginning of a statement or after + `for (` or `for await (`. + + ```js + var let = [new Date]; // ok: let as identifier + let v = let; // ok: let as keyword, then identifier + let let; // SyntaxError: banned by special early error rule + let.length; // ok: `let .` -> ExpressionStatement + let[0].getYear(); // SyntaxError: `let [` -> LexicalDeclaration + ``` + + In strict mode code, `let` is reserved. + +* `static` is similar. It’s a valid identifier, except in strict + mode. It’s only special at the beginning of a *ClassElement*. + + In strict mode code, `static` is reserved. + +* `async` is similar, but trickier. It’s an identifier. It is special + only if it’s marking an `async` function, method, or arrow function + (the tough case, since you won’t know it’s an arrow function until + you see the `=>`, possibly much later). + + ```js + function async() {} // normal function named "async" + + async(); // ok, `async` is an Identifier; function call + async() => {}; // ok, `async` is not an Identifier; async arrow function + ``` + +* `of` is special only in one specific place in `for-of` loop syntax. + + ```js + var of = [1, 2, 3]; + for (of of of) console.log(of); // logs 1, 2, 3 + ``` + + Amazingly, both of the following are valid JS code: + + ```js + for (async of => {};;) {} + for (async of []) {} + ``` + + In the first line, `async` is a keyword and `of` is an identifier; + in the second line it's the other way round. + + Even a simplified JS grammar can't be LR(1) as long as it includes + the features used here! + +* `get` and `set` are special only in a class or an object literal, + and then only if followed by a PropertyName: + + ```js + var obj1 = {get: f}; // `get` is an identifier + var obj2 = {get x() {}}; // `get` means getter + + class C1 { get = 3; } // `get` is an identifier + class C2 { get x() {} } // `get` means getter + ``` + +* `target` is special only in `new.target`. + +* `arguments` and `eval` can't be binding names, and can't be assigned + to, in strict mode code. + +To complicate matters, there are a few grammatical contexts where both +*IdentifierName* and *Identifier* match. For example, after `var {` +there are two possibilities: + +```js +// Longhand properties: BindingProperty -> PropertyName -> IdentifierName +var { xy: v } = obj; // ok +var { if: v } = obj; // ok, `if` is an IdentifierName + +// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier +var { xy } = obj; // ok +var { if } = obj; // SyntaxError: `if` is not an Identifier +``` + + +### Escape sequences in keywords + +*(entangled with: conditional keywords, ASI)* + +You can use escape sequences to write variable and property names, but +not keywords (including contextual keywords in contexts where they act +as keywords). + +So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not. + +And you don't necessarily know if you're lexing a contextual keyword +until the next token: `({ g\u0065t: 0 })` is legal, but +`({ g\u0065t x(){} })` is not. + +And for `let` it's even worse: `l\u0065t` by itself is a legal way to +reference a variable named `let`, which means that + +```js +let +x +``` +declares a variable named `x`, while, thanks to ASI, + +```js +l\u0065t +x +``` +is a reference to a variable named `let` followed by a reference to a +variable named `x`. + + +### Early errors (**) + +*(entangled with: lazy parsing, conditional keywords, ASI)* + +Some early errors are basically syntactic. Others are not. + +This is entangled with lazy compilation: "early errors" often involve a +retrospective look at an arbitrarily large glob of code we just parsed, +but in Beast Mode we’re not building an AST. In fact we would like to be +doing as little bookkeeping as possible. + +Even setting that aside, every early error is a special case, and it’s +just a ton of rules that all have to be implemented by hand. + +Here are some examples of Early Error rules—setting aside restrictions +that are covered adequately elsewhere: + +* Rules about names: + + * Rules that affect the set of keywords (character sequences that + match *IdentifierName* but are not allowed as binding names) based + on whether or not we’re in strict mode code, or in a + *Module*. Affected identifiers include `arguments`, `eval`, `yield`, + `await`, `let`, `implements`, `interface`, `package`, `private`, + `protected`, `public`, `static`. + + * One of these is a strangely worded rule which prohibits using + `yield` as a *BindingIdentifier*. At first blush, this seems + like it could be enforced in the grammar, but that approach + would make this a valid program, due to ASI: + + ```js + let + yield 0; + ``` + + Enforcing the same rule using an Early Error prohibits ASI here. + It works by exploiting the detailed inner workings of ASI case + 1, and arranging for `0` to be "the offending token" rather than + `yield`. + + * Lexical variable names have to be unique within a scope: + + * Lexical variables (`let` and `const`) can’t be declared more + than once in a block, or both lexically declared and + declared with `var`. + + * Lexically declared variables in a function body can’t have the same + name as argument bindings. + + * A lexical variable can’t be named `let`. + + * Common-sense rules dealing with unicode escape sequences in + identifiers. + +* Common-sense rules about regular expression literals. (They have to + actually be valid regular expressions, and unrecognized flags are + errors.) + +* The number of string parts that a template string can have is + limited to 2<sup>32</sup> − 1. + +* Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are + banned in non-tagged templates (in tagged templates, they are allowed). + +* The *SuperCall* syntax is allowed only in derived class + constructors. + +* `const x;` without an initializer is a Syntax Error. + +* A direct substatement of an `if` statement, loop statement, or + `with` statement can’t be a labelled `function`. + +* Early errors are used to hook up cover grammars. + + * Early errors are also used in one case to avoid having to + specify a very large refinement grammar when *ObjectLiteral* + almost covers *ObjectAssignmentPattern*: + [sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors). + +* Early errors are sometimes used to prevent parsers from needing to + backtrack too much. + + * When parsing `async ( x = await/a/g )`, you don't know until the + next token if this is an async arrow or a call to a function named + `async`. This means you can't even tokenize properly, because in + the former case the thing following `x =` is two divisions and in + the latter case it's an *AwaitExpression* of a regular expression. + So an Early Error forbids having `await` in parameters at all, + allowing parsers to immediately throw an error if they find + themselves in this case. + +Many strict mode rules are enforced using Early Errors, but others +affect runtime semantics. + +<!-- +* I think the rules about assignment targets are related to cover + grammars. Not sure. + + * Expressions used in assignment context (i.e. as the operands of `++` + and `--`, the left operand of assignment, the loop variable of a + `for-in/for-of` loop, or sub-targets in destructuring) must be valid + assignment targets. + + * In destructuring assignment, `[...EXPR] = x` is an error if `EXPR` + is an array or object destructuring target. + +--> + + +### Boolean parameters (**) + +Some nonterminals are parameterized. (Search for “parameterized +production” in [this spec +section](https://tc39.es/ecma262/#sec-grammar-notation).) + +Implemented naively (e.g. by macro expansion) in a parser generator, +each parameter could nearly double the size of the parser. Instead, the +parameters must be tracked at run time somehow. + + +### Lookahead restrictions (**) + +*(entangled with: restricted productions)* + +TODO (I implemented this by hacking the entire LR algorithm. Most every +part of it is touched, although in ways that seem almost obvious once +you understand LR inside and out.) + +(Note: It may seem like all of the lookahead restrictions in the spec +are really just a way of saying “this production takes precedence over +that one”—for example, that the lookahead restriction on +*ExpressionStatement* just means that other productions for statements +and declarations take precedence over it. But that isn't accurate; you +can't have an *ExpressionStatement* that starts with `{`, even if it +doesn't parse as a *Block* or any other kind of statement.) + + +### Automatic Semicolon Insertion (**) + +*(entangled with: restricted productions, slashes)* + +Most semicolons at the end of JS statements and declarations “may be +omitted from the source text in certain situations”. This is called +[Automatic Semicolon +Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion), +or ASI for short. + +The specification for this feature is both very-high-level and weirdly +procedural (“When, as the source text is parsed from left to right, a +token is encountered...”, as if the specification is telling a story +about a browser. As far as I know, this is the only place in the spec +where anything is assumed or implied about the internal implementation +details of parsing.) But it would be hard to specify ASI any other way. + +Wrinkles: + +1. Whitespace is significant (including whitespace inside comments). + Most semicolons in the grammar are optional only at the end of a + line (or before `}`, or at the end of the program). + +2. The ending semicolon of a `do`-`while` statement is extra optional. + You can always omit it. + +3. A few semicolons are never optional, like the semicolons in `for (;;)`. + + This means there’s a semicolon in the grammar that is optionally + optional! This one: + + > *LexicalDeclaration* : *LetOrConst* *BindingList* `;` + + It’s usually optional, but not if this is the *LexicalDeclaration* + in `for (let i = 0; i < 9; i++)`! + +4. Semicolons are not inserted only as a last resort to avoid + SyntaxErrors. That turned out to be too error-prone, so there are + also *restricted productions* (see below), where semicolons are more + aggressively inferred. + +5. In implementations, ASI interacts with the ambiguity of *slashes* + (see below). + +A recursive descent parser implements ASI by calling a special method +every time it needs to parse a semicolon that might be optional. The +special method has to peek at the next token and consume it only if it’s +a semicolon. This would not be so bad if it weren’t for slashes. + +In a parser generator, ASI can be implemented using an error recovery +mechanism. + +I think the [error recovery mechanism in +yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery) +is too imprecise—when an error happens, it discards states from the +stack searching for a matching error-handling rule. The manual says +“Error recovery strategies are necessarily guesses.” + +But here’s a slightly more careful error recovery mechanism that could +do the job: + +1. For each production in the ES spec grammar where ASI could happen, e.g. + + ``` + ImportDeclaration ::= `import` ModuleSpecifier `;` + { import_declaration($2); } + ``` + + add an ASI production, like this: + + ``` + ImportDeclaration ::= `import` ModuleSpecifier [ERROR] + { check_asi(); import_declaration($2); } + ``` + + What does this mean? This production can be matched, like any other + production, but it's a fallback. All other productions take + precedence. + +2. While generating the parser, treat `[ERROR]` as a terminal + symbol. It can be included in start sets and follow sets, lookahead, + and so forth. + +3. At run time, when an error happens, synthesize an `[ERROR]` token. + Let that bounce through the state machine. It will cause zero or + more reductions. Then, it might actually match a production that + contains `[ERROR]`, like the ASI production above. + + Otherwise, we’ll get another error—the entry in the parser table for + an `[ERROR]` token at this state will be an `error` entry. Then we + really have a syntax error. + +This solves most of the ASI issues: + +* [x] Whitespace sensitivity: That's what `check_asi()` is for. It + should signal an error if we're not at the end of a line. + +* [x] Special treatment of `do`-`while` loops: Make an error production, + but don't `check_asi()`. + +* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`: + Easy, don't create error productions for those. + + * [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust + the grammar, copying *LexicalDeclaration* so that there's a + *LexicalDeclarationNoASI* production used only by `for` + statements. Not a big deal, as it turns out. + +* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the + next token, if it starts with `/`. + +* [ ] Restricted productions: Not solved. Read on. + + +### Restricted productions (**) + +*(entangled with: ASI, slashes)* + +Line breaks aren’t allowed in certain places. For example, the following +is not a valid program: + + throw // SyntaxError + new Error(); + +For another example, this function contains two statements, not one: + + function f(g) { + return // ASI + g(); + } + +The indentation is misleading; actually ASI inserts a semicolon at the +end of the first line: `return; g();`. (This function always returns +undefined. The second statement is never reached.) + +These restrictions apply even to multiline comments, so the function + +```js +function f(g) { + return /* + */ g(); +} +``` +contains two statements, just as the previous example did. + +I’m not sure why these rules exist, but it’s probably because (back in +the Netscape days) users complained about the bizarre behavior of +automatic semicolon insertion, and so some special do-what-I-mean hacks +were put in. + +This is specified with a weird special thing in the grammar: + +> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;` + +This is called a *restricted production*, and it’s unfortunately +necessary to go through them one by one, because there are several +kinds. Note that the particular hack required to parse them in a +recursive descent parser is a little bit different each time. + +* After `continue`, `break`, or `return`, a line break triggers ASI. + + The relevant productions are all statements, and in each case + there’s an alternative production that ends immediately with a + semicolon: `continue ;` `break ;` and `return ;`. + + Note that the alternative production is *not* restricted: e.g. a + *LineTerminator* can appear between `return` and `;`: + + ```js + if (x) + return // ok + ; + else + f(); + ``` + +* After `throw`, a line break is a SyntaxError. + +* After `yield`, a line break terminates the *YieldExpression*. + + Here the alternative production is simply `yield`, not `yield ;`. + +* In a post-increment or post-decrement expression, there can’t be a + line break before `++` or `--`. + + The purpose of this rule is subtle. It triggers ASI and thus prevents + syntax errors: + + ```js + var x = y // ok: semicolon inserted here + ++z; + ``` + + Without the restricted production, `var x = y ++` would parse + successfully, and the “offending token” would be `z`. It would be + too late for ASI. + + However, the restriction can of course also *cause* a SyntaxError: + + ```js + var x = (y + ++); // SyntaxError + ``` + +As we said, recursive descent parsers can implement these rules with hax. + +In a generated parser, there are a few possible ways to implement +them. Here are three. If you are not interested in ridiculous +approaches, you can skip the first two. + +* Treat every token as actually a different token when it appears + after a line break: `TokenType::LeftParen` and + `TokenType::LeftParenAfterLineBreak`. Of course the parser + generator can treat these exactly the same in normal cases, and + automatically generate identical table entries (or whatever) except + in states where there’s a relevant restricted production. + +* Add a special LineTerminator token. Normally, the lexer skips + newlines and never emits this token. However, if the current state + has a relevant restricted production, the lexer knows this and emits + a LineTerminator for the first line break it sees; and the parser + uses that token to trigger an error or transition to another state, + as appropriate. + +* When in a state that has a relevant restricted production, change + states if there’s a line break before the next token. That is, + split each such state into two: the one we stay in when there’s not + a line break, and the one we jump to if there is a line break. + +In all cases it’ll be hard to have confidence that the resulting parser +generator is really sound. (That is, it might not properly reject all +ambiguous grammars.) I don’t know exactly what property of the few +special uses in the ES grammar makes them seem benign. + + +### Slashes (**) + +*(entangled with: ASI, restricted productions)* + +When you see `/` in a JS program, you don’t know if that’s a +division operator or the start of a regular expression unless you’ve +been paying attention up to that point. + +[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar) + +> There are several situations where the identification of lexical input +> elements is sensitive to the syntactic grammar context that is +> consuming the input elements. This requires multiple goal symbols for +> the lexical grammar. + +You might think the lexer could treat `/` as an operator only if the +previous token is one that can be the last token of an expression (a set +that includes literals, identifiers, `this`, `)`, `]`, and `}`). To see +that this does not work, consider: + +```js +{} /x/ // `/` after `}` is regexp +({} / 2) // `/` after `}` is division + +for (g of /(a)(b)/) {} // `/` after `of` is regexp +var of = 6; of / 2 // `/` after `of` is division + +throw /x/; // `/` after `throw` is regexp +Math.throw / 2; // `/` after `throw` is division + +++/x/.lastIndex; // `/` after `++` is regexp +n++ / 2; // `/` after `++` is division +``` + +So how can the spec be implemented? + +In a recursive descent parser, you have to tell the lexer which goal +symbol to use every time you ask for a token. And you have to make sure, +if you look ahead at a token, but *don’t* consume it, and fall back on +another path that can accept a *RegularExpressionLiteral* or +*DivPunctuator*, that you did not initially lex it incorrectly. We have +assertions for this and it is a bit of a nightmare when we have to touch +it (which is thankfully rare). Part of the problem is that the token +you’re peeking ahead at might not be part of the same production at all. +Thanks to ASI, it might be the start of the next statement, which will +be parsed in a faraway part of the Parser. + +A table-driven parser has it easy here! The lexer can consult the state +table and see which kind of token can be accepted in the current +state. This is closer to what the spec actually says. + +Two minor things to watch out for: + +* The nonterminal *ClassTail* is used both at the end of + *ClassExpression*, which may be followed by `/`; and at the end of + *ClassDeclaration*, which may be followed by a + *RegularExpressionLiteral* at the start of the next + statement. Canonical LR creates separate states for these two uses + of *ClassTail*, but the LALR algorithm will unify them, creating + some states that have both `/` and *RegularExpressionLiteral* in the + follow set. In these states, determining which terminal is actually + allowed requires looking not only at the current state, but at the + current stack of states (to see one level of grammatical context). + +* Since this decision depends on the parser state, and automatic + semicolon insertion adjusts the parser state, a parser may need to + re-scan a token after ASI. + +In other kinds of generated parsers, at least the lexical goal symbol +can be determined automatically. + + +### Lazy compilation and scoping (**) + +*(entangled with: arrow functions)* + +JS engines *lazily compile* function bodies. During parsing, when the +engine sees a `function`, it switches to a high-speed parsing mode +(which I will call “Beast Mode”) that just skims the function and checks +for syntax errors. Beast Mode does not compile the code. Beast Mode +doesn’t even create AST nodes. All that will be done later, on demand, +the first time the function is called. + +The point is to get through parsing *fast*, so that the script can start +running. In browsers, `<script>` compilation usually must finish before +we can show the user any meaningful content. Any part of it that can be +deferred must be deferred. (Bonus: many functions are never called, so +the work is not just deferred but avoided altogether.) + +Later, when a function is called, we can still defer compilation of +nested functions inside it. + +So what? Seems easy enough, right? Well... + +Local variables in JS are optimized. To generate reasonable code for the +script or function we are compiling, we need to know which of its local +variables are used by nested functions, which we are *not* currently +compiling. That is, we must know which variables *occur free in* each +nested function. So scoping, which otherwise could be done as a separate +phase of compilation, must be done during parsing in Beast Mode. + +Getting the scoping of arrow functions right while parsing is tricky, +because it’s often not possible to know when you’re entering a scope +until later. Consider parsing `(a`. This could be the beginning of an +arrow function, or not; we might not know until after we reach the +matching `)`, which could be a long way away. + +Annex B.3.3 adds extremely complex rules for scoping for function +declarations which makes this especially difficult. In + +```js +function f() { + let x = 1; + return function g() { + { + function x(){} + } + { + let x; + } + return x; + }; +} +``` + +the function `g` does not use the initial `let x`, but in + +```js +function f() { + let x = 1; + return function g() { + { + { + function x(){} + } + let x; + } + return x; + }; +} +``` +it does. + +[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go) +is a good writeup of what's going on in these examples. + + +### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*) + +*(entangled with: lazy compilation)* + +Suppose the parser is scanning text from start to end, when it sees a +statement that starts with something like this: + +```js +let f = (a +``` + +The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a +*ParenthesizedExpression*. Either is possible. We’ll know when either +(1) we see something that rules out the other case: + +```js +let f = (a + b // can't be ArrowParameters + +let f = (a, b, ...args // can't be ParenthesizedExpression +``` + +or (2) we reach the matching `)` and see if the next token is `=>`. + +Probably (1) is a pain to implement. + +To keep the language nominally LR(1), the standard specifies a “cover +grammar”, a nonterminal named +*CoverParenthesizedExpressionAndArrowParameterList* that is a superset +of both *ArrowParameters* and *ParenthesizedExpression*. The spec +grammar uses the *Cover* nonterminal in both places, so technically `let +f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically +valid *Script*s, according to the formal grammar. But there are Early +Error rules (that is, plain English text, not part of the formal +grammar) that say arrow functions’ parameters must match +*ArrowParameters*, and parenthesized expressions must match +*ParenthesizedExpression*. + +So after the initial parse, the implementation must somehow check that +the *CoverParenthesizedExpressionAndArrowParameterList* really is valid +in context. This complicates lazy compilation because in Beast Mode we +are not even building an AST. It’s not easy to go back and check. + +Something similar happens in a few other cases: the spec is written as +though syntactic rules are applied after the fact: + +* *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)` + which looks like a function call and also looks like the beginning + of an async arrow function. + +* *ObjectLiteral* is a cover grammar; it covers both actual object + literals and the syntax `propertyName = expr`, which is not valid in + object literals but allowed in destructuring assignment: + + ```js + var obj = {x = 1}; // SyntaxError, by an early error rule + + ({x = 1} = {}); // ok, assigns 1 to x + ``` + +* *LeftHandSideExpression* is used to the left of `=` in assignment, + and as the operand of postfix `++` and `--`. But this is way too lax; + most expressions shouldn’t be assigned to: + + ```js + 1 = 0; // matches the formal grammar, SyntaxError by an early error rule + + null++; // same + + ["a", "b", "c"]++; // same + ``` + + +## Conclusion + +What have we learned today? + +* Do not write a JS parser. + +* JavaScript has some syntactic horrors in it. But hey, you don't + make the world's most widely used programming language by avoiding + all mistakes. You do it by shipping a serviceable tool, in the right + circumstances, for the right users. |