summaryrefslogtreecommitdiffstats
path: root/third_party/rust/jsparagus/js-quirks.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-28 14:29:10 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-28 14:29:10 +0000
commit2aa4a82499d4becd2284cdb482213d541b8804dd (patch)
treeb80bf8bf13c3766139fbacc530efd0dd9d54394c /third_party/rust/jsparagus/js-quirks.md
parentInitial commit. (diff)
downloadfirefox-upstream.tar.xz
firefox-upstream.zip
Adding upstream version 86.0.1.upstream/86.0.1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'third_party/rust/jsparagus/js-quirks.md')
-rw-r--r--third_party/rust/jsparagus/js-quirks.md1036
1 files changed, 1036 insertions, 0 deletions
diff --git a/third_party/rust/jsparagus/js-quirks.md b/third_party/rust/jsparagus/js-quirks.md
new file mode 100644
index 0000000000..20c621c92a
--- /dev/null
+++ b/third_party/rust/jsparagus/js-quirks.md
@@ -0,0 +1,1036 @@
+## JS syntactic quirks
+
+> *To make a labyrinth, it takes*
+> *Some good intentions, some mistakes.*
+> —A. E. Stallings, “Daedal”
+
+JavaScript is rather hard to parse. Here is an in-depth accounting of
+its syntactic quirks, with an eye toward actually implementing a parser
+from scratch.
+
+With apologies to the generous people who work on the standard. Thanks
+for doing that—better you than me.
+
+Thanks to [@bakkot](https://github.com/bakkot) and
+[@mathiasbynens](https://github.com/mathiasbynens) for pointing out
+several additional quirks.
+
+Problems are rated in terms of difficulty, from `(*)` = easy to `(***)`
+= hard. We’ll start with the easiest problems.
+
+
+### Dangling else (*)
+
+If you know what this is, you may be excused.
+
+Statements like `if (EXPR) STMT if (EXPR) STMT else STMT`
+are straight-up ambiguous in the JS formal grammar.
+The ambiguity is resolved with
+[a line of specification text](https://tc39.es/ecma262/#sec-if-statement):
+
+> Each `else` for which the choice of associated `if` is ambiguous shall
+> be associated with the nearest possible `if` that would otherwise have
+> no corresponding `else`.
+
+I love this sentence. Something about it cracks me up, I dunno...
+
+In a recursive descent parser, just doing the dumbest possible thing
+correctly implements this rule.
+
+A parser generator has to decide what to do. In Yacc, you can use
+operator precedence for this.
+
+Yacc aside: This should seem a little outrageous at first, as `else` is
+hardly an operator. It helps if you understand what Yacc is doing. In LR
+parsers, this kind of ambiguity in the grammar manifests as a
+shift-reduce conflict. In this case, when we’ve already parsed `if (
+Expression ) Statement if ( Expression ) Statement`
+and are looking at `else`, it’s unclear to Yacc
+whether to reduce the if-statement or shift `else`. Yacc does not offer
+a feature that lets us just say "always shift `else` here"; but there
+*is* a Yacc feature that lets us resolve shift-reduce conflicts in a
+rather odd, indirect way: operator precedence. We can resolve this
+conflict by making `else` higher-precedence than the preceding symbol
+`)`.
+
+Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]"
+at the end of the IfStatement production that doesn’t have an `else`.
+
+
+### Other ambiguities and informal parts of the spec (*)
+
+Not all of the spec is as formal as it seems at first. Most of the stuff
+in this section is easy to deal with, but #4 is special.
+
+1. The lexical grammar is ambiguous: when looking at the characters `<<=`,
+ there is the question of whether to parse that as one token `<<=`, two
+ tokens (`< <=` or `<< =`), or three (`< < =`).
+
+ Of course every programming language has this, and the fix is one
+ sentence of prose in the spec:
+
+ > The source text is scanned from left to right, repeatedly taking the
+ > longest possible sequence of code points as the next input element.
+
+ This is easy enough for hand-coded lexers, and for systems that are
+ designed to use separate lexical and syntactic grammars. (Other
+ parser generators may need help to avoid parsing `functionf(){}` as
+ a function.)
+
+2. The above line of prose does not apply *within* input elements, in
+ components of the lexical grammar. In those cases, the same basic
+ idea ("maximum munch") is specified using lookahead restrictions at
+ the end of productions:
+
+ > *LineTerminatorSequence* ::
+ > &nbsp; &nbsp; &lt;LF&gt;
+ > &nbsp; &nbsp; &lt;CR&gt;[lookahead &ne; &lt;LF&gt;]
+ > &nbsp; &nbsp; &lt;LS&gt;
+ > &nbsp; &nbsp; &lt;PS&gt;
+ > &nbsp; &nbsp; &lt;CR&gt;&lt;LF&gt;
+
+ The lookahead restriction prevents a CR LF sequence from being
+ parsed as two adjacent *LineTerminatorSequence*s.
+
+ This technique is used in several places, particularly in
+ [*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence).
+
+3. Annex B.1.4 extends the syntax for regular expressions, making the
+ grammar ambiguous. Again, a line of prose explains how to cope:
+
+ > These changes introduce ambiguities that are broken by the
+ > ordering of grammar productions and by contextual
+ > information. When parsing using the following grammar, each
+ > alternative is considered only if previous production alternatives
+ > do not match.
+
+4. Annex B.1.2 extends the syntax of string literals to allow legacy
+ octal escape sequences, like `\033`. It says:
+
+ > The syntax and semantics of 11.8.4 is extended as follows except
+ > that this extension is not allowed for strict mode code:
+
+ ...followed by a new definition of *EscapeSequence*.
+
+ So there are two sets of productions for *EscapeSequence*, and an
+ implementation is required to implement both and dynamically switch
+ between them.
+
+ This means that `function f() { "\033"; "use strict"; }` is a
+ SyntaxError, even though the octal escape is scanned before we know
+ we're in strict mode.
+
+For another ambiguity, see "Slashes" below.
+
+
+### Unicode quirks
+
+JavaScript source is Unicode and usually follows Unicode rules for thing
+like identifiers and whitespace, but it has a few special cases: `$`,
+`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are
+legal in identifiers (the latter two only after the first character), and
+`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is
+treated as whitespace.
+
+It also allows any code point, including surrogate halves, even though the
+Unicode standard says that unpaired surrogate halves should be treated as
+encoding errors.
+
+
+### Legacy octal literals and escape sequences (*)
+
+This is more funny than difficult.
+
+In a browser, in non-strict code, every sequence of decimal digits (not
+followed by an identifier character) is a *NumericLiteral* token.
+
+If it starts with `0`, with more digits after, then it's a legacy Annex
+B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal
+number. Otherwise, hilariously, it's octal.
+
+```
+js> [067, 068, 069, 070]
+[55, 68, 69, 56]
+```
+
+There are also legacy octal escape sequences in strings, and these have
+their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8
+is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0`
+followed by `8` or `9` is a legacy octal escape sequence representing
+the null character. (Not to be confused with `\0` in strict code, not
+followed by a digit, which still represents the null character, but
+doesn't count as octal.)
+
+None of this is hard to implement, but figuring out what the spec says
+is hard.
+
+
+### Strict mode (*)
+
+*(entangled with: lazy compilation)*
+
+A script or function can start with this:
+
+```js
+"use strict";
+```
+
+This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript).
+Additionally, all classes and modules are strict mode code.
+
+Strict mode has both parse-time and run-time effects. Parse-time effects
+include:
+
+* Strict mode affects the lexical grammar: octal integer literals are
+ SyntaxErrors, octal character escapes are SyntaxErrors, and a
+ handful of words like `private` and `interface` are reserved (and
+ thus usually SyntaxErrors) in strict mode.
+
+ Like the situation with slashes, this means it is not possible to
+ implement a complete lexer for JS without also parsing—at least
+ enough to detect class boundaries, "use strict" directives in
+ functions, and function boundaries.
+
+* It’s a SyntaxError to have bindings named `eval` or `arguments` in
+ strict mode code, or to assign to `eval` or `arguments`.
+
+* It’s a SyntaxError to have two argument bindings with the same name
+ in a strict function.
+
+ Interestingly, you don’t always know if you’re in strict mode or not
+ when parsing arguments.
+
+ ```js
+ function foo(a, a) {
+ "use strict";
+ }
+ ```
+
+ When the implementation reaches the Use Strict Directive, it must
+ either know that `foo` has two arguments both named `a`, or switch
+ to strict mode, go back, and reparse the function from the
+ beginning.
+
+ Fortunately an Early Error rule prohibits mixing `"use strict"` with
+ more complex parameter lists, like `function foo(x = eval('')) {`.
+
+* The expression syntax “`delete` *Identifier*” and the abominable
+ *WithStatement* are banned in strict mode.
+
+
+### Conditional keywords (**)
+
+In some programming languages, you could write a lexer that has rules
+like
+
+* When you see `if`, return `Token::If`.
+
+* When you see something like `apple` or `arrow` or `target`,
+ return `Token::Identifier`.
+
+Not so in JavaScript. The input `if` matches both the terminal `if` and
+the nonterminal *IdentifierName*, both of which appear in the high-level
+grammar. The same goes for `target`.
+
+This poses a deceptively difficult problem for table-driven parsers.
+Such parsers run on a stream of token-ids, but the question of which
+token-id to use for a word like `if` or `target` is ambiguous. The
+current parser state can't fully resolve the ambiguity: there are cases
+like `class C { get` where the token `get` might match either as a
+keyword (the start of a getter) or as an *IdentifierName* (a method or
+property named `get`) in different grammatical productions.
+
+All keywords are conditional, but some are more conditional than others.
+The rules are inconsistent to a tragicomic extent. Keywords like `if`
+that date back to JavaScript 1.0 are always keywords except when used as
+property names or method names. They can't be variable names. Two
+conditional keywords (`await` and `yield`) are in the *Keyword* list;
+the rest are not. New syntax that happened to be introduced around the
+same time as strict mode was awarded keyword status in strict mode. The
+rules are scattered through the spec. All this interacts with `\u0065`
+Unicode escape sequences somehow. It’s just unbelievably confusing.
+
+(After writing this section, I
+[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694)
+to make it a little less confusing.)
+
+* Thirty-six words are always reserved:
+
+ > `break` `case` `catch` `class` `const` `continue` `debugger`
+ > `default` `delete` `do` `else` `enum` `export` `extends` `false`
+ > `finally` `for` `function` `if` `import` `in` `instanceof` `new`
+ > `null` `return` `super` `switch` `this` `throw` `true` `try`
+ > `typeof` `var` `void` `while` `with`
+
+ These tokens can't be used as names of variables or arguments.
+ They're always considered special *except* when used as property
+ names, method names, or import/export names in modules.
+
+ ```js
+ // property names
+ let obj = {if: 3, function: 4};
+ assert(obj.if == 3);
+
+ // method names
+ class C {
+ if() {}
+ function() {}
+ }
+
+ // imports and exports
+ import {if as my_if} from "modulename";
+ export {my_if as if};
+ ```
+
+* Two more words, `yield` and `await`, are in the *Keyword* list but
+ do not always act like keywords in practice.
+
+ * `yield` is a *Keyword*; but it can be used as an identifier,
+ except in generators and strict mode code.
+
+ This means that `yield - 1` is valid both inside and outside
+ generators, with different meanings. Outside a generator, it’s
+ subtraction. Inside, it yields the value **-1**.
+
+ That reminds me of the Groucho Marx line: Outside of a dog, a
+ book is a man’s best friend. Inside of a dog it’s too dark to
+ read.
+
+ * `await` is like that, but in async functions. Also it’s not a
+ valid identifier in modules.
+
+ Conditional keywords are entangled with slashes: `yield /a/g` is two
+ tokens in a generator but five tokens elsewhere.
+
+* In strict mode code, `implements`, `interface`, `package`,
+ `private`, `protected`, and `public` are reserved (via Early Errors
+ rules).
+
+ This is reflected in the message and location information for
+ certain syntax errors:
+
+ ```
+ SyntaxError: implements is a reserved identifier:
+ class implements {}
+ ......^
+
+ SyntaxError: implements is a reserved identifier:
+ function implements() { "use strict"; }
+ ....................................^
+ ```
+
+* `let` is not a *Keyword* or *ReservedWord*. Usually it can be an
+ identifier. It is special at the beginning of a statement or after
+ `for (` or `for await (`.
+
+ ```js
+ var let = [new Date]; // ok: let as identifier
+ let v = let; // ok: let as keyword, then identifier
+ let let; // SyntaxError: banned by special early error rule
+ let.length; // ok: `let .` -> ExpressionStatement
+ let[0].getYear(); // SyntaxError: `let [` -> LexicalDeclaration
+ ```
+
+ In strict mode code, `let` is reserved.
+
+* `static` is similar. It’s a valid identifier, except in strict
+ mode. It’s only special at the beginning of a *ClassElement*.
+
+ In strict mode code, `static` is reserved.
+
+* `async` is similar, but trickier. It’s an identifier. It is special
+ only if it’s marking an `async` function, method, or arrow function
+ (the tough case, since you won’t know it’s an arrow function until
+ you see the `=>`, possibly much later).
+
+ ```js
+ function async() {} // normal function named "async"
+
+ async(); // ok, `async` is an Identifier; function call
+ async() => {}; // ok, `async` is not an Identifier; async arrow function
+ ```
+
+* `of` is special only in one specific place in `for-of` loop syntax.
+
+ ```js
+ var of = [1, 2, 3];
+ for (of of of) console.log(of); // logs 1, 2, 3
+ ```
+
+ Amazingly, both of the following are valid JS code:
+
+ ```js
+ for (async of => {};;) {}
+ for (async of []) {}
+ ```
+
+ In the first line, `async` is a keyword and `of` is an identifier;
+ in the second line it's the other way round.
+
+ Even a simplified JS grammar can't be LR(1) as long as it includes
+ the features used here!
+
+* `get` and `set` are special only in a class or an object literal,
+ and then only if followed by a PropertyName:
+
+ ```js
+ var obj1 = {get: f}; // `get` is an identifier
+ var obj2 = {get x() {}}; // `get` means getter
+
+ class C1 { get = 3; } // `get` is an identifier
+ class C2 { get x() {} } // `get` means getter
+ ```
+
+* `target` is special only in `new.target`.
+
+* `arguments` and `eval` can't be binding names, and can't be assigned
+ to, in strict mode code.
+
+To complicate matters, there are a few grammatical contexts where both
+*IdentifierName* and *Identifier* match. For example, after `var {`
+there are two possibilities:
+
+```js
+// Longhand properties: BindingProperty -> PropertyName -> IdentifierName
+var { xy: v } = obj; // ok
+var { if: v } = obj; // ok, `if` is an IdentifierName
+
+// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier
+var { xy } = obj; // ok
+var { if } = obj; // SyntaxError: `if` is not an Identifier
+```
+
+
+### Escape sequences in keywords
+
+*(entangled with: conditional keywords, ASI)*
+
+You can use escape sequences to write variable and property names, but
+not keywords (including contextual keywords in contexts where they act
+as keywords).
+
+So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not.
+
+And you don't necessarily know if you're lexing a contextual keyword
+until the next token: `({ g\u0065t: 0 })` is legal, but
+`({ g\u0065t x(){} })` is not.
+
+And for `let` it's even worse: `l\u0065t` by itself is a legal way to
+reference a variable named `let`, which means that
+
+```js
+let
+x
+```
+declares a variable named `x`, while, thanks to ASI,
+
+```js
+l\u0065t
+x
+```
+is a reference to a variable named `let` followed by a reference to a
+variable named `x`.
+
+
+### Early errors (**)
+
+*(entangled with: lazy parsing, conditional keywords, ASI)*
+
+Some early errors are basically syntactic. Others are not.
+
+This is entangled with lazy compilation: "early errors" often involve a
+retrospective look at an arbitrarily large glob of code we just parsed,
+but in Beast Mode we’re not building an AST. In fact we would like to be
+doing as little bookkeeping as possible.
+
+Even setting that aside, every early error is a special case, and it’s
+just a ton of rules that all have to be implemented by hand.
+
+Here are some examples of Early Error rules—setting aside restrictions
+that are covered adequately elsewhere:
+
+* Rules about names:
+
+ * Rules that affect the set of keywords (character sequences that
+ match *IdentifierName* but are not allowed as binding names) based
+ on whether or not we’re in strict mode code, or in a
+ *Module*. Affected identifiers include `arguments`, `eval`, `yield`,
+ `await`, `let`, `implements`, `interface`, `package`, `private`,
+ `protected`, `public`, `static`.
+
+ * One of these is a strangely worded rule which prohibits using
+ `yield` as a *BindingIdentifier*. At first blush, this seems
+ like it could be enforced in the grammar, but that approach
+ would make this a valid program, due to ASI:
+
+ ```js
+ let
+ yield 0;
+ ```
+
+ Enforcing the same rule using an Early Error prohibits ASI here.
+ It works by exploiting the detailed inner workings of ASI case
+ 1, and arranging for `0` to be "the offending token" rather than
+ `yield`.
+
+ * Lexical variable names have to be unique within a scope:
+
+ * Lexical variables (`let` and `const`) can’t be declared more
+ than once in a block, or both lexically declared and
+ declared with `var`.
+
+ * Lexically declared variables in a function body can’t have the same
+ name as argument bindings.
+
+ * A lexical variable can’t be named `let`.
+
+ * Common-sense rules dealing with unicode escape sequences in
+ identifiers.
+
+* Common-sense rules about regular expression literals. (They have to
+ actually be valid regular expressions, and unrecognized flags are
+ errors.)
+
+* The number of string parts that a template string can have is
+ limited to 2<sup>32</sup> &minus; 1.
+
+* Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are
+ banned in non-tagged templates (in tagged templates, they are allowed).
+
+* The *SuperCall* syntax is allowed only in derived class
+ constructors.
+
+* `const x;` without an initializer is a Syntax Error.
+
+* A direct substatement of an `if` statement, loop statement, or
+ `with` statement can’t be a labelled `function`.
+
+* Early errors are used to hook up cover grammars.
+
+ * Early errors are also used in one case to avoid having to
+ specify a very large refinement grammar when *ObjectLiteral*
+ almost covers *ObjectAssignmentPattern*:
+ [sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors).
+
+* Early errors are sometimes used to prevent parsers from needing to
+ backtrack too much.
+
+ * When parsing `async ( x = await/a/g )`, you don't know until the
+ next token if this is an async arrow or a call to a function named
+ `async`. This means you can't even tokenize properly, because in
+ the former case the thing following `x =` is two divisions and in
+ the latter case it's an *AwaitExpression* of a regular expression.
+ So an Early Error forbids having `await` in parameters at all,
+ allowing parsers to immediately throw an error if they find
+ themselves in this case.
+
+Many strict mode rules are enforced using Early Errors, but others
+affect runtime semantics.
+
+<!--
+* I think the rules about assignment targets are related to cover
+ grammars. Not sure.
+
+ * Expressions used in assignment context (i.e. as the operands of `++`
+ and `--`, the left operand of assignment, the loop variable of a
+ `for-in/for-of` loop, or sub-targets in destructuring) must be valid
+ assignment targets.
+
+ * In destructuring assignment, `[...EXPR] = x` is an error if `EXPR`
+ is an array or object destructuring target.
+
+-->
+
+
+### Boolean parameters (**)
+
+Some nonterminals are parameterized. (Search for “parameterized
+production” in [this spec
+section](https://tc39.es/ecma262/#sec-grammar-notation).)
+
+Implemented naively (e.g. by macro expansion) in a parser generator,
+each parameter could nearly double the size of the parser. Instead, the
+parameters must be tracked at run time somehow.
+
+
+### Lookahead restrictions (**)
+
+*(entangled with: restricted productions)*
+
+TODO (I implemented this by hacking the entire LR algorithm. Most every
+part of it is touched, although in ways that seem almost obvious once
+you understand LR inside and out.)
+
+(Note: It may seem like all of the lookahead restrictions in the spec
+are really just a way of saying “this production takes precedence over
+that one”—for example, that the lookahead restriction on
+*ExpressionStatement* just means that other productions for statements
+and declarations take precedence over it. But that isn't accurate; you
+can't have an *ExpressionStatement* that starts with `{`, even if it
+doesn't parse as a *Block* or any other kind of statement.)
+
+
+### Automatic Semicolon Insertion (**)
+
+*(entangled with: restricted productions, slashes)*
+
+Most semicolons at the end of JS statements and declarations “may be
+omitted from the source text in certain situations”. This is called
+[Automatic Semicolon
+Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion),
+or ASI for short.
+
+The specification for this feature is both very-high-level and weirdly
+procedural (“When, as the source text is parsed from left to right, a
+token is encountered...”, as if the specification is telling a story
+about a browser. As far as I know, this is the only place in the spec
+where anything is assumed or implied about the internal implementation
+details of parsing.) But it would be hard to specify ASI any other way.
+
+Wrinkles:
+
+1. Whitespace is significant (including whitespace inside comments).
+ Most semicolons in the grammar are optional only at the end of a
+ line (or before `}`, or at the end of the program).
+
+2. The ending semicolon of a `do`-`while` statement is extra optional.
+ You can always omit it.
+
+3. A few semicolons are never optional, like the semicolons in `for (;;)`.
+
+ This means there’s a semicolon in the grammar that is optionally
+ optional! This one:
+
+ > *LexicalDeclaration* : *LetOrConst* *BindingList* `;`
+
+ It’s usually optional, but not if this is the *LexicalDeclaration*
+ in `for (let i = 0; i < 9; i++)`!
+
+4. Semicolons are not inserted only as a last resort to avoid
+ SyntaxErrors. That turned out to be too error-prone, so there are
+ also *restricted productions* (see below), where semicolons are more
+ aggressively inferred.
+
+5. In implementations, ASI interacts with the ambiguity of *slashes*
+ (see below).
+
+A recursive descent parser implements ASI by calling a special method
+every time it needs to parse a semicolon that might be optional. The
+special method has to peek at the next token and consume it only if it’s
+a semicolon. This would not be so bad if it weren’t for slashes.
+
+In a parser generator, ASI can be implemented using an error recovery
+mechanism.
+
+I think the [error recovery mechanism in
+yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery)
+is too imprecise—when an error happens, it discards states from the
+stack searching for a matching error-handling rule. The manual says
+“Error recovery strategies are necessarily guesses.”
+
+But here’s a slightly more careful error recovery mechanism that could
+do the job:
+
+1. For each production in the ES spec grammar where ASI could happen, e.g.
+
+ ```
+ ImportDeclaration ::= `import` ModuleSpecifier `;`
+ { import_declaration($2); }
+ ```
+
+ add an ASI production, like this:
+
+ ```
+ ImportDeclaration ::= `import` ModuleSpecifier [ERROR]
+ { check_asi(); import_declaration($2); }
+ ```
+
+ What does this mean? This production can be matched, like any other
+ production, but it's a fallback. All other productions take
+ precedence.
+
+2. While generating the parser, treat `[ERROR]` as a terminal
+ symbol. It can be included in start sets and follow sets, lookahead,
+ and so forth.
+
+3. At run time, when an error happens, synthesize an `[ERROR]` token.
+ Let that bounce through the state machine. It will cause zero or
+ more reductions. Then, it might actually match a production that
+ contains `[ERROR]`, like the ASI production above.
+
+ Otherwise, we’ll get another error—the entry in the parser table for
+ an `[ERROR]` token at this state will be an `error` entry. Then we
+ really have a syntax error.
+
+This solves most of the ASI issues:
+
+* [x] Whitespace sensitivity: That's what `check_asi()` is for. It
+ should signal an error if we're not at the end of a line.
+
+* [x] Special treatment of `do`-`while` loops: Make an error production,
+ but don't `check_asi()`.
+
+* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`:
+ Easy, don't create error productions for those.
+
+ * [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust
+ the grammar, copying *LexicalDeclaration* so that there's a
+ *LexicalDeclarationNoASI* production used only by `for`
+ statements. Not a big deal, as it turns out.
+
+* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the
+ next token, if it starts with `/`.
+
+* [ ] Restricted productions: Not solved. Read on.
+
+
+### Restricted productions (**)
+
+*(entangled with: ASI, slashes)*
+
+Line breaks aren’t allowed in certain places. For example, the following
+is not a valid program:
+
+ throw // SyntaxError
+ new Error();
+
+For another example, this function contains two statements, not one:
+
+ function f(g) {
+ return // ASI
+ g();
+ }
+
+The indentation is misleading; actually ASI inserts a semicolon at the
+end of the first line: `return; g();`. (This function always returns
+undefined. The second statement is never reached.)
+
+These restrictions apply even to multiline comments, so the function
+
+```js
+function f(g) {
+ return /*
+ */ g();
+}
+```
+contains two statements, just as the previous example did.
+
+I’m not sure why these rules exist, but it’s probably because (back in
+the Netscape days) users complained about the bizarre behavior of
+automatic semicolon insertion, and so some special do-what-I-mean hacks
+were put in.
+
+This is specified with a weird special thing in the grammar:
+
+> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;`
+
+This is called a *restricted production*, and it’s unfortunately
+necessary to go through them one by one, because there are several
+kinds. Note that the particular hack required to parse them in a
+recursive descent parser is a little bit different each time.
+
+* After `continue`, `break`, or `return`, a line break triggers ASI.
+
+ The relevant productions are all statements, and in each case
+ there’s an alternative production that ends immediately with a
+ semicolon: `continue ;` `break ;` and `return ;`.
+
+ Note that the alternative production is *not* restricted: e.g. a
+ *LineTerminator* can appear between `return` and `;`:
+
+ ```js
+ if (x)
+ return // ok
+ ;
+ else
+ f();
+ ```
+
+* After `throw`, a line break is a SyntaxError.
+
+* After `yield`, a line break terminates the *YieldExpression*.
+
+ Here the alternative production is simply `yield`, not `yield ;`.
+
+* In a post-increment or post-decrement expression, there can’t be a
+ line break before `++` or `--`.
+
+ The purpose of this rule is subtle. It triggers ASI and thus prevents
+ syntax errors:
+
+ ```js
+ var x = y // ok: semicolon inserted here
+ ++z;
+ ```
+
+ Without the restricted production, `var x = y ++` would parse
+ successfully, and the “offending token” would be `z`. It would be
+ too late for ASI.
+
+ However, the restriction can of course also *cause* a SyntaxError:
+
+ ```js
+ var x = (y
+ ++); // SyntaxError
+ ```
+
+As we said, recursive descent parsers can implement these rules with hax.
+
+In a generated parser, there are a few possible ways to implement
+them. Here are three. If you are not interested in ridiculous
+approaches, you can skip the first two.
+
+* Treat every token as actually a different token when it appears
+ after a line break: `TokenType::LeftParen` and
+ `TokenType::LeftParenAfterLineBreak`. Of course the parser
+ generator can treat these exactly the same in normal cases, and
+ automatically generate identical table entries (or whatever) except
+ in states where there’s a relevant restricted production.
+
+* Add a special LineTerminator token. Normally, the lexer skips
+ newlines and never emits this token. However, if the current state
+ has a relevant restricted production, the lexer knows this and emits
+ a LineTerminator for the first line break it sees; and the parser
+ uses that token to trigger an error or transition to another state,
+ as appropriate.
+
+* When in a state that has a relevant restricted production, change
+ states if there’s a line break before the next token. That is,
+ split each such state into two: the one we stay in when there’s not
+ a line break, and the one we jump to if there is a line break.
+
+In all cases it’ll be hard to have confidence that the resulting parser
+generator is really sound. (That is, it might not properly reject all
+ambiguous grammars.) I don’t know exactly what property of the few
+special uses in the ES grammar makes them seem benign.
+
+
+### Slashes (**)
+
+*(entangled with: ASI, restricted productions)*
+
+When you see `/` in a JS program, you don’t know if that’s a
+division operator or the start of a regular expression unless you’ve
+been paying attention up to that point.
+
+[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar)
+
+> There are several situations where the identification of lexical input
+> elements is sensitive to the syntactic grammar context that is
+> consuming the input elements. This requires multiple goal symbols for
+> the lexical grammar.
+
+You might think the lexer could treat `/` as an operator only if the
+previous token is one that can be the last token of an expression (a set
+that includes literals, identifiers, `this`, `)`, `]`, and `}`). To see
+that this does not work, consider:
+
+```js
+{} /x/ // `/` after `}` is regexp
+({} / 2) // `/` after `}` is division
+
+for (g of /(a)(b)/) {} // `/` after `of` is regexp
+var of = 6; of / 2 // `/` after `of` is division
+
+throw /x/; // `/` after `throw` is regexp
+Math.throw / 2; // `/` after `throw` is division
+
+++/x/.lastIndex; // `/` after `++` is regexp
+n++ / 2; // `/` after `++` is division
+```
+
+So how can the spec be implemented?
+
+In a recursive descent parser, you have to tell the lexer which goal
+symbol to use every time you ask for a token. And you have to make sure,
+if you look ahead at a token, but *don’t* consume it, and fall back on
+another path that can accept a *RegularExpressionLiteral* or
+*DivPunctuator*, that you did not initially lex it incorrectly. We have
+assertions for this and it is a bit of a nightmare when we have to touch
+it (which is thankfully rare). Part of the problem is that the token
+you’re peeking ahead at might not be part of the same production at all.
+Thanks to ASI, it might be the start of the next statement, which will
+be parsed in a faraway part of the Parser.
+
+A table-driven parser has it easy here! The lexer can consult the state
+table and see which kind of token can be accepted in the current
+state. This is closer to what the spec actually says.
+
+Two minor things to watch out for:
+
+* The nonterminal *ClassTail* is used both at the end of
+ *ClassExpression*, which may be followed by `/`; and at the end of
+ *ClassDeclaration*, which may be followed by a
+ *RegularExpressionLiteral* at the start of the next
+ statement. Canonical LR creates separate states for these two uses
+ of *ClassTail*, but the LALR algorithm will unify them, creating
+ some states that have both `/` and *RegularExpressionLiteral* in the
+ follow set. In these states, determining which terminal is actually
+ allowed requires looking not only at the current state, but at the
+ current stack of states (to see one level of grammatical context).
+
+* Since this decision depends on the parser state, and automatic
+ semicolon insertion adjusts the parser state, a parser may need to
+ re-scan a token after ASI.
+
+In other kinds of generated parsers, at least the lexical goal symbol
+can be determined automatically.
+
+
+### Lazy compilation and scoping (**)
+
+*(entangled with: arrow functions)*
+
+JS engines *lazily compile* function bodies. During parsing, when the
+engine sees a `function`, it switches to a high-speed parsing mode
+(which I will call “Beast Mode”) that just skims the function and checks
+for syntax errors. Beast Mode does not compile the code. Beast Mode
+doesn’t even create AST nodes. All that will be done later, on demand,
+the first time the function is called.
+
+The point is to get through parsing *fast*, so that the script can start
+running. In browsers, `<script>` compilation usually must finish before
+we can show the user any meaningful content. Any part of it that can be
+deferred must be deferred. (Bonus: many functions are never called, so
+the work is not just deferred but avoided altogether.)
+
+Later, when a function is called, we can still defer compilation of
+nested functions inside it.
+
+So what? Seems easy enough, right? Well...
+
+Local variables in JS are optimized. To generate reasonable code for the
+script or function we are compiling, we need to know which of its local
+variables are used by nested functions, which we are *not* currently
+compiling. That is, we must know which variables *occur free in* each
+nested function. So scoping, which otherwise could be done as a separate
+phase of compilation, must be done during parsing in Beast Mode.
+
+Getting the scoping of arrow functions right while parsing is tricky,
+because it’s often not possible to know when you’re entering a scope
+until later. Consider parsing `(a`. This could be the beginning of an
+arrow function, or not; we might not know until after we reach the
+matching `)`, which could be a long way away.
+
+Annex B.3.3 adds extremely complex rules for scoping for function
+declarations which makes this especially difficult. In
+
+```js
+function f() {
+ let x = 1;
+ return function g() {
+ {
+ function x(){}
+ }
+ {
+ let x;
+ }
+ return x;
+ };
+}
+```
+
+the function `g` does not use the initial `let x`, but in
+
+```js
+function f() {
+ let x = 1;
+ return function g() {
+ {
+ {
+ function x(){}
+ }
+ let x;
+ }
+ return x;
+ };
+}
+```
+it does.
+
+[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go)
+is a good writeup of what's going on in these examples.
+
+
+### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*)
+
+*(entangled with: lazy compilation)*
+
+Suppose the parser is scanning text from start to end, when it sees a
+statement that starts with something like this:
+
+```js
+let f = (a
+```
+
+The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a
+*ParenthesizedExpression*. Either is possible. We’ll know when either
+(1) we see something that rules out the other case:
+
+```js
+let f = (a + b // can't be ArrowParameters
+
+let f = (a, b, ...args // can't be ParenthesizedExpression
+```
+
+or (2) we reach the matching `)` and see if the next token is `=>`.
+
+Probably (1) is a pain to implement.
+
+To keep the language nominally LR(1), the standard specifies a “cover
+grammar”, a nonterminal named
+*CoverParenthesizedExpressionAndArrowParameterList* that is a superset
+of both *ArrowParameters* and *ParenthesizedExpression*. The spec
+grammar uses the *Cover* nonterminal in both places, so technically `let
+f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically
+valid *Script*s, according to the formal grammar. But there are Early
+Error rules (that is, plain English text, not part of the formal
+grammar) that say arrow functions’ parameters must match
+*ArrowParameters*, and parenthesized expressions must match
+*ParenthesizedExpression*.
+
+So after the initial parse, the implementation must somehow check that
+the *CoverParenthesizedExpressionAndArrowParameterList* really is valid
+in context. This complicates lazy compilation because in Beast Mode we
+are not even building an AST. It’s not easy to go back and check.
+
+Something similar happens in a few other cases: the spec is written as
+though syntactic rules are applied after the fact:
+
+* *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)`
+ which looks like a function call and also looks like the beginning
+ of an async arrow function.
+
+* *ObjectLiteral* is a cover grammar; it covers both actual object
+ literals and the syntax `propertyName = expr`, which is not valid in
+ object literals but allowed in destructuring assignment:
+
+ ```js
+ var obj = {x = 1}; // SyntaxError, by an early error rule
+
+ ({x = 1} = {}); // ok, assigns 1 to x
+ ```
+
+* *LeftHandSideExpression* is used to the left of `=` in assignment,
+ and as the operand of postfix `++` and `--`. But this is way too lax;
+ most expressions shouldn’t be assigned to:
+
+ ```js
+ 1 = 0; // matches the formal grammar, SyntaxError by an early error rule
+
+ null++; // same
+
+ ["a", "b", "c"]++; // same
+ ```
+
+
+## Conclusion
+
+What have we learned today?
+
+* Do not write a JS parser.
+
+* JavaScript has some syntactic horrors in it. But hey, you don't
+ make the world's most widely used programming language by avoiding
+ all mistakes. You do it by shipping a serviceable tool, in the right
+ circumstances, for the right users.