Adding upstream version 115.7.0esr.upstream/115.7.0esr

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-07 19:33:14 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-07 19:33:14 +0000
commit: 36d22d82aa202bb199967e9512281e9a53db42c9 (patch)
tree: 105e8c98ddea1c1e4784a60a5a6410fa416be2de /third_party/rust/jsparagus/js-quirks.md
parent: Initial commit. (diff)
download: firefox-esr-36d22d82aa202bb199967e9512281e9a53db42c9.tar.xz
firefox-esr-36d22d82aa202bb199967e9512281e9a53db42c9.zip
1 files changed, 1036 insertions, 0 deletions
diff --git a/third_party/rust/jsparagus/js-quirks.md b/third_party/rust/jsparagus/js-quirks.md
new file mode 100644
index 0000000000..20c621c92a
--- /dev/null
+++ b/third_party/rust/jsparagus/js-quirks.md
@@ -0,0 +1,1036 @@
+## JS syntactic quirks
+
+> *To make a labyrinth, it takes*
+> *Some good intentions, some mistakes.*
+> —A. E. Stallings, “Daedal”
+
+JavaScript is rather hard to parse. Here is an in-depth accounting of
+its syntactic quirks, with an eye toward actually implementing a parser
+from scratch.
+
+With apologies to the generous people who work on the standard. Thanks
+for doing that—better you than me.
+
+Thanks to [@bakkot](https://github.com/bakkot) and
+[@mathiasbynens](https://github.com/mathiasbynens) for pointing out
+several additional quirks.
+
+Problems are rated in terms of difficulty, from `(*)` = easy to `(***)`
+= hard. We’ll start with the easiest problems.
+
+
+### Dangling else (*)
+
+If you know what this is, you may be excused.
+
+Statements like `if (EXPR) STMT if (EXPR) STMT else STMT`
+are straight-up ambiguous in the JS formal grammar.
+The ambiguity is resolved with
+[a line of specification text](https://tc39.es/ecma262/#sec-if-statement):
+
+> Each `else` for which the choice of associated `if` is ambiguous shall
+> be associated with the nearest possible `if` that would otherwise have
+> no corresponding `else`.
+
+I love this sentence. Something about it cracks me up, I dunno...
+
+In a recursive descent parser, just doing the dumbest possible thing
+correctly implements this rule.
+
+A parser generator has to decide what to do. In Yacc, you can use
+operator precedence for this.
+
+Yacc aside: This should seem a little outrageous at first, as `else` is
+hardly an operator. It helps if you understand what Yacc is doing. In LR
+parsers, this kind of ambiguity in the grammar manifests as a
+shift-reduce conflict. In this case, when we’ve already parsed `if (
+Expression ) Statement if ( Expression ) Statement`
+and are looking at `else`, it’s unclear to Yacc
+whether to reduce the if-statement or shift `else`. Yacc does not offer
+a feature that lets us just say "always shift `else` here"; but there
+*is* a Yacc feature that lets us resolve shift-reduce conflicts in a
+rather odd, indirect way: operator precedence.  We can resolve this
+conflict by making `else` higher-precedence than the preceding symbol
+`)`.
+
+Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]"
+at the end of the IfStatement production that doesn’t have an `else`.
+
+
+### Other ambiguities and informal parts of the spec (*)
+
+Not all of the spec is as formal as it seems at first. Most of the stuff
+in this section is easy to deal with, but #4 is special.
+
+1.  The lexical grammar is ambiguous: when looking at the characters `<<=`,
+    there is the question of whether to parse that as one token `<<=`, two
+    tokens (`< <=` or `<< =`), or three (`< < =`).
+
+    Of course every programming language has this, and the fix is one
+    sentence of prose in the spec:
+
+    > The source text is scanned from left to right, repeatedly taking the
+    > longest possible sequence of code points as the next input element.
+
+    This is easy enough for hand-coded lexers, and for systems that are
+    designed to use separate lexical and syntactic grammars.  (Other
+    parser generators may need help to avoid parsing `functionf(){}` as
+    a function.)
+
+2.  The above line of prose does not apply *within* input elements, in
+    components of the lexical grammar. In those cases, the same basic
+    idea ("maximum munch") is specified using lookahead restrictions at
+    the end of productions:
+
+    > *LineTerminatorSequence* ::  
+    > &nbsp; &nbsp; &lt;LF&gt;  
+    > &nbsp; &nbsp; &lt;CR&gt;[lookahead &ne; &lt;LF&gt;]  
+    > &nbsp; &nbsp; &lt;LS&gt;  
+    > &nbsp; &nbsp; &lt;PS&gt;  
+    > &nbsp; &nbsp; &lt;CR&gt;&lt;LF&gt;
+
+    The lookahead restriction prevents a CR LF sequence from being
+    parsed as two adjacent *LineTerminatorSequence*s.
+
+    This technique is used in several places, particularly in
+    [*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence).
+
+3.  Annex B.1.4 extends the syntax for regular expressions, making the
+    grammar ambiguous. Again, a line of prose explains how to cope:
+
+    > These changes introduce ambiguities that are broken by the
+    > ordering of grammar productions and by contextual
+    > information. When parsing using the following grammar, each
+    > alternative is considered only if previous production alternatives
+    > do not match.
+
+4.  Annex B.1.2 extends the syntax of string literals to allow legacy
+    octal escape sequences, like `\033`. It says:
+
+    > The syntax and semantics of 11.8.4 is extended as follows except
+    > that this extension is not allowed for strict mode code:
+
+    ...followed by a new definition of *EscapeSequence*.
+
+    So there are two sets of productions for *EscapeSequence*, and an
+    implementation is required to implement both and dynamically switch
+    between them.
+
+    This means that `function f() { "\033"; "use strict"; }` is a
+    SyntaxError, even though the octal escape is scanned before we know
+    we're in strict mode.
+
+For another ambiguity, see "Slashes" below.
+
+
+### Unicode quirks
+
+JavaScript source is Unicode and usually follows Unicode rules for thing
+like identifiers and whitespace, but it has a few special cases: `$`,
+`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are
+legal in identifiers (the latter two only after the first character), and
+`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is
+treated as whitespace.
+
+It also allows any code point, including surrogate halves, even though the
+Unicode standard says that unpaired surrogate halves should be treated as
+encoding errors.
+
+
+### Legacy octal literals and escape sequences (*)
+
+This is more funny than difficult.
+
+In a browser, in non-strict code, every sequence of decimal digits (not
+followed by an identifier character) is a *NumericLiteral* token.
+
+If it starts with `0`, with more digits after, then it's a legacy Annex
+B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal
+number. Otherwise, hilariously, it's octal.
+
+```
+js> [067, 068, 069, 070]
+[55, 68, 69, 56]
+```
+
+There are also legacy octal escape sequences in strings, and these have
+their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8
+is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0`
+followed by `8` or `9` is a legacy octal escape sequence representing
+the null character. (Not to be confused with `\0` in strict code, not
+followed by a digit, which still represents the null character, but
+doesn't count as octal.)
+
+None of this is hard to implement, but figuring out what the spec says
+is hard.
+
+
+### Strict mode (*)
+
+*(entangled with: lazy compilation)*
+
+A script or function can start with this:
+
+```js
+"use strict";
+```
+
+This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript).
+Additionally, all classes and modules are strict mode code.
+
+Strict mode has both parse-time and run-time effects. Parse-time effects
+include:
+
+*   Strict mode affects the lexical grammar: octal integer literals are
+    SyntaxErrors, octal character escapes are SyntaxErrors, and a
+    handful of words like `private` and `interface` are reserved (and
+    thus usually SyntaxErrors) in strict mode.
+
+    Like the situation with slashes, this means it is not possible to
+    implement a complete lexer for JS without also parsing—at least
+    enough to detect class boundaries, "use strict" directives in
+    functions, and function boundaries.
+
+*   It’s a SyntaxError to have bindings named `eval` or `arguments` in
+    strict mode code, or to assign to `eval` or `arguments`.
+
+*   It’s a SyntaxError to have two argument bindings with the same name
+    in a strict function.
+
+    Interestingly, you don’t always know if you’re in strict mode or not
+    when parsing arguments.
+
+    ```js
+    function foo(a, a) {
+        "use strict";
+    }
+    ```
+
+    When the implementation reaches the Use Strict Directive, it must
+    either know that `foo` has two arguments both named `a`, or switch
+    to strict mode, go back, and reparse the function from the
+    beginning.
+
+    Fortunately an Early Error rule prohibits mixing `"use strict"` with
+    more complex parameter lists, like `function foo(x = eval('')) {`.
+
+*   The expression syntax “`delete` *Identifier*” and the abominable
+    *WithStatement* are banned in strict mode.
+
+
+### Conditional keywords (**)
+
+In some programming languages, you could write a lexer that has rules
+like
+
+*   When you see `if`, return `Token::If`.
+
+*   When you see something like `apple` or `arrow` or `target`,
+    return `Token::Identifier`.
+
+Not so in JavaScript. The input `if` matches both the terminal `if` and
+the nonterminal *IdentifierName*, both of which appear in the high-level
+grammar. The same goes for `target`.
+
+This poses a deceptively difficult problem for table-driven parsers.
+Such parsers run on a stream of token-ids, but the question of which
+token-id to use for a word like `if` or `target` is ambiguous. The
+current parser state can't fully resolve the ambiguity: there are cases
+like `class C { get` where the token `get` might match either as a
+keyword (the start of a getter) or as an *IdentifierName* (a method or
+property named `get`) in different grammatical productions.
+
+All keywords are conditional, but some are more conditional than others.
+The rules are inconsistent to a tragicomic extent. Keywords like `if`
+that date back to JavaScript 1.0 are always keywords except when used as
+property names or method names. They can't be variable names. Two
+conditional keywords (`await` and `yield`) are in the *Keyword* list;
+the rest are not. New syntax that happened to be introduced around the
+same time as strict mode was awarded keyword status in strict mode. The
+rules are scattered through the spec. All this interacts with `\u0065`
+Unicode escape sequences somehow. It’s just unbelievably confusing.
+
+(After writing this section, I
+[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694)
+to make it a little less confusing.)
+
+*   Thirty-six words are always reserved:
+
+    > `break` `case` `catch` `class` `const` `continue` `debugger`
+    > `default` `delete` `do` `else` `enum` `export` `extends` `false`
+    > `finally` `for` `function` `if` `import` `in` `instanceof` `new`
+    > `null` `return` `super` `switch` `this` `throw` `true` `try`
+    > `typeof` `var` `void` `while` `with`
+
+    These tokens can't be used as names of variables or arguments.
+    They're always considered special *except* when used as property
+    names, method names, or import/export names in modules.
+
+    ```js
+    // property names
+    let obj = {if: 3, function: 4};
+    assert(obj.if == 3);
+
+    // method names
+    class C {
+      if() {}
+      function() {}
+    }
+
+    // imports and exports
+    import {if as my_if} from "modulename";
+    export {my_if as if};
+    ```
+
+*   Two more words, `yield` and `await`, are in the *Keyword* list but
+    do not always act like keywords in practice.
+
+    *   `yield` is a *Keyword*; but it can be used as an identifier,
+        except in generators and strict mode code.
+
+        This means that `yield - 1` is valid both inside and outside
+        generators, with different meanings. Outside a generator, it’s
+        subtraction. Inside, it yields the value **-1**.
+
+        That reminds me of the Groucho Marx line: Outside of a dog, a
+        book is a man’s best friend. Inside of a dog it’s too dark to
+        read.
+
+    *   `await` is like that, but in async functions. Also it’s not a
+        valid identifier in modules.
+
+    Conditional keywords are entangled with slashes: `yield /a/g` is two
+    tokens in a generator but five tokens elsewhere.
+
+*   In strict mode code, `implements`, `interface`, `package`,
+    `private`, `protected`, and `public` are reserved (via Early Errors
+    rules).
+
+    This is reflected in the message and location information for
+    certain syntax errors:
+
+    ```
+    SyntaxError: implements is a reserved identifier:
+    class implements {}
+    ......^
+
+    SyntaxError: implements is a reserved identifier:
+    function implements() { "use strict"; }
+    ....................................^
+    ```
+
+*   `let` is not a *Keyword* or *ReservedWord*. Usually it can be an
+    identifier. It is special at the beginning of a statement or after
+    `for (` or `for await (`.
+
+    ```js
+    var let = [new Date];   // ok: let as identifier
+    let v = let;            // ok: let as keyword, then identifier
+    let let;           // SyntaxError: banned by special early error rule
+    let.length;        // ok: `let .` -> ExpressionStatement
+    let[0].getYear();  // SyntaxError: `let [` -> LexicalDeclaration
+    ```
+
+    In strict mode code, `let` is reserved.
+
+*   `static` is similar. It’s a valid identifier, except in strict
+    mode. It’s only special at the beginning of a *ClassElement*.
+
+    In strict mode code, `static` is reserved.
+
+*   `async` is similar, but trickier. It’s an identifier. It is special
+    only if it’s marking an `async` function, method, or arrow function
+    (the tough case, since you won’t know it’s an arrow function until
+    you see the `=>`, possibly much later).
+
+    ```js
+    function async() {}   // normal function named "async"
+
+    async();        // ok, `async` is an Identifier; function call
+    async() => {};  // ok, `async` is not an Identifier; async arrow function
+    ```
+
+*   `of` is special only in one specific place in `for-of` loop syntax.
+
+    ```js
+    var of = [1, 2, 3];
+    for (of of of) console.log(of);  // logs 1, 2, 3
+    ```
+
+    Amazingly, both of the following are valid JS code:
+
+    ```js
+    for (async of => {};;) {}
+    for (async of []) {}
+    ```
+
+    In the first line, `async` is a keyword and `of` is an identifier;
+    in the second line it's the other way round.
+
+    Even a simplified JS grammar can't be LR(1) as long as it includes
+    the features used here!
+
+*   `get` and `set` are special only in a class or an object literal,
+    and then only if followed by a PropertyName:
+
+    ```js
+    var obj1 = {get: f};      // `get` is an identifier
+    var obj2 = {get x() {}};  // `get` means getter
+
+    class C1 { get = 3; }     // `get` is an identifier
+    class C2 { get x() {} }   // `get` means getter
+    ```
+
+*   `target` is special only in `new.target`.
+
+*   `arguments` and `eval` can't be binding names, and can't be assigned
+    to, in strict mode code.
+
+To complicate matters, there are a few grammatical contexts where both
+*IdentifierName* and *Identifier* match. For example, after `var {`
+there are two possibilities:
+
+```js
+// Longhand properties: BindingProperty -> PropertyName -> IdentifierName
+var { xy: v } = obj;    // ok
+var { if: v } = obj;    // ok, `if` is an IdentifierName
+
+// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier
+var { xy } = obj;       // ok
+var { if } = obj;       // SyntaxError: `if` is not an Identifier
+```
+
+
+### Escape sequences in keywords
+
+*(entangled with: conditional keywords, ASI)*
+
+You can use escape sequences to write variable and property names, but
+not keywords (including contextual keywords in contexts where they act
+as keywords).
+
+So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not.
+
+And you don't necessarily know if you're lexing a contextual keyword
+until the next token: `({ g\u0065t: 0 })` is legal, but
+`({ g\u0065t x(){} })` is not.
+
+And for `let` it's even worse: `l\u0065t` by itself is a legal way to
+reference a variable named `let`, which means that
+
+```js
+let
+x
+```
+declares a variable named `x`, while, thanks to ASI,
+
+```js
+l\u0065t
+x
+```
+is a reference to a variable named `let` followed by a reference to a
+variable named `x`.
+
+
+### Early errors (**)
+
+*(entangled with: lazy parsing, conditional keywords, ASI)*
+
+Some early errors are basically syntactic. Others are not.
+
+This is entangled with lazy compilation: "early errors" often involve a
+retrospective look at an arbitrarily large glob of code we just parsed,
+but in Beast Mode we’re not building an AST. In fact we would like to be
+doing as little bookkeeping as possible.
+
+Even setting that aside, every early error is a special case, and it’s
+just a ton of rules that all have to be implemented by hand.
+
+Here are some examples of Early Error rules—setting aside restrictions
+that are covered adequately elsewhere:
+
+*   Rules about names:
+
+    *   Rules that affect the set of keywords (character sequences that
+        match *IdentifierName* but are not allowed as binding names) based
+        on whether or not we’re in strict mode code, or in a
+        *Module*. Affected identifiers include `arguments`, `eval`, `yield`,
+        `await`, `let`, `implements`, `interface`, `package`, `private`,
+        `protected`, `public`, `static`.
+
+        *   One of these is a strangely worded rule which prohibits using
+            `yield` as a *BindingIdentifier*.  At first blush, this seems
+            like it could be enforced in the grammar, but that approach
+            would make this a valid program, due to ASI:
+
+            ```js
+            let
+            yield 0;
+            ```
+
+            Enforcing the same rule using an Early Error prohibits ASI here.
+            It works by exploiting the detailed inner workings of ASI case
+            1, and arranging for `0` to be "the offending token" rather than
+            `yield`.
+
+    *   Lexical variable names have to be unique within a scope:
+
+        *   Lexical variables (`let` and `const`) can’t be declared more
+            than once in a block, or both lexically declared and
+            declared with `var`.
+
+        *   Lexically declared variables in a function body can’t have the same
+            name as argument bindings.
+
+    *   A lexical variable can’t be named `let`.
+
+    *   Common-sense rules dealing with unicode escape sequences in
+        identifiers.
+
+*   Common-sense rules about regular expression literals. (They have to
+    actually be valid regular expressions, and unrecognized flags are
+    errors.)
+
+*   The number of string parts that a template string can have is
+    limited to 2<sup>32</sup> &minus; 1.
+
+*   Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are
+    banned in non-tagged templates (in tagged templates, they are allowed).
+
+*   The *SuperCall* syntax is allowed only in derived class
+    constructors.
+
+*   `const x;` without an initializer is a Syntax Error.
+
+*   A direct substatement of an `if` statement, loop statement, or
+    `with` statement can’t be a labelled `function`.
+
+*   Early errors are used to hook up cover grammars.
+
+    *   Early errors are also used in one case to avoid having to
+        specify a very large refinement grammar when *ObjectLiteral*
+        almost covers *ObjectAssignmentPattern*:
+        [sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors).
+
+*   Early errors are sometimes used to prevent parsers from needing to
+    backtrack too much.
+
+    *   When parsing `async ( x = await/a/g )`, you don't know until the
+        next token if this is an async arrow or a call to a function named
+        `async`. This means you can't even tokenize properly, because in
+        the former case the thing following `x =` is two divisions and in
+        the latter case it's an *AwaitExpression* of a regular expression.
+        So an Early Error forbids having `await` in parameters at all,
+        allowing parsers to immediately throw an error if they find
+        themselves in this case.
+
+Many strict mode rules are enforced using Early Errors, but others
+affect runtime semantics.
+
+<!--
+*   I think the rules about assignment targets are related to cover
+    grammars. Not sure.
+
+    *   Expressions used in assignment context (i.e. as the operands of `++`
+        and `--`, the left operand of assignment, the loop variable of a
+        `for-in/for-of` loop, or sub-targets in destructuring) must be valid
+        assignment targets.
+
+    *   In destructuring assignment, `[...EXPR] = x` is an error if `EXPR`
+        is an array or object destructuring target.
+
+-->
+
+
+### Boolean parameters (**)
+
+Some nonterminals are parameterized. (Search for “parameterized
+production” in [this spec
+section](https://tc39.es/ecma262/#sec-grammar-notation).)
+
+Implemented naively (e.g. by macro expansion) in a parser generator,
+each parameter could nearly double the size of the parser. Instead, the
+parameters must be tracked at run time somehow.
+
+
+### Lookahead restrictions (**)
+
+*(entangled with: restricted productions)*
+
+TODO (I implemented this by hacking the entire LR algorithm. Most every
+part of it is touched, although in ways that seem almost obvious once
+you understand LR inside and out.)
+
+(Note: It may seem like all of the lookahead restrictions in the spec
+are really just a way of saying “this production takes precedence over
+that one”—for example, that the lookahead restriction on
+*ExpressionStatement* just means that other productions for statements
+and declarations take precedence over it. But that isn't accurate; you
+can't have an *ExpressionStatement* that starts with `{`, even if it
+doesn't parse as a *Block* or any other kind of statement.)
+
+
+### Automatic Semicolon Insertion (**)
+
+*(entangled with: restricted productions, slashes)*
+
+Most semicolons at the end of JS statements and declarations “may be
+omitted from the source text in certain situations”.  This is called
+[Automatic Semicolon
+Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion),
+or ASI for short.
+
+The specification for this feature is both very-high-level and weirdly
+procedural (“When, as the source text is parsed from left to right, a
+token is encountered...”, as if the specification is telling a story
+about a browser. As far as I know, this is the only place in the spec
+where anything is assumed or implied about the internal implementation
+details of parsing.) But it would be hard to specify ASI any other way.
+
+Wrinkles:
+
+1.  Whitespace is significant (including whitespace inside comments).
+    Most semicolons in the grammar are optional only at the end of a
+    line (or before `}`, or at the end of the program).
+
+2.  The ending semicolon of a `do`-`while` statement is extra optional.
+    You can always omit it.
+
+3.  A few semicolons are never optional, like the semicolons in `for (;;)`.
+
+    This means there’s a semicolon in the grammar that is optionally
+    optional! This one:
+
+    > *LexicalDeclaration* : *LetOrConst* *BindingList* `;`
+
+    It’s usually optional, but not if this is the *LexicalDeclaration*
+    in `for (let i = 0; i < 9; i++)`!
+
+4.  Semicolons are not inserted only as a last resort to avoid
+    SyntaxErrors. That turned out to be too error-prone, so there are
+    also *restricted productions* (see below), where semicolons are more
+    aggressively inferred.
+
+5.  In implementations, ASI interacts with the ambiguity of *slashes*
+    (see below).
+
+A recursive descent parser implements ASI by calling a special method
+every time it needs to parse a semicolon that might be optional.  The
+special method has to peek at the next token and consume it only if it’s
+a semicolon. This would not be so bad if it weren’t for slashes.
+
+In a parser generator, ASI can be implemented using an error recovery
+mechanism.
+
+I think the [error recovery mechanism in
+yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery)
+is too imprecise—when an error happens, it discards states from the
+stack searching for a matching error-handling rule. The manual says
+“Error recovery strategies are necessarily guesses.”
+
+But here’s a slightly more careful error recovery mechanism that could
+do the job:
+
+1.  For each production in the ES spec grammar where ASI could happen, e.g.
+
+    ```
+    ImportDeclaration ::= `import` ModuleSpecifier `;`
+                                                    { import_declaration($2); }
+    ```
+
+    add an ASI production, like this:
+
+    ```
+    ImportDeclaration ::= `import` ModuleSpecifier [ERROR]
+                                       { check_asi(); import_declaration($2); }
+    ```
+
+    What does this mean? This production can be matched, like any other
+    production, but it's a fallback. All other productions take
+    precedence.
+
+2.  While generating the parser, treat `[ERROR]` as a terminal
+    symbol. It can be included in start sets and follow sets, lookahead,
+    and so forth.
+
+3.  At run time, when an error happens, synthesize an `[ERROR]` token.
+    Let that bounce through the state machine. It will cause zero or
+    more reductions. Then, it might actually match a production that
+    contains `[ERROR]`, like the ASI production above.
+
+    Otherwise, we’ll get another error—the entry in the parser table for
+    an `[ERROR]` token at this state will be an `error` entry. Then we
+    really have a syntax error.
+
+This solves most of the ASI issues:
+
+* [x] Whitespace sensitivity: That's what `check_asi()` is for. It
+    should signal an error if we're not at the end of a line.
+
+* [x] Special treatment of `do`-`while` loops: Make an error production,
+    but don't `check_asi()`.
+
+* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`:
+    Easy, don't create error productions for those.
+
+    * [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust
+        the grammar, copying *LexicalDeclaration* so that there's a
+        *LexicalDeclarationNoASI* production used only by `for`
+        statements. Not a big deal, as it turns out.
+
+* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the
+    next token, if it starts with `/`.
+
+* [ ] Restricted productions: Not solved. Read on.
+
+
+### Restricted productions (**)
+
+*(entangled with: ASI, slashes)*
+
+Line breaks aren’t allowed in certain places. For example, the following
+is not a valid program:
+
+    throw                 // SyntaxError
+      new Error();
+
+For another example, this function contains two statements, not one:
+
+    function f(g) {
+      return              // ASI
+        g();
+    }
+
+The indentation is misleading; actually ASI inserts a semicolon at the
+end of the first line: `return; g();`. (This function always returns
+undefined. The second statement is never reached.)
+
+These restrictions apply even to multiline comments, so the function
+
+```js
+function f(g) {
+  return /*
+    */ g();
+}
+```
+contains two statements, just as the previous example did.
+
+I’m not sure why these rules exist, but it’s probably because (back in
+the Netscape days) users complained about the bizarre behavior of
+automatic semicolon insertion, and so some special do-what-I-mean hacks
+were put in.
+
+This is specified with a weird special thing in the grammar:
+
+> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;`
+
+This is called a *restricted production*, and it’s unfortunately
+necessary to go through them one by one, because there are several
+kinds. Note that the particular hack required to parse them in a
+recursive descent parser is a little bit different each time.
+
+*   After `continue`, `break`, or `return`, a line break triggers ASI.
+
+    The relevant productions are all statements, and in each case
+    there’s an alternative production that ends immediately with a
+    semicolon: `continue ;` `break ;` and `return ;`.
+
+    Note that the alternative production is *not* restricted: e.g. a
+    *LineTerminator* can appear between `return` and `;`:
+
+    ```js
+    if (x)
+      return        // ok
+        ;
+    else
+      f();
+    ```
+
+*   After `throw`, a line break is a SyntaxError.
+
+*   After `yield`, a line break terminates the *YieldExpression*.
+
+    Here the alternative production is simply `yield`, not `yield ;`.
+
+*   In a post-increment or post-decrement expression, there can’t be a
+    line break before `++` or `--`.
+
+    The purpose of this rule is subtle. It triggers ASI and thus prevents
+    syntax errors:
+
+    ```js
+    var x = y       // ok: semicolon inserted here
+    ++z;
+    ```
+
+    Without the restricted production, `var x = y ++` would parse
+    successfully, and the “offending token” would be `z`. It would be
+    too late for ASI.
+
+    However, the restriction can of course also *cause* a SyntaxError:
+
+    ```js
+    var x = (y
+             ++);   // SyntaxError
+    ```
+
+As we said, recursive descent parsers can implement these rules with hax.
+
+In a generated parser, there are a few possible ways to implement
+them. Here are three. If you are not interested in ridiculous
+approaches, you can skip the first two.
+
+*   Treat every token as actually a different token when it appears
+    after a line break: `TokenType::LeftParen` and
+    `TokenType::LeftParenAfterLineBreak`.  Of course the parser
+    generator can treat these exactly the same in normal cases, and
+    automatically generate identical table entries (or whatever) except
+    in states where there’s a relevant restricted production.
+
+*   Add a special LineTerminator token. Normally, the lexer skips
+    newlines and never emits this token. However, if the current state
+    has a relevant restricted production, the lexer knows this and emits
+    a LineTerminator for the first line break it sees; and the parser
+    uses that token to trigger an error or transition to another state,
+    as appropriate.
+
+*   When in a state that has a relevant restricted production, change
+    states if there’s a line break before the next token.  That is,
+    split each such state into two: the one we stay in when there’s not
+    a line break, and the one we jump to if there is a line break.
+
+In all cases it’ll be hard to have confidence that the resulting parser
+generator is really sound. (That is, it might not properly reject all
+ambiguous grammars.) I don’t know exactly what property of the few
+special uses in the ES grammar makes them seem benign.
+
+
+### Slashes (**)
+
+*(entangled with: ASI, restricted productions)*
+
+When you see `/` in a JS program, you don’t know if that’s a
+division operator or the start of a regular expression unless you’ve
+been paying attention up to that point.
+
+[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar)
+
+> There are several situations where the identification of lexical input
+> elements is sensitive to the syntactic grammar context that is
+> consuming the input elements. This requires multiple goal symbols for
+> the lexical grammar.
+
+You might think the lexer could treat `/` as an operator only if the
+previous token is one that can be the last token of an expression (a set
+that includes literals, identifiers, `this`, `)`, `]`, and `}`).  To see
+that this does not work, consider:
+
+```js
+{} /x/                  // `/` after `}` is regexp
+({} / 2)                // `/` after `}` is division
+
+for (g of /(a)(b)/) {}  // `/` after `of` is regexp
+var of = 6; of / 2      // `/` after `of` is division
+
+throw /x/;              // `/` after `throw` is regexp
+Math.throw / 2;         // `/` after `throw` is division
+
+++/x/.lastIndex;        // `/` after `++` is regexp
+n++ / 2;                // `/` after `++` is division
+```
+
+So how can the spec be implemented?
+
+In a recursive descent parser, you have to tell the lexer which goal
+symbol to use every time you ask for a token. And you have to make sure,
+if you look ahead at a token, but *don’t* consume it, and fall back on
+another path that can accept a *RegularExpressionLiteral* or
+*DivPunctuator*, that you did not initially lex it incorrectly. We have
+assertions for this and it is a bit of a nightmare when we have to touch
+it (which is thankfully rare). Part of the problem is that the token
+you’re peeking ahead at might not be part of the same production at all.
+Thanks to ASI, it might be the start of the next statement, which will
+be parsed in a faraway part of the Parser.
+
+A table-driven parser has it easy here! The lexer can consult the state
+table and see which kind of token can be accepted in the current
+state. This is closer to what the spec actually says.
+
+Two minor things to watch out for:
+
+*   The nonterminal *ClassTail* is used both at the end of
+    *ClassExpression*, which may be followed by `/`; and at the end of
+    *ClassDeclaration*, which may be followed by a
+    *RegularExpressionLiteral* at the start of the next
+    statement. Canonical LR creates separate states for these two uses
+    of *ClassTail*, but the LALR algorithm will unify them, creating
+    some states that have both `/` and *RegularExpressionLiteral* in the
+    follow set.  In these states, determining which terminal is actually
+    allowed requires looking not only at the current state, but at the
+    current stack of states (to see one level of grammatical context).
+
+*   Since this decision depends on the parser state, and automatic
+    semicolon insertion adjusts the parser state, a parser may need to
+    re-scan a token after ASI.
+
+In other kinds of generated parsers, at least the lexical goal symbol
+can be determined automatically.
+
+
+### Lazy compilation and scoping (**)
+
+*(entangled with: arrow functions)*
+
+JS engines *lazily compile* function bodies. During parsing, when the
+engine sees a `function`, it switches to a high-speed parsing mode
+(which I will call “Beast Mode”) that just skims the function and checks
+for syntax errors. Beast Mode does not compile the code. Beast Mode
+doesn’t even create AST nodes. All that will be done later, on demand,
+the first time the function is called.
+
+The point is to get through parsing *fast*, so that the script can start
+running. In browsers, `<script>` compilation usually must finish before
+we can show the user any meaningful content. Any part of it that can be
+deferred must be deferred. (Bonus: many functions are never called, so
+the work is not just deferred but avoided altogether.)
+
+Later, when a function is called, we can still defer compilation of
+nested functions inside it.
+
+So what? Seems easy enough, right? Well...
+
+Local variables in JS are optimized. To generate reasonable code for the
+script or function we are compiling, we need to know which of its local
+variables are used by nested functions, which we are *not* currently
+compiling. That is, we must know which variables *occur free in* each
+nested function. So scoping, which otherwise could be done as a separate
+phase of compilation, must be done during parsing in Beast Mode.
+
+Getting the scoping of arrow functions right while parsing is tricky,
+because it’s often not possible to know when you’re entering a scope
+until later. Consider parsing `(a`. This could be the beginning of an
+arrow function, or not; we might not know until after we reach the
+matching `)`, which could be a long way away.
+
+Annex B.3.3 adds extremely complex rules for scoping for function
+declarations which makes this especially difficult. In
+
+```js
+function f() {
+    let x = 1;
+    return function g() {
+        {
+            function x(){}
+        }
+        {
+            let x;
+        }
+        return x;
+    };
+}
+```
+
+the function `g` does not use the initial `let x`, but in
+
+```js
+function f() {
+    let x = 1;
+    return function g() {
+        {
+            {
+                function x(){}
+            }
+            let x;
+        }
+        return x;
+    };
+}
+```
+it does.
+
+[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go)
+is a good writeup of what's going on in these examples.
+
+
+### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*)
+
+*(entangled with: lazy compilation)*
+
+Suppose the parser is scanning text from start to end, when it sees a
+statement that starts with something like this:
+
+```js
+let f = (a
+```
+
+The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a
+*ParenthesizedExpression*. Either is possible. We’ll know when either
+(1) we see something that rules out the other case:
+
+```js
+let f = (a + b           // can't be ArrowParameters
+
+let f = (a, b, ...args   // can't be ParenthesizedExpression
+```
+
+or (2) we reach the matching `)` and see if the next token is `=>`.
+
+Probably (1) is a pain to implement.
+
+To keep the language nominally LR(1), the standard specifies a “cover
+grammar”, a nonterminal named
+*CoverParenthesizedExpressionAndArrowParameterList* that is a superset
+of both *ArrowParameters* and *ParenthesizedExpression*.  The spec
+grammar uses the *Cover* nonterminal in both places, so technically `let
+f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically
+valid *Script*s, according to the formal grammar. But there are Early
+Error rules (that is, plain English text, not part of the formal
+grammar) that say arrow functions’ parameters must match
+*ArrowParameters*, and parenthesized expressions must match
+*ParenthesizedExpression*.
+
+So after the initial parse, the implementation must somehow check that
+the *CoverParenthesizedExpressionAndArrowParameterList* really is valid
+in context. This complicates lazy compilation because in Beast Mode we
+are not even building an AST. It’s not easy to go back and check.
+
+Something similar happens in a few other cases: the spec is written as
+though syntactic rules are applied after the fact:
+
+*   *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)`
+    which looks like a function call and also looks like the beginning
+    of an async arrow function.
+
+*   *ObjectLiteral* is a cover grammar; it covers both actual object
+    literals and the syntax `propertyName = expr`, which is not valid in
+    object literals but allowed in destructuring assignment:
+
+    ```js
+    var obj = {x = 1};  // SyntaxError, by an early error rule
+
+    ({x = 1} = {});   // ok, assigns 1 to x
+    ```
+
+*   *LeftHandSideExpression* is used to the left of `=` in assignment,
+    and as the operand of postfix `++` and `--`. But this is way too lax;
+    most expressions shouldn’t be assigned to:
+
+    ```js
+    1 = 0;  // matches the formal grammar, SyntaxError by an early error rule
+
+    null++;   // same
+
+    ["a", "b", "c"]++;  // same
+    ```
+
+
+## Conclusion
+
+What have we learned today?
+
+*   Do not write a JS parser.
+
+*   JavaScript has some syntactic horrors in it. But hey, you don't
+    make the world's most widely used programming language by avoiding
+    all mistakes. You do it by shipping a serviceable tool, in the right
+    circumstances, for the right users.
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-07 19:33:14 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-07 19:33:14 +0000
commit	36d22d82aa202bb199967e9512281e9a53db42c9 (patch)
tree	105e8c98ddea1c1e4784a60a5a6410fa416be2de /third_party/rust/jsparagus/js-quirks.md
parent	Initial commit. (diff)
download	firefox-esr-36d22d82aa202bb199967e9512281e9a53db42c9.tar.xz firefox-esr-36d22d82aa202bb199967e9512281e9a53db42c9.zip