1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
|
## JS syntactic quirks
> *To make a labyrinth, it takes*
> *Some good intentions, some mistakes.*
> —A. E. Stallings, “Daedal”
JavaScript is rather hard to parse. Here is an in-depth accounting of
its syntactic quirks, with an eye toward actually implementing a parser
from scratch.
With apologies to the generous people who work on the standard. Thanks
for doing that—better you than me.
Thanks to [@bakkot](https://github.com/bakkot) and
[@mathiasbynens](https://github.com/mathiasbynens) for pointing out
several additional quirks.
Problems are rated in terms of difficulty, from `(*)` = easy to `(***)`
= hard. We’ll start with the easiest problems.
### Dangling else (*)
If you know what this is, you may be excused.
Statements like `if (EXPR) STMT if (EXPR) STMT else STMT`
are straight-up ambiguous in the JS formal grammar.
The ambiguity is resolved with
[a line of specification text](https://tc39.es/ecma262/#sec-if-statement):
> Each `else` for which the choice of associated `if` is ambiguous shall
> be associated with the nearest possible `if` that would otherwise have
> no corresponding `else`.
I love this sentence. Something about it cracks me up, I dunno...
In a recursive descent parser, just doing the dumbest possible thing
correctly implements this rule.
A parser generator has to decide what to do. In Yacc, you can use
operator precedence for this.
Yacc aside: This should seem a little outrageous at first, as `else` is
hardly an operator. It helps if you understand what Yacc is doing. In LR
parsers, this kind of ambiguity in the grammar manifests as a
shift-reduce conflict. In this case, when we’ve already parsed `if (
Expression ) Statement if ( Expression ) Statement`
and are looking at `else`, it’s unclear to Yacc
whether to reduce the if-statement or shift `else`. Yacc does not offer
a feature that lets us just say "always shift `else` here"; but there
*is* a Yacc feature that lets us resolve shift-reduce conflicts in a
rather odd, indirect way: operator precedence. We can resolve this
conflict by making `else` higher-precedence than the preceding symbol
`)`.
Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]"
at the end of the IfStatement production that doesn’t have an `else`.
### Other ambiguities and informal parts of the spec (*)
Not all of the spec is as formal as it seems at first. Most of the stuff
in this section is easy to deal with, but #4 is special.
1. The lexical grammar is ambiguous: when looking at the characters `<<=`,
there is the question of whether to parse that as one token `<<=`, two
tokens (`< <=` or `<< =`), or three (`< < =`).
Of course every programming language has this, and the fix is one
sentence of prose in the spec:
> The source text is scanned from left to right, repeatedly taking the
> longest possible sequence of code points as the next input element.
This is easy enough for hand-coded lexers, and for systems that are
designed to use separate lexical and syntactic grammars. (Other
parser generators may need help to avoid parsing `functionf(){}` as
a function.)
2. The above line of prose does not apply *within* input elements, in
components of the lexical grammar. In those cases, the same basic
idea ("maximum munch") is specified using lookahead restrictions at
the end of productions:
> *LineTerminatorSequence* ::
> <LF>
> <CR>[lookahead ≠ <LF>]
> <LS>
> <PS>
> <CR><LF>
The lookahead restriction prevents a CR LF sequence from being
parsed as two adjacent *LineTerminatorSequence*s.
This technique is used in several places, particularly in
[*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence).
3. Annex B.1.4 extends the syntax for regular expressions, making the
grammar ambiguous. Again, a line of prose explains how to cope:
> These changes introduce ambiguities that are broken by the
> ordering of grammar productions and by contextual
> information. When parsing using the following grammar, each
> alternative is considered only if previous production alternatives
> do not match.
4. Annex B.1.2 extends the syntax of string literals to allow legacy
octal escape sequences, like `\033`. It says:
> The syntax and semantics of 11.8.4 is extended as follows except
> that this extension is not allowed for strict mode code:
...followed by a new definition of *EscapeSequence*.
So there are two sets of productions for *EscapeSequence*, and an
implementation is required to implement both and dynamically switch
between them.
This means that `function f() { "\033"; "use strict"; }` is a
SyntaxError, even though the octal escape is scanned before we know
we're in strict mode.
For another ambiguity, see "Slashes" below.
### Unicode quirks
JavaScript source is Unicode and usually follows Unicode rules for thing
like identifiers and whitespace, but it has a few special cases: `$`,
`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are
legal in identifiers (the latter two only after the first character), and
`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is
treated as whitespace.
It also allows any code point, including surrogate halves, even though the
Unicode standard says that unpaired surrogate halves should be treated as
encoding errors.
### Legacy octal literals and escape sequences (*)
This is more funny than difficult.
In a browser, in non-strict code, every sequence of decimal digits (not
followed by an identifier character) is a *NumericLiteral* token.
If it starts with `0`, with more digits after, then it's a legacy Annex
B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal
number. Otherwise, hilariously, it's octal.
```
js> [067, 068, 069, 070]
[55, 68, 69, 56]
```
There are also legacy octal escape sequences in strings, and these have
their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8
is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0`
followed by `8` or `9` is a legacy octal escape sequence representing
the null character. (Not to be confused with `\0` in strict code, not
followed by a digit, which still represents the null character, but
doesn't count as octal.)
None of this is hard to implement, but figuring out what the spec says
is hard.
### Strict mode (*)
*(entangled with: lazy compilation)*
A script or function can start with this:
```js
"use strict";
```
This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript).
Additionally, all classes and modules are strict mode code.
Strict mode has both parse-time and run-time effects. Parse-time effects
include:
* Strict mode affects the lexical grammar: octal integer literals are
SyntaxErrors, octal character escapes are SyntaxErrors, and a
handful of words like `private` and `interface` are reserved (and
thus usually SyntaxErrors) in strict mode.
Like the situation with slashes, this means it is not possible to
implement a complete lexer for JS without also parsing—at least
enough to detect class boundaries, "use strict" directives in
functions, and function boundaries.
* It’s a SyntaxError to have bindings named `eval` or `arguments` in
strict mode code, or to assign to `eval` or `arguments`.
* It’s a SyntaxError to have two argument bindings with the same name
in a strict function.
Interestingly, you don’t always know if you’re in strict mode or not
when parsing arguments.
```js
function foo(a, a) {
"use strict";
}
```
When the implementation reaches the Use Strict Directive, it must
either know that `foo` has two arguments both named `a`, or switch
to strict mode, go back, and reparse the function from the
beginning.
Fortunately an Early Error rule prohibits mixing `"use strict"` with
more complex parameter lists, like `function foo(x = eval('')) {`.
* The expression syntax “`delete` *Identifier*” and the abominable
*WithStatement* are banned in strict mode.
### Conditional keywords (**)
In some programming languages, you could write a lexer that has rules
like
* When you see `if`, return `Token::If`.
* When you see something like `apple` or `arrow` or `target`,
return `Token::Identifier`.
Not so in JavaScript. The input `if` matches both the terminal `if` and
the nonterminal *IdentifierName*, both of which appear in the high-level
grammar. The same goes for `target`.
This poses a deceptively difficult problem for table-driven parsers.
Such parsers run on a stream of token-ids, but the question of which
token-id to use for a word like `if` or `target` is ambiguous. The
current parser state can't fully resolve the ambiguity: there are cases
like `class C { get` where the token `get` might match either as a
keyword (the start of a getter) or as an *IdentifierName* (a method or
property named `get`) in different grammatical productions.
All keywords are conditional, but some are more conditional than others.
The rules are inconsistent to a tragicomic extent. Keywords like `if`
that date back to JavaScript 1.0 are always keywords except when used as
property names or method names. They can't be variable names. Two
conditional keywords (`await` and `yield`) are in the *Keyword* list;
the rest are not. New syntax that happened to be introduced around the
same time as strict mode was awarded keyword status in strict mode. The
rules are scattered through the spec. All this interacts with `\u0065`
Unicode escape sequences somehow. It’s just unbelievably confusing.
(After writing this section, I
[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694)
to make it a little less confusing.)
* Thirty-six words are always reserved:
> `break` `case` `catch` `class` `const` `continue` `debugger`
> `default` `delete` `do` `else` `enum` `export` `extends` `false`
> `finally` `for` `function` `if` `import` `in` `instanceof` `new`
> `null` `return` `super` `switch` `this` `throw` `true` `try`
> `typeof` `var` `void` `while` `with`
These tokens can't be used as names of variables or arguments.
They're always considered special *except* when used as property
names, method names, or import/export names in modules.
```js
// property names
let obj = {if: 3, function: 4};
assert(obj.if == 3);
// method names
class C {
if() {}
function() {}
}
// imports and exports
import {if as my_if} from "modulename";
export {my_if as if};
```
* Two more words, `yield` and `await`, are in the *Keyword* list but
do not always act like keywords in practice.
* `yield` is a *Keyword*; but it can be used as an identifier,
except in generators and strict mode code.
This means that `yield - 1` is valid both inside and outside
generators, with different meanings. Outside a generator, it’s
subtraction. Inside, it yields the value **-1**.
That reminds me of the Groucho Marx line: Outside of a dog, a
book is a man’s best friend. Inside of a dog it’s too dark to
read.
* `await` is like that, but in async functions. Also it’s not a
valid identifier in modules.
Conditional keywords are entangled with slashes: `yield /a/g` is two
tokens in a generator but five tokens elsewhere.
* In strict mode code, `implements`, `interface`, `package`,
`private`, `protected`, and `public` are reserved (via Early Errors
rules).
This is reflected in the message and location information for
certain syntax errors:
```
SyntaxError: implements is a reserved identifier:
class implements {}
......^
SyntaxError: implements is a reserved identifier:
function implements() { "use strict"; }
....................................^
```
* `let` is not a *Keyword* or *ReservedWord*. Usually it can be an
identifier. It is special at the beginning of a statement or after
`for (` or `for await (`.
```js
var let = [new Date]; // ok: let as identifier
let v = let; // ok: let as keyword, then identifier
let let; // SyntaxError: banned by special early error rule
let.length; // ok: `let .` -> ExpressionStatement
let[0].getYear(); // SyntaxError: `let [` -> LexicalDeclaration
```
In strict mode code, `let` is reserved.
* `static` is similar. It’s a valid identifier, except in strict
mode. It’s only special at the beginning of a *ClassElement*.
In strict mode code, `static` is reserved.
* `async` is similar, but trickier. It’s an identifier. It is special
only if it’s marking an `async` function, method, or arrow function
(the tough case, since you won’t know it’s an arrow function until
you see the `=>`, possibly much later).
```js
function async() {} // normal function named "async"
async(); // ok, `async` is an Identifier; function call
async() => {}; // ok, `async` is not an Identifier; async arrow function
```
* `of` is special only in one specific place in `for-of` loop syntax.
```js
var of = [1, 2, 3];
for (of of of) console.log(of); // logs 1, 2, 3
```
Amazingly, both of the following are valid JS code:
```js
for (async of => {};;) {}
for (async of []) {}
```
In the first line, `async` is a keyword and `of` is an identifier;
in the second line it's the other way round.
Even a simplified JS grammar can't be LR(1) as long as it includes
the features used here!
* `get` and `set` are special only in a class or an object literal,
and then only if followed by a PropertyName:
```js
var obj1 = {get: f}; // `get` is an identifier
var obj2 = {get x() {}}; // `get` means getter
class C1 { get = 3; } // `get` is an identifier
class C2 { get x() {} } // `get` means getter
```
* `target` is special only in `new.target`.
* `arguments` and `eval` can't be binding names, and can't be assigned
to, in strict mode code.
To complicate matters, there are a few grammatical contexts where both
*IdentifierName* and *Identifier* match. For example, after `var {`
there are two possibilities:
```js
// Longhand properties: BindingProperty -> PropertyName -> IdentifierName
var { xy: v } = obj; // ok
var { if: v } = obj; // ok, `if` is an IdentifierName
// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier
var { xy } = obj; // ok
var { if } = obj; // SyntaxError: `if` is not an Identifier
```
### Escape sequences in keywords
*(entangled with: conditional keywords, ASI)*
You can use escape sequences to write variable and property names, but
not keywords (including contextual keywords in contexts where they act
as keywords).
So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not.
And you don't necessarily know if you're lexing a contextual keyword
until the next token: `({ g\u0065t: 0 })` is legal, but
`({ g\u0065t x(){} })` is not.
And for `let` it's even worse: `l\u0065t` by itself is a legal way to
reference a variable named `let`, which means that
```js
let
x
```
declares a variable named `x`, while, thanks to ASI,
```js
l\u0065t
x
```
is a reference to a variable named `let` followed by a reference to a
variable named `x`.
### Early errors (**)
*(entangled with: lazy parsing, conditional keywords, ASI)*
Some early errors are basically syntactic. Others are not.
This is entangled with lazy compilation: "early errors" often involve a
retrospective look at an arbitrarily large glob of code we just parsed,
but in Beast Mode we’re not building an AST. In fact we would like to be
doing as little bookkeeping as possible.
Even setting that aside, every early error is a special case, and it’s
just a ton of rules that all have to be implemented by hand.
Here are some examples of Early Error rules—setting aside restrictions
that are covered adequately elsewhere:
* Rules about names:
* Rules that affect the set of keywords (character sequences that
match *IdentifierName* but are not allowed as binding names) based
on whether or not we’re in strict mode code, or in a
*Module*. Affected identifiers include `arguments`, `eval`, `yield`,
`await`, `let`, `implements`, `interface`, `package`, `private`,
`protected`, `public`, `static`.
* One of these is a strangely worded rule which prohibits using
`yield` as a *BindingIdentifier*. At first blush, this seems
like it could be enforced in the grammar, but that approach
would make this a valid program, due to ASI:
```js
let
yield 0;
```
Enforcing the same rule using an Early Error prohibits ASI here.
It works by exploiting the detailed inner workings of ASI case
1, and arranging for `0` to be "the offending token" rather than
`yield`.
* Lexical variable names have to be unique within a scope:
* Lexical variables (`let` and `const`) can’t be declared more
than once in a block, or both lexically declared and
declared with `var`.
* Lexically declared variables in a function body can’t have the same
name as argument bindings.
* A lexical variable can’t be named `let`.
* Common-sense rules dealing with unicode escape sequences in
identifiers.
* Common-sense rules about regular expression literals. (They have to
actually be valid regular expressions, and unrecognized flags are
errors.)
* The number of string parts that a template string can have is
limited to 2<sup>32</sup> − 1.
* Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are
banned in non-tagged templates (in tagged templates, they are allowed).
* The *SuperCall* syntax is allowed only in derived class
constructors.
* `const x;` without an initializer is a Syntax Error.
* A direct substatement of an `if` statement, loop statement, or
`with` statement can’t be a labelled `function`.
* Early errors are used to hook up cover grammars.
* Early errors are also used in one case to avoid having to
specify a very large refinement grammar when *ObjectLiteral*
almost covers *ObjectAssignmentPattern*:
[sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors).
* Early errors are sometimes used to prevent parsers from needing to
backtrack too much.
* When parsing `async ( x = await/a/g )`, you don't know until the
next token if this is an async arrow or a call to a function named
`async`. This means you can't even tokenize properly, because in
the former case the thing following `x =` is two divisions and in
the latter case it's an *AwaitExpression* of a regular expression.
So an Early Error forbids having `await` in parameters at all,
allowing parsers to immediately throw an error if they find
themselves in this case.
Many strict mode rules are enforced using Early Errors, but others
affect runtime semantics.
<!--
* I think the rules about assignment targets are related to cover
grammars. Not sure.
* Expressions used in assignment context (i.e. as the operands of `++`
and `--`, the left operand of assignment, the loop variable of a
`for-in/for-of` loop, or sub-targets in destructuring) must be valid
assignment targets.
* In destructuring assignment, `[...EXPR] = x` is an error if `EXPR`
is an array or object destructuring target.
-->
### Boolean parameters (**)
Some nonterminals are parameterized. (Search for “parameterized
production” in [this spec
section](https://tc39.es/ecma262/#sec-grammar-notation).)
Implemented naively (e.g. by macro expansion) in a parser generator,
each parameter could nearly double the size of the parser. Instead, the
parameters must be tracked at run time somehow.
### Lookahead restrictions (**)
*(entangled with: restricted productions)*
TODO (I implemented this by hacking the entire LR algorithm. Most every
part of it is touched, although in ways that seem almost obvious once
you understand LR inside and out.)
(Note: It may seem like all of the lookahead restrictions in the spec
are really just a way of saying “this production takes precedence over
that one”—for example, that the lookahead restriction on
*ExpressionStatement* just means that other productions for statements
and declarations take precedence over it. But that isn't accurate; you
can't have an *ExpressionStatement* that starts with `{`, even if it
doesn't parse as a *Block* or any other kind of statement.)
### Automatic Semicolon Insertion (**)
*(entangled with: restricted productions, slashes)*
Most semicolons at the end of JS statements and declarations “may be
omitted from the source text in certain situations”. This is called
[Automatic Semicolon
Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion),
or ASI for short.
The specification for this feature is both very-high-level and weirdly
procedural (“When, as the source text is parsed from left to right, a
token is encountered...”, as if the specification is telling a story
about a browser. As far as I know, this is the only place in the spec
where anything is assumed or implied about the internal implementation
details of parsing.) But it would be hard to specify ASI any other way.
Wrinkles:
1. Whitespace is significant (including whitespace inside comments).
Most semicolons in the grammar are optional only at the end of a
line (or before `}`, or at the end of the program).
2. The ending semicolon of a `do`-`while` statement is extra optional.
You can always omit it.
3. A few semicolons are never optional, like the semicolons in `for (;;)`.
This means there’s a semicolon in the grammar that is optionally
optional! This one:
> *LexicalDeclaration* : *LetOrConst* *BindingList* `;`
It’s usually optional, but not if this is the *LexicalDeclaration*
in `for (let i = 0; i < 9; i++)`!
4. Semicolons are not inserted only as a last resort to avoid
SyntaxErrors. That turned out to be too error-prone, so there are
also *restricted productions* (see below), where semicolons are more
aggressively inferred.
5. In implementations, ASI interacts with the ambiguity of *slashes*
(see below).
A recursive descent parser implements ASI by calling a special method
every time it needs to parse a semicolon that might be optional. The
special method has to peek at the next token and consume it only if it’s
a semicolon. This would not be so bad if it weren’t for slashes.
In a parser generator, ASI can be implemented using an error recovery
mechanism.
I think the [error recovery mechanism in
yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery)
is too imprecise—when an error happens, it discards states from the
stack searching for a matching error-handling rule. The manual says
“Error recovery strategies are necessarily guesses.”
But here’s a slightly more careful error recovery mechanism that could
do the job:
1. For each production in the ES spec grammar where ASI could happen, e.g.
```
ImportDeclaration ::= `import` ModuleSpecifier `;`
{ import_declaration($2); }
```
add an ASI production, like this:
```
ImportDeclaration ::= `import` ModuleSpecifier [ERROR]
{ check_asi(); import_declaration($2); }
```
What does this mean? This production can be matched, like any other
production, but it's a fallback. All other productions take
precedence.
2. While generating the parser, treat `[ERROR]` as a terminal
symbol. It can be included in start sets and follow sets, lookahead,
and so forth.
3. At run time, when an error happens, synthesize an `[ERROR]` token.
Let that bounce through the state machine. It will cause zero or
more reductions. Then, it might actually match a production that
contains `[ERROR]`, like the ASI production above.
Otherwise, we’ll get another error—the entry in the parser table for
an `[ERROR]` token at this state will be an `error` entry. Then we
really have a syntax error.
This solves most of the ASI issues:
* [x] Whitespace sensitivity: That's what `check_asi()` is for. It
should signal an error if we're not at the end of a line.
* [x] Special treatment of `do`-`while` loops: Make an error production,
but don't `check_asi()`.
* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`:
Easy, don't create error productions for those.
* [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust
the grammar, copying *LexicalDeclaration* so that there's a
*LexicalDeclarationNoASI* production used only by `for`
statements. Not a big deal, as it turns out.
* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the
next token, if it starts with `/`.
* [ ] Restricted productions: Not solved. Read on.
### Restricted productions (**)
*(entangled with: ASI, slashes)*
Line breaks aren’t allowed in certain places. For example, the following
is not a valid program:
throw // SyntaxError
new Error();
For another example, this function contains two statements, not one:
function f(g) {
return // ASI
g();
}
The indentation is misleading; actually ASI inserts a semicolon at the
end of the first line: `return; g();`. (This function always returns
undefined. The second statement is never reached.)
These restrictions apply even to multiline comments, so the function
```js
function f(g) {
return /*
*/ g();
}
```
contains two statements, just as the previous example did.
I’m not sure why these rules exist, but it’s probably because (back in
the Netscape days) users complained about the bizarre behavior of
automatic semicolon insertion, and so some special do-what-I-mean hacks
were put in.
This is specified with a weird special thing in the grammar:
> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;`
This is called a *restricted production*, and it’s unfortunately
necessary to go through them one by one, because there are several
kinds. Note that the particular hack required to parse them in a
recursive descent parser is a little bit different each time.
* After `continue`, `break`, or `return`, a line break triggers ASI.
The relevant productions are all statements, and in each case
there’s an alternative production that ends immediately with a
semicolon: `continue ;` `break ;` and `return ;`.
Note that the alternative production is *not* restricted: e.g. a
*LineTerminator* can appear between `return` and `;`:
```js
if (x)
return // ok
;
else
f();
```
* After `throw`, a line break is a SyntaxError.
* After `yield`, a line break terminates the *YieldExpression*.
Here the alternative production is simply `yield`, not `yield ;`.
* In a post-increment or post-decrement expression, there can’t be a
line break before `++` or `--`.
The purpose of this rule is subtle. It triggers ASI and thus prevents
syntax errors:
```js
var x = y // ok: semicolon inserted here
++z;
```
Without the restricted production, `var x = y ++` would parse
successfully, and the “offending token” would be `z`. It would be
too late for ASI.
However, the restriction can of course also *cause* a SyntaxError:
```js
var x = (y
++); // SyntaxError
```
As we said, recursive descent parsers can implement these rules with hax.
In a generated parser, there are a few possible ways to implement
them. Here are three. If you are not interested in ridiculous
approaches, you can skip the first two.
* Treat every token as actually a different token when it appears
after a line break: `TokenType::LeftParen` and
`TokenType::LeftParenAfterLineBreak`. Of course the parser
generator can treat these exactly the same in normal cases, and
automatically generate identical table entries (or whatever) except
in states where there’s a relevant restricted production.
* Add a special LineTerminator token. Normally, the lexer skips
newlines and never emits this token. However, if the current state
has a relevant restricted production, the lexer knows this and emits
a LineTerminator for the first line break it sees; and the parser
uses that token to trigger an error or transition to another state,
as appropriate.
* When in a state that has a relevant restricted production, change
states if there’s a line break before the next token. That is,
split each such state into two: the one we stay in when there’s not
a line break, and the one we jump to if there is a line break.
In all cases it’ll be hard to have confidence that the resulting parser
generator is really sound. (That is, it might not properly reject all
ambiguous grammars.) I don’t know exactly what property of the few
special uses in the ES grammar makes them seem benign.
### Slashes (**)
*(entangled with: ASI, restricted productions)*
When you see `/` in a JS program, you don’t know if that’s a
division operator or the start of a regular expression unless you’ve
been paying attention up to that point.
[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar)
> There are several situations where the identification of lexical input
> elements is sensitive to the syntactic grammar context that is
> consuming the input elements. This requires multiple goal symbols for
> the lexical grammar.
You might think the lexer could treat `/` as an operator only if the
previous token is one that can be the last token of an expression (a set
that includes literals, identifiers, `this`, `)`, `]`, and `}`). To see
that this does not work, consider:
```js
{} /x/ // `/` after `}` is regexp
({} / 2) // `/` after `}` is division
for (g of /(a)(b)/) {} // `/` after `of` is regexp
var of = 6; of / 2 // `/` after `of` is division
throw /x/; // `/` after `throw` is regexp
Math.throw / 2; // `/` after `throw` is division
++/x/.lastIndex; // `/` after `++` is regexp
n++ / 2; // `/` after `++` is division
```
So how can the spec be implemented?
In a recursive descent parser, you have to tell the lexer which goal
symbol to use every time you ask for a token. And you have to make sure,
if you look ahead at a token, but *don’t* consume it, and fall back on
another path that can accept a *RegularExpressionLiteral* or
*DivPunctuator*, that you did not initially lex it incorrectly. We have
assertions for this and it is a bit of a nightmare when we have to touch
it (which is thankfully rare). Part of the problem is that the token
you’re peeking ahead at might not be part of the same production at all.
Thanks to ASI, it might be the start of the next statement, which will
be parsed in a faraway part of the Parser.
A table-driven parser has it easy here! The lexer can consult the state
table and see which kind of token can be accepted in the current
state. This is closer to what the spec actually says.
Two minor things to watch out for:
* The nonterminal *ClassTail* is used both at the end of
*ClassExpression*, which may be followed by `/`; and at the end of
*ClassDeclaration*, which may be followed by a
*RegularExpressionLiteral* at the start of the next
statement. Canonical LR creates separate states for these two uses
of *ClassTail*, but the LALR algorithm will unify them, creating
some states that have both `/` and *RegularExpressionLiteral* in the
follow set. In these states, determining which terminal is actually
allowed requires looking not only at the current state, but at the
current stack of states (to see one level of grammatical context).
* Since this decision depends on the parser state, and automatic
semicolon insertion adjusts the parser state, a parser may need to
re-scan a token after ASI.
In other kinds of generated parsers, at least the lexical goal symbol
can be determined automatically.
### Lazy compilation and scoping (**)
*(entangled with: arrow functions)*
JS engines *lazily compile* function bodies. During parsing, when the
engine sees a `function`, it switches to a high-speed parsing mode
(which I will call “Beast Mode”) that just skims the function and checks
for syntax errors. Beast Mode does not compile the code. Beast Mode
doesn’t even create AST nodes. All that will be done later, on demand,
the first time the function is called.
The point is to get through parsing *fast*, so that the script can start
running. In browsers, `<script>` compilation usually must finish before
we can show the user any meaningful content. Any part of it that can be
deferred must be deferred. (Bonus: many functions are never called, so
the work is not just deferred but avoided altogether.)
Later, when a function is called, we can still defer compilation of
nested functions inside it.
So what? Seems easy enough, right? Well...
Local variables in JS are optimized. To generate reasonable code for the
script or function we are compiling, we need to know which of its local
variables are used by nested functions, which we are *not* currently
compiling. That is, we must know which variables *occur free in* each
nested function. So scoping, which otherwise could be done as a separate
phase of compilation, must be done during parsing in Beast Mode.
Getting the scoping of arrow functions right while parsing is tricky,
because it’s often not possible to know when you’re entering a scope
until later. Consider parsing `(a`. This could be the beginning of an
arrow function, or not; we might not know until after we reach the
matching `)`, which could be a long way away.
Annex B.3.3 adds extremely complex rules for scoping for function
declarations which makes this especially difficult. In
```js
function f() {
let x = 1;
return function g() {
{
function x(){}
}
{
let x;
}
return x;
};
}
```
the function `g` does not use the initial `let x`, but in
```js
function f() {
let x = 1;
return function g() {
{
{
function x(){}
}
let x;
}
return x;
};
}
```
it does.
[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go)
is a good writeup of what's going on in these examples.
### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*)
*(entangled with: lazy compilation)*
Suppose the parser is scanning text from start to end, when it sees a
statement that starts with something like this:
```js
let f = (a
```
The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a
*ParenthesizedExpression*. Either is possible. We’ll know when either
(1) we see something that rules out the other case:
```js
let f = (a + b // can't be ArrowParameters
let f = (a, b, ...args // can't be ParenthesizedExpression
```
or (2) we reach the matching `)` and see if the next token is `=>`.
Probably (1) is a pain to implement.
To keep the language nominally LR(1), the standard specifies a “cover
grammar”, a nonterminal named
*CoverParenthesizedExpressionAndArrowParameterList* that is a superset
of both *ArrowParameters* and *ParenthesizedExpression*. The spec
grammar uses the *Cover* nonterminal in both places, so technically `let
f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically
valid *Script*s, according to the formal grammar. But there are Early
Error rules (that is, plain English text, not part of the formal
grammar) that say arrow functions’ parameters must match
*ArrowParameters*, and parenthesized expressions must match
*ParenthesizedExpression*.
So after the initial parse, the implementation must somehow check that
the *CoverParenthesizedExpressionAndArrowParameterList* really is valid
in context. This complicates lazy compilation because in Beast Mode we
are not even building an AST. It’s not easy to go back and check.
Something similar happens in a few other cases: the spec is written as
though syntactic rules are applied after the fact:
* *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)`
which looks like a function call and also looks like the beginning
of an async arrow function.
* *ObjectLiteral* is a cover grammar; it covers both actual object
literals and the syntax `propertyName = expr`, which is not valid in
object literals but allowed in destructuring assignment:
```js
var obj = {x = 1}; // SyntaxError, by an early error rule
({x = 1} = {}); // ok, assigns 1 to x
```
* *LeftHandSideExpression* is used to the left of `=` in assignment,
and as the operand of postfix `++` and `--`. But this is way too lax;
most expressions shouldn’t be assigned to:
```js
1 = 0; // matches the formal grammar, SyntaxError by an early error rule
null++; // same
["a", "b", "c"]++; // same
```
## Conclusion
What have we learned today?
* Do not write a JS parser.
* JavaScript has some syntactic horrors in it. But hey, you don't
make the world's most widely used programming language by avoiding
all mistakes. You do it by shipping a serviceable tool, in the right
circumstances, for the right users.
|