diff options
Diffstat (limited to 'third_party/rust/jsparagus/journal.md')
-rw-r--r-- | third_party/rust/jsparagus/journal.md | 272 |
1 files changed, 272 insertions, 0 deletions
diff --git a/third_party/rust/jsparagus/journal.md b/third_party/rust/jsparagus/journal.md new file mode 100644 index 0000000000..3536812ef2 --- /dev/null +++ b/third_party/rust/jsparagus/journal.md @@ -0,0 +1,272 @@ +## What I learned, what I wonder + + +### Stab 5 (simple LR, LR(1), then LALR(1)) + +Well. I learned enough to implement this, although there is still much I +don't understand. + +I learned a bit about what kind of phenomenon can render a grammar +outside XLL(1) (that is, LL(1) as extended by automated left-factoring +and left-recursion elimination); see `testFirstFirstConflict` in +`test.py` for a contrived example, and `testLeftHandSideExpression` for +a realistic one. + +I learned that the shift-reduce operator precedence parser I wrote for +SpiderMonkey is even less like a typical LR parser than I imagined. + +I was stunned to find that the SLR parser I wrote first, including the +table generator, was *less* code than the predictive LL parser of stab +4. However, full LR(1) took rather a lot of code. + +I learned that I will apparently hand-code the computation of transitive +closures of sets under relations ten times before even considering +writing a general algorithm. The patterns I have written over and over +are: 1. `while not done:` visit every element already in the set, +iterating to a fixed point, which is this ludicrous O(*n*<sup>2</sup>) +in the number of pairs in the relation; 2. depth-first graph walking +with cycle detection, which can overflow the stack. + +I learned three ways to hack features into an LR parser generator (cf. how +easy it is to hack stuff into a recursive descent parser). The tricks I +know are: + +1. Add custom items. To add lookahead assertions, I just added a + lookahead element to the LRItem tuple. The trick then is to make + sure you are normalizing states that are actually identical, to + avoid combinatorial explosion—and eventually, I expect, table + compression. + +2. Add custom actions. I think I can support automatic semicolon + insertion by replacing the usual error action of some states with a + special ASI actions. + +3. Desugaring. The + [ECMAScript standard](https://tc39.es/ecma262/#sec-grammar-notation) + describes optional elements and parameterized nonterminals this way, + and for now at least, that's how we actually implement them. + +There's a lot still to learn here. + +* OMG, what does it all mean? I'm getting more comfortable with the + control flow ("calls" and "returns") of this system, but I wouldn't + say I understand it! + +* Why is lookahead, past the end of the current half-parsed + production, part of an LR item? What other kinds of item + embellishment could be done instead? + +* In what sense is an LR parser a DFA? I implemented it, but there's + more to it that I haven't grokked yet. + +* Is there just one DFA or many? What exactly is the "derived" grammar + that the DFA parses? How on earth does it magically turn out to be + regular? (This depends on it not extending past the first handle, + but I still don't quite see.) + +* If I faithfully implement the algorithms in the book, will it be + less of a dumpster fire? Smaller, more factored? + +* How can I tell if a transformation on grammars preserves the + property of being LR(k)? Factoring out a nonterminal, for example, + may not preserve LR(k)ness. Inlining probably always does. + +* Is there some variant of this that treats nonterminals more like + terminals? It's easy to imagine computing start sets and follow sets + that contain both kinds of symbols. Does that buy us anything? + + +Things I noticed: + +* I think Yacc allows bits of code in the middle of productions: + + nt1: T1 T2 nt2 { code1(); } T3 nt3 T4 { code2(); } + + That could be implemented by introducing a synthetic production + that contains everything up to the first code block: + + nt1_aux: T1 T2 nt2 { code1(); } + nt1: nt1_aux T3 nt3 T4 { code2(); } + + There is a principle that says code should happen only at the end of + a production: because LR states are superpositions of items. We + don't know which production we are really parsing until we reduce, + so we don't know which code to execute. + +* Each state is reachable from an initial state by a finite sequence + of "pushes", each of which pushes either a terminal (a shift action) + or a nonterminal (a summary of a bunch of parsing actions, ending + with a reduce). + + States can sometimes be reached multiple ways (it's a state + transition graph). But regardless of which path you take, the symbols + pushed by the last few steps always match the symbols appearing to + the left of point in each of the state's LR items. (This implies + that those items have to agree on what has happened. Might make a + nice assertion.) + + + +### Stab 4 (nonrecursive table-driven predictive LL parser) + +I learned that testing that a Python program can do something deeply +recursive is kind of nontrivial. :-\ + +I learned that the predictive parser still takes two stacks (one +representing the future and one representing the past). It's not magic! +This makes me want to hop back to stab 3, optimize away the operand +stack, and see what kind of code I can get. + +It seems like recursive descent would be faster, but the table-driven +parser could be made to support incremental parsing (the state of the +algorithm is "just data", a pair of stacks, neither of which is the +parser program's native call stack). + + +### Stab 3 (recursive descent with principled left-recursion-elimination and left-factoring) + +I learned how to eliminate left recursion in a grammar (Algorithm 4.1 +from the book). I learned how to check that a grammar is LL(1) using +the start and follow sets, although I didn't really learn what LL(1) +means in any depth. (I'm just using it as a means to prove that the +grammar is unambiguous.) + +I learned from the book how to do a table-driven "nonrecursive +predictive parser". Something to try later. + +I came up with the "reduction symbol" thing. It seems to work as +expected! This allows me to transform the grammar, but still generate +parse trees reflecting the source grammar. However, the resulting code +is inefficient. Further optimization would improve it, but the +predictive parser will fare better even without optimization. + +I wonder what differences there are between LL(1) and LR(1) grammars. +(The book repeatedly says they are different, but the distinctions it +draws suggest differences like: left-recursive grammars can be LR but +never LL. That particular difference doesn't matter much to me, because +there's an algorithm for eliminating left recursion.) + + +### Stab 2 (recursive descent with ad hoc immediate-left-recursion-elimination) + +I learned it's easy for code to race ahead of understanding. +I learned that a little feature can mean a lot of complexity. + +I learned that it's probably hard to support indirect left-recursion using this approach. +We're able to twist left-recursion into a `while` loop because what we're doing is local to a single nonterminal's productions, +and they're all parsed by a single function. +Making this work across function boundaries would be annoying, +even ignoring the possibility that a nonterminal can be involved in multiple left-call cycles. + +I wonder if the JS spec uses any indirect left-recursion. + +I wonder if there's a nice formalization of a "grammar with actions" that abstracts away "implementation details", +so that we could prove two grammars equivalent, +not just in that they describe the same language, +but equivalent in output. +This could help me explore "grammar rewrites", +which could lead to usable optimizations. + +I noticed that the ES spec contains this: + +> ### 13.6 The if Statement +> #### Syntax +> ``` +> IfStatement[Yield, Await, Return]: +> if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return] else Statement[?Yield, ?Await, ?Return] +> if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return] +> ``` +> +> Each `else` for which the choice of associated `if` is ambiguous shall +> be associated with the nearest possible `if` that would otherwise have +> no corresponding `else`. + +I wonder if this prose is effectively the same as adding a negative lookahead assertion +"[lookahead ≠ `else`]" at the end of the shorter production. + +(I asked bterlson and he thinks so.) + +I wonder if follow sets can be usefully considered as context-dependent. +What do I mean by this? +For example, `function` is certainly in the follow set of *Statement* in JS, +but there are plenty of contexts, like the rule `do Statement while ( Expression ) ;`, +where the nested *Statement* is never followed by `function`. +But does it matter? +I think it only matters if you're interested in better error messages. +Follow sets only matter to detect ambiguity in a grammar, +and *Statement* is ambiguous if it's ambiguous in *any* context. + + +### Stab 1 (very naive recursive descent) + +I learned that if you simply define a grammar as a set of rules, +there are all sorts of anomalies that can come up: + +* Vacant nonterminals (that do not match any input strings); + +* Nonterminals that match only infinite strings, like `a ::= X a`. + +* Cycles ("busy loops"), like `a ::= a`. + These always introduce ambiguity. + (You can also have cycles through multiple nonterminals: + `a ::= b; b ::= a`.) + +These in particular are easy to test for, with no false positives. +I wonder if there are other anomalies, +and if the "easiness" generalizes to all of them, and why. + +I know what it means for a grammar to be *ambiguous*: +it means there's at least one input with multiple valid parses. +I understand that parser generators can check for ambiguity. +But it's easiest to do so by imposing draconian restrictions. +I learned the "dangling `else` problem" is an ambiguity in exactly this sense. +I wonder if there's a principled way to deal with it. + +I know that a parse is a constructive proof that a string matches a grammar. + +I learned that start sets are important even in minimal parser generators. +This is interesting because they'll be a bit more interesting to compute +once we start considering empty productions. +I wonder if it turns out to still be pretty easy. +Does the start set of a possibly-empty production include its follow set? +(According to the dragon book, you add epsilon to the start set in this case.) + + +### Nice grammars + +I learned that the definition of a "grammar" +as a formal description of a language (= a set of strings) +is incomplete. + +Consider the Lisp syntax we're using: + +``` +sexpr ::= SYMBOL +sexpr ::= "(" tail + +tail ::= ")" +tail ::= sexpr tail +``` + +Nobody wants to parse Lisp like that. +There are two problems. + +One is expressive. +The `"("` and `")"` tokens should appear in the same production. +That way, the grammar says declaratively: these marks always appear in properly nesting pairs. + +``` +sexpr ::= SYMBOL +sexpr ::= "(" list ")" + +list ::= [empty] +list ::= sexpr list +``` + +The other problem has to do with *what you've got* when you get an automatically generated parse. +A grammar is more than just a description of a language, +to the extent we care about the form of the parse trees we get out of the parser. + +A grammar is a particular way of writing a parser, +and since we care about the parser's output, +we care about details of the grammar that would be mere "implementation details" otherwise. |