From 97144fb2271b9bd749deac2eb3537103d9513182 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 24 Dec 2023 08:49:56 +0100 Subject: Adding upstream version 20.4.0. Signed-off-by: Daniel Baumann --- posts/ast_primer.md | 405 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 405 insertions(+) create mode 100644 posts/ast_primer.md (limited to 'posts') diff --git a/posts/ast_primer.md b/posts/ast_primer.md new file mode 100644 index 0000000..a02c11f --- /dev/null +++ b/posts/ast_primer.md @@ -0,0 +1,405 @@ +# A Primer on SQLGlot's Abstract Syntax Tree + +SQLGlot is a powerful tool for analyzing and transforming SQL, but the learning curve can be intimidating. + +This post is intended to familiarize newbies with SQLGlot's abstract syntax trees, how to traverse them, and how to mutate them. + +## The tree + +SQLGlot parses SQL into an abstract syntax tree (AST). +```python +from sqlglot import parse_one + +ast = parse_one("SELECT a FROM (SELECT a FROM x) AS x") +``` + +An AST is a data structure that represents a SQL statement. The best way to glean the structure of a particular AST is python's builtin `repr` function: +```python +repr(ast) + +# (SELECT expressions: +# (COLUMN this: +# (IDENTIFIER this: a, quoted: False)), from: +# (FROM this: +# (SUBQUERY this: +# (SELECT expressions: +# (COLUMN this: +# (IDENTIFIER this: a, quoted: False)), from: +# (FROM this: +# (TABLE this: +# (IDENTIFIER this: x, quoted: False)))), alias: +# (TABLEALIAS this: +# (IDENTIFIER this: x, quoted: False))))) +``` + +This is a textual representation of the internal data structure. Here's a breakdown of some of its components: +``` +Expression type child key + | / +(SELECT expressions: + (COLUMN this: ---------------------------------- COLUMN is a child node of SELECT + (IDENTIFIER this: a, quoted: False)), from: -- "from:" is another child key of SELECT + (FROM this: ------------------------------------- FROM is also a child node of SELECT + ... +``` + +## Nodes of the tree + +The nodes in this tree are instances of `sqlglot.Expression`. Nodes reference their children in `args` and their parent in `parent`: +```python +ast.args +# { +# "expressions": [(COLUMN this: ...)], +# "from": (FROM this: ...), +# ... +# } + +ast.args["expressions"][0] +# (COLUMN this: ...) + +ast.args["expressions"][0].args["this"] +# (IDENTIFIER this: ...) + +ast.args["from"] +# (FROM this: ...) + +assert ast.args["expressions"][0].args["this"].parent.parent is ast +``` + +Children can either be: +1. An Expression instance +2. A list of Expression instances +3. Another Python object, such as str or bool. This will always be a leaf node in the tree. + +Navigating this tree requires an understanding of the different Expression types. The best way to browse Expression types is directly in the code at [expressions.py](../sqlglot/expressions.py). Let's look at a simplified version of one Expression type: +```python +class Column(Expression): + arg_types = { + "this": True, + "table": False, + ... + } +``` + +`Column` subclasses `Expression`. + +`arg_types` is a class attribute that specifies the possible children. The `args` keys of an Expression instance correspond to the `arg_types` keys of its class. The values of the `arg_types` dict are `True` if the key is required. + +There are some common `arg_types` keys: +- "this": This is typically used for the primary child. In `Column`, "this" is the identifier for the column's name. +- "expression": This is typically used for the secondary child +- "expressions": This is typically used for a primary list of children + +There aren't strict rules for when these keys are used, but they help with some of the convenience methods available on all Expression types: +- `Expression.this`: shorthand for `self.args.get("this")` +- `Expression.expression`: similarly, shorthand for the expression arg +- `Expression.expressions`: similarly, shorthand for the expressions list arg +- `Expression.name`: text name for whatever `this` is + +`arg_types` don't specify the possible Expression types of children. This can be a challenge when you are writing code to traverse a particular AST and you don't know what to expect. A common trick is to parse an example query and print out the `repr`. + +You can traverse an AST using just args, but there are some higher-order functions for programmatic traversal. + +> [!NOTE] +> SQLGlot can parse and generate SQL for many different dialects. However, there is only a single set of Expression types for all dialects. We like to say that the AST can represent the _superset_ of all dialects. +> +> Sometimes, SQLGlot will parse SQL from a dialect into Expression types you didn't expect: +> +> ```python +> ast = parse_one("SELECT NOW()", dialect="postgres") +> +> repr(ast) +> # (SELECT expressions: +> # (CURRENTTIMESTAMP )) -- CURRENTTIMESTAMP, not NOW() +> ``` +> +> This is because SQLGlot tries to converge dialects on a standard AST. This means you can often write one piece of code that handles multiple dialects. + +## Traversing the AST + +Analyzing a SQL statement requires traversing this data structure. There are a few ways to do this: + +### Args + +If you know the structure of an AST, you can use `Expression.args` just like above. However, this can be very limited if you're dealing with arbitrary SQL. + +### Walk methods + +The walk methods of `Expression` (`find`, `find_all`, and `walk`) are the simplest way to analyze an AST. + +`find` and `find_all` search an AST for specific Expression types: +```python +from sqlglot import exp + +ast.find(exp.Select) +# (SELECT expressions: +# (COLUMN this: +# (IDENTIFIER this: a, quoted: False)), from: +# ... + +list(ast.find_all(exp.Select)) +# [(SELECT expressions: +# (COLUMN this: +# (IDENTIFIER this: a, quoted: False)), from: +# ... +``` + +Both `find` and `find_all` are built on `walk`, which gives finer grained control: +```python +for ( + node, # the current AST node + parent, # parent of the current AST node (this will be None for the root node) + key # The 'key' of this node in its parent's args +) in ast.walk(): + ... +``` + +> [!WARNING] +> Here's a common pitfall of the walk methods: +> ```python +> ast.find_all(exp.Table) +> ``` +> At first glance, this seems like a great way to find all tables in a query. However, `Table` instances are not always tables in your database. Here's an example where this fails: +> ```python +> ast = parse_one(""" +> WITH x AS ( +> SELECT a FROM y +> ) +> SELECT a FROM x +> """) +> +> # This is NOT a good way to find all tables in the query! +> for table in ast.find_all(exp.Table): +> print(table) +> +> # x -- this is a common table expression, NOT an actual table +> # y +> ``` +> +> For programmatic traversal of ASTs that requires deeper semantic understanding of a query, you need "scope". + +### Scope + +Scope is a traversal module that handles more semantic context of SQL queries. It's harder to use than the `walk` methods but is more powerful: +```python +from sqlglot.optimizer.scope import build_scope + +ast = parse_one(""" +WITH x AS ( + SELECT a FROM y +) +SELECT a FROM x +""") + +root = build_scope(ast) +for scope in root.traverse(): + print(scope) + +# Scope +# y.c => Scope +# y.b => Scope