# Primer on SQLGlot's Abstract Syntax Tree SQLGlot is a powerful tool for analyzing and transforming SQL, but the learning curve can be intimidating. This post is intended to familiarize newbies with SQLGlot's abstract syntax trees, how to traverse them, and how to mutate them. ## The tree SQLGlot parses SQL into an abstract syntax tree (AST). ```python from sqlglot import parse_one ast = parse_one("SELECT a FROM (SELECT a FROM x) AS x") ``` An AST is a data structure that represents a SQL statement. The best way to glean the structure of a particular AST is python's builtin `repr` function: ```python repr(ast) # Select( # expressions=[ # Column( # this=Identifier(this=a, quoted=False))], # from=From( # this=Subquery( # this=Select( # expressions=[ # Column( # this=Identifier(this=a, quoted=False))], # from=From( # this=Table( # this=Identifier(this=x, quoted=False)))), # alias=TableAlias( # this=Identifier(this=x, quoted=False))))) ``` This is a textual representation of the internal data structure. Here's a breakdown of some of its components: ``` `Select` is the expression type | Select( expressions=[ ------------------------------- `expressions` is a child key of `Select` Column( ----------------------------------- `Column` is the expression type of the child this=Identifier(this=a, quoted=False))], from=From( ---------------------------------- `from` is another child key of `Select` ... ``` ## Nodes of the tree The nodes in this tree are instances of `sqlglot.Expression`. Nodes reference their children in `args` and their parent in `parent`: ```python ast.args # { # "expressions": [Column(this=...)], # "from": From(this=...), # ... # } ast.args["expressions"][0] # Column(this=...) ast.args["expressions"][0].args["this"] # Identifier(this=...) ast.args["from"] # From(this=...) assert ast.args["expressions"][0].args["this"].parent.parent is ast ``` Children can either be: 1. An Expression instance 2. A list of Expression instances 3. Another Python object, such as str or bool. This will always be a leaf node in the tree. Navigating this tree requires an understanding of the different Expression types. The best way to browse Expression types is directly in the code at [expressions.py](../sqlglot/expressions.py). Let's look at a simplified version of one Expression type: ```python class Column(Expression): arg_types = { "this": True, "table": False, ... } ``` `Column` subclasses `Expression`. `arg_types` is a class attribute that specifies the possible children. The `args` keys of an Expression instance correspond to the `arg_types` keys of its class. The values of the `arg_types` dict are `True` if the key is required. There are some common `arg_types` keys: - "this": This is typically used for the primary child. In `Column`, "this" is the identifier for the column's name. - "expression": This is typically used for the secondary child - "expressions": This is typically used for a primary list of children There aren't strict rules for when these keys are used, but they help with some of the convenience methods available on all Expression types: - `Expression.this`: shorthand for `self.args.get("this")` - `Expression.expression`: similarly, shorthand for the expression arg - `Expression.expressions`: similarly, shorthand for the expressions list arg - `Expression.name`: text name for whatever `this` is `arg_types` don't specify the possible Expression types of children. This can be a challenge when you are writing code to traverse a particular AST and you don't know what to expect. A common trick is to parse an example query and print out the `repr`. You can traverse an AST using just args, but there are some higher-order functions for programmatic traversal. > [!NOTE] > SQLGlot can parse and generate SQL for many different dialects. However, there is only a single set of Expression types for all dialects. We like to say that the AST can represent the _superset_ of all dialects. > > Sometimes, SQLGlot will parse SQL from a dialect into Expression types you didn't expect: > > ```python > ast = parse_one("SELECT NOW()", dialect="postgres") > > repr(ast) > # Select( > # expressions=[ > # CurrentTimestamp()]) > ``` > > This is because SQLGlot tries to converge dialects on a standard AST. This means you can often write one piece of code that handles multiple dialects. ## Traversing the AST Analyzing a SQL statement requires traversing this data structure. There are a few ways to do this: ### Args If you know the structure of an AST, you can use `Expression.args` just like above. However, this can be very limited if you're dealing with arbitrary SQL. ### Walk methods The walk methods of `Expression` (`find`, `find_all`, and `walk`) are the simplest way to analyze an AST. `find` and `find_all` search an AST for specific Expression types: ```python from sqlglot import exp ast.find(exp.Select) # Select( # expressions=[ # Column( # this=Identifier(this=a, quoted=False))], # ... list(ast.find_all(exp.Select)) # [Select( # expressions=[ # Column( # this=Identifier(this=a, quoted=False))], # ... ``` Both `find` and `find_all` are built on `walk`, which gives finer grained control: ```python for node in ast.walk(): ... ``` > [!WARNING] > Here's a common pitfall of the walk methods: > ```python > ast.find_all(exp.Table) > ``` > At first glance, this seems like a great way to find all tables in a query. However, `Table` instances are not always tables in your database. Here's an example where this fails: > ```python > ast = parse_one(""" > WITH x AS ( > SELECT a FROM y > ) > SELECT a FROM x > """) > > # This is NOT a good way to find all tables in the query! > for table in ast.find_all(exp.Table): > print(table) > > # x -- this is a common table expression, NOT an actual table > # y > ``` > > For programmatic traversal of ASTs that requires deeper semantic understanding of a query, you need "scope". ### Scope Scope is a traversal module that handles more semantic context of SQL queries. It's harder to use than the `walk` methods but is more powerful: ```python from sqlglot.optimizer.scope import build_scope ast = parse_one(""" WITH x AS ( SELECT a FROM y ) SELECT a FROM x """) root = build_scope(ast) for scope in root.traverse(): print(scope) # Scope # y.c => Scope # y.b => Scope