From 62a67b10ff9f9eea6a4695649fb8252d2a4bc74d Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Mon, 3 Jun 2024 07:16:44 +0200 Subject: Merging upstream version 3.46.0. Signed-off-by: Daniel Baumann --- www/whybytecode.html | 435 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 435 insertions(+) create mode 100644 www/whybytecode.html (limited to 'www/whybytecode.html') diff --git a/www/whybytecode.html b/www/whybytecode.html new file mode 100644 index 0000000..98da8b3 --- /dev/null +++ b/www/whybytecode.html @@ -0,0 +1,435 @@ + + + + + +Why SQLite Uses Bytecode + + + +
+ + + +
+
+Small. Fast. Reliable.
Choose any three. +
+ + +
+
+ + + +
+
+
+ +
+
+
+Why SQLite Uses Bytecode +
+ + +
+ + + + + +

1. Introduction

+ +

Every SQL database engine works in roughly the same way: It first +translates the input SQL text into a "prepared statement". Then it "executes" +the prepared statement to generate a result. + +

A prepared statement is an object that represents the steps needed +to accomplish the input SQL. Or, to think of it in another way, +the prepared statement is the SQL statement translated into a form that is +more easily understood by the computer. + +

In SQLite, a prepared statement is an instance of the +sqlite3_stmt object. In other systems, the prepared +statement is usually an internal data structure that is not directly visible to +the application programmer. Developers of other SQL database engines +do not necessarily call these objects "prepared statements". +But such objects exists, whatever they might be called. +This paper will use the term "prepared statement". + +

There are countless ways of implementing a prepared statement. This +paper will look at the two most common methods: + +

    +
  1. +Bytecode → The input SQL is translated into a virtual machine language +that is then run by a virtual machine interpreter. This is the technique +used by SQLite. + +

  2. +Tree-Of-Objects → The input SQL is translated in a tree of objects +that represent the processing to be done. The SQL is executed by walking this +tree. This is the technique used by MySQL and PostgreSQL. +

+ +

+There are advantages and disadvantages to each of these representations of +a prepared statement. The purpose of this paper is to articulate some of those +advantages and disadvantages. + +

1.1. How To Provide Feedback

+ +

+This document is written from the perspective of the original author of SQLite. +If you disagree with any of the opinions offered in this document, you are +welcomed to offer corrections and/or contrary views on the +SQLite Forum. Or you can email the author +directly. + +

1.2. Definition Of "Bytecode"

+ +

The bytecode generated by SQLite might be a little different from what +many readers think of as bytecode. The bytecode used (for example) by the +Java virtual machine or +by WebAssembly consists almost +entirely of low-level operations, similar to what physical CPUs implement: +basic math operators, comparisons, conditional jumps, and +instructions to move content between different memory locations. SQLite bytecode +has these kinds of low-level instructions, too. But SQLite bytecode also contains +some high-level operations that are specific to the needs of a database engine. +Here are just a few examples: + +

+

    +
  • OP_Column → Extract the value from the N-th column of the +database row that a particular cursor is currently pointing at. +

  • OP_CreateBtree → Allocate space for a new B-Tree in the +database file. +

  • OP_ParseSchema → Reread and reparse all or part of the +sqlite_schema table and refresh internal symbol tables accordingly. +

  • OP_SeekGE → Move a cursor on a particular B-Tree to the first +entry that is greater than or equal to a given key. +

  • OP_Next → Advance a cursor on a particular B-Tree to the next +entry in the B-Tree and jump, or fall through if there are no more entries in that +B-Tree. +

+ +

In other words, the "bytecode" used by SQLite is not so much a set of CPU +instructions as it is a list of database primitives that are to be run +in a particular order. + +

1.3. Definition Of "Abstract Syntax Tree" or "AST"

+ +

+An "Abstract Syntax Tree" or AST +is a data structure that describes a program or statement in some kind of formal +language. In our context, the formal language is SQL. An AST is typically +implemented as a tree of objects where each object represents one small part of +the overall SQL statement. ASTs emerge naturally from parsers for formal languages. +The usual technique is to use an +LALR(1) parser. +With such a parser, each terminal +symbol holds metadata that will become a leaf of the AST, and each non-terminal +symbol holds metadata that will become a sub-branch of the overall AST. +As rules of the grammar are "reduced" by the parser, new nodes of the AST are +allocated and connected to subnodes. +After the parse completes, the start-symbol of the grammar is left holding the +root of the AST. + +

An AST is a tree of objects. But an AST is not a suitable form for +a prepared statement. After being generated, an AST first needs to be +transformed in various ways before it can executed. Symbols need to be +resolved. Semantic rules need to be checked. Optimizations need to be +applied that transform input SQL statement into different forms that +execute more quickly. Finally, the AST needs to be translated into an +alternative representation that is more amenable to execution. + +

Some people refer to the tree of objects that is used as the +executable form for MySQL and PostgreSQL as an AST. This is probably a +misuse of the term "AST", because by the time the tree of objects is +ready to be executed, it has been changed so much that it has little +resemblance to the original SQL text. The confusion arises in part because +both the final prepared statement object and the original AST are both +trees of objects. The usual technique is for the original AST that comes +directly out of the parser to be transformed little by little, in multiple +passes, until at the end it is fully converted into an tree of objects +that is no longer strictly an AST but that can be evaluated to +generate a result. There is not necessarily a clear point during this +process when the tree-of-objects ceases to be an AST and becomes a +prepared statement instead. And because there is no clear boundary between an +AST and a prepared statement, people often refer to a prepared statement +that is represented as a tree of objects as an "AST", even though that +description is not precise. + +

1.4. Dataflow Programming

+ +

Dataflow programming +is a style of programming in which individual nodes specialize in doing +one small part of the overall computation. Each node receives inputs from +other nodes and sends its output to other nodes. Thus the nodes form a +directed graph that carry inputs into outputs. + +

+A "dataflow program" is perhaps a better description than "AST" for +the tree of objects that an SQL database engine uses as a prepared statement. + +

2. Advantages To Compiling Into Bytecode

+ +

SQLite compiles to bytecode, and the SQLite developers are very happy +with this approach. Here is why: + +

2.1. Bytecode Is Easier To Understand

+ +

A flat list of opcodes can be easily printed to see exactly +how an SQL statement is being implemented. This is what happens in SQLite +when you preface an SQL statement with the "EXPLAIN" keyword: Instead of +actually running the SQL, the result is a listing of the bytecode +that would have been used to implement that SQL. + +

Bytecode lends itself to this because a bytecode program is easily +represented as a table. In SQLite bytecode, each instruction +has one opcode and five operands. Thus a prepared statement can be +rendered as if it were a query against a six-column table. + +

A tree-of-objects representation is more difficult to publish in +a human-readable form. The objects that comprise the tree tend to +all be very different, and thus it is tricky to come up with a +consistent and simple table representation with which to display +the objects. Any such table representation that you do come up +with would almost certainly have more than six columns, probably many more. +The problem of rendering a tree-of-objects as a table is sufficiently +difficult that nobody does it, as far as I know. Hence, no +tree-of-objects database engine provides the level +of detail in their "EXPLAIN" output that SQLite provides. + +

2.2. Bytecode Is Easier To Debug

+ +

Bytecode provides a clear separation between front-end parsing and +analysis and back-end evaluation of an SQL statement. When problems arise +(incorrect answers and/or poor performance), the developers can examine +the bytecode to quickly determine if the source of the trouble is either the +front-end analysis or the back-end data storage section of the product. + +

In debugging builds of SQLite, the PRAGMA vdbe_trace=ON; command will +cause a trace of the bytecode execution to appear on the console. + +

2.3. Bytecode Can Be Run Incrementally

+ +

+SQL statements written in bytecode can be evaluated incrementally. +For example, a statement can be run until it generates just its first +row of output. The statement then pauses until it is stepped again. +It is not necessary to run the statement to completion before examining +the first row of output. + +

+This is more difficult to achieve in a tree-of-objects design. When +the prepared statement is a tree-of-objects, execution is normally +accomplished by walking the tree. To pause the statement in the middle +of a computation means unwinding the stack back up to the caller, all +the while saving enough state to resume evaluation where it last left +off. This is not impossible to do, but is sufficiently difficult that +I have never seen it actually done. + +

+Most SQL database engines do not really need to do incremental +execution of prepared statements because most SQL database engines +are client/server. In client/server engines, a single SQL statement is sent +to the server, then the complete reply comes back over the wire +all at once. Thus each statement runs to completion in a single go. +But SQLite is not client/server. SQLite is a library +that runs in the same address space and using the same stack as the +application. Being able to easily and reliably perform incremental +execution of an SQL statement is important to SQLite. + +

2.4. Bytecode Is Smaller

+ +

+The bytecode generated by SQLite is usually smaller than the corresponding +AST coming out of the parser. During initial processing of SQL text +(during the call to sqlite3_prepare() and similar) both the AST and the +bytecode exist in memory at the same time, so more memory is used then. +But that is a transient state. The AST +is quickly discarded and its memory recycled, even before the call to +sqlite3_prepare() returns, so the resulting prepared statement ends +up consuming less memory in its bytecode representation than it did as an AST. +This is important because calls to sqlite3_prepare() are transient, but +prepared statements are often cached for possible reuse and persist in memory +for a long time. + +

2.5. Bytecode Is Faster

+ +

+I believe that a bytecode representation of a prepared statement +runs faster, because fewer decisions need to be made for each step of +the computation. Emphasis on "believe" in the previous sentence +→ it is difficult to verify +this claim experimentally since nobody has ever put in the multiple years +of effort necessary to generate equivalent bytecode and tree-of-object +representations of a prepared statement to see which one actually runs faster. +We do know that SQLite is very fast, but we +do not have good side-by-side comparisons with other SQL databases since +the other databases spend a lot of time doing client/server message processing, +and it is difficult to untangle the message round-trip overhead from the +actual processing time. + +

3. Advantages Of Compiling Into A Tree Of Objects

+ +

The SQLite developers think that the bytecode approach is +best, at least for the use cases the SQLite tries to fill, but the +tree-of-objects approach to processing SQL does have some advantages over +bytecode. There are always tradeoffs. + +

3.1. Query Planning Decisions Can Be Deferred Until Runtime

+ +

+When a prepared statement is bytecode, once the bytecode has been generated, +the algorithm is fixed and cannot be subsequently changed without completely +rewriting the bytecode. +This is not the case with a tree-of-objects prepared +statement. A tree-of-objects is easier to modify on-the-fly. The query +plan is mutable and can be tweaked as it is running, based on the progress +of the query. Thus a query can be dynamically self-tuning. + +

3.2. Dataflow Programs Are Easy To Parallelize

+ +

In a dataflow program, each processing node can be assigned to a +different thread. There needs to be some kind of threadsafe queuing +mechanism for transferring intermediate results from one node to the +next. But no synchronization primitives are typically needed within +each node of the program. Node schedule is trivial: A node becomes +eligible to run when it has data available and there is space in its +output queue. + +

This is an important consideration for database engines that are +designed to run large analytic queries +(OLAP) +on large multi-core servers. +The primary focus of SQLite is transaction processing +(OLTP) +on the internet-of-things, so there is less need to +represent prepared statements as dataflow programs in SQLite. +

This page last modified on 2024-05-09 17:38:03 UTC

+ -- cgit v1.2.3