diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 14:07:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-13 14:07:11 +0000 |
commit | 63847496f14c813a5d80efd5b7de0f1294ffe1e3 (patch) | |
tree | 01c7571c7c762ceee70638549a99834fdd7c411b /doc/jsonb.md | |
parent | Initial commit. (diff) | |
download | sqlite3-63847496f14c813a5d80efd5b7de0f1294ffe1e3.tar.xz sqlite3-63847496f14c813a5d80efd5b7de0f1294ffe1e3.zip |
Adding upstream version 3.45.1.upstream/3.45.1
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/jsonb.md')
-rw-r--r-- | doc/jsonb.md | 290 |
1 files changed, 290 insertions, 0 deletions
diff --git a/doc/jsonb.md b/doc/jsonb.md new file mode 100644 index 0000000..5beed16 --- /dev/null +++ b/doc/jsonb.md @@ -0,0 +1,290 @@ +# The JSONB Format + +This document describes SQLite's JSONB binary encoding of +JSON. + +## 1.0 What Is JSONB? + +Beginning with version 3.45.0 (circa 2024-01-01), SQLite supports an +alternative binary encoding of JSON which we call "JSONB". JSONB is +a binary format that stored as a BLOB. + +The advantage of JSONB over ordinary text RFC 8259 JSON is that JSONB +is both slightly smaller (by between 5% and 10% in most cases) and +can be processed in less than half the number of CPU cycles. The built-in +[JSON SQL functions] of SQLite can accept either ordinary text JSON +or the binary JSONB encoding for any of their JSON inputs. + +The "JSONB" name is inspired by [PostgreSQL](https://postgresql.org), but the +on-disk format for SQLite's JSONB is not the same as PostgreSQL's. +The two formats have the same name, but they have wildly different internal +representations and are not in any way binary compatible. + +The central idea behind this JSONB specification is that each element +begins with a header that includes the size and type of that element. +The header takes the place of punctuation such as double-quotes, +curly-brackes, square-brackets, commas, and colons. Since the size +and type of each element is contained in its header, the element can +be read faster since it is no longer necessary to carefully scan forward +looking for the closing delimiter. The payload of JSONB is the same +as for corresponding text JSON. The same payload bytes occur in the +same order. The only real difference between JSONB and ordinary text +JSON is that JSONB includes a binary header on +each element and omits delimiter and separator punctuation. + +### 1.1 Internal Use Only + +The details of the JSONB are not intended to be visible to application +developers. Application developers should look at JSONB as an opaque BLOB +used internally by SQLite. Nevertheless, we want the format to be backwards +compatible across all future versions of SQLite. To that end, the format +is documented by this file in the source tree. But this file should be +used only by SQLite core developers, not by developers of applications +that only use SQLite. + +## 2.0 The Purpose Of This Document + +JSONB is not intended as an external format to be used by +applications. JSONB is designed for internal use by SQLite only. +Programmers do not need to understand the JSONB format in order to +use it effectively. +Applications should access JSONB only through the [JSON SQL functions], +not by looking at individual bytes of the BLOB. + +However, JSONB is intended to be portable and backwards compatible +for all future versions of SQLite. In other words, you should not have +to export and reimport your SQLite database files when you upgrade to +a newer SQLite version. For that reason, the JSONB format needs to +be well-defined. + +This document is therefore similar in purpose to the +[SQLite database file format] document that describes the on-disk +format of an SQLite database file. Applications are not expected +to directly read and write the bits and bytes of SQLite database files. +The SQLite database file format is carefully documented so that it +can be stable and enduring. In the same way, the JSONB representation +of JSON is documented here so that it too can be stable and enduring, +not so that applications can read or writes individual bytes. + +## 3.0 Encoding + +JSONB is a direct translation of the underlying text JSON. The difference +is that JSONB uses a binary encoding that is faster to parse compared to +the detailed syntax of text JSON. + +Each JSON element is encoded as a header and a payload. The header +determines type of element (string, numeric, boolean, null, object, or +array) and the size of the payload. The header can be between 1 and +9 bytes in size. The payload can be any size from zero bytes up to the +maximum allowed BLOB size. + +### 3.1 Payload Size + +The upper four bits of the first byte of the header determine size of the +header and possibly also the size of the payload. +If the upper four bits have a value between 0 and 11, then the header is +exactly one byte in size and the payload size is determined by those +upper four bits. If the upper four bits have a value between 12 and 15, +that means that the total header size is 2, 3, 5, or 9 bytes and the +payload size is unsigned big-endian integer that is contained in the +subsequent bytes. The size integer is the one byte that following the +initial header byte if the upper four bits +are 12, two bytes if the upper bits are 13, four bytes if the upper bits +are 14, and eight bytes if the upper bits are 15. The current design +of SQLite does not support BLOB values larger than 2GiB, so the eight-byte +variant of the payload size integer will never be used by the current code. +The eight-byte payload size integer is included in the specification +to allow for future expansion. + +The header for an element does *not* need to be in its simplest +form. For example, consider the JSON numeric value "`1`". +That element can be encode in five different ways: + + * `0x13 0x31` + * `0xc3 0x01 0x31` + * `0xd3 0x00 0x01 0x31` + * `0xe3 0x00 0x00 0x00 0x01 0x31` + * `0xf3 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x31` + +The shortest encoding is preferred, of course, and usually happens with +primitive elements such as numbers. However the total size of an array +or object might not be known exactly when the header of the element is +first generated. It is convenient to reserve space for the largest +possible header and then go back and fill in the correct payload size +at the end. This technique can result in array or object headers that +are larger than absolutely necessary. + +### 3.2 Element Type + +The least-significant four bits of the first byte of the header (the first +byte masked against 0x0f) determine element type. The following codes are +used: + +<ol> +<li type="0"><p><b>NULL</b> → +The element is a JSON "null". The payload size for a true JSON NULL must +must be zero. Future versions of SQLite might extend the JSONB format +with elements that have a zero element type but a non-zero size. In that +way, legacy versions of SQLite will interpret the element as a NULL +for backwards compatibility while newer versions will interpret the +element in some other way. + +<li value="1"><p><b>TRUE</b> → +The element is a JSON "true". The payload size must be zero for a actual +"true" value. Elements with type 1 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 1 with a non-zero payload size should continue to interpret that +element as "true" for compatibility. + +<li value="2"><p><b>FALSE</b> → +The element is a JSON "false". The payload size must be zero for a actual +"false" value. Elements with type 2 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 2 with a non-zero payload size should continue to interpret that +element as "false" for compatibility. + +<li value="3"><p><b>INT</b> → +The element is a JSON integer value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +<li value="4"><p><b>INT5</b> → +The element is a JSON integer value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +<li value="5"><p><b>FLOAT</b> → +The element is a JSON floating-point value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +<li value="6"><p><b>FLOAT5</b> → +The element is a JSON floating-point value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +<li value="7"><p><b>TEXT</b> → +The element is a JSON string value that does not contain +any escapes nor any characters that need to be escaped for either SQL or +JSON. The payload is the UTF8 text representation of the string value. +The payload does <i>not</i> include string delimiters. + +<li value="8"><p><b>TEXTJ</b> → +The element is a JSON string value that contains +RFC 8259 character escapes (such as "<tt>\n</tt>" or "<tt>\u0020</tt>"). +Those escapes will need to be translated into actual UTF8 if this element +is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does <i>not</i> include string delimiters. + +<li value="9"><p><b>TEXT5</b> → +The element is a JSON string value that contains +character escapes, including some character escapes that part of JSON5 +and which are not found in the canonical RFC 8259 spec. +Those escapes will need to be translated into standard JSON prior to +rendering the JSON as text, or into their actual UTF8 characters if this +element is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does <i>not</i> include string delimiters. + +<li value="10"><p><b>TEXTRAW</b> → +The element is a JSON string value that contains +UTF8 characters that need to be escaped if this string is rendered into +standard JSON text. +The payload does <i>not</i> include string delimiters. + +<li value="11"><p><b>ARRAY</b> → +The element is a JSON array. The payload contains +JSONB elements that comprise values contained within the array. + +<li value="12"><p><b>OBJECT</b> → +The element is a JSON object. The payload contains +pairs of JSONB elements that comprise entries for the JSON object. +The first element in each pair must be a string (types 7 through 10). +The second element of each pair may be any types, including nested +arrays or objects. + +<li value="13"><p><b>RESERVED-13</b> → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +<li value="14"><p><b>RESERVED-14</b> → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +<li value="15"><p><b>RESERVED-15</b> → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. +</ol> + +Element types outside the range of 0 to 12 are reserved for future +expansion. The current implement raises an error if see an element type +other than those listed above. However, future versions of SQLite might +use of the three remaining element types to implement indexing or similar +optimizations, to speed up lookup against large JSON arrays and/or objects. + +### 3.3 Design Rationale For Element Types + +A key goal of JSONB is that it should be quick to translate +to and from text JSON and/or be constructed from SQL values. +When converting from text into JSONB, we do not want the +converter subroutine to burn CPU cycles converting elements +values into some standard format which might never be used. +Format conversion is "lazy" - it is deferred until actually +needed. This has implications for the JSONB format design: + + 1. Numeric values are stored as text, not a numbers. The values are + a direct copy of the text JSON values from which they are derived. + + 2. There are multiple element types depending on the details of value + formats. For example, INT is used for pure RFC-8259 integer + literals and INT5 exists for JSON5 extensions such as hexadecimal + notation. FLOAT is used for pure RFC-8259 floating point literals + and FLOAT5 is used for JSON5 extensions. There are four different + representations of strings, depending on where the string came from + and how special characters within the string are escaped. + +A second goal of JSONB is that it should be capable of serving as the +"parse tree" for JSON when a JSON value is being processed by the +various [JSON SQL functions] built into SQLite. Before JSONB was +developed, operations such [json_replace()] and [json_patch()] +and similar worked in three stages: + + + 1. Translate the text JSON into a internal format that is + easier to scan and edit. + 2. Perform the requested operation on the JSON. + 3. Translate the internal format back into text. + +JSONB seeks to serve as the internal format directly - bypassing +the first and third stages of that process. Since most of the CPU +cycles are spent on the first and third stages, that suggests that +JSONB processing will be much faster than text JSON processing. + +So when processing JSONB, only the second stage of the three-stage +process is required. But when processing text JSON, it is still necessary +to do stages one and three. If JSONB is to be used as the internal +binary representation, this is yet another reason to store numeric +values as text. Storing numbers as text minimizes the amount of +conversion work needed for stages one and three. This is also why +there are four different representations of text in JSONB. Different +text representations are used for text coming from different sources +(RFC-8259 JSON, JSON5, or SQL string values) and conversions only +happen if and when they are actually needed. + +### 3.4 Valid JSONB BLOBs + +A valid JSONB BLOB consists of a single JSON element. The element must +exactly fill the BLOB. This one element is often a JSON object or array +and those usually contain additional elements as its payload, but the +element can be a primite value such a string, number, boolean, or null. + +When the built-in JSON functions are attempting to determine if a BLOB +argument is a JSONB or just a random BLOB, they look at the header of +the outer element to see that it is well-formed and that the element +completely fills the BLOB. If these conditions are met, then the BLOB +is accepted as a JSONB value. |