summaryrefslogtreecommitdiffstats
path: root/www/fts5.html
diff options
context:
space:
mode:
Diffstat (limited to 'www/fts5.html')
-rw-r--r--www/fts5.html244
1 files changed, 214 insertions, 30 deletions
diff --git a/www/fts5.html b/www/fts5.html
index c80bc26..1793f2c 100644
--- a/www/fts5.html
+++ b/www/fts5.html
@@ -177,9 +177,8 @@ Table Of Contents
<div class="fancy-toc2"><a href="#custom_tokenizers">7.1. Custom Tokenizers</a></div>
<div class="fancy-toc3"><a href="#synonym_support">7.1.1. Synonym Support</a></div>
<div class="fancy-toc2"><a href="#custom_auxiliary_functions">7.2. Custom Auxiliary Functions</a></div>
-<div class="fancy-toc3"><a href="#_custom_auxiliary_functions_api_reference_">7.2.1.
-Custom Auxiliary Functions API Reference
-</a></div>
+<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_overview">7.2.1. Custom Auxiliary Functions API Overview</a></div>
+<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_reference">7.2.2. Custom Auxiliary Functions API Reference</a></div>
<div class="fancy-toc1"><a href="#the_fts5vocab_virtual_table_module">8. The fts5vocab Virtual Table Module</a></div>
<div class="fancy-toc1"><a href="#fts5_data_structures">9. FTS5 Data Structures</a></div>
<div class="fancy-toc2"><a href="#varint_format">9.1. Varint Format</a></div>
@@ -775,7 +774,7 @@ CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter' 'ascii');
<p>
-FTS5 features three built-in tokenizer modules, described in subsequent
+FTS5 features four built-in tokenizer modules, described in subsequent
sections:
</p><ul>
@@ -787,6 +786,9 @@ sections:
</li><li> The <b>porter</b> tokenizer, which implements the
<a href="https://tartarus.org/martin/PorterStemmer" /="1">porter stemming algorithm</a>.
+
+ </li><li> The <b>trigram</b> tokenizer, which treats each contiguous sequence of
+ three characters as a token, allowing FTS5 to support more general substring matching.
</li></ul>
<p> It is also possible to create custom tokenizers for FTS5. The API for doing so is <a href="fts5.html#custom_tokenizers">described here</a>.
@@ -987,6 +989,9 @@ Notes:
but still works. If the index is to be used only for LIKE and/or GLOB
pattern matching, these options are worth experimenting with to reduce
the index size.
+
+ </li><li> The index cannot be used to optimize LIKE patterns if the LIKE operator
+ has an ESCAPE clause.
</li></ul>
<a name="external_content_and_contentless_tables"></a>
@@ -1400,12 +1405,12 @@ on secondary or tertiary markings in the document or query terms.
<p> Auxiliary functions are similar to <a href="lang_corefunc.html">SQL scalar functions</a>,
except that they may only be used within full-text queries (those that use
-the MATCH operator) on an FTS5 table. Their results are calculated based not
-only on the arguments passed to them, but also on the current match and
-matched row. For example, an auxiliary function may return a numeric value
-indicating the accuracy of the match (see the <a href="fts5.html#the_bm25_function">bm25()</a> function),
-or a fragment of text from the matched row that contains one or more
-instances of the search terms (see the <a href="fts5.html#the_snippet_function">snippet()</a> function).
+the MATCH operator, or LIKE/GLOB with the trigram tokenizer) on an FTS5 table.
+Their results are calculated based not only on the arguments passed to them,
+but also on the current match and matched row. For example, an auxiliary
+function may return a numeric value indicating the accuracy of the match (see
+the <a href="fts5.html#the_bm25_function">bm25()</a> function), or a fragment of text from the matched row
+that contains one or more instances of the search terms (see the <a href="fts5.html#the_snippet_function">snippet()</a> function).
</p><p>To invoke an auxiliary function, the name of the FTS5 table should be
specified as the first argument. Other arguments may follow the first,
@@ -1437,7 +1442,10 @@ the following section. Applications may also implement
fragment of text from one of the columns of the matched row and returns
it with each instance of a queried term surrounded by markup in
the same manner as the highlight() function. The fragment of text is
- selected so as to maximize the number of queried terms it contains.
+ selected so as to maximize the number of distinct queried terms it
+ contains. Higher weight is given to snippets that occur at the start
+ of a column value, or that immediately follow "." or ":" characters
+ in the text.
</li></ul>
<a name="the_bm25_function"></a>
@@ -2349,18 +2357,19 @@ replaced.
returns an SQLite error code. In this case the xDestroy function is <b>not</b>
invoked.
-</p><p> The final three arguments passed to the auxiliary function callback are
-similar to the three arguments passed to the implementation of a scalar SQL
-function. All arguments except the first passed to the auxiliary function are
-available to the implementation in the apVal&#91;&#93; array. The
+</p><p> The final three arguments passed to the auxiliary function callback
+(pCtx, nVal and apVal above) are similar to the three arguments passed to the
+implementation of a scalar SQL function. The apVal&#91;&#93; array contains all
+SQL arguments except the first passed to the auxiliary function. The
implementation should return a result or error via the content handle pCtx.
</p><p> The first argument passed to an auxiliary function callback is a pointer
-to a structure containing methods that may be invoked in order to obtain
-information regarding the current query or row. The second argument is an
-opaque handle that should be passed as the first argument to any such method
-invocation. For example, the following auxiliary function definition returns
-the total number of tokens in all columns of the current row:
+to a structure (pApi above) containing methods that may be invoked
+in order to obtain information regarding the current query or row. The second
+argument is an opaque handle (pFts above) that should be passed as the
+first argument to any such method invocation. For example, the following
+auxiliary function returns the total number of tokens in all columns of the
+current row:
</p><div class="codeblock"><pre><i>/*
** Implementation of an auxiliary function that returns the number
@@ -2388,10 +2397,185 @@ static void column_size_imp(
implementations in detail. Further examples may be found in the "fts5_aux.c"
file of the source code.
-</p><a name="_custom_auxiliary_functions_api_reference_"></a>
-<h3 tags="custom auxiliary functions" id="_custom_auxiliary_functions_api_reference_"><span>7.2.1. </span>
- Custom Auxiliary Functions API Reference
-</h3>
+</p><a name="custom_auxiliary_functions_api_overview"></a>
+<h3 tags="custom auxiliary overview" id="custom_auxiliary_functions_api_overview"><span>7.2.1. </span>Custom Auxiliary Functions API Overview</h3>
+
+<p>This section provides an overview of the capabilities of the auxiliary
+function API. It does not describe every function. Refer to the <a href="#custom_auxiliary_functions_api_reference">reference text</a> below for a
+complete description.
+
+</p><p>When invoked, an auxiliary function implementation has access to APIs that
+allow it to query FTS5 for various information. Some of these APIs return
+information relating to the current row of the FTS5 table being visited,
+some relating to the entire set of rows that will be visited by the FTS5
+query, and some relating to the FTS5 table. Given an FTS5 table populated as
+follows:
+
+</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft USING fts5(a, b);
+INSERT INTO ft(rowid, a, b) VALUES
+ (1, 'ab cd', 'cd de one'),
+ (2, 'de fg', 'fg gh'),
+ (3, 'gh ij', 'ij ab three four');
+</pre></div>
+
+<p>and the query:
+
+</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab')
+</pre></div>
+
+<p>then the custom auxiliary function will be invoked for rows 1 and 3 (all
+rows that contain the token "ab" and therefore match the query).
+
+</p><p><b>Number of rows/columns in table: xRowCount, xColumnCount
+
+</b></p><p>The system may be queried for the total number of rows in the FTS5 table
+using the <a href="#xRowCount">xRowCount</a> API. This provides the total number
+of rows in the table, not the number that match the current query.
+
+</p><p>Table columns are numbered from left to right starting from 0. The
+"rowid" column does not count - only user declared columns - so in the example
+above column "a" is column 0 and column "b" is column 1. From within an
+auxiliary function implementation, the <a href="#xColumnCount">xColumnCount</a>
+API may be used to determine how many columns the table being queried has. If
+the xColumnCount() API is invoked from within the implementation of the
+auxiliary function my_aux_function in the example above, it returns 2.
+
+</p><p><b>Data From Current Row: xColumnText, xRowid
+
+</b></p><p>The <a href="#xRowid">xRowid</a> API may be used to find the rowid value
+for the current row. The <a href="#xColumnText">xColumnText</a> may be used
+to obtain the text stored in a specified column of the current row.
+
+</p><p><b>Token Counts: xColumnSize, xColumnTotalSize
+
+</b></p><p>FTS5 divides documents inserted into an fts table into tokens. These are
+usually just words, perhaps folded to either upper or lower case and with any
+punctuation removed. For example, the default
+<a href="#unicode61_tokenizer">unicode61 tokenizer</a> tokenizes the text "The
+tokenizer is case-insensitive" to a list of 5 tokens - "the", "tokenizer", is",
+"case" and "insensitive". Exactly how tokens are extracted from text is
+determined by the <a href="#tokenizers">tokenizer</a>.
+
+</p><p>The auxiliary functions API provides functions to query for both the number
+of tokens in a specified column of the current row (the
+<a href="#xColumnSize">xColumnSize</a> API), or for the number of tokens in a
+specified column of all rows of the table (the <a href="#xColumnTotalSize">xColumnTotalSize</a> API). For the example at the
+top of this section, when visiting row 1, xColumnSize returns 2 for column 0
+and 3 for column 1. xColumnTotalSize returns 6 for column 0 and 9 for column 1
+regardless of the current row.
+
+</p><p><b>The Current Full-Text Query: xPhraseCount, xPhraseSize, xQueryToken
+
+</b></p><p>An FTS5 query contains one or more <a href="#fts5_phrases">phrases</a>. The
+<a href="#xPhraseCount">xPhraseCount</a>, <a href="#xPhraseSize">xPhraseSize</a>
+and <a href="#xQueryToken">xQueryToken</a> APIs allow an auxiliary function
+implementation to query the system for details of the current query. The
+xPhraseCount API returns the number of phrases in the current query. For
+example, if an FTS5 table is queried as follows:
+
+</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab AND "cd ef gh" OR ij + kl')
+</pre></div>
+
+<p>and the xPhraseCount() API invoked from within the implementation of the
+auxiliary function, it returns 3 (the three phrases being "ab", "ce ef gh" and
+"ij kl").
+
+</p><p>Phrases are numbered in order of appearance within a query starting from 0.
+The xPhraseSize() API may be used to query for the number of tokens in a
+specified phrase of the query. In the example above, phrase 0 contains 1 token,
+phrase 1 contains 3 tokens, and phrase 2 contains 2.
+
+</p><p>The xQueryToken API may be used to access the text of a specified token
+within a specified phrase of the query. Tokens are numbered within their
+phrases from left to right starting from 0. For example, if the xQueryToken
+API is used to request token 1 of phrase 2 in the example above, it returns
+the text "kl". Token 0 of phrase 0 is "ab".
+
+</p><p><b>Phrase Hits in the Current Row: xPhraseFirst, xPhraseNext
+
+</b></p><p>These two API functions may be used to iterate through the matches for
+a specified phrase of the query within the current row. Phrase matches are
+identified by the column and token offset within the current row. For
+example, say the following example table:
+
+</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft2 USING fts5(x, y);
+INSERT INTO ft2(rowid, x, y) VALUES
+ (1, 'xxx one two xxx five xxx six', 'seven four'),
+ (2, 'five four four xxx six', 'three four five six four five six');
+</pre></div>
+
+<p>is queried with:
+
+</p><div class="codeblock"><pre>SELECT my_aux_function(ft2) FROM ft2(
+ '("one two" OR "three") AND y:four NEAR(five six, 2)'
+);
+</pre></div>
+
+<p>The query above contains 5 phrases - "one two", "three", "four",
+"five" and "six". It matches all rows of the table, so the auxiliary
+function is invoked for each row.
+
+</p><p>In row 1, for phrase 0, "one two", there is exactly one match to iterate
+through - at column 0 token offset 1. The column number is 0 because the
+match appears in the left most column. The token offset is 1 because there
+is exactly one token ("xxx") before the phrase match in the column value.
+For phrase 1, "three", there are no matches. Phrase 2, "four", has one
+match, at column 1, token offset 0. Phrase 3, "five", has one match at
+column 0, token offset 4, and phrase 4, "six", has one match at column 0
+token offset 6.
+
+</p><p>The set of matches for each phrase in each row of the example is presented
+in the table below. Each match is notated as (column-number, token-offset):
+
+</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
+ <tr style="text-align:left"><th>Row</th><th>Phrase 0</th><th>Phrase 1</th><th>Phrase 2</th><th>Phrase 3</th><th>Phrase 4
+ </th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 1) </td><td></td><td>(1, 1)</td><td>(0, 4)</td><td>(0, 6)
+ </td></tr><tr style="text-align:left"><td>2</td><td></td><td>(1,0)</td><td>(1, 1), (1,4)</td><td>(1, 2), (1, 5)</td><td>(1, 3), (1, 6)
+</td></tr></table>
+
+<p>The second row is slightly more complicated. There were no occurrences of
+phrase 0. Phrase 1 ("three") appears once, at column 1 token offset 0. Although
+there are instances of phrase 2 ("four") in column 0, none of them are reported
+by the API, as phrase 4 has a <a href="#fts5_column_filters">column filter</a> -
+"y:". Matches that are filtered out by column filters do not count. Similarly,
+although phrases 3 and 4 do occur in column "x" of row 2, they are filtered
+out by the <a href="#fts5_near_queries">NEAR filter</a>. Matches that are
+filtered out by NEAR filters do not count either.
+
+</p><p><b>Phrase Hits in the Current Row (2): xInstCount, xInst
+
+</b></p><p>The <a href="#xInstCount">xInstCount</a> and <a href="#xInst">xInst</a> APIs
+provide access to the same information as the xPhraseFirst and xPhraseNext
+described above. The difference is that instead of iterating through the
+matches for a single, specified phrase, the xInstCount/xInst APIs collate
+all matches into a single flat array, sorted in order of occurrence within
+the current row. Elements of this array may then be accessed randomly.
+
+</p><p>Each array element consists of three values:
+
+</p><ul>
+ <li> A phrase number,
+ </li><li> A column number, and
+ </li><li> A token offset
+</li></ul>
+
+<p>Using the same example data and query as for xPhraseFirst/xPhraseNext
+above, the array accessible via xInstCount/xInst consists of the following
+entries for each row:
+
+</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
+ <tr style="text-align:left"><th>Row</th><th>xInstCount/xInst array
+ </th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 0, 1), (3, 0, 4), (4, 0, 6), (2, 1, 1)
+ </td></tr><tr style="text-align:left"><td>2</td><td>(1, 1, 0), (2, 1, 1), (3, 1, 2), (4, 1, 3), (2, 1, 4), (3, 1, 5), (4, 1, 6)
+</td></tr></table>
+
+<p>Each entry of the array is called a phrase match. Phrase matches are
+numbered in order, starting from 0. So, in the example above, in row 2, phrase
+match 3 is (4, 1, 3) - phrase 4 of the query matches at column 1, token offset
+3.
+
+</p><a name="custom_auxiliary_functions_api_reference"></a>
+<h3 tags="custom auxiliary functions" id="custom_auxiliary_functions_api_reference"><span>7.2.2. </span>Custom Auxiliary Functions API Reference</h3>
<div class="codeblock"><pre>struct Fts5ExtensionApi {
int iVersion; <i>/* Currently always set to 3 */</i>
@@ -2443,8 +2627,8 @@ file of the source code.
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xUserData">
<b>void *(*xUserData)(Fts5Context*)</b></dt><dd>
<p style="margin-top:0.1em">
-Return a copy of the context pointer the extension function was
- registered with.
+Return a copy of the pUserData pointer passed to the xCreateFunction()
+ API when the extension function was registered.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnTotalSize">
@@ -3090,9 +3274,9 @@ the subset of matches with rowids in the required range.
</p><ul>
<li> The special <a href="#structure_record_format">structure record</a>,
- stored with id=1.
- </li><li> The special <a href="#averages_record_format">averages record</a>,
stored with id=10.
+ </li><li> The special <a href="#averages_record_format">averages record</a>,
+ stored with id=1.
</li><li> A record to store each <a href="#segment_b_tree_format">segment b-tree</a>
leaf and <a href="#doclist_index_format">doclist index</a> leaf and
internal node. See below for how id values are calculated for these
@@ -3612,7 +3796,7 @@ CREATE VIRTUAL TABLE t1 USING fts5(
<p> FTS5 has no matchinfo() or offsets() function, and the snippet() function
is not as fully-featured as in FTS3/4. However, since FTS5 does provide
-an API allowing applications to create <a href="fts5.html#_custom_auxiliary_functions_api_reference_">custom auxiliary functions</a>, any
+an API allowing applications to create <a href="fts5.html#custom_auxiliary_functions_api_reference">custom auxiliary functions</a>, any
required functionality may be implemented within the application code.
</p><p> The set of built-in auxiliary functions provided by FTS5 may be
@@ -3714,5 +3898,5 @@ using FTS5.
the amount of processing that may take place within any given
INSERT, UPDATE or DELETE operation.
</p></li></ul>
-<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/fts5.in?m=386eeb4c00">2024-03-27 20:33:18</a> UTC </small></i></p>
+<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/fts5.in?m=39d9a3b727">2024-05-22 18:42:01</a> UTC </small></i></p>