diff options
Diffstat (limited to 'www/fts5.html')
-rw-r--r-- | www/fts5.html | 244 |
1 files changed, 214 insertions, 30 deletions
diff --git a/www/fts5.html b/www/fts5.html index c80bc26..1793f2c 100644 --- a/www/fts5.html +++ b/www/fts5.html @@ -177,9 +177,8 @@ Table Of Contents <div class="fancy-toc2"><a href="#custom_tokenizers">7.1. Custom Tokenizers</a></div> <div class="fancy-toc3"><a href="#synonym_support">7.1.1. Synonym Support</a></div> <div class="fancy-toc2"><a href="#custom_auxiliary_functions">7.2. Custom Auxiliary Functions</a></div> -<div class="fancy-toc3"><a href="#_custom_auxiliary_functions_api_reference_">7.2.1. -Custom Auxiliary Functions API Reference -</a></div> +<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_overview">7.2.1. Custom Auxiliary Functions API Overview</a></div> +<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_reference">7.2.2. Custom Auxiliary Functions API Reference</a></div> <div class="fancy-toc1"><a href="#the_fts5vocab_virtual_table_module">8. The fts5vocab Virtual Table Module</a></div> <div class="fancy-toc1"><a href="#fts5_data_structures">9. FTS5 Data Structures</a></div> <div class="fancy-toc2"><a href="#varint_format">9.1. Varint Format</a></div> @@ -775,7 +774,7 @@ CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter' 'ascii'); <p> -FTS5 features three built-in tokenizer modules, described in subsequent +FTS5 features four built-in tokenizer modules, described in subsequent sections: </p><ul> @@ -787,6 +786,9 @@ sections: </li><li> The <b>porter</b> tokenizer, which implements the <a href="https://tartarus.org/martin/PorterStemmer" /="1">porter stemming algorithm</a>. + + </li><li> The <b>trigram</b> tokenizer, which treats each contiguous sequence of + three characters as a token, allowing FTS5 to support more general substring matching. </li></ul> <p> It is also possible to create custom tokenizers for FTS5. The API for doing so is <a href="fts5.html#custom_tokenizers">described here</a>. @@ -987,6 +989,9 @@ Notes: but still works. If the index is to be used only for LIKE and/or GLOB pattern matching, these options are worth experimenting with to reduce the index size. + + </li><li> The index cannot be used to optimize LIKE patterns if the LIKE operator + has an ESCAPE clause. </li></ul> <a name="external_content_and_contentless_tables"></a> @@ -1400,12 +1405,12 @@ on secondary or tertiary markings in the document or query terms. <p> Auxiliary functions are similar to <a href="lang_corefunc.html">SQL scalar functions</a>, except that they may only be used within full-text queries (those that use -the MATCH operator) on an FTS5 table. Their results are calculated based not -only on the arguments passed to them, but also on the current match and -matched row. For example, an auxiliary function may return a numeric value -indicating the accuracy of the match (see the <a href="fts5.html#the_bm25_function">bm25()</a> function), -or a fragment of text from the matched row that contains one or more -instances of the search terms (see the <a href="fts5.html#the_snippet_function">snippet()</a> function). +the MATCH operator, or LIKE/GLOB with the trigram tokenizer) on an FTS5 table. +Their results are calculated based not only on the arguments passed to them, +but also on the current match and matched row. For example, an auxiliary +function may return a numeric value indicating the accuracy of the match (see +the <a href="fts5.html#the_bm25_function">bm25()</a> function), or a fragment of text from the matched row +that contains one or more instances of the search terms (see the <a href="fts5.html#the_snippet_function">snippet()</a> function). </p><p>To invoke an auxiliary function, the name of the FTS5 table should be specified as the first argument. Other arguments may follow the first, @@ -1437,7 +1442,10 @@ the following section. Applications may also implement fragment of text from one of the columns of the matched row and returns it with each instance of a queried term surrounded by markup in the same manner as the highlight() function. The fragment of text is - selected so as to maximize the number of queried terms it contains. + selected so as to maximize the number of distinct queried terms it + contains. Higher weight is given to snippets that occur at the start + of a column value, or that immediately follow "." or ":" characters + in the text. </li></ul> <a name="the_bm25_function"></a> @@ -2349,18 +2357,19 @@ replaced. returns an SQLite error code. In this case the xDestroy function is <b>not</b> invoked. -</p><p> The final three arguments passed to the auxiliary function callback are -similar to the three arguments passed to the implementation of a scalar SQL -function. All arguments except the first passed to the auxiliary function are -available to the implementation in the apVal[] array. The +</p><p> The final three arguments passed to the auxiliary function callback +(pCtx, nVal and apVal above) are similar to the three arguments passed to the +implementation of a scalar SQL function. The apVal[] array contains all +SQL arguments except the first passed to the auxiliary function. The implementation should return a result or error via the content handle pCtx. </p><p> The first argument passed to an auxiliary function callback is a pointer -to a structure containing methods that may be invoked in order to obtain -information regarding the current query or row. The second argument is an -opaque handle that should be passed as the first argument to any such method -invocation. For example, the following auxiliary function definition returns -the total number of tokens in all columns of the current row: +to a structure (pApi above) containing methods that may be invoked +in order to obtain information regarding the current query or row. The second +argument is an opaque handle (pFts above) that should be passed as the +first argument to any such method invocation. For example, the following +auxiliary function returns the total number of tokens in all columns of the +current row: </p><div class="codeblock"><pre><i>/* ** Implementation of an auxiliary function that returns the number @@ -2388,10 +2397,185 @@ static void column_size_imp( implementations in detail. Further examples may be found in the "fts5_aux.c" file of the source code. -</p><a name="_custom_auxiliary_functions_api_reference_"></a> -<h3 tags="custom auxiliary functions" id="_custom_auxiliary_functions_api_reference_"><span>7.2.1. </span> - Custom Auxiliary Functions API Reference -</h3> +</p><a name="custom_auxiliary_functions_api_overview"></a> +<h3 tags="custom auxiliary overview" id="custom_auxiliary_functions_api_overview"><span>7.2.1. </span>Custom Auxiliary Functions API Overview</h3> + +<p>This section provides an overview of the capabilities of the auxiliary +function API. It does not describe every function. Refer to the <a href="#custom_auxiliary_functions_api_reference">reference text</a> below for a +complete description. + +</p><p>When invoked, an auxiliary function implementation has access to APIs that +allow it to query FTS5 for various information. Some of these APIs return +information relating to the current row of the FTS5 table being visited, +some relating to the entire set of rows that will be visited by the FTS5 +query, and some relating to the FTS5 table. Given an FTS5 table populated as +follows: + +</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft USING fts5(a, b); +INSERT INTO ft(rowid, a, b) VALUES + (1, 'ab cd', 'cd de one'), + (2, 'de fg', 'fg gh'), + (3, 'gh ij', 'ij ab three four'); +</pre></div> + +<p>and the query: + +</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab') +</pre></div> + +<p>then the custom auxiliary function will be invoked for rows 1 and 3 (all +rows that contain the token "ab" and therefore match the query). + +</p><p><b>Number of rows/columns in table: xRowCount, xColumnCount + +</b></p><p>The system may be queried for the total number of rows in the FTS5 table +using the <a href="#xRowCount">xRowCount</a> API. This provides the total number +of rows in the table, not the number that match the current query. + +</p><p>Table columns are numbered from left to right starting from 0. The +"rowid" column does not count - only user declared columns - so in the example +above column "a" is column 0 and column "b" is column 1. From within an +auxiliary function implementation, the <a href="#xColumnCount">xColumnCount</a> +API may be used to determine how many columns the table being queried has. If +the xColumnCount() API is invoked from within the implementation of the +auxiliary function my_aux_function in the example above, it returns 2. + +</p><p><b>Data From Current Row: xColumnText, xRowid + +</b></p><p>The <a href="#xRowid">xRowid</a> API may be used to find the rowid value +for the current row. The <a href="#xColumnText">xColumnText</a> may be used +to obtain the text stored in a specified column of the current row. + +</p><p><b>Token Counts: xColumnSize, xColumnTotalSize + +</b></p><p>FTS5 divides documents inserted into an fts table into tokens. These are +usually just words, perhaps folded to either upper or lower case and with any +punctuation removed. For example, the default +<a href="#unicode61_tokenizer">unicode61 tokenizer</a> tokenizes the text "The +tokenizer is case-insensitive" to a list of 5 tokens - "the", "tokenizer", is", +"case" and "insensitive". Exactly how tokens are extracted from text is +determined by the <a href="#tokenizers">tokenizer</a>. + +</p><p>The auxiliary functions API provides functions to query for both the number +of tokens in a specified column of the current row (the +<a href="#xColumnSize">xColumnSize</a> API), or for the number of tokens in a +specified column of all rows of the table (the <a href="#xColumnTotalSize">xColumnTotalSize</a> API). For the example at the +top of this section, when visiting row 1, xColumnSize returns 2 for column 0 +and 3 for column 1. xColumnTotalSize returns 6 for column 0 and 9 for column 1 +regardless of the current row. + +</p><p><b>The Current Full-Text Query: xPhraseCount, xPhraseSize, xQueryToken + +</b></p><p>An FTS5 query contains one or more <a href="#fts5_phrases">phrases</a>. The +<a href="#xPhraseCount">xPhraseCount</a>, <a href="#xPhraseSize">xPhraseSize</a> +and <a href="#xQueryToken">xQueryToken</a> APIs allow an auxiliary function +implementation to query the system for details of the current query. The +xPhraseCount API returns the number of phrases in the current query. For +example, if an FTS5 table is queried as follows: + +</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab AND "cd ef gh" OR ij + kl') +</pre></div> + +<p>and the xPhraseCount() API invoked from within the implementation of the +auxiliary function, it returns 3 (the three phrases being "ab", "ce ef gh" and +"ij kl"). + +</p><p>Phrases are numbered in order of appearance within a query starting from 0. +The xPhraseSize() API may be used to query for the number of tokens in a +specified phrase of the query. In the example above, phrase 0 contains 1 token, +phrase 1 contains 3 tokens, and phrase 2 contains 2. + +</p><p>The xQueryToken API may be used to access the text of a specified token +within a specified phrase of the query. Tokens are numbered within their +phrases from left to right starting from 0. For example, if the xQueryToken +API is used to request token 1 of phrase 2 in the example above, it returns +the text "kl". Token 0 of phrase 0 is "ab". + +</p><p><b>Phrase Hits in the Current Row: xPhraseFirst, xPhraseNext + +</b></p><p>These two API functions may be used to iterate through the matches for +a specified phrase of the query within the current row. Phrase matches are +identified by the column and token offset within the current row. For +example, say the following example table: + +</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft2 USING fts5(x, y); +INSERT INTO ft2(rowid, x, y) VALUES + (1, 'xxx one two xxx five xxx six', 'seven four'), + (2, 'five four four xxx six', 'three four five six four five six'); +</pre></div> + +<p>is queried with: + +</p><div class="codeblock"><pre>SELECT my_aux_function(ft2) FROM ft2( + '("one two" OR "three") AND y:four NEAR(five six, 2)' +); +</pre></div> + +<p>The query above contains 5 phrases - "one two", "three", "four", +"five" and "six". It matches all rows of the table, so the auxiliary +function is invoked for each row. + +</p><p>In row 1, for phrase 0, "one two", there is exactly one match to iterate +through - at column 0 token offset 1. The column number is 0 because the +match appears in the left most column. The token offset is 1 because there +is exactly one token ("xxx") before the phrase match in the column value. +For phrase 1, "three", there are no matches. Phrase 2, "four", has one +match, at column 1, token offset 0. Phrase 3, "five", has one match at +column 0, token offset 4, and phrase 4, "six", has one match at column 0 +token offset 6. + +</p><p>The set of matches for each phrase in each row of the example is presented +in the table below. Each match is notated as (column-number, token-offset): + +</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0"> + <tr style="text-align:left"><th>Row</th><th>Phrase 0</th><th>Phrase 1</th><th>Phrase 2</th><th>Phrase 3</th><th>Phrase 4 + </th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 1) </td><td></td><td>(1, 1)</td><td>(0, 4)</td><td>(0, 6) + </td></tr><tr style="text-align:left"><td>2</td><td></td><td>(1,0)</td><td>(1, 1), (1,4)</td><td>(1, 2), (1, 5)</td><td>(1, 3), (1, 6) +</td></tr></table> + +<p>The second row is slightly more complicated. There were no occurrences of +phrase 0. Phrase 1 ("three") appears once, at column 1 token offset 0. Although +there are instances of phrase 2 ("four") in column 0, none of them are reported +by the API, as phrase 4 has a <a href="#fts5_column_filters">column filter</a> - +"y:". Matches that are filtered out by column filters do not count. Similarly, +although phrases 3 and 4 do occur in column "x" of row 2, they are filtered +out by the <a href="#fts5_near_queries">NEAR filter</a>. Matches that are +filtered out by NEAR filters do not count either. + +</p><p><b>Phrase Hits in the Current Row (2): xInstCount, xInst + +</b></p><p>The <a href="#xInstCount">xInstCount</a> and <a href="#xInst">xInst</a> APIs +provide access to the same information as the xPhraseFirst and xPhraseNext +described above. The difference is that instead of iterating through the +matches for a single, specified phrase, the xInstCount/xInst APIs collate +all matches into a single flat array, sorted in order of occurrence within +the current row. Elements of this array may then be accessed randomly. + +</p><p>Each array element consists of three values: + +</p><ul> + <li> A phrase number, + </li><li> A column number, and + </li><li> A token offset +</li></ul> + +<p>Using the same example data and query as for xPhraseFirst/xPhraseNext +above, the array accessible via xInstCount/xInst consists of the following +entries for each row: + +</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0"> + <tr style="text-align:left"><th>Row</th><th>xInstCount/xInst array + </th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 0, 1), (3, 0, 4), (4, 0, 6), (2, 1, 1) + </td></tr><tr style="text-align:left"><td>2</td><td>(1, 1, 0), (2, 1, 1), (3, 1, 2), (4, 1, 3), (2, 1, 4), (3, 1, 5), (4, 1, 6) +</td></tr></table> + +<p>Each entry of the array is called a phrase match. Phrase matches are +numbered in order, starting from 0. So, in the example above, in row 2, phrase +match 3 is (4, 1, 3) - phrase 4 of the query matches at column 1, token offset +3. + +</p><a name="custom_auxiliary_functions_api_reference"></a> +<h3 tags="custom auxiliary functions" id="custom_auxiliary_functions_api_reference"><span>7.2.2. </span>Custom Auxiliary Functions API Reference</h3> <div class="codeblock"><pre>struct Fts5ExtensionApi { int iVersion; <i>/* Currently always set to 3 */</i> @@ -2443,8 +2627,8 @@ file of the source code. <dt style="white-space:pre;font-family:monospace;font-size:120%" id="xUserData"> <b>void *(*xUserData)(Fts5Context*)</b></dt><dd> <p style="margin-top:0.1em"> -Return a copy of the context pointer the extension function was - registered with. +Return a copy of the pUserData pointer passed to the xCreateFunction() + API when the extension function was registered. </p> </dd> <dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnTotalSize"> @@ -3090,9 +3274,9 @@ the subset of matches with rowids in the required range. </p><ul> <li> The special <a href="#structure_record_format">structure record</a>, - stored with id=1. - </li><li> The special <a href="#averages_record_format">averages record</a>, stored with id=10. + </li><li> The special <a href="#averages_record_format">averages record</a>, + stored with id=1. </li><li> A record to store each <a href="#segment_b_tree_format">segment b-tree</a> leaf and <a href="#doclist_index_format">doclist index</a> leaf and internal node. See below for how id values are calculated for these @@ -3612,7 +3796,7 @@ CREATE VIRTUAL TABLE t1 USING fts5( <p> FTS5 has no matchinfo() or offsets() function, and the snippet() function is not as fully-featured as in FTS3/4. However, since FTS5 does provide -an API allowing applications to create <a href="fts5.html#_custom_auxiliary_functions_api_reference_">custom auxiliary functions</a>, any +an API allowing applications to create <a href="fts5.html#custom_auxiliary_functions_api_reference">custom auxiliary functions</a>, any required functionality may be implemented within the application code. </p><p> The set of built-in auxiliary functions provided by FTS5 may be @@ -3714,5 +3898,5 @@ using FTS5. the amount of processing that may take place within any given INSERT, UPDATE or DELETE operation. </p></li></ul> -<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/fts5.in?m=386eeb4c00">2024-03-27 20:33:18</a> UTC </small></i></p> +<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/fts5.in?m=39d9a3b727">2024-05-22 18:42:01</a> UTC </small></i></p> |