Adding upstream version 3.40.1.upstream/3.40.1 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-05 17:28:19 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-05 17:28:19 +0000
commit: 18657a960e125336f704ea058e25c27bd3900dcb (patch)
tree: 17b438b680ed45a996d7b59951e6aa34023783f2 /ext/fts2/README.tokenizers
parent: Initial commit. (diff)
download: sqlite3-upstream.tar.xz
sqlite3-upstream.zip
1 files changed, 133 insertions, 0 deletions
diff --git a/ext/fts2/README.tokenizers b/ext/fts2/README.tokenizers
new file mode 100644
index 0000000..98d2021
--- /dev/null
+++ b/ext/fts2/README.tokenizers
@@ -0,0 +1,133 @@
+
+1. FTS2 Tokenizers
+
+  When creating a new full-text table, FTS2 allows the user to select
+  the text tokenizer implementation to be used when indexing text
+  by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE
+  statement:
+
+    CREATE VIRTUAL TABLE <table-name> USING fts2(
+      <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]]
+    );
+
+  The built-in tokenizers (valid values to pass as <tokenizer name>) are
+  "simple" and "porter".
+
+  <tokenizer-args> should consist of zero or more white-space separated
+  arguments to pass to the selected tokenizer implementation. The 
+  interpretation of the arguments, if any, depends on the individual 
+  tokenizer.
+
+2. Custom Tokenizers
+
+  FTS2 allows users to provide custom tokenizer implementations. The 
+  interface used to create a new tokenizer is defined and described in 
+  the fts2_tokenizer.h source file.
+
+  Registering a new FTS2 tokenizer is similar to registering a new 
+  virtual table module with SQLite. The user passes a pointer to a
+  structure containing pointers to various callback functions that
+  make up the implementation of the new tokenizer type. For tokenizers,
+  the structure (defined in fts2_tokenizer.h) is called
+  "sqlite3_tokenizer_module".
+
+  FTS2 does not expose a C-function that users call to register new
+  tokenizer types with a database handle. Instead, the pointer must
+  be encoded as an SQL blob value and passed to FTS2 through the SQL
+  engine by evaluating a special scalar function, "fts2_tokenizer()".
+  The fts2_tokenizer() function may be called with one or two arguments,
+  as follows:
+
+    SELECT fts2_tokenizer(<tokenizer-name>);
+    SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
+  
+  Where <tokenizer-name> is a string identifying the tokenizer and
+  <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
+  structure encoded as an SQL blob. If the second argument is present,
+  it is registered as tokenizer <tokenizer-name> and a copy of it
+  returned. If only one argument is passed, a pointer to the tokenizer
+  implementation currently registered as <tokenizer-name> is returned,
+  encoded as a blob. Or, if no such tokenizer exists, an SQL exception
+  (error) is raised.
+
+  SECURITY: If the fts2 extension is used in an environment where potentially
+    malicious users may execute arbitrary SQL (i.e. gears), they should be
+    prevented from invoking the fts2_tokenizer() function, possibly using the
+    authorisation callback.
+
+  See "Sample code" below for an example of calling the fts2_tokenizer()
+  function from C code.
+
+3. ICU Library Tokenizers
+
+  If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor 
+  symbol defined, then there exists a built-in tokenizer named "icu" 
+  implemented using the ICU library. The first argument passed to the
+  xCreate() method (see fts2_tokenizer.h) of this tokenizer may be
+  an ICU locale identifier. For example "tr_TR" for Turkish as used
+  in Turkey, or "en_AU" for English as used in Australia. For example:
+
+    "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)"
+
+  The ICU tokenizer implementation is very simple. It splits the input
+  text according to the ICU rules for finding word boundaries and discards
+  any tokens that consist entirely of white-space. This may be suitable
+  for some applications in some locales, but not all. If more complex
+  processing is required, for example to implement stemming or 
+  discard punctuation, this can be done by creating a tokenizer 
+  implementation that uses the ICU tokenizer as part of its implementation.
+
+  When using the ICU tokenizer this way, it is safe to overwrite the
+  contents of the strings returned by the xNext() method (see
+  fts2_tokenizer.h).
+
+4. Sample code.
+
+  The following two code samples illustrate the way C code should invoke
+  the fts2_tokenizer() scalar function:
+
+      int registerTokenizer(
+        sqlite3 *db, 
+        char *zName, 
+        const sqlite3_tokenizer_module *p
+      ){
+        int rc;
+        sqlite3_stmt *pStmt;
+        const char zSql[] = "SELECT fts2_tokenizer(?, ?)";
+      
+        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
+        if( rc!=SQLITE_OK ){
+          return rc;
+        }
+      
+        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
+        sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
+        sqlite3_step(pStmt);
+      
+        return sqlite3_finalize(pStmt);
+      }
+      
+      int queryTokenizer(
+        sqlite3 *db, 
+        char *zName,  
+        const sqlite3_tokenizer_module **pp
+      ){
+        int rc;
+        sqlite3_stmt *pStmt;
+        const char zSql[] = "SELECT fts2_tokenizer(?)";
+      
+        *pp = 0;
+        rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
+        if( rc!=SQLITE_OK ){
+          return rc;
+        }
+      
+        sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
+        if( SQLITE_ROW==sqlite3_step(pStmt) ){
+          if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
+            memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
+          }
+        }
+      
+        return sqlite3_finalize(pStmt);
+      }
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-05 17:28:19 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-05 17:28:19 +0000
commit	18657a960e125336f704ea058e25c27bd3900dcb (patch)
tree	17b438b680ed45a996d7b59951e6aa34023783f2 /ext/fts2/README.tokenizers
parent	Initial commit. (diff)
download	sqlite3-upstream.tar.xz sqlite3-upstream.zip