1 files changed, 661 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-dictionaries.html b/doc/src/sgml/html/textsearch-dictionaries.html
new file mode 100644
index 0000000..1ba4782
--- /dev/null
+++ b/doc/src/sgml/html/textsearch-dictionaries.html
@@ -0,0 +1,661 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p>
+   Dictionaries are used to eliminate words that should not be considered in a
+   search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so
+   that different derived forms of the same word will match.  A successfully
+   normalized word is called a <em class="firstterm">lexeme</em>.  Aside from
+   improving search quality, normalization and removal of stop words reduce the
+   size of the <code class="type">tsvector</code> representation of a document, thereby
+   improving performance.  Normalization does not always have linguistic meaning
+   and usually depends on application semantics.
+  </p><p>
+   Some examples of normalization:
+
+   </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+      Linguistic — Ispell dictionaries try to reduce input words to a
+      normalized form; stemmer dictionaries remove word endings
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      <acronym class="acronym">URL</acronym> locations can be canonicalized to make
+      equivalent URLs match:
+
+      </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+         http://www.pgsql.ru/db/mw/index.html
+        </p></li><li class="listitem" style="list-style-type: disc"><p>
+         http://www.pgsql.ru/db/mw/
+        </p></li><li class="listitem" style="list-style-type: disc"><p>
+         http://www.pgsql.ru/db/../db/mw/index.html
+        </p></li></ul></div><p>
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      Color names can be replaced by their hexadecimal values, e.g.,
+      <code class="literal">red, green, blue, magenta -&gt; FF0000, 00FF00, 0000FF, FF00FF</code>
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      If indexing numbers, we can
+      remove some fractional digits to reduce the range of possible
+      numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359,
+      <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same
+      after normalization if only two digits are kept after the decimal point.
+     </p></li></ul></div><p>
+
+  </p><p>
+   A dictionary is a program that accepts a token as
+   input and returns:
+   </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+      an array of lexemes if the input token is known to the dictionary
+      (notice that one token can produce more than one lexeme)
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace
+      the original token with a new token to be passed to subsequent
+      dictionaries (a dictionary that does this is called a
+      <em class="firstterm">filtering dictionary</em>)
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      an empty array if the dictionary knows the token, but it is a stop word
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      <code class="literal">NULL</code> if the dictionary does not recognize the input token
+     </p></li></ul></div><p>
+  </p><p>
+   <span class="productname">PostgreSQL</span> provides predefined dictionaries for
+   many languages.  There are also several predefined templates that can be
+   used to create new dictionaries with custom parameters.  Each predefined
+   dictionary template is described below.  If no existing
+   template is suitable, it is possible to create new ones; see the
+   <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution
+   for examples.
+  </p><p>
+   A text search configuration binds a parser together with a set of
+   dictionaries to process the parser's output tokens.  For each token
+   type that the parser can return, a separate list of dictionaries is
+   specified by the configuration.  When a token of that type is found
+   by the parser, each dictionary in the list is consulted in turn,
+   until some dictionary recognizes it as a known word.  If it is identified
+   as a stop word, or if no dictionary recognizes the token, it will be
+   discarded and not indexed or searched for.
+   Normally, the first dictionary that returns a non-<code class="literal">NULL</code>
+   output determines the result, and any remaining dictionaries are not
+   consulted; but a filtering dictionary can replace the given word
+   with a modified word, which is then passed to subsequent dictionaries.
+  </p><p>
+   The general rule for configuring a list of dictionaries
+   is to place first the most narrow, most specific dictionary, then the more
+   general dictionaries, finishing with a very general dictionary, like
+   a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which
+   recognizes everything.  For example, for an astronomy-specific search
+   (<code class="literal">astro_en</code> configuration) one could bind token type
+   <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical
+   terms, a general English dictionary and a <span class="application">Snowball</span> English
+   stemmer:
+
+</p><pre class="programlisting">
+ALTER TEXT SEARCH CONFIGURATION astro_en
+    ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
+</pre><p>
+  </p><p>
+   A filtering dictionary can be placed anywhere in the list, except at the
+   end where it'd be useless.  Filtering dictionaries are useful to partially
+   normalize words to simplify the task of later dictionaries.  For example,
+   a filtering dictionary could be used to remove accents from accented
+   letters, as is done by the <a class="xref" href="unaccent.html" title="F.48. unaccent">unaccent</a> module.
+  </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words</h3></div></div></div><p>
+    Stop words are words that are very common, appear in almost every
+    document, and have no discrimination value. Therefore, they can be ignored
+    in the context of full text searching. For example, every English text
+    contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is
+    useless to store them in an index.  However, stop words do affect the
+    positions in <code class="type">tsvector</code>, which in turn affect ranking:
+
+</p><pre class="screen">
+SELECT to_tsvector('english', 'in the list of stop words');
+        to_tsvector
+----------------------------
+ 'list':3 'stop':5 'word':6
+</pre><p>
+
+    The missing positions 1,2,4 are because of stop words.  Ranks
+    calculated for documents with and without stop words are quite different:
+
+</p><pre class="screen">
+SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
+ ts_rank_cd
+------------
+       0.05
+
+SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
+ ts_rank_cd
+------------
+        0.1
+</pre><p>
+
+   </p><p>
+    It is up to the specific dictionary how it treats stop words. For example,
+    <code class="literal">ispell</code> dictionaries first normalize words and then
+    look at the list of stop words, while <code class="literal">Snowball</code> stemmers
+    first check the list of stop words. The reason for the different
+    behavior is an attempt to decrease noise.
+   </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary</h3></div></div></div><p>
+    The <code class="literal">simple</code> dictionary template operates by converting the
+    input token to lower case and checking it against a file of stop words.
+    If it is found in the file then an empty array is returned, causing
+    the token to be discarded.  If not, the lower-cased form of the word
+    is returned as the normalized lexeme.  Alternatively, the dictionary
+    can be configured to report non-stop-words as unrecognized, allowing
+    them to be passed on to the next dictionary in the list.
+   </p><p>
+    Here is an example of a dictionary definition using the <code class="literal">simple</code>
+    template:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY public.simple_dict (
+    TEMPLATE = pg_catalog.simple,
+    STOPWORDS = english
+);
+</pre><p>
+
+    Here, <code class="literal">english</code> is the base name of a file of stop words.
+    The file's full name will be
+    <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>,
+    where <code class="literal">$SHAREDIR</code> means the
+    <span class="productname">PostgreSQL</span> installation's shared-data directory,
+    often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config
+    --sharedir</code> to determine it if you're not sure).
+    The file format is simply a list
+    of words, one per line.  Blank lines and trailing spaces are ignored,
+    and upper case is folded to lower case, but no other processing is done
+    on the file contents.
+   </p><p>
+    Now we can test our dictionary:
+
+</p><pre class="screen">
+SELECT ts_lexize('public.simple_dict', 'YeS');
+ ts_lexize
+-----------
+ {yes}
+
+SELECT ts_lexize('public.simple_dict', 'The');
+ ts_lexize
+-----------
+ {}
+</pre><p>
+   </p><p>
+    We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased
+    word, if it is not found in the stop words file.  This behavior is
+    selected by setting the dictionary's <code class="literal">Accept</code> parameter to
+    <code class="literal">false</code>.  Continuing the example:
+
+</p><pre class="screen">
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
+
+SELECT ts_lexize('public.simple_dict', 'YeS');
+ ts_lexize
+-----------
+
+
+SELECT ts_lexize('public.simple_dict', 'The');
+ ts_lexize
+-----------
+ {}
+</pre><p>
+   </p><p>
+    With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>,
+    it is only useful to place a <code class="literal">simple</code> dictionary at the end
+    of a list of dictionaries, since it will never pass on any token to
+    a following dictionary.  Conversely, <code class="literal">Accept</code> = <code class="literal">false</code>
+    is only useful when there is at least one following dictionary.
+   </p><div class="caution"><h3 class="title">Caution</h3><p>
+     Most types of dictionaries rely on configuration files, such as files of
+     stop words.  These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding.
+     They will be translated to the actual database encoding, if that is
+     different, when they are read into the server.
+    </p></div><div class="caution"><h3 class="title">Caution</h3><p>
+     Normally, a database session will read a dictionary configuration file
+     only once, when it is first used within the session.  If you modify a
+     configuration file and want to force existing sessions to pick up the
+     new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command
+     on the dictionary.  This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't
+     actually change any parameter values.
+    </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary</h3></div></div></div><p>
+    This dictionary template is used to create dictionaries that replace a
+    word with a synonym. Phrases are not supported (use the thesaurus
+    template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that).  A synonym
+    dictionary can be used to overcome linguistic problems, for example, to
+    prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to
+    <span class="quote">“<span class="quote">pari</span>”</span>.  It is enough to have a <code class="literal">Paris paris</code> line in the
+    synonym dictionary and put it before the <code class="literal">english_stem</code>
+    dictionary.  For example:
+
+</p><pre class="screen">
+SELECT * FROM ts_debug('english', 'Paris');
+   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
+-----------+-----------------+-------+----------------+--------------+---------
+ asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
+
+CREATE TEXT SEARCH DICTIONARY my_synonym (
+    TEMPLATE = synonym,
+    SYNONYMS = my_synonyms
+);
+
+ALTER TEXT SEARCH CONFIGURATION english
+    ALTER MAPPING FOR asciiword
+    WITH my_synonym, english_stem;
+
+SELECT * FROM ts_debug('english', 'Paris');
+   alias   |   description   | token |       dictionaries        | dictionary | lexemes
+-----------+-----------------+-------+---------------------------+------------+---------
+ asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
+</pre><p>
+   </p><p>
+    The only parameter required by the <code class="literal">synonym</code> template is
+    <code class="literal">SYNONYMS</code>, which is the base name of its configuration file
+    — <code class="literal">my_synonyms</code> in the above example.
+    The file's full name will be
+    <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code>
+    (where <code class="literal">$SHAREDIR</code> means the
+    <span class="productname">PostgreSQL</span> installation's shared-data directory).
+    The file format is just one line
+    per word to be substituted, with the word followed by its synonym,
+    separated by white space.  Blank lines and trailing spaces are ignored.
+   </p><p>
+    The <code class="literal">synonym</code> template also has an optional parameter
+    <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>.  When
+    <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file
+    are folded to lower case, as are input tokens.  When it is
+    <code class="literal">true</code>, words and tokens are not folded to lower case,
+    but are compared as-is.
+   </p><p>
+    An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym
+    in the configuration file.  This indicates that the synonym is a prefix.
+    The asterisk is ignored when the entry is used in
+    <code class="function">to_tsvector()</code>, but when it is used in
+    <code class="function">to_tsquery()</code>, the result will be a query item with
+    the prefix match marker (see
+    <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>).
+    For example, suppose we have these entries in
+    <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>:
+</p><pre class="programlisting">
+postgres        pgsql
+postgresql      pgsql
+postgre pgsql
+gogle   googl
+indices index*
+</pre><p>
+    Then we will get these results:
+</p><pre class="screen">
+mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
+mydb=# SELECT ts_lexize('syn', 'indices');
+ ts_lexize
+-----------
+ {index}
+(1 row)
+
+mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
+mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
+mydb=# SELECT to_tsvector('tst', 'indices');
+ to_tsvector
+-------------
+ 'index':1
+(1 row)
+
+mydb=# SELECT to_tsquery('tst', 'indices');
+ to_tsquery
+------------
+ 'index':*
+(1 row)
+
+mydb=# SELECT 'indexes are very useful'::tsvector;
+            tsvector
+---------------------------------
+ 'are' 'indexes' 'useful' 'very'
+(1 row)
+
+mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
+ ?column?
+----------
+ t
+(1 row)
+</pre><p>
+   </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary</h3></div></div></div><p>
+    A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is
+    a collection of words that includes information about the relationships
+    of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower
+    terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related
+    terms, etc.
+   </p><p>
+    Basically a thesaurus dictionary replaces all non-preferred terms by one
+    preferred term and, optionally, preserves the original terms for indexing
+    as well.  <span class="productname">PostgreSQL</span>'s current implementation of the
+    thesaurus dictionary is an extension of the synonym dictionary with added
+    <em class="firstterm">phrase</em> support.  A thesaurus dictionary requires
+    a configuration file of the following format:
+
+</p><pre class="programlisting">
+# this is a comment
+sample word(s) : indexed word(s)
+more sample word(s) : more indexed word(s)
+...
+</pre><p>
+
+    where  the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a
+    phrase and its replacement.
+   </p><p>
+    A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which
+    is specified in the dictionary's configuration) to normalize the input
+    text before checking for phrase matches. It is only possible to select one
+    subdictionary.  An error is reported if the subdictionary fails to
+    recognize a word. In that case, you should remove the use of the word or
+    teach the subdictionary about it.  You can place an asterisk
+    (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying
+    the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known
+    to the subdictionary.
+   </p><p>
+    The thesaurus dictionary chooses the longest match if there are multiple
+    phrases matching the input, and ties are broken by using the last
+    definition.
+   </p><p>
+    Specific stop words recognized by the subdictionary cannot be
+    specified;  instead use <code class="literal">?</code> to mark the location where any
+    stop word can appear.  For example, assuming that <code class="literal">a</code> and
+    <code class="literal">the</code> are stop words according to the subdictionary:
+
+</p><pre class="programlisting">
+? one ? two : swsw
+</pre><p>
+
+    matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>;
+    both would be replaced by <code class="literal">swsw</code>.
+   </p><p>
+    Since a thesaurus dictionary has the capability to recognize phrases it
+    must remember its state and interact with the parser. A thesaurus dictionary
+    uses these assignments to check if it should handle the next word or stop
+    accumulation.  The thesaurus dictionary must be configured
+    carefully. For example, if the thesaurus dictionary is assigned to handle
+    only the <code class="literal">asciiword</code> token, then a thesaurus dictionary
+    definition like <code class="literal">one 7</code> will not work since token type
+    <code class="literal">uint</code> is not assigned to the thesaurus dictionary.
+   </p><div class="caution"><h3 class="title">Caution</h3><p>
+     Thesauruses are used during indexing so any change in the thesaurus
+     dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing.
+     For most other dictionary types, small changes such as adding or
+     removing stopwords does not force reindexing.
+    </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration</h4></div></div></div><p>
+    To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code>
+    template.  For example:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
+    TEMPLATE = thesaurus,
+    DictFile = mythesaurus,
+    Dictionary = pg_catalog.english_stem
+);
+</pre><p>
+
+    Here:
+    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+       <code class="literal">thesaurus_simple</code> is the new dictionary's name
+      </p></li><li class="listitem" style="list-style-type: disc"><p>
+       <code class="literal">mythesaurus</code> is the base name of the thesaurus
+       configuration file.
+       (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>,
+       where <code class="literal">$SHAREDIR</code> means the installation shared-data
+       directory.)
+      </p></li><li class="listitem" style="list-style-type: disc"><p>
+       <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here,
+       a Snowball English stemmer) to use for thesaurus normalization.
+       Notice that the subdictionary will have its own
+       configuration (for example, stop words), which is not shown here.
+      </p></li></ul></div><p>
+
+    Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code>
+    to the desired token types in a configuration, for example:
+
+</p><pre class="programlisting">
+ALTER TEXT SEARCH CONFIGURATION russian
+    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
+    WITH thesaurus_simple;
+</pre><p>
+   </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example</h4></div></div></div><p>
+    Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>,
+    which contains some astronomical word combinations:
+
+</p><pre class="programlisting">
+supernovae stars : sn
+crab nebulae : crab
+</pre><p>
+
+    Below we create a dictionary and bind some token types to
+    an astronomical thesaurus and English stemmer:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
+    TEMPLATE = thesaurus,
+    DictFile = thesaurus_astro,
+    Dictionary = english_stem
+);
+
+ALTER TEXT SEARCH CONFIGURATION russian
+    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
+    WITH thesaurus_astro, english_stem;
+</pre><p>
+
+    Now we can see how it works.
+    <code class="function">ts_lexize</code> is not very useful for testing a thesaurus,
+    because it treats its input as a single token.  Instead we can use
+    <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code>
+    which will break their input strings into multiple tokens:
+
+</p><pre class="screen">
+SELECT plainto_tsquery('supernova star');
+ plainto_tsquery
+-----------------
+ 'sn'
+
+SELECT to_tsvector('supernova star');
+ to_tsvector
+-------------
+ 'sn':1
+</pre><p>
+
+    In principle, one can use <code class="function">to_tsquery</code> if you quote
+    the argument:
+
+</p><pre class="screen">
+SELECT to_tsquery('''supernova star''');
+ to_tsquery
+------------
+ 'sn'
+</pre><p>
+
+    Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae
+    stars</code> in <code class="literal">thesaurus_astro</code> because we specified
+    the <code class="literal">english_stem</code> stemmer in the thesaurus definition.
+    The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>.
+   </p><p>
+    To index the original phrase as well as the substitute, just include it
+    in the right-hand part of the definition:
+
+</p><pre class="screen">
+supernovae stars : sn supernovae stars
+
+SELECT plainto_tsquery('supernova star');
+       plainto_tsquery
+-----------------------------
+ 'sn' &amp; 'supernova' &amp; 'star'
+</pre><p>
+   </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary</h3></div></div></div><p>
+    The <span class="application">Ispell</span> dictionary template supports
+    <em class="firstterm">morphological dictionaries</em>, which can normalize many
+    different linguistic forms of a word into the same lexeme.  For example,
+    an English <span class="application">Ispell</span> dictionary can match all declensions and
+    conjugations of the search term <code class="literal">bank</code>, e.g.,
+    <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>,
+    <code class="literal">banks'</code>, and <code class="literal">bank's</code>.
+   </p><p>
+    The standard <span class="productname">PostgreSQL</span> distribution does
+    not include any <span class="application">Ispell</span> configuration files.
+    Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>.
+    Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO &lt; 2.0.1)
+    and <a class="ulink" href="https://hunspell.github.io/" target="_top">Hunspell</a>
+    (OO &gt;= 2.0.2).  A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice
+    Wiki</a>.
+   </p><p>
+    To create an <span class="application">Ispell</span> dictionary perform these steps:
+   </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+      download dictionary configuration files. <span class="productname">OpenOffice</span>
+      extension files have the <code class="filename">.oxt</code> extension. It is necessary
+      to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change
+      extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some
+      dictionary files it is also needed to convert characters to the UTF-8
+      encoding with commands (for example, for a Norwegian language dictionary):
+</p><pre class="programlisting">
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
+</pre><p>
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      load files into PostgreSQL with the following command:
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY english_hunspell (
+    TEMPLATE = ispell,
+    DictFile = en_us,
+    AffFile = en_us,
+    Stopwords = english);
+</pre><p>
+     </p></li></ul></div><p>
+    Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code>
+    specify the base names of the dictionary, affixes, and stop-words files.
+    The stop-words file has the same format explained above for the
+    <code class="literal">simple</code> dictionary type.  The format of the other files is
+    not specified here but is available from the above-mentioned web sites.
+   </p><p>
+    Ispell dictionaries usually recognize a limited set of words, so they
+    should be followed by another broader dictionary; for
+    example, a Snowball dictionary, which recognizes everything.
+   </p><p>
+    The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following
+    structure:
+</p><pre class="programlisting">
+prefixes
+flag *A:
+    .           &gt;   RE      # As in enter &gt; reenter
+suffixes
+flag T:
+    E           &gt;   ST      # As in late &gt; latest
+    [^AEIOU]Y   &gt;   -Y,IEST # As in dirty &gt; dirtiest
+    [AEIOU]Y    &gt;   EST     # As in gray &gt; grayest
+    [^EY]       &gt;   EST     # As in small &gt; smallest
+</pre><p>
+   </p><p>
+    And the <code class="filename">.dict</code> file has the following structure:
+</p><pre class="programlisting">
+lapse/ADGRS
+lard/DGRS
+large/PRTY
+lark/MRS
+</pre><p>
+   </p><p>
+    Format of the <code class="filename">.dict</code> file is:
+</p><pre class="programlisting">
+basic_form/affix_class_name
+</pre><p>
+   </p><p>
+    In the <code class="filename">.affix</code> file every affix flag is described in the
+    following format:
+</p><pre class="programlisting">
+condition &gt; [-stripping_letters,] adding_affix
+</pre><p>
+   </p><p>
+    Here, condition has a format similar to the format of regular expressions.
+    It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>.
+    For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word
+    is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>,
+    <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>.
+    <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code>
+    nor <code class="literal">"y"</code>.
+   </p><p>
+    Ispell dictionaries support splitting compound words;
+    a useful feature.
+    Notice that the affix file should specify a special flag using the
+    <code class="literal">compoundwords controlled</code> statement that marks dictionary
+    words that can participate in compound formation:
+
+</p><pre class="programlisting">
+compoundwords  controlled z
+</pre><p>
+
+    Here are some examples for the Norwegian language:
+
+</p><pre class="programlisting">
+SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
+   {over,buljong,terning,pakk,mester,assistent}
+SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
+   {sjokoladefabrikk,sjokolade,fabrikk}
+</pre><p>
+   </p><p>
+    <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>.
+    The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following
+    structure:
+</p><pre class="programlisting">
+PFX A Y 1
+PFX A   0     re         .
+SFX T N 4
+SFX T   0     st         e
+SFX T   y     iest       [^aeiou]y
+SFX T   0     est        [aeiou]y
+SFX T   0     est        [^ey]
+</pre><p>
+   </p><p>
+    The first line of an affix class is the header. Fields of an affix rules are
+    listed after the header:
+   </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+      parameter name (PFX or SFX)
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      flag (name of the affix class)
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      stripping characters from beginning (at prefix) or end (at suffix) of the
+      word
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      adding affix
+     </p></li><li class="listitem" style="list-style-type: disc"><p>
+      condition that has a format similar to the format of regular expressions.
+     </p></li></ul></div><p>
+    The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of
+    <span class="application">Ispell</span>:
+</p><pre class="programlisting">
+larder/M
+lardy/RT
+large/RSPMYT
+largehearted
+</pre><p>
+   </p><div class="note"><h3 class="title">Note</h3><p>
+     <span class="application">MySpell</span> does not support compound words.
+     <span class="application">Hunspell</span> has sophisticated support for compound words. At
+     present, <span class="productname">PostgreSQL</span> implements only the basic
+     compound word operations of Hunspell.
+    </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary</h3></div></div></div><p>
+    The <span class="application">Snowball</span> dictionary template is based on a project
+    by Martin Porter, inventor of the popular Porter's stemming algorithm
+    for the English language.  Snowball now provides stemming algorithms for
+    many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball
+    site</a> for more information).  Each algorithm understands how to
+    reduce common variant forms of words to a base, or stem, spelling within
+    its language.  A Snowball dictionary requires a <code class="literal">language</code>
+    parameter to identify which stemmer to use, and optionally can specify a
+    <code class="literal">stopword</code> file name that gives a list of words to eliminate.
+    (<span class="productname">PostgreSQL</span>'s standard stopword lists are also
+    provided by the Snowball project.)
+    For example, there is a built-in definition equivalent to
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY english_stem (
+    TEMPLATE = snowball,
+    Language = english,
+    StopWords = english
+);
+</pre><p>
+
+    The stopword file format is the same as already explained.
+   </p><p>
+    A <span class="application">Snowball</span> dictionary recognizes everything, whether
+    or not it is able to simplify the word, so it should be placed
+    at the end of the dictionary list. It is useless to have it
+    before any other dictionary because a token will never pass through it to
+    the next dictionary.
+   </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html>
+\ No newline at end of file