summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/textsearch-dictionaries.html
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:19:15 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:19:15 +0000
commit6eb9c5a5657d1fe77b55cc261450f3538d35a94d (patch)
tree657d8194422a5daccecfd42d654b8a245ef7b4c8 /doc/src/sgml/html/textsearch-dictionaries.html
parentInitial commit. (diff)
downloadpostgresql-13-6eb9c5a5657d1fe77b55cc261450f3538d35a94d.tar.xz
postgresql-13-6eb9c5a5657d1fe77b55cc261450f3538d35a94d.zip
Adding upstream version 13.4.upstream/13.4upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/html/textsearch-dictionaries.html')
-rw-r--r--doc/src/sgml/html/textsearch-dictionaries.html661
1 files changed, 661 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-dictionaries.html b/doc/src/sgml/html/textsearch-dictionaries.html
new file mode 100644
index 0000000..02d02dd
--- /dev/null
+++ b/doc/src/sgml/html/textsearch-dictionaries.html
@@ -0,0 +1,661 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p>
+ Dictionaries are used to eliminate words that should not be considered in a
+ search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so
+ that different derived forms of the same word will match. A successfully
+ normalized word is called a <em class="firstterm">lexeme</em>. Aside from
+ improving search quality, normalization and removal of stop words reduce the
+ size of the <code class="type">tsvector</code> representation of a document, thereby
+ improving performance. Normalization does not always have linguistic meaning
+ and usually depends on application semantics.
+ </p><p>
+ Some examples of normalization:
+
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ Linguistic — Ispell dictionaries try to reduce input words to a
+ normalized form; stemmer dictionaries remove word endings
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <acronym class="acronym">URL</acronym> locations can be canonicalized to make
+ equivalent URLs match:
+
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ http://www.pgsql.ru/db/mw/index.html
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ http://www.pgsql.ru/db/mw/
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ http://www.pgsql.ru/db/../db/mw/index.html
+ </p></li></ul></div><p>
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ Color names can be replaced by their hexadecimal values, e.g.,
+ <code class="literal">red, green, blue, magenta -&gt; FF0000, 00FF00, 0000FF, FF00FF</code>
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ If indexing numbers, we can
+ remove some fractional digits to reduce the range of possible
+ numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359,
+ <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same
+ after normalization if only two digits are kept after the decimal point.
+ </p></li></ul></div><p>
+
+ </p><p>
+ A dictionary is a program that accepts a token as
+ input and returns:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ an array of lexemes if the input token is known to the dictionary
+ (notice that one token can produce more than one lexeme)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace
+ the original token with a new token to be passed to subsequent
+ dictionaries (a dictionary that does this is called a
+ <em class="firstterm">filtering dictionary</em>)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ an empty array if the dictionary knows the token, but it is a stop word
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <code class="literal">NULL</code> if the dictionary does not recognize the input token
+ </p></li></ul></div><p>
+ </p><p>
+ <span class="productname">PostgreSQL</span> provides predefined dictionaries for
+ many languages. There are also several predefined templates that can be
+ used to create new dictionaries with custom parameters. Each predefined
+ dictionary template is described below. If no existing
+ template is suitable, it is possible to create new ones; see the
+ <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution
+ for examples.
+ </p><p>
+ A text search configuration binds a parser together with a set of
+ dictionaries to process the parser's output tokens. For each token
+ type that the parser can return, a separate list of dictionaries is
+ specified by the configuration. When a token of that type is found
+ by the parser, each dictionary in the list is consulted in turn,
+ until some dictionary recognizes it as a known word. If it is identified
+ as a stop word, or if no dictionary recognizes the token, it will be
+ discarded and not indexed or searched for.
+ Normally, the first dictionary that returns a non-<code class="literal">NULL</code>
+ output determines the result, and any remaining dictionaries are not
+ consulted; but a filtering dictionary can replace the given word
+ with a modified word, which is then passed to subsequent dictionaries.
+ </p><p>
+ The general rule for configuring a list of dictionaries
+ is to place first the most narrow, most specific dictionary, then the more
+ general dictionaries, finishing with a very general dictionary, like
+ a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which
+ recognizes everything. For example, for an astronomy-specific search
+ (<code class="literal">astro_en</code> configuration) one could bind token type
+ <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical
+ terms, a general English dictionary and a <span class="application">Snowball</span> English
+ stemmer:
+
+</p><pre class="programlisting">
+ALTER TEXT SEARCH CONFIGURATION astro_en
+ ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
+</pre><p>
+ </p><p>
+ A filtering dictionary can be placed anywhere in the list, except at the
+ end where it'd be useless. Filtering dictionaries are useful to partially
+ normalize words to simplify the task of later dictionaries. For example,
+ a filtering dictionary could be used to remove accents from accented
+ letters, as is done by the <a class="xref" href="unaccent.html" title="F.43. unaccent">unaccent</a> module.
+ </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words</h3></div></div></div><p>
+ Stop words are words that are very common, appear in almost every
+ document, and have no discrimination value. Therefore, they can be ignored
+ in the context of full text searching. For example, every English text
+ contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is
+ useless to store them in an index. However, stop words do affect the
+ positions in <code class="type">tsvector</code>, which in turn affect ranking:
+
+</p><pre class="screen">
+SELECT to_tsvector('english', 'in the list of stop words');
+ to_tsvector
+----------------------------
+ 'list':3 'stop':5 'word':6
+</pre><p>
+
+ The missing positions 1,2,4 are because of stop words. Ranks
+ calculated for documents with and without stop words are quite different:
+
+</p><pre class="screen">
+SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
+ ts_rank_cd
+------------
+ 0.05
+
+SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
+ ts_rank_cd
+------------
+ 0.1
+</pre><p>
+
+ </p><p>
+ It is up to the specific dictionary how it treats stop words. For example,
+ <code class="literal">ispell</code> dictionaries first normalize words and then
+ look at the list of stop words, while <code class="literal">Snowball</code> stemmers
+ first check the list of stop words. The reason for the different
+ behavior is an attempt to decrease noise.
+ </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary</h3></div></div></div><p>
+ The <code class="literal">simple</code> dictionary template operates by converting the
+ input token to lower case and checking it against a file of stop words.
+ If it is found in the file then an empty array is returned, causing
+ the token to be discarded. If not, the lower-cased form of the word
+ is returned as the normalized lexeme. Alternatively, the dictionary
+ can be configured to report non-stop-words as unrecognized, allowing
+ them to be passed on to the next dictionary in the list.
+ </p><p>
+ Here is an example of a dictionary definition using the <code class="literal">simple</code>
+ template:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY public.simple_dict (
+ TEMPLATE = pg_catalog.simple,
+ STOPWORDS = english
+);
+</pre><p>
+
+ Here, <code class="literal">english</code> is the base name of a file of stop words.
+ The file's full name will be
+ <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>,
+ where <code class="literal">$SHAREDIR</code> means the
+ <span class="productname">PostgreSQL</span> installation's shared-data directory,
+ often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config
+ --sharedir</code> to determine it if you're not sure).
+ The file format is simply a list
+ of words, one per line. Blank lines and trailing spaces are ignored,
+ and upper case is folded to lower case, but no other processing is done
+ on the file contents.
+ </p><p>
+ Now we can test our dictionary:
+
+</p><pre class="screen">
+SELECT ts_lexize('public.simple_dict', 'YeS');
+ ts_lexize
+-----------
+ {yes}
+
+SELECT ts_lexize('public.simple_dict', 'The');
+ ts_lexize
+-----------
+ {}
+</pre><p>
+ </p><p>
+ We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased
+ word, if it is not found in the stop words file. This behavior is
+ selected by setting the dictionary's <code class="literal">Accept</code> parameter to
+ <code class="literal">false</code>. Continuing the example:
+
+</p><pre class="screen">
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
+
+SELECT ts_lexize('public.simple_dict', 'YeS');
+ ts_lexize
+-----------
+
+
+SELECT ts_lexize('public.simple_dict', 'The');
+ ts_lexize
+-----------
+ {}
+</pre><p>
+ </p><p>
+ With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>,
+ it is only useful to place a <code class="literal">simple</code> dictionary at the end
+ of a list of dictionaries, since it will never pass on any token to
+ a following dictionary. Conversely, <code class="literal">Accept</code> = <code class="literal">false</code>
+ is only useful when there is at least one following dictionary.
+ </p><div class="caution"><h3 class="title">Caution</h3><p>
+ Most types of dictionaries rely on configuration files, such as files of
+ stop words. These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding.
+ They will be translated to the actual database encoding, if that is
+ different, when they are read into the server.
+ </p></div><div class="caution"><h3 class="title">Caution</h3><p>
+ Normally, a database session will read a dictionary configuration file
+ only once, when it is first used within the session. If you modify a
+ configuration file and want to force existing sessions to pick up the
+ new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command
+ on the dictionary. This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't
+ actually change any parameter values.
+ </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary</h3></div></div></div><p>
+ This dictionary template is used to create dictionaries that replace a
+ word with a synonym. Phrases are not supported (use the thesaurus
+ template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that). A synonym
+ dictionary can be used to overcome linguistic problems, for example, to
+ prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to
+ <span class="quote">“<span class="quote">pari</span>”</span>. It is enough to have a <code class="literal">Paris paris</code> line in the
+ synonym dictionary and put it before the <code class="literal">english_stem</code>
+ dictionary. For example:
+
+</p><pre class="screen">
+SELECT * FROM ts_debug('english', 'Paris');
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-----------------+-------+----------------+--------------+---------
+ asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
+
+CREATE TEXT SEARCH DICTIONARY my_synonym (
+ TEMPLATE = synonym,
+ SYNONYMS = my_synonyms
+);
+
+ALTER TEXT SEARCH CONFIGURATION english
+ ALTER MAPPING FOR asciiword
+ WITH my_synonym, english_stem;
+
+SELECT * FROM ts_debug('english', 'Paris');
+ alias | description | token | dictionaries | dictionary | lexemes
+-----------+-----------------+-------+---------------------------+------------+---------
+ asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
+</pre><p>
+ </p><p>
+ The only parameter required by the <code class="literal">synonym</code> template is
+ <code class="literal">SYNONYMS</code>, which is the base name of its configuration file
+ — <code class="literal">my_synonyms</code> in the above example.
+ The file's full name will be
+ <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code>
+ (where <code class="literal">$SHAREDIR</code> means the
+ <span class="productname">PostgreSQL</span> installation's shared-data directory).
+ The file format is just one line
+ per word to be substituted, with the word followed by its synonym,
+ separated by white space. Blank lines and trailing spaces are ignored.
+ </p><p>
+ The <code class="literal">synonym</code> template also has an optional parameter
+ <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>. When
+ <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file
+ are folded to lower case, as are input tokens. When it is
+ <code class="literal">true</code>, words and tokens are not folded to lower case,
+ but are compared as-is.
+ </p><p>
+ An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym
+ in the configuration file. This indicates that the synonym is a prefix.
+ The asterisk is ignored when the entry is used in
+ <code class="function">to_tsvector()</code>, but when it is used in
+ <code class="function">to_tsquery()</code>, the result will be a query item with
+ the prefix match marker (see
+ <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>).
+ For example, suppose we have these entries in
+ <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>:
+</p><pre class="programlisting">
+postgres pgsql
+postgresql pgsql
+postgre pgsql
+gogle googl
+indices index*
+</pre><p>
+ Then we will get these results:
+</p><pre class="screen">
+mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
+mydb=# SELECT ts_lexize('syn', 'indices');
+ ts_lexize
+-----------
+ {index}
+(1 row)
+
+mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
+mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
+mydb=# SELECT to_tsvector('tst', 'indices');
+ to_tsvector
+-------------
+ 'index':1
+(1 row)
+
+mydb=# SELECT to_tsquery('tst', 'indices');
+ to_tsquery
+------------
+ 'index':*
+(1 row)
+
+mydb=# SELECT 'indexes are very useful'::tsvector;
+ tsvector
+---------------------------------
+ 'are' 'indexes' 'useful' 'very'
+(1 row)
+
+mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
+ ?column?
+----------
+ t
+(1 row)
+</pre><p>
+ </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary</h3></div></div></div><p>
+ A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is
+ a collection of words that includes information about the relationships
+ of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower
+ terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related
+ terms, etc.
+ </p><p>
+ Basically a thesaurus dictionary replaces all non-preferred terms by one
+ preferred term and, optionally, preserves the original terms for indexing
+ as well. <span class="productname">PostgreSQL</span>'s current implementation of the
+ thesaurus dictionary is an extension of the synonym dictionary with added
+ <em class="firstterm">phrase</em> support. A thesaurus dictionary requires
+ a configuration file of the following format:
+
+</p><pre class="programlisting">
+# this is a comment
+sample word(s) : indexed word(s)
+more sample word(s) : more indexed word(s)
+...
+</pre><p>
+
+ where the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a
+ phrase and its replacement.
+ </p><p>
+ A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which
+ is specified in the dictionary's configuration) to normalize the input
+ text before checking for phrase matches. It is only possible to select one
+ subdictionary. An error is reported if the subdictionary fails to
+ recognize a word. In that case, you should remove the use of the word or
+ teach the subdictionary about it. You can place an asterisk
+ (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying
+ the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known
+ to the subdictionary.
+ </p><p>
+ The thesaurus dictionary chooses the longest match if there are multiple
+ phrases matching the input, and ties are broken by using the last
+ definition.
+ </p><p>
+ Specific stop words recognized by the subdictionary cannot be
+ specified; instead use <code class="literal">?</code> to mark the location where any
+ stop word can appear. For example, assuming that <code class="literal">a</code> and
+ <code class="literal">the</code> are stop words according to the subdictionary:
+
+</p><pre class="programlisting">
+? one ? two : swsw
+</pre><p>
+
+ matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>;
+ both would be replaced by <code class="literal">swsw</code>.
+ </p><p>
+ Since a thesaurus dictionary has the capability to recognize phrases it
+ must remember its state and interact with the parser. A thesaurus dictionary
+ uses these assignments to check if it should handle the next word or stop
+ accumulation. The thesaurus dictionary must be configured
+ carefully. For example, if the thesaurus dictionary is assigned to handle
+ only the <code class="literal">asciiword</code> token, then a thesaurus dictionary
+ definition like <code class="literal">one 7</code> will not work since token type
+ <code class="literal">uint</code> is not assigned to the thesaurus dictionary.
+ </p><div class="caution"><h3 class="title">Caution</h3><p>
+ Thesauruses are used during indexing so any change in the thesaurus
+ dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing.
+ For most other dictionary types, small changes such as adding or
+ removing stopwords does not force reindexing.
+ </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration</h4></div></div></div><p>
+ To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code>
+ template. For example:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
+ TEMPLATE = thesaurus,
+ DictFile = mythesaurus,
+ Dictionary = pg_catalog.english_stem
+);
+</pre><p>
+
+ Here:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ <code class="literal">thesaurus_simple</code> is the new dictionary's name
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <code class="literal">mythesaurus</code> is the base name of the thesaurus
+ configuration file.
+ (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>,
+ where <code class="literal">$SHAREDIR</code> means the installation shared-data
+ directory.)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here,
+ a Snowball English stemmer) to use for thesaurus normalization.
+ Notice that the subdictionary will have its own
+ configuration (for example, stop words), which is not shown here.
+ </p></li></ul></div><p>
+
+ Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code>
+ to the desired token types in a configuration, for example:
+
+</p><pre class="programlisting">
+ALTER TEXT SEARCH CONFIGURATION russian
+ ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
+ WITH thesaurus_simple;
+</pre><p>
+ </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example</h4></div></div></div><p>
+ Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>,
+ which contains some astronomical word combinations:
+
+</p><pre class="programlisting">
+supernovae stars : sn
+crab nebulae : crab
+</pre><p>
+
+ Below we create a dictionary and bind some token types to
+ an astronomical thesaurus and English stemmer:
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
+ TEMPLATE = thesaurus,
+ DictFile = thesaurus_astro,
+ Dictionary = english_stem
+);
+
+ALTER TEXT SEARCH CONFIGURATION russian
+ ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
+ WITH thesaurus_astro, english_stem;
+</pre><p>
+
+ Now we can see how it works.
+ <code class="function">ts_lexize</code> is not very useful for testing a thesaurus,
+ because it treats its input as a single token. Instead we can use
+ <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code>
+ which will break their input strings into multiple tokens:
+
+</p><pre class="screen">
+SELECT plainto_tsquery('supernova star');
+ plainto_tsquery
+-----------------
+ 'sn'
+
+SELECT to_tsvector('supernova star');
+ to_tsvector
+-------------
+ 'sn':1
+</pre><p>
+
+ In principle, one can use <code class="function">to_tsquery</code> if you quote
+ the argument:
+
+</p><pre class="screen">
+SELECT to_tsquery('''supernova star''');
+ to_tsquery
+------------
+ 'sn'
+</pre><p>
+
+ Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae
+ stars</code> in <code class="literal">thesaurus_astro</code> because we specified
+ the <code class="literal">english_stem</code> stemmer in the thesaurus definition.
+ The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>.
+ </p><p>
+ To index the original phrase as well as the substitute, just include it
+ in the right-hand part of the definition:
+
+</p><pre class="screen">
+supernovae stars : sn supernovae stars
+
+SELECT plainto_tsquery('supernova star');
+ plainto_tsquery
+-----------------------------
+ 'sn' &amp; 'supernova' &amp; 'star'
+</pre><p>
+ </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary</h3></div></div></div><p>
+ The <span class="application">Ispell</span> dictionary template supports
+ <em class="firstterm">morphological dictionaries</em>, which can normalize many
+ different linguistic forms of a word into the same lexeme. For example,
+ an English <span class="application">Ispell</span> dictionary can match all declensions and
+ conjugations of the search term <code class="literal">bank</code>, e.g.,
+ <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>,
+ <code class="literal">banks'</code>, and <code class="literal">bank's</code>.
+ </p><p>
+ The standard <span class="productname">PostgreSQL</span> distribution does
+ not include any <span class="application">Ispell</span> configuration files.
+ Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>.
+ Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO &lt; 2.0.1)
+ and <a class="ulink" href="https://sourceforge.net/projects/hunspell/" target="_top">Hunspell</a>
+ (OO &gt;= 2.0.2). A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice
+ Wiki</a>.
+ </p><p>
+ To create an <span class="application">Ispell</span> dictionary perform these steps:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ download dictionary configuration files. <span class="productname">OpenOffice</span>
+ extension files have the <code class="filename">.oxt</code> extension. It is necessary
+ to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change
+ extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some
+ dictionary files it is also needed to convert characters to the UTF-8
+ encoding with commands (for example, for a Norwegian language dictionary):
+</p><pre class="programlisting">
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
+iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
+</pre><p>
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ load files into PostgreSQL with the following command:
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY english_hunspell (
+ TEMPLATE = ispell,
+ DictFile = en_us,
+ AffFile = en_us,
+ Stopwords = english);
+</pre><p>
+ </p></li></ul></div><p>
+ Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code>
+ specify the base names of the dictionary, affixes, and stop-words files.
+ The stop-words file has the same format explained above for the
+ <code class="literal">simple</code> dictionary type. The format of the other files is
+ not specified here but is available from the above-mentioned web sites.
+ </p><p>
+ Ispell dictionaries usually recognize a limited set of words, so they
+ should be followed by another broader dictionary; for
+ example, a Snowball dictionary, which recognizes everything.
+ </p><p>
+ The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following
+ structure:
+</p><pre class="programlisting">
+prefixes
+flag *A:
+ . &gt; RE # As in enter &gt; reenter
+suffixes
+flag T:
+ E &gt; ST # As in late &gt; latest
+ [^AEIOU]Y &gt; -Y,IEST # As in dirty &gt; dirtiest
+ [AEIOU]Y &gt; EST # As in gray &gt; grayest
+ [^EY] &gt; EST # As in small &gt; smallest
+</pre><p>
+ </p><p>
+ And the <code class="filename">.dict</code> file has the following structure:
+</p><pre class="programlisting">
+lapse/ADGRS
+lard/DGRS
+large/PRTY
+lark/MRS
+</pre><p>
+ </p><p>
+ Format of the <code class="filename">.dict</code> file is:
+</p><pre class="programlisting">
+basic_form/affix_class_name
+</pre><p>
+ </p><p>
+ In the <code class="filename">.affix</code> file every affix flag is described in the
+ following format:
+</p><pre class="programlisting">
+condition &gt; [-stripping_letters,] adding_affix
+</pre><p>
+ </p><p>
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>.
+ For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word
+ is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>,
+ <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>.
+ <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code>
+ nor <code class="literal">"y"</code>.
+ </p><p>
+ Ispell dictionaries support splitting compound words;
+ a useful feature.
+ Notice that the affix file should specify a special flag using the
+ <code class="literal">compoundwords controlled</code> statement that marks dictionary
+ words that can participate in compound formation:
+
+</p><pre class="programlisting">
+compoundwords controlled z
+</pre><p>
+
+ Here are some examples for the Norwegian language:
+
+</p><pre class="programlisting">
+SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
+ {over,buljong,terning,pakk,mester,assistent}
+SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
+ {sjokoladefabrikk,sjokolade,fabrikk}
+</pre><p>
+ </p><p>
+ <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>.
+ The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following
+ structure:
+</p><pre class="programlisting">
+PFX A Y 1
+PFX A 0 re .
+SFX T N 4
+SFX T 0 st e
+SFX T y iest [^aeiou]y
+SFX T 0 est [aeiou]y
+SFX T 0 est [^ey]
+</pre><p>
+ </p><p>
+ The first line of an affix class is the header. Fields of an affix rules are
+ listed after the header:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ parameter name (PFX or SFX)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ flag (name of the affix class)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ stripping characters from beginning (at prefix) or end (at suffix) of the
+ word
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ adding affix
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ condition that has a format similar to the format of regular expressions.
+ </p></li></ul></div><p>
+ The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of
+ <span class="application">Ispell</span>:
+</p><pre class="programlisting">
+larder/M
+lardy/RT
+large/RSPMYT
+largehearted
+</pre><p>
+ </p><div class="note"><h3 class="title">Note</h3><p>
+ <span class="application">MySpell</span> does not support compound words.
+ <span class="application">Hunspell</span> has sophisticated support for compound words. At
+ present, <span class="productname">PostgreSQL</span> implements only the basic
+ compound word operations of Hunspell.
+ </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary</h3></div></div></div><p>
+ The <span class="application">Snowball</span> dictionary template is based on a project
+ by Martin Porter, inventor of the popular Porter's stemming algorithm
+ for the English language. Snowball now provides stemming algorithms for
+ many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball
+ site</a> for more information). Each algorithm understands how to
+ reduce common variant forms of words to a base, or stem, spelling within
+ its language. A Snowball dictionary requires a <code class="literal">language</code>
+ parameter to identify which stemmer to use, and optionally can specify a
+ <code class="literal">stopword</code> file name that gives a list of words to eliminate.
+ (<span class="productname">PostgreSQL</span>'s standard stopword lists are also
+ provided by the Snowball project.)
+ For example, there is a built-in definition equivalent to
+
+</p><pre class="programlisting">
+CREATE TEXT SEARCH DICTIONARY english_stem (
+ TEMPLATE = snowball,
+ Language = english,
+ StopWords = english
+);
+</pre><p>
+
+ The stopword file format is the same as already explained.
+ </p><p>
+ A <span class="application">Snowball</span> dictionary recognizes everything, whether
+ or not it is able to simplify the word, so it should be placed
+ at the end of the dictionary list. It is useless to have it
+ before any other dictionary because a token will never pass through it to
+ the next dictionary.
+ </p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html> \ No newline at end of file