diff options
Diffstat (limited to 'doc/src/sgml/html/textsearch-dictionaries.html')
-rw-r--r-- | doc/src/sgml/html/textsearch-dictionaries.html | 661 |
1 files changed, 661 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-dictionaries.html b/doc/src/sgml/html/textsearch-dictionaries.html new file mode 100644 index 0000000..1ba4782 --- /dev/null +++ b/doc/src/sgml/html/textsearch-dictionaries.html @@ -0,0 +1,661 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.6. Dictionaries</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-parsers.html" title="12.5. Parsers" /><link rel="next" href="textsearch-configuration.html" title="12.7. Configuration Example" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.6. Dictionaries</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-DICTIONARIES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.6. Dictionaries</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS">12.6.1. Stop Words</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY">12.6.2. Simple Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY">12.6.3. Synonym Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS">12.6.4. Thesaurus Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY">12.6.5. <span class="application">Ispell</span> Dictionary</a></span></dt><dt><span class="sect2"><a href="textsearch-dictionaries.html#TEXTSEARCH-SNOWBALL-DICTIONARY">12.6.6. <span class="application">Snowball</span> Dictionary</a></span></dt></dl></div><p> + Dictionaries are used to eliminate words that should not be considered in a + search (<em class="firstterm">stop words</em>), and to <em class="firstterm">normalize</em> words so + that different derived forms of the same word will match. A successfully + normalized word is called a <em class="firstterm">lexeme</em>. Aside from + improving search quality, normalization and removal of stop words reduce the + size of the <code class="type">tsvector</code> representation of a document, thereby + improving performance. Normalization does not always have linguistic meaning + and usually depends on application semantics. + </p><p> + Some examples of normalization: + + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + Linguistic — Ispell dictionaries try to reduce input words to a + normalized form; stemmer dictionaries remove word endings + </p></li><li class="listitem" style="list-style-type: disc"><p> + <acronym class="acronym">URL</acronym> locations can be canonicalized to make + equivalent URLs match: + + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + http://www.pgsql.ru/db/mw/index.html + </p></li><li class="listitem" style="list-style-type: disc"><p> + http://www.pgsql.ru/db/mw/ + </p></li><li class="listitem" style="list-style-type: disc"><p> + http://www.pgsql.ru/db/../db/mw/index.html + </p></li></ul></div><p> + </p></li><li class="listitem" style="list-style-type: disc"><p> + Color names can be replaced by their hexadecimal values, e.g., + <code class="literal">red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</code> + </p></li><li class="listitem" style="list-style-type: disc"><p> + If indexing numbers, we can + remove some fractional digits to reduce the range of possible + numbers, so for example <span class="emphasis"><em>3.14</em></span>159265359, + <span class="emphasis"><em>3.14</em></span>15926, <span class="emphasis"><em>3.14</em></span> will be the same + after normalization if only two digits are kept after the decimal point. + </p></li></ul></div><p> + + </p><p> + A dictionary is a program that accepts a token as + input and returns: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + an array of lexemes if the input token is known to the dictionary + (notice that one token can produce more than one lexeme) + </p></li><li class="listitem" style="list-style-type: disc"><p> + a single lexeme with the <code class="literal">TSL_FILTER</code> flag set, to replace + the original token with a new token to be passed to subsequent + dictionaries (a dictionary that does this is called a + <em class="firstterm">filtering dictionary</em>) + </p></li><li class="listitem" style="list-style-type: disc"><p> + an empty array if the dictionary knows the token, but it is a stop word + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">NULL</code> if the dictionary does not recognize the input token + </p></li></ul></div><p> + </p><p> + <span class="productname">PostgreSQL</span> provides predefined dictionaries for + many languages. There are also several predefined templates that can be + used to create new dictionaries with custom parameters. Each predefined + dictionary template is described below. If no existing + template is suitable, it is possible to create new ones; see the + <code class="filename">contrib/</code> area of the <span class="productname">PostgreSQL</span> distribution + for examples. + </p><p> + A text search configuration binds a parser together with a set of + dictionaries to process the parser's output tokens. For each token + type that the parser can return, a separate list of dictionaries is + specified by the configuration. When a token of that type is found + by the parser, each dictionary in the list is consulted in turn, + until some dictionary recognizes it as a known word. If it is identified + as a stop word, or if no dictionary recognizes the token, it will be + discarded and not indexed or searched for. + Normally, the first dictionary that returns a non-<code class="literal">NULL</code> + output determines the result, and any remaining dictionaries are not + consulted; but a filtering dictionary can replace the given word + with a modified word, which is then passed to subsequent dictionaries. + </p><p> + The general rule for configuring a list of dictionaries + is to place first the most narrow, most specific dictionary, then the more + general dictionaries, finishing with a very general dictionary, like + a <span class="application">Snowball</span> stemmer or <code class="literal">simple</code>, which + recognizes everything. For example, for an astronomy-specific search + (<code class="literal">astro_en</code> configuration) one could bind token type + <code class="type">asciiword</code> (ASCII word) to a synonym dictionary of astronomical + terms, a general English dictionary and a <span class="application">Snowball</span> English + stemmer: + +</p><pre class="programlisting"> +ALTER TEXT SEARCH CONFIGURATION astro_en + ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem; +</pre><p> + </p><p> + A filtering dictionary can be placed anywhere in the list, except at the + end where it'd be useless. Filtering dictionaries are useful to partially + normalize words to simplify the task of later dictionaries. For example, + a filtering dictionary could be used to remove accents from accented + letters, as is done by the <a class="xref" href="unaccent.html" title="F.48. unaccent">unaccent</a> module. + </p><div class="sect2" id="TEXTSEARCH-STOPWORDS"><div class="titlepage"><div><div><h3 class="title">12.6.1. Stop Words</h3></div></div></div><p> + Stop words are words that are very common, appear in almost every + document, and have no discrimination value. Therefore, they can be ignored + in the context of full text searching. For example, every English text + contains words like <code class="literal">a</code> and <code class="literal">the</code>, so it is + useless to store them in an index. However, stop words do affect the + positions in <code class="type">tsvector</code>, which in turn affect ranking: + +</p><pre class="screen"> +SELECT to_tsvector('english', 'in the list of stop words'); + to_tsvector +---------------------------- + 'list':3 'stop':5 'word':6 +</pre><p> + + The missing positions 1,2,4 are because of stop words. Ranks + calculated for documents with and without stop words are quite different: + +</p><pre class="screen"> +SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); + ts_rank_cd +------------ + 0.05 + +SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); + ts_rank_cd +------------ + 0.1 +</pre><p> + + </p><p> + It is up to the specific dictionary how it treats stop words. For example, + <code class="literal">ispell</code> dictionaries first normalize words and then + look at the list of stop words, while <code class="literal">Snowball</code> stemmers + first check the list of stop words. The reason for the different + behavior is an attempt to decrease noise. + </p></div><div class="sect2" id="TEXTSEARCH-SIMPLE-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.2. Simple Dictionary</h3></div></div></div><p> + The <code class="literal">simple</code> dictionary template operates by converting the + input token to lower case and checking it against a file of stop words. + If it is found in the file then an empty array is returned, causing + the token to be discarded. If not, the lower-cased form of the word + is returned as the normalized lexeme. Alternatively, the dictionary + can be configured to report non-stop-words as unrecognized, allowing + them to be passed on to the next dictionary in the list. + </p><p> + Here is an example of a dictionary definition using the <code class="literal">simple</code> + template: + +</p><pre class="programlisting"> +CREATE TEXT SEARCH DICTIONARY public.simple_dict ( + TEMPLATE = pg_catalog.simple, + STOPWORDS = english +); +</pre><p> + + Here, <code class="literal">english</code> is the base name of a file of stop words. + The file's full name will be + <code class="filename">$SHAREDIR/tsearch_data/english.stop</code>, + where <code class="literal">$SHAREDIR</code> means the + <span class="productname">PostgreSQL</span> installation's shared-data directory, + often <code class="filename">/usr/local/share/postgresql</code> (use <code class="command">pg_config + --sharedir</code> to determine it if you're not sure). + The file format is simply a list + of words, one per line. Blank lines and trailing spaces are ignored, + and upper case is folded to lower case, but no other processing is done + on the file contents. + </p><p> + Now we can test our dictionary: + +</p><pre class="screen"> +SELECT ts_lexize('public.simple_dict', 'YeS'); + ts_lexize +----------- + {yes} + +SELECT ts_lexize('public.simple_dict', 'The'); + ts_lexize +----------- + {} +</pre><p> + </p><p> + We can also choose to return <code class="literal">NULL</code>, instead of the lower-cased + word, if it is not found in the stop words file. This behavior is + selected by setting the dictionary's <code class="literal">Accept</code> parameter to + <code class="literal">false</code>. Continuing the example: + +</p><pre class="screen"> +ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false ); + +SELECT ts_lexize('public.simple_dict', 'YeS'); + ts_lexize +----------- + + +SELECT ts_lexize('public.simple_dict', 'The'); + ts_lexize +----------- + {} +</pre><p> + </p><p> + With the default setting of <code class="literal">Accept</code> = <code class="literal">true</code>, + it is only useful to place a <code class="literal">simple</code> dictionary at the end + of a list of dictionaries, since it will never pass on any token to + a following dictionary. Conversely, <code class="literal">Accept</code> = <code class="literal">false</code> + is only useful when there is at least one following dictionary. + </p><div class="caution"><h3 class="title">Caution</h3><p> + Most types of dictionaries rely on configuration files, such as files of + stop words. These files <span class="emphasis"><em>must</em></span> be stored in UTF-8 encoding. + They will be translated to the actual database encoding, if that is + different, when they are read into the server. + </p></div><div class="caution"><h3 class="title">Caution</h3><p> + Normally, a database session will read a dictionary configuration file + only once, when it is first used within the session. If you modify a + configuration file and want to force existing sessions to pick up the + new contents, issue an <code class="command">ALTER TEXT SEARCH DICTIONARY</code> command + on the dictionary. This can be a <span class="quote">“<span class="quote">dummy</span>”</span> update that doesn't + actually change any parameter values. + </p></div></div><div class="sect2" id="TEXTSEARCH-SYNONYM-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.3. Synonym Dictionary</h3></div></div></div><p> + This dictionary template is used to create dictionaries that replace a + word with a synonym. Phrases are not supported (use the thesaurus + template (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-THESAURUS" title="12.6.4. Thesaurus Dictionary">Section 12.6.4</a>) for that). A synonym + dictionary can be used to overcome linguistic problems, for example, to + prevent an English stemmer dictionary from reducing the word <span class="quote">“<span class="quote">Paris</span>”</span> to + <span class="quote">“<span class="quote">pari</span>”</span>. It is enough to have a <code class="literal">Paris paris</code> line in the + synonym dictionary and put it before the <code class="literal">english_stem</code> + dictionary. For example: + +</p><pre class="screen"> +SELECT * FROM ts_debug('english', 'Paris'); + alias | description | token | dictionaries | dictionary | lexemes +-----------+-----------------+-------+----------------+--------------+--------- + asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari} + +CREATE TEXT SEARCH DICTIONARY my_synonym ( + TEMPLATE = synonym, + SYNONYMS = my_synonyms +); + +ALTER TEXT SEARCH CONFIGURATION english + ALTER MAPPING FOR asciiword + WITH my_synonym, english_stem; + +SELECT * FROM ts_debug('english', 'Paris'); + alias | description | token | dictionaries | dictionary | lexemes +-----------+-----------------+-------+---------------------------+------------+--------- + asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris} +</pre><p> + </p><p> + The only parameter required by the <code class="literal">synonym</code> template is + <code class="literal">SYNONYMS</code>, which is the base name of its configuration file + — <code class="literal">my_synonyms</code> in the above example. + The file's full name will be + <code class="filename">$SHAREDIR/tsearch_data/my_synonyms.syn</code> + (where <code class="literal">$SHAREDIR</code> means the + <span class="productname">PostgreSQL</span> installation's shared-data directory). + The file format is just one line + per word to be substituted, with the word followed by its synonym, + separated by white space. Blank lines and trailing spaces are ignored. + </p><p> + The <code class="literal">synonym</code> template also has an optional parameter + <code class="literal">CaseSensitive</code>, which defaults to <code class="literal">false</code>. When + <code class="literal">CaseSensitive</code> is <code class="literal">false</code>, words in the synonym file + are folded to lower case, as are input tokens. When it is + <code class="literal">true</code>, words and tokens are not folded to lower case, + but are compared as-is. + </p><p> + An asterisk (<code class="literal">*</code>) can be placed at the end of a synonym + in the configuration file. This indicates that the synonym is a prefix. + The asterisk is ignored when the entry is used in + <code class="function">to_tsvector()</code>, but when it is used in + <code class="function">to_tsquery()</code>, the result will be a query item with + the prefix match marker (see + <a class="xref" href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES" title="12.3.2. Parsing Queries">Section 12.3.2</a>). + For example, suppose we have these entries in + <code class="filename">$SHAREDIR/tsearch_data/synonym_sample.syn</code>: +</p><pre class="programlisting"> +postgres pgsql +postgresql pgsql +postgre pgsql +gogle googl +indices index* +</pre><p> + Then we will get these results: +</p><pre class="screen"> +mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); +mydb=# SELECT ts_lexize('syn', 'indices'); + ts_lexize +----------- + {index} +(1 row) + +mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); +mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; +mydb=# SELECT to_tsvector('tst', 'indices'); + to_tsvector +------------- + 'index':1 +(1 row) + +mydb=# SELECT to_tsquery('tst', 'indices'); + to_tsquery +------------ + 'index':* +(1 row) + +mydb=# SELECT 'indexes are very useful'::tsvector; + tsvector +--------------------------------- + 'are' 'indexes' 'useful' 'very' +(1 row) + +mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'); + ?column? +---------- + t +(1 row) +</pre><p> + </p></div><div class="sect2" id="TEXTSEARCH-THESAURUS"><div class="titlepage"><div><div><h3 class="title">12.6.4. Thesaurus Dictionary</h3></div></div></div><p> + A thesaurus dictionary (sometimes abbreviated as <acronym class="acronym">TZ</acronym>) is + a collection of words that includes information about the relationships + of words and phrases, i.e., broader terms (<acronym class="acronym">BT</acronym>), narrower + terms (<acronym class="acronym">NT</acronym>), preferred terms, non-preferred terms, related + terms, etc. + </p><p> + Basically a thesaurus dictionary replaces all non-preferred terms by one + preferred term and, optionally, preserves the original terms for indexing + as well. <span class="productname">PostgreSQL</span>'s current implementation of the + thesaurus dictionary is an extension of the synonym dictionary with added + <em class="firstterm">phrase</em> support. A thesaurus dictionary requires + a configuration file of the following format: + +</p><pre class="programlisting"> +# this is a comment +sample word(s) : indexed word(s) +more sample word(s) : more indexed word(s) +... +</pre><p> + + where the colon (<code class="symbol">:</code>) symbol acts as a delimiter between a + phrase and its replacement. + </p><p> + A thesaurus dictionary uses a <em class="firstterm">subdictionary</em> (which + is specified in the dictionary's configuration) to normalize the input + text before checking for phrase matches. It is only possible to select one + subdictionary. An error is reported if the subdictionary fails to + recognize a word. In that case, you should remove the use of the word or + teach the subdictionary about it. You can place an asterisk + (<code class="symbol">*</code>) at the beginning of an indexed word to skip applying + the subdictionary to it, but all sample words <span class="emphasis"><em>must</em></span> be known + to the subdictionary. + </p><p> + The thesaurus dictionary chooses the longest match if there are multiple + phrases matching the input, and ties are broken by using the last + definition. + </p><p> + Specific stop words recognized by the subdictionary cannot be + specified; instead use <code class="literal">?</code> to mark the location where any + stop word can appear. For example, assuming that <code class="literal">a</code> and + <code class="literal">the</code> are stop words according to the subdictionary: + +</p><pre class="programlisting"> +? one ? two : swsw +</pre><p> + + matches <code class="literal">a one the two</code> and <code class="literal">the one a two</code>; + both would be replaced by <code class="literal">swsw</code>. + </p><p> + Since a thesaurus dictionary has the capability to recognize phrases it + must remember its state and interact with the parser. A thesaurus dictionary + uses these assignments to check if it should handle the next word or stop + accumulation. The thesaurus dictionary must be configured + carefully. For example, if the thesaurus dictionary is assigned to handle + only the <code class="literal">asciiword</code> token, then a thesaurus dictionary + definition like <code class="literal">one 7</code> will not work since token type + <code class="literal">uint</code> is not assigned to the thesaurus dictionary. + </p><div class="caution"><h3 class="title">Caution</h3><p> + Thesauruses are used during indexing so any change in the thesaurus + dictionary's parameters <span class="emphasis"><em>requires</em></span> reindexing. + For most other dictionary types, small changes such as adding or + removing stopwords does not force reindexing. + </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-CONFIG"><div class="titlepage"><div><div><h4 class="title">12.6.4.1. Thesaurus Configuration</h4></div></div></div><p> + To define a new thesaurus dictionary, use the <code class="literal">thesaurus</code> + template. For example: + +</p><pre class="programlisting"> +CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( + TEMPLATE = thesaurus, + DictFile = mythesaurus, + Dictionary = pg_catalog.english_stem +); +</pre><p> + + Here: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">thesaurus_simple</code> is the new dictionary's name + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">mythesaurus</code> is the base name of the thesaurus + configuration file. + (Its full name will be <code class="filename">$SHAREDIR/tsearch_data/mythesaurus.ths</code>, + where <code class="literal">$SHAREDIR</code> means the installation shared-data + directory.) + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">pg_catalog.english_stem</code> is the subdictionary (here, + a Snowball English stemmer) to use for thesaurus normalization. + Notice that the subdictionary will have its own + configuration (for example, stop words), which is not shown here. + </p></li></ul></div><p> + + Now it is possible to bind the thesaurus dictionary <code class="literal">thesaurus_simple</code> + to the desired token types in a configuration, for example: + +</p><pre class="programlisting"> +ALTER TEXT SEARCH CONFIGURATION russian + ALTER MAPPING FOR asciiword, asciihword, hword_asciipart + WITH thesaurus_simple; +</pre><p> + </p></div><div class="sect3" id="TEXTSEARCH-THESAURUS-EXAMPLES"><div class="titlepage"><div><div><h4 class="title">12.6.4.2. Thesaurus Example</h4></div></div></div><p> + Consider a simple astronomical thesaurus <code class="literal">thesaurus_astro</code>, + which contains some astronomical word combinations: + +</p><pre class="programlisting"> +supernovae stars : sn +crab nebulae : crab +</pre><p> + + Below we create a dictionary and bind some token types to + an astronomical thesaurus and English stemmer: + +</p><pre class="programlisting"> +CREATE TEXT SEARCH DICTIONARY thesaurus_astro ( + TEMPLATE = thesaurus, + DictFile = thesaurus_astro, + Dictionary = english_stem +); + +ALTER TEXT SEARCH CONFIGURATION russian + ALTER MAPPING FOR asciiword, asciihword, hword_asciipart + WITH thesaurus_astro, english_stem; +</pre><p> + + Now we can see how it works. + <code class="function">ts_lexize</code> is not very useful for testing a thesaurus, + because it treats its input as a single token. Instead we can use + <code class="function">plainto_tsquery</code> and <code class="function">to_tsvector</code> + which will break their input strings into multiple tokens: + +</p><pre class="screen"> +SELECT plainto_tsquery('supernova star'); + plainto_tsquery +----------------- + 'sn' + +SELECT to_tsvector('supernova star'); + to_tsvector +------------- + 'sn':1 +</pre><p> + + In principle, one can use <code class="function">to_tsquery</code> if you quote + the argument: + +</p><pre class="screen"> +SELECT to_tsquery('''supernova star'''); + to_tsquery +------------ + 'sn' +</pre><p> + + Notice that <code class="literal">supernova star</code> matches <code class="literal">supernovae + stars</code> in <code class="literal">thesaurus_astro</code> because we specified + the <code class="literal">english_stem</code> stemmer in the thesaurus definition. + The stemmer removed the <code class="literal">e</code> and <code class="literal">s</code>. + </p><p> + To index the original phrase as well as the substitute, just include it + in the right-hand part of the definition: + +</p><pre class="screen"> +supernovae stars : sn supernovae stars + +SELECT plainto_tsquery('supernova star'); + plainto_tsquery +----------------------------- + 'sn' & 'supernova' & 'star' +</pre><p> + </p></div></div><div class="sect2" id="TEXTSEARCH-ISPELL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.5. <span class="application">Ispell</span> Dictionary</h3></div></div></div><p> + The <span class="application">Ispell</span> dictionary template supports + <em class="firstterm">morphological dictionaries</em>, which can normalize many + different linguistic forms of a word into the same lexeme. For example, + an English <span class="application">Ispell</span> dictionary can match all declensions and + conjugations of the search term <code class="literal">bank</code>, e.g., + <code class="literal">banking</code>, <code class="literal">banked</code>, <code class="literal">banks</code>, + <code class="literal">banks'</code>, and <code class="literal">bank's</code>. + </p><p> + The standard <span class="productname">PostgreSQL</span> distribution does + not include any <span class="application">Ispell</span> configuration files. + Dictionaries for a large number of languages are available from <a class="ulink" href="https://www.cs.hmc.edu/~geoff/ispell.html" target="_top">Ispell</a>. + Also, some more modern dictionary file formats are supported — <a class="ulink" href="https://en.wikipedia.org/wiki/MySpell" target="_top">MySpell</a> (OO < 2.0.1) + and <a class="ulink" href="https://hunspell.github.io/" target="_top">Hunspell</a> + (OO >= 2.0.2). A large list of dictionaries is available on the <a class="ulink" href="https://wiki.openoffice.org/wiki/Dictionaries" target="_top">OpenOffice + Wiki</a>. + </p><p> + To create an <span class="application">Ispell</span> dictionary perform these steps: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + download dictionary configuration files. <span class="productname">OpenOffice</span> + extension files have the <code class="filename">.oxt</code> extension. It is necessary + to extract <code class="filename">.aff</code> and <code class="filename">.dic</code> files, change + extensions to <code class="filename">.affix</code> and <code class="filename">.dict</code>. For some + dictionary files it is also needed to convert characters to the UTF-8 + encoding with commands (for example, for a Norwegian language dictionary): +</p><pre class="programlisting"> +iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff +iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic +</pre><p> + </p></li><li class="listitem" style="list-style-type: disc"><p> + copy files to the <code class="filename">$SHAREDIR/tsearch_data</code> directory + </p></li><li class="listitem" style="list-style-type: disc"><p> + load files into PostgreSQL with the following command: +</p><pre class="programlisting"> +CREATE TEXT SEARCH DICTIONARY english_hunspell ( + TEMPLATE = ispell, + DictFile = en_us, + AffFile = en_us, + Stopwords = english); +</pre><p> + </p></li></ul></div><p> + Here, <code class="literal">DictFile</code>, <code class="literal">AffFile</code>, and <code class="literal">StopWords</code> + specify the base names of the dictionary, affixes, and stop-words files. + The stop-words file has the same format explained above for the + <code class="literal">simple</code> dictionary type. The format of the other files is + not specified here but is available from the above-mentioned web sites. + </p><p> + Ispell dictionaries usually recognize a limited set of words, so they + should be followed by another broader dictionary; for + example, a Snowball dictionary, which recognizes everything. + </p><p> + The <code class="filename">.affix</code> file of <span class="application">Ispell</span> has the following + structure: +</p><pre class="programlisting"> +prefixes +flag *A: + . > RE # As in enter > reenter +suffixes +flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y > EST # As in gray > grayest + [^EY] > EST # As in small > smallest +</pre><p> + </p><p> + And the <code class="filename">.dict</code> file has the following structure: +</p><pre class="programlisting"> +lapse/ADGRS +lard/DGRS +large/PRTY +lark/MRS +</pre><p> + </p><p> + Format of the <code class="filename">.dict</code> file is: +</p><pre class="programlisting"> +basic_form/affix_class_name +</pre><p> + </p><p> + In the <code class="filename">.affix</code> file every affix flag is described in the + following format: +</p><pre class="programlisting"> +condition > [-stripping_letters,] adding_affix +</pre><p> + </p><p> + Here, condition has a format similar to the format of regular expressions. + It can use groupings <code class="literal">[...]</code> and <code class="literal">[^...]</code>. + For example, <code class="literal">[AEIOU]Y</code> means that the last letter of the word + is <code class="literal">"y"</code> and the penultimate letter is <code class="literal">"a"</code>, + <code class="literal">"e"</code>, <code class="literal">"i"</code>, <code class="literal">"o"</code> or <code class="literal">"u"</code>. + <code class="literal">[^EY]</code> means that the last letter is neither <code class="literal">"e"</code> + nor <code class="literal">"y"</code>. + </p><p> + Ispell dictionaries support splitting compound words; + a useful feature. + Notice that the affix file should specify a special flag using the + <code class="literal">compoundwords controlled</code> statement that marks dictionary + words that can participate in compound formation: + +</p><pre class="programlisting"> +compoundwords controlled z +</pre><p> + + Here are some examples for the Norwegian language: + +</p><pre class="programlisting"> +SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); + {over,buljong,terning,pakk,mester,assistent} +SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); + {sjokoladefabrikk,sjokolade,fabrikk} +</pre><p> + </p><p> + <span class="application">MySpell</span> format is a subset of <span class="application">Hunspell</span>. + The <code class="filename">.affix</code> file of <span class="application">Hunspell</span> has the following + structure: +</p><pre class="programlisting"> +PFX A Y 1 +PFX A 0 re . +SFX T N 4 +SFX T 0 st e +SFX T y iest [^aeiou]y +SFX T 0 est [aeiou]y +SFX T 0 est [^ey] +</pre><p> + </p><p> + The first line of an affix class is the header. Fields of an affix rules are + listed after the header: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + parameter name (PFX or SFX) + </p></li><li class="listitem" style="list-style-type: disc"><p> + flag (name of the affix class) + </p></li><li class="listitem" style="list-style-type: disc"><p> + stripping characters from beginning (at prefix) or end (at suffix) of the + word + </p></li><li class="listitem" style="list-style-type: disc"><p> + adding affix + </p></li><li class="listitem" style="list-style-type: disc"><p> + condition that has a format similar to the format of regular expressions. + </p></li></ul></div><p> + The <code class="filename">.dict</code> file looks like the <code class="filename">.dict</code> file of + <span class="application">Ispell</span>: +</p><pre class="programlisting"> +larder/M +lardy/RT +large/RSPMYT +largehearted +</pre><p> + </p><div class="note"><h3 class="title">Note</h3><p> + <span class="application">MySpell</span> does not support compound words. + <span class="application">Hunspell</span> has sophisticated support for compound words. At + present, <span class="productname">PostgreSQL</span> implements only the basic + compound word operations of Hunspell. + </p></div></div><div class="sect2" id="TEXTSEARCH-SNOWBALL-DICTIONARY"><div class="titlepage"><div><div><h3 class="title">12.6.6. <span class="application">Snowball</span> Dictionary</h3></div></div></div><p> + The <span class="application">Snowball</span> dictionary template is based on a project + by Martin Porter, inventor of the popular Porter's stemming algorithm + for the English language. Snowball now provides stemming algorithms for + many languages (see the <a class="ulink" href="https://snowballstem.org/" target="_top">Snowball + site</a> for more information). Each algorithm understands how to + reduce common variant forms of words to a base, or stem, spelling within + its language. A Snowball dictionary requires a <code class="literal">language</code> + parameter to identify which stemmer to use, and optionally can specify a + <code class="literal">stopword</code> file name that gives a list of words to eliminate. + (<span class="productname">PostgreSQL</span>'s standard stopword lists are also + provided by the Snowball project.) + For example, there is a built-in definition equivalent to + +</p><pre class="programlisting"> +CREATE TEXT SEARCH DICTIONARY english_stem ( + TEMPLATE = snowball, + Language = english, + StopWords = english +); +</pre><p> + + The stopword file format is the same as already explained. + </p><p> + A <span class="application">Snowball</span> dictionary recognizes everything, whether + or not it is able to simplify the word, so it should be placed + at the end of the dictionary list. It is useless to have it + before any other dictionary because a token will never pass through it to + the next dictionary. + </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-parsers.html" title="12.5. Parsers">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-configuration.html" title="12.7. Configuration Example">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.5. Parsers </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.7. Configuration Example</td></tr></table></div></body></html>
\ No newline at end of file |