summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/textsearch-debugging.html
blob: d10122bbc2e045b6354110be55f189c4eec5515a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.8. Testing and Debugging Text Search</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-configuration.html" title="12.7. Configuration Example" /><link rel="next" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.8. Testing and Debugging Text Search</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-configuration.html" title="12.7. Configuration Example">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 16.3 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-DEBUGGING"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.8. Testing and Debugging Text Search <a href="#TEXTSEARCH-DEBUGGING" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-CONFIGURATION-TESTING">12.8.1. Configuration Testing</a></span></dt><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING">12.8.2. Parser Testing</a></span></dt><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-DICTIONARY-TESTING">12.8.3. Dictionary Testing</a></span></dt></dl></div><p>
   The behavior of a custom text search configuration can easily become
   confusing.  The functions described
   in this section are useful for testing text search objects.  You can
   test a complete configuration, or test parsers and dictionaries separately.
  </p><div class="sect2" id="TEXTSEARCH-CONFIGURATION-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.1. Configuration Testing <a href="#TEXTSEARCH-CONFIGURATION-TESTING" class="id_link">#</a></h3></div></div></div><p>
   The function <code class="function">ts_debug</code> allows easy testing of a
   text search configuration.
  </p><a id="id-1.5.11.11.3.3" class="indexterm"></a><pre class="synopsis">
ts_debug([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>dictionaries</code></em> <code class="type">regdictionary[]</code>,
         OUT <em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code>,
         OUT <em class="replaceable"><code>lexemes</code></em> <code class="type">text[]</code>)
         returns setof record
</pre><p>
   <code class="function">ts_debug</code> displays information about every token of
   <em class="replaceable"><code>document</code></em> as produced by the
   parser and processed by the configured dictionaries.  It uses the
   configuration specified by <em class="replaceable"><code>config</code></em>,
   or <code class="varname">default_text_search_config</code> if that argument is
   omitted.
  </p><p>
   <code class="function">ts_debug</code> returns one row for each token identified in the text
   by the parser.  The columns returned are

    </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>alias</code></em> <code class="type">text</code> — short name of the token type
      </p></li><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>description</code></em> <code class="type">text</code> — description of the
       token type
      </p></li><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>token</code></em> <code class="type">text</code> — text of the token
      </p></li><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>dictionaries</code></em> <code class="type">regdictionary[]</code> — the
       dictionaries selected by the configuration for this token type
      </p></li><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code> — the dictionary
       that recognized the token, or <code class="literal">NULL</code> if none did
      </p></li><li class="listitem" style="list-style-type: disc"><p>
       <em class="replaceable"><code>lexemes</code></em> <code class="type">text[]</code> — the lexeme(s) produced
       by the dictionary that recognized the token, or <code class="literal">NULL</code> if
       none did; an empty array (<code class="literal">{}</code>) means it was recognized as a
       stop word
      </p></li></ul></div><p>
  </p><p>
   Here is a simple example:

</p><pre class="screen">
SELECT * FROM ts_debug('english', 'a fat  cat sat on a mat - it ate a fat rats');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
 blank     | Space symbols   |       | {}             |              |
 blank     | Space symbols   | -     | {}             |              |
 asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}
</pre><p>
  </p><p>
   For a more extensive demonstration, we
   first create a <code class="literal">public.english</code> configuration and
   Ispell dictionary for the English language:
  </p><pre class="programlisting">
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );

CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
    DictFile = english,
    AffFile = english,
    StopWords = english
);

ALTER TEXT SEARCH CONFIGURATION public.english
   ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
</pre><pre class="screen">
SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
   alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes
-----------+-----------------+-------------+-------------------------------+----------------+-------------
 asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {}
 blank     | Space symbols   |             | {}                            |                |
 asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright}
 blank     | Space symbols   |             | {}                            |                |
 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}
</pre><p>
   In this example, the word <code class="literal">Brightest</code> was recognized by the
   parser as an <code class="literal">ASCII word</code> (alias <code class="literal">asciiword</code>).
   For this token type the dictionary list is
   <code class="literal">english_ispell</code> and
   <code class="literal">english_stem</code>. The word was recognized by
   <code class="literal">english_ispell</code>, which reduced it to the noun
   <code class="literal">bright</code>. The word <code class="literal">supernovaes</code> is
   unknown to the <code class="literal">english_ispell</code> dictionary so it
   was passed to the next dictionary, and, fortunately, was recognized (in
   fact, <code class="literal">english_stem</code> is a Snowball dictionary which
   recognizes everything; that is why it was placed at the end of the
   dictionary list).
  </p><p>
   The word <code class="literal">The</code> was recognized by the
   <code class="literal">english_ispell</code> dictionary as a stop word (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS" title="12.6.1. Stop Words">Section 12.6.1</a>) and will not be indexed.
   The spaces are discarded too, since the configuration provides no
   dictionaries at all for them.
  </p><p>
   You can reduce the width of the output by explicitly specifying which columns
   you want to see:

</p><pre class="screen">
SELECT alias, token, dictionary, lexemes
FROM ts_debug('public.english', 'The Brightest supernovaes');
   alias   |    token    |   dictionary   |   lexemes
-----------+-------------+----------------+-------------
 asciiword | The         | english_ispell | {}
 blank     |             |                |
 asciiword | Brightest   | english_ispell | {bright}
 blank     |             |                |
 asciiword | supernovaes | english_stem   | {supernova}
</pre><p>
  </p></div><div class="sect2" id="TEXTSEARCH-PARSER-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.2. Parser Testing <a href="#TEXTSEARCH-PARSER-TESTING" class="id_link">#</a></h3></div></div></div><p>
   The following functions allow direct testing of a text search parser.
  </p><a id="id-1.5.11.11.4.3" class="indexterm"></a><pre class="synopsis">
ts_parse(<em class="replaceable"><code>parser_name</code></em> <code class="type">text</code>, <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>, OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
ts_parse(<em class="replaceable"><code>parser_oid</code></em> <code class="type">oid</code>, <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
         OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>, OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
</pre><p>
   <code class="function">ts_parse</code> parses the given <em class="replaceable"><code>document</code></em>
   and returns a series of records, one for each token produced by
   parsing. Each record includes a <code class="varname">tokid</code> showing the
   assigned token type and a <code class="varname">token</code> which is the text of the
   token.  For example:

</p><pre class="screen">
SELECT * FROM ts_parse('default', '123 - a number');
 tokid | token
-------+--------
    22 | 123
    12 |
    12 | -
     1 | a
    12 |
     1 | number
</pre><p>
  </p><a id="id-1.5.11.11.4.6" class="indexterm"></a><pre class="synopsis">
ts_token_type(<em class="replaceable"><code>parser_name</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>,
              OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
ts_token_type(<em class="replaceable"><code>parser_oid</code></em> <code class="type">oid</code>, OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>,
              OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
</pre><p>
   <code class="function">ts_token_type</code> returns a table which describes each type of
   token the specified parser can recognize.  For each token type, the table
   gives the integer <code class="varname">tokid</code> that the parser uses to label a
   token of that type, the <code class="varname">alias</code> that names the token type
   in configuration commands, and a short <code class="varname">description</code>.  For
   example:

</p><pre class="screen">
SELECT * FROM ts_token_type('default');
 tokid |      alias      |               description
-------+-----------------+------------------------------------------
     1 | asciiword       | Word, all ASCII
     2 | word            | Word, all letters
     3 | numword         | Word, letters and digits
     4 | email           | Email address
     5 | url             | URL
     6 | host            | Host
     7 | sfloat          | Scientific notation
     8 | version         | Version number
     9 | hword_numpart   | Hyphenated word part, letters and digits
    10 | hword_part      | Hyphenated word part, all letters
    11 | hword_asciipart | Hyphenated word part, all ASCII
    12 | blank           | Space symbols
    13 | tag             | XML tag
    14 | protocol        | Protocol head
    15 | numhword        | Hyphenated word, letters and digits
    16 | asciihword      | Hyphenated word, all ASCII
    17 | hword           | Hyphenated word, all letters
    18 | url_path        | URL path
    19 | file            | File or path name
    20 | float           | Decimal notation
    21 | int             | Signed integer
    22 | uint            | Unsigned integer
    23 | entity          | XML entity
</pre><p>
   </p></div><div class="sect2" id="TEXTSEARCH-DICTIONARY-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.3. Dictionary Testing <a href="#TEXTSEARCH-DICTIONARY-TESTING" class="id_link">#</a></h3></div></div></div><p>
    The <code class="function">ts_lexize</code> function facilitates dictionary testing.
   </p><a id="id-1.5.11.11.5.3" class="indexterm"></a><pre class="synopsis">
ts_lexize(<em class="replaceable"><code>dict</code></em> <code class="type">regdictionary</code>, <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">text[]</code>
</pre><p>
    <code class="function">ts_lexize</code> returns an array of lexemes if the input
    <em class="replaceable"><code>token</code></em> is known to the dictionary,
    or an empty array if the token
    is known to the dictionary but it is a stop word, or
    <code class="literal">NULL</code> if it is an unknown word.
   </p><p>
    Examples:

</p><pre class="screen">
SELECT ts_lexize('english_stem', 'stars');
 ts_lexize
-----------
 {star}

SELECT ts_lexize('english_stem', 'a');
 ts_lexize
-----------
 {}
</pre><p>
   </p><div class="note"><h3 class="title">Note</h3><p>
     The <code class="function">ts_lexize</code> function expects a single
     <span class="emphasis"><em>token</em></span>, not text. Here is a case
     where this can be confusing:

</p><pre class="screen">
SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
 ?column?
----------
 t
</pre><p>

     The thesaurus dictionary <code class="literal">thesaurus_astro</code> does know the
     phrase <code class="literal">supernovae stars</code>, but <code class="function">ts_lexize</code>
     fails since it does not parse the input text but treats it as a single
     token. Use <code class="function">plainto_tsquery</code> or <code class="function">to_tsvector</code> to
     test thesaurus dictionaries, for example:

</p><pre class="screen">
SELECT plainto_tsquery('supernovae stars');
 plainto_tsquery
-----------------
 'sn'
</pre><p>
    </p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-configuration.html" title="12.7. Configuration Example">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.7. Configuration Example </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 16.3 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.9. Preferred Index Types for Text Search</td></tr></table></div></body></html>