1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.8. Testing and Debugging Text Search</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-configuration.html" title="12.7. Configuration Example" /><link rel="next" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.8. Testing and Debugging Text Search</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-configuration.html" title="12.7. Configuration Example">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-DEBUGGING"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.8. Testing and Debugging Text Search</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-CONFIGURATION-TESTING">12.8.1. Configuration Testing</a></span></dt><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING">12.8.2. Parser Testing</a></span></dt><dt><span class="sect2"><a href="textsearch-debugging.html#TEXTSEARCH-DICTIONARY-TESTING">12.8.3. Dictionary Testing</a></span></dt></dl></div><p>
The behavior of a custom text search configuration can easily become
confusing. The functions described
in this section are useful for testing text search objects. You can
test a complete configuration, or test parsers and dictionaries separately.
</p><div class="sect2" id="TEXTSEARCH-CONFIGURATION-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.1. Configuration Testing</h3></div></div></div><p>
The function <code class="function">ts_debug</code> allows easy testing of a
text search configuration.
</p><a id="id-1.5.11.11.3.3" class="indexterm"></a><pre class="synopsis">
ts_debug([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>dictionaries</code></em> <code class="type">regdictionary[]</code>,
OUT <em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code>,
OUT <em class="replaceable"><code>lexemes</code></em> <code class="type">text[]</code>)
returns setof record
</pre><p>
<code class="function">ts_debug</code> displays information about every token of
<em class="replaceable"><code>document</code></em> as produced by the
parser and processed by the configured dictionaries. It uses the
configuration specified by <em class="replaceable"><code>config</code></em>,
or <code class="varname">default_text_search_config</code> if that argument is
omitted.
</p><p>
<code class="function">ts_debug</code> returns one row for each token identified in the text
by the parser. The columns returned are
</p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>alias</code></em> <code class="type">text</code> — short name of the token type
</p></li><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>description</code></em> <code class="type">text</code> — description of the
token type
</p></li><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>token</code></em> <code class="type">text</code> — text of the token
</p></li><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>dictionaries</code></em> <code class="type">regdictionary[]</code> — the
dictionaries selected by the configuration for this token type
</p></li><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code> — the dictionary
that recognized the token, or <code class="literal">NULL</code> if none did
</p></li><li class="listitem" style="list-style-type: disc"><p>
<em class="replaceable"><code>lexemes</code></em> <code class="type">text[]</code> — the lexeme(s) produced
by the dictionary that recognized the token, or <code class="literal">NULL</code> if
none did; an empty array (<code class="literal">{}</code>) means it was recognized as a
stop word
</p></li></ul></div><p>
</p><p>
Here is a simple example:
</p><pre class="screen">
SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
blank | Space symbols | | {} | |
blank | Space symbols | - | {} | |
asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
</pre><p>
</p><p>
For a more extensive demonstration, we
first create a <code class="literal">public.english</code> configuration and
Ispell dictionary for the English language:
</p><pre class="programlisting">
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
ALTER TEXT SEARCH CONFIGURATION public.english
ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
</pre><pre class="screen">
SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------------+-------------------------------+----------------+-------------
asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
</pre><p>
In this example, the word <code class="literal">Brightest</code> was recognized by the
parser as an <code class="literal">ASCII word</code> (alias <code class="literal">asciiword</code>).
For this token type the dictionary list is
<code class="literal">english_ispell</code> and
<code class="literal">english_stem</code>. The word was recognized by
<code class="literal">english_ispell</code>, which reduced it to the noun
<code class="literal">bright</code>. The word <code class="literal">supernovaes</code> is
unknown to the <code class="literal">english_ispell</code> dictionary so it
was passed to the next dictionary, and, fortunately, was recognized (in
fact, <code class="literal">english_stem</code> is a Snowball dictionary which
recognizes everything; that is why it was placed at the end of the
dictionary list).
</p><p>
The word <code class="literal">The</code> was recognized by the
<code class="literal">english_ispell</code> dictionary as a stop word (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS" title="12.6.1. Stop Words">Section 12.6.1</a>) and will not be indexed.
The spaces are discarded too, since the configuration provides no
dictionaries at all for them.
</p><p>
You can reduce the width of the output by explicitly specifying which columns
you want to see:
</p><pre class="screen">
SELECT alias, token, dictionary, lexemes
FROM ts_debug('public.english', 'The Brightest supernovaes');
alias | token | dictionary | lexemes
-----------+-------------+----------------+-------------
asciiword | The | english_ispell | {}
blank | | |
asciiword | Brightest | english_ispell | {bright}
blank | | |
asciiword | supernovaes | english_stem | {supernova}
</pre><p>
</p></div><div class="sect2" id="TEXTSEARCH-PARSER-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.2. Parser Testing</h3></div></div></div><p>
The following functions allow direct testing of a text search parser.
</p><a id="id-1.5.11.11.4.3" class="indexterm"></a><pre class="synopsis">
ts_parse(<em class="replaceable"><code>parser_name</code></em> <code class="type">text</code>, <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>, OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
ts_parse(<em class="replaceable"><code>parser_oid</code></em> <code class="type">oid</code>, <em class="replaceable"><code>document</code></em> <code class="type">text</code>,
OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>, OUT <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
</pre><p>
<code class="function">ts_parse</code> parses the given <em class="replaceable"><code>document</code></em>
and returns a series of records, one for each token produced by
parsing. Each record includes a <code class="varname">tokid</code> showing the
assigned token type and a <code class="varname">token</code> which is the text of the
token. For example:
</p><pre class="screen">
SELECT * FROM ts_parse('default', '123 - a number');
tokid | token
-------+--------
22 | 123
12 |
12 | -
1 | a
12 |
1 | number
</pre><p>
</p><a id="id-1.5.11.11.4.6" class="indexterm"></a><pre class="synopsis">
ts_token_type(<em class="replaceable"><code>parser_name</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>,
OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
ts_token_type(<em class="replaceable"><code>parser_oid</code></em> <code class="type">oid</code>, OUT <em class="replaceable"><code>tokid</code></em> <code class="type">integer</code>,
OUT <em class="replaceable"><code>alias</code></em> <code class="type">text</code>, OUT <em class="replaceable"><code>description</code></em> <code class="type">text</code>) returns <code class="type">setof record</code>
</pre><p>
<code class="function">ts_token_type</code> returns a table which describes each type of
token the specified parser can recognize. For each token type, the table
gives the integer <code class="varname">tokid</code> that the parser uses to label a
token of that type, the <code class="varname">alias</code> that names the token type
in configuration commands, and a short <code class="varname">description</code>. For
example:
</p><pre class="screen">
SELECT * FROM ts_token_type('default');
tokid | alias | description
-------+-----------------+------------------------------------------
1 | asciiword | Word, all ASCII
2 | word | Word, all letters
3 | numword | Word, letters and digits
4 | email | Email address
5 | url | URL
6 | host | Host
7 | sfloat | Scientific notation
8 | version | Version number
9 | hword_numpart | Hyphenated word part, letters and digits
10 | hword_part | Hyphenated word part, all letters
11 | hword_asciipart | Hyphenated word part, all ASCII
12 | blank | Space symbols
13 | tag | XML tag
14 | protocol | Protocol head
15 | numhword | Hyphenated word, letters and digits
16 | asciihword | Hyphenated word, all ASCII
17 | hword | Hyphenated word, all letters
18 | url_path | URL path
19 | file | File or path name
20 | float | Decimal notation
21 | int | Signed integer
22 | uint | Unsigned integer
23 | entity | XML entity
</pre><p>
</p></div><div class="sect2" id="TEXTSEARCH-DICTIONARY-TESTING"><div class="titlepage"><div><div><h3 class="title">12.8.3. Dictionary Testing</h3></div></div></div><p>
The <code class="function">ts_lexize</code> function facilitates dictionary testing.
</p><a id="id-1.5.11.11.5.3" class="indexterm"></a><pre class="synopsis">
ts_lexize(<em class="replaceable"><code>dict</code></em> <code class="type">regdictionary</code>, <em class="replaceable"><code>token</code></em> <code class="type">text</code>) returns <code class="type">text[]</code>
</pre><p>
<code class="function">ts_lexize</code> returns an array of lexemes if the input
<em class="replaceable"><code>token</code></em> is known to the dictionary,
or an empty array if the token
is known to the dictionary but it is a stop word, or
<code class="literal">NULL</code> if it is an unknown word.
</p><p>
Examples:
</p><pre class="screen">
SELECT ts_lexize('english_stem', 'stars');
ts_lexize
-----------
{star}
SELECT ts_lexize('english_stem', 'a');
ts_lexize
-----------
{}
</pre><p>
</p><div class="note"><h3 class="title">Note</h3><p>
The <code class="function">ts_lexize</code> function expects a single
<span class="emphasis"><em>token</em></span>, not text. Here is a case
where this can be confusing:
</p><pre class="screen">
SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
?column?
----------
t
</pre><p>
The thesaurus dictionary <code class="literal">thesaurus_astro</code> does know the
phrase <code class="literal">supernovae stars</code>, but <code class="function">ts_lexize</code>
fails since it does not parse the input text but treats it as a single
token. Use <code class="function">plainto_tsquery</code> or <code class="function">to_tsvector</code> to
test thesaurus dictionaries, for example:
</p><pre class="screen">
SELECT plainto_tsquery('supernovae stars');
plainto_tsquery
-----------------
'sn'
</pre><p>
</p></div></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-configuration.html" title="12.7. Configuration Example">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.7. Configuration Example </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.9. Preferred Index Types for Text Search</td></tr></table></div></body></html>
|