doc/src/sgml/html/datatype-textsearch.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>8.11. Text Search Types</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="datatype-bit.html" title="8.10. Bit String Types" /><link rel="next" href="datatype-uuid.html" title="8.12. UUID Type" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">8.11. Text Search Types</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="datatype-bit.html" title="8.10. Bit String Types">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="datatype.html" title="Chapter 8. Data Types">Up</a></td><th width="60%" align="center">Chapter 8. Data Types</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.6 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="datatype-uuid.html" title="8.12. UUID Type">Next</a></td></tr></table><hr /></div><div class="sect1" id="DATATYPE-TEXTSEARCH"><div class="titlepage"><div><div><h2 class="title" style="clear: both">8.11. Text Search Types</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="datatype-textsearch.html#DATATYPE-TSVECTOR">8.11.1. <code class="type">tsvector</code></a></span></dt><dt><span class="sect2"><a href="datatype-textsearch.html#DATATYPE-TSQUERY">8.11.2. <code class="type">tsquery</code></a></span></dt></dl></div><a id="id-1.5.7.19.2" class="indexterm"></a><a id="id-1.5.7.19.3" class="indexterm"></a><p>
    <span class="productname">PostgreSQL</span> provides two data types that
    are designed to support full text search, which is the activity of
    searching through a collection of natural-language <em class="firstterm">documents</em>
    to locate those that best match a <em class="firstterm">query</em>.
    The <code class="type">tsvector</code> type represents a document in a form optimized
    for text search; the <code class="type">tsquery</code> type similarly represents
    a text query.
    <a class="xref" href="textsearch.html" title="Chapter 12. Full Text Search">Chapter 12</a> provides a detailed explanation of this
    facility, and <a class="xref" href="functions-textsearch.html" title="9.13. Text Search Functions and Operators">Section 9.13</a> summarizes the
    related functions and operators.
   </p><div class="sect2" id="DATATYPE-TSVECTOR"><div class="titlepage"><div><div><h3 class="title">8.11.1. <code class="type">tsvector</code></h3></div></div></div><a id="id-1.5.7.19.5.2" class="indexterm"></a><p>
     A <code class="type">tsvector</code> value is a sorted list of distinct
     <em class="firstterm">lexemes</em>, which are words that have been
     <em class="firstterm">normalized</em> to merge different variants of the same word
     (see <a class="xref" href="textsearch.html" title="Chapter 12. Full Text Search">Chapter 12</a> for details).  Sorting and
     duplicate-elimination are done automatically during input, as shown in
     this example:

</p><pre class="programlisting">
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
</pre><p>

     To represent
     lexemes containing whitespace or punctuation, surround them with quotes:

</p><pre class="programlisting">
SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'
</pre><p>

     (We use dollar-quoted string literals in this example and the next one
     to avoid the confusion of having to double quote marks within the
     literals.)  Embedded quotes and backslashes must be doubled:

</p><pre class="programlisting">
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector
------------------------------------------------
 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
</pre><p>

     Optionally, integer <em class="firstterm">positions</em>
     can be attached to lexemes:

</p><pre class="programlisting">
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
</pre><p>

     A position normally indicates the source word's location in the
     document.  Positional information can be used for
     <em class="firstterm">proximity ranking</em>.  Position values can
     range from 1 to 16383; larger numbers are silently set to 16383.
     Duplicate positions for the same lexeme are discarded.
    </p><p>
     Lexemes that have positions can further be labeled with a
     <em class="firstterm">weight</em>, which can be <code class="literal">A</code>,
     <code class="literal">B</code>, <code class="literal">C</code>, or <code class="literal">D</code>.
     <code class="literal">D</code> is the default and hence is not shown on output:

</p><pre class="programlisting">
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector
----------------------------
 'a':1A 'cat':5 'fat':2B,4C
</pre><p>

     Weights are typically used to reflect document structure, for example
     by marking title words differently from body words.  Text search
     ranking functions can assign different priorities to the different
     weight markers.
    </p><p>
     It is important to understand that the
     <code class="type">tsvector</code> type itself does not perform any word
     normalization; it assumes the words it is given are normalized
     appropriately for the application.  For example,

</p><pre class="programlisting">
SELECT 'The Fat Rats'::tsvector;
      tsvector
--------------------
 'Fat' 'Rats' 'The'
</pre><p>

     For most English-text-searching applications the above words would
     be considered non-normalized, but <code class="type">tsvector</code> doesn't care.
     Raw document text should usually be passed through
     <code class="function">to_tsvector</code> to normalize the words appropriately
     for searching:

</p><pre class="programlisting">
SELECT to_tsvector('english', 'The Fat Rats');
   to_tsvector
-----------------
 'fat':2 'rat':3
</pre><p>

     Again, see <a class="xref" href="textsearch.html" title="Chapter 12. Full Text Search">Chapter 12</a> for more detail.
    </p></div><div class="sect2" id="DATATYPE-TSQUERY"><div class="titlepage"><div><div><h3 class="title">8.11.2. <code class="type">tsquery</code></h3></div></div></div><a id="id-1.5.7.19.6.2" class="indexterm"></a><p>
     A <code class="type">tsquery</code> value stores lexemes that are to be
     searched for, and can combine them using the Boolean operators
     <code class="literal">&amp;</code> (AND), <code class="literal">|</code> (OR), and
     <code class="literal">!</code> (NOT), as well as the phrase search operator
     <code class="literal">&lt;-&gt;</code> (FOLLOWED BY).  There is also a variant
     <code class="literal">&lt;<em class="replaceable"><code>N</code></em>&gt;</code> of the FOLLOWED BY
     operator, where <em class="replaceable"><code>N</code></em> is an integer constant that
     specifies the distance between the two lexemes being searched
     for.  <code class="literal">&lt;-&gt;</code> is equivalent to <code class="literal">&lt;1&gt;</code>.
    </p><p>
     Parentheses can be used to enforce grouping of these operators.
     In the absence of parentheses, <code class="literal">!</code> (NOT) binds most tightly,
     <code class="literal">&lt;-&gt;</code> (FOLLOWED BY) next most tightly, then
     <code class="literal">&amp;</code> (AND), with <code class="literal">|</code> (OR) binding
     the least tightly.
    </p><p>
     Here are some examples:

</p><pre class="programlisting">
SELECT 'fat &amp; rat'::tsquery;
    tsquery
---------------
 'fat' &amp; 'rat'

SELECT 'fat &amp; (rat | cat)'::tsquery;
          tsquery
---------------------------
 'fat' &amp; ( 'rat' | 'cat' )

SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
        tsquery
------------------------
 'fat' &amp; 'rat' &amp; !'cat'
</pre><p>
    </p><p>
     Optionally, lexemes in a <code class="type">tsquery</code> can be labeled with
     one or more weight letters, which restricts them to match only
     <code class="type">tsvector</code> lexemes with one of those weights:

</p><pre class="programlisting">
SELECT 'fat:ab &amp; cat'::tsquery;
    tsquery
------------------
 'fat':AB &amp; 'cat'
</pre><p>
    </p><p>
     Also, lexemes in a <code class="type">tsquery</code> can be labeled with <code class="literal">*</code>
     to specify prefix matching:
</p><pre class="programlisting">
SELECT 'super:*'::tsquery;
  tsquery
-----------
 'super':*
</pre><p>
     This query will match any word in a <code class="type">tsvector</code> that begins
     with <span class="quote">“<span class="quote">super</span>”</span>.
    </p><p>
     Quoting rules for lexemes are the same as described previously for
     lexemes in <code class="type">tsvector</code>; and, as with <code class="type">tsvector</code>,
     any required normalization of words must be done before converting
     to the <code class="type">tsquery</code> type.  The <code class="function">to_tsquery</code>
     function is convenient for performing such normalization:

</p><pre class="programlisting">
SELECT to_tsquery('Fat:ab &amp; Cats');
    to_tsquery
------------------
 'fat':AB &amp; 'cat'
</pre><p>

     Note that <code class="function">to_tsquery</code> will process prefixes in the same way
     as other words, which means this comparison returns true:

</p><pre class="programlisting">
SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
 ?column?
----------
 t
</pre><p>
     because <code class="literal">postgres</code> gets stemmed to <code class="literal">postgr</code>:
</p><pre class="programlisting">
SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
  to_tsvector  | to_tsquery
---------------+------------
 'postgradu':1 | 'postgr':*
</pre><p>
     which will match the stemmed form of <code class="literal">postgraduate</code>.
    </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="datatype-bit.html" title="8.10. Bit String Types">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="datatype.html" title="Chapter 8. Data Types">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="datatype-uuid.html" title="8.12. UUID Type">Next</a></td></tr><tr><td width="40%" align="left" valign="top">8.10. Bit String Types </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.6 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 8.12. <acronym class="acronym">UUID</acronym> Type</td></tr></table></div></body></html>