summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/textsearch-intro.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/html/textsearch-intro.html')
-rw-r--r--doc/src/sgml/html/textsearch-intro.html339
1 files changed, 339 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-intro.html b/doc/src/sgml/html/textsearch-intro.html
new file mode 100644
index 0000000..7c5adaa
--- /dev/null
+++ b/doc/src/sgml/html/textsearch-intro.html
@@ -0,0 +1,339 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.1. Introduction</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="textsearch.html" title="Chapter 12. Full Text Search" /><link rel="next" href="textsearch-tables.html" title="12.2. Tables and Indexes" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.1. Introduction</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-INTRO"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.1. Introduction</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-DOCUMENT">12.1.1. What Is a Document?</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-MATCHING">12.1.2. Basic Text Matching</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-INTRO-CONFIGURATIONS">12.1.3. Configurations</a></span></dt></dl></div><p>
+ Full Text Searching (or just <em class="firstterm">text search</em>) provides
+ the capability to identify natural-language <em class="firstterm">documents</em> that
+ satisfy a <em class="firstterm">query</em>, and optionally to sort them by
+ relevance to the query. The most common type of search
+ is to find all documents containing given <em class="firstterm">query terms</em>
+ and return them in order of their <em class="firstterm">similarity</em> to the
+ query. Notions of <code class="varname">query</code> and
+ <code class="varname">similarity</code> are very flexible and depend on the specific
+ application. The simplest search considers <code class="varname">query</code> as a
+ set of words and <code class="varname">similarity</code> as the frequency of query
+ words in the document.
+ </p><p>
+ Textual search operators have existed in databases for years.
+ <span class="productname">PostgreSQL</span> has
+ <code class="literal">~</code>, <code class="literal">~*</code>, <code class="literal">LIKE</code>, and
+ <code class="literal">ILIKE</code> operators for textual data types, but they lack
+ many essential properties required by modern information systems:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ There is no linguistic support, even for English. Regular expressions
+ are not sufficient because they cannot easily handle derived words, e.g.,
+ <code class="literal">satisfies</code> and <code class="literal">satisfy</code>. You might
+ miss documents that contain <code class="literal">satisfies</code>, although you
+ probably would like to find them when searching for
+ <code class="literal">satisfy</code>. It is possible to use <code class="literal">OR</code>
+ to search for multiple derived forms, but this is tedious and error-prone
+ (some words can have several thousand derivatives).
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ They provide no ordering (ranking) of search results, which makes them
+ ineffective when thousands of matching documents are found.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ They tend to be slow because there is no index support, so they must
+ process all documents for every search.
+ </p></li></ul></div><p>
+ Full text indexing allows documents to be <span class="emphasis"><em>preprocessed</em></span>
+ and an index saved for later rapid searching. Preprocessing includes:
+ </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: none; "><li class="listitem" style="list-style-type: none"><p>
+ <span class="emphasis"><em>Parsing documents into <em class="firstterm">tokens</em></em></span>. It is
+ useful to identify various classes of tokens, e.g., numbers, words,
+ complex words, email addresses, so that they can be processed
+ differently. In principle token classes depend on the specific
+ application, but for most purposes it is adequate to use a predefined
+ set of classes.
+ <span class="productname">PostgreSQL</span> uses a <em class="firstterm">parser</em> to
+ perform this step. A standard parser is provided, and custom parsers
+ can be created for specific needs.
+ </p></li><li class="listitem" style="list-style-type: none"><p>
+ <span class="emphasis"><em>Converting tokens into <em class="firstterm">lexemes</em></em></span>.
+ A lexeme is a string, just like a token, but it has been
+ <em class="firstterm">normalized</em> so that different forms of the same word
+ are made alike. For example, normalization almost always includes
+ folding upper-case letters to lower-case, and often involves removal
+ of suffixes (such as <code class="literal">s</code> or <code class="literal">es</code> in English).
+ This allows searches to find variant forms of the
+ same word, without tediously entering all the possible variants.
+ Also, this step typically eliminates <em class="firstterm">stop words</em>, which
+ are words that are so common that they are useless for searching.
+ (In short, then, tokens are raw fragments of the document text, while
+ lexemes are words that are believed useful for indexing and searching.)
+ <span class="productname">PostgreSQL</span> uses <em class="firstterm">dictionaries</em> to
+ perform this step. Various standard dictionaries are provided, and
+ custom ones can be created for specific needs.
+ </p></li><li class="listitem" style="list-style-type: none"><p>
+ <span class="emphasis"><em>Storing preprocessed documents optimized for
+ searching</em></span>. For example, each document can be represented
+ as a sorted array of normalized lexemes. Along with the lexemes it is
+ often desirable to store positional information to use for
+ <em class="firstterm">proximity ranking</em>, so that a document that
+ contains a more <span class="quote">“<span class="quote">dense</span>”</span> region of query words is
+ assigned a higher rank than one with scattered query words.
+ </p></li></ul></div><p>
+ Dictionaries allow fine-grained control over how tokens are normalized.
+ With appropriate dictionaries, you can:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ Define stop words that should not be indexed.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ Map synonyms to a single word using <span class="application">Ispell</span>.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ Map phrases to a single word using a thesaurus.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ Map different variations of a word to a canonical form using
+ an <span class="application">Ispell</span> dictionary.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ Map different variations of a word to a canonical form using
+ <span class="application">Snowball</span> stemmer rules.
+ </p></li></ul></div><p>
+ A data type <code class="type">tsvector</code> is provided for storing preprocessed
+ documents, along with a type <code class="type">tsquery</code> for representing processed
+ queries (<a class="xref" href="datatype-textsearch.html" title="8.11. Text Search Types">Section 8.11</a>). There are many
+ functions and operators available for these data types
+ (<a class="xref" href="functions-textsearch.html" title="9.13. Text Search Functions and Operators">Section 9.13</a>), the most important of which is
+ the match operator <code class="literal">@@</code>, which we introduce in
+ <a class="xref" href="textsearch-intro.html#TEXTSEARCH-MATCHING" title="12.1.2. Basic Text Matching">Section 12.1.2</a>. Full text searches can be accelerated
+ using indexes (<a class="xref" href="textsearch-indexes.html" title="12.9. GIN and GiST Index Types">Section 12.9</a>).
+ </p><div class="sect2" id="TEXTSEARCH-DOCUMENT"><div class="titlepage"><div><div><h3 class="title">12.1.1. What Is a Document?</h3></div></div></div><a id="id-1.5.11.4.10.2" class="indexterm"></a><p>
+ A <em class="firstterm">document</em> is the unit of searching in a full text search
+ system; for example, a magazine article or email message. The text search
+ engine must be able to parse documents and store associations of lexemes
+ (key words) with their parent document. Later, these associations are
+ used to search for documents that contain query words.
+ </p><p>
+ For searches within <span class="productname">PostgreSQL</span>,
+ a document is normally a textual field within a row of a database table,
+ or possibly a combination (concatenation) of such fields, perhaps stored
+ in several tables or obtained dynamically. In other words, a document can
+ be constructed from different parts for indexing and it might not be
+ stored anywhere as a whole. For example:
+
+</p><pre class="programlisting">
+SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
+FROM messages
+WHERE mid = 12;
+
+SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
+FROM messages m, docs d
+WHERE m.mid = d.did AND m.mid = 12;
+</pre><p>
+ </p><div class="note"><h3 class="title">Note</h3><p>
+ Actually, in these example queries, <code class="function">coalesce</code>
+ should be used to prevent a single <code class="literal">NULL</code> attribute from
+ causing a <code class="literal">NULL</code> result for the whole document.
+ </p></div><p>
+ Another possibility is to store the documents as simple text files in the
+ file system. In this case, the database can be used to store the full text
+ index and to execute searches, and some unique identifier can be used to
+ retrieve the document from the file system. However, retrieving files
+ from outside the database requires superuser permissions or special
+ function support, so this is usually less convenient than keeping all
+ the data inside <span class="productname">PostgreSQL</span>. Also, keeping
+ everything inside the database allows easy access
+ to document metadata to assist in indexing and display.
+ </p><p>
+ For text search purposes, each document must be reduced to the
+ preprocessed <code class="type">tsvector</code> format. Searching and ranking
+ are performed entirely on the <code class="type">tsvector</code> representation
+ of a document — the original text need only be retrieved
+ when the document has been selected for display to a user.
+ We therefore often speak of the <code class="type">tsvector</code> as being the
+ document, but of course it is only a compact representation of
+ the full document.
+ </p></div><div class="sect2" id="TEXTSEARCH-MATCHING"><div class="titlepage"><div><div><h3 class="title">12.1.2. Basic Text Matching</h3></div></div></div><p>
+ Full text searching in <span class="productname">PostgreSQL</span> is based on
+ the match operator <code class="literal">@@</code>, which returns
+ <code class="literal">true</code> if a <code class="type">tsvector</code>
+ (document) matches a <code class="type">tsquery</code> (query).
+ It doesn't matter which data type is written first:
+
+</p><pre class="programlisting">
+SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
+ ?column?
+----------
+ t
+
+SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+----------
+ f
+</pre><p>
+ </p><p>
+ As the above example suggests, a <code class="type">tsquery</code> is not just raw
+ text, any more than a <code class="type">tsvector</code> is. A <code class="type">tsquery</code>
+ contains search terms, which must be already-normalized lexemes, and
+ may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
+ (For syntax details see <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>.) There are
+ functions <code class="function">to_tsquery</code>, <code class="function">plainto_tsquery</code>,
+ and <code class="function">phraseto_tsquery</code>
+ that are helpful in converting user-written text into a proper
+ <code class="type">tsquery</code>, primarily by normalizing words appearing in
+ the text. Similarly, <code class="function">to_tsvector</code> is used to parse and
+ normalize a document string. So in practice a text search match would
+ look more like this:
+
+</p><pre class="programlisting">
+SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
+ ?column?
+----------
+ t
+</pre><p>
+
+ Observe that this match would not succeed if written as
+
+</p><pre class="programlisting">
+SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
+ ?column?
+----------
+ f
+</pre><p>
+
+ since here no normalization of the word <code class="literal">rats</code> will occur.
+ The elements of a <code class="type">tsvector</code> are lexemes, which are assumed
+ already normalized, so <code class="literal">rats</code> does not match <code class="literal">rat</code>.
+ </p><p>
+ The <code class="literal">@@</code> operator also
+ supports <code class="type">text</code> input, allowing explicit conversion of a text
+ string to <code class="type">tsvector</code> or <code class="type">tsquery</code> to be skipped
+ in simple cases. The variants available are:
+
+</p><pre class="programlisting">
+tsvector @@ tsquery
+tsquery @@ tsvector
+text @@ tsquery
+text @@ text
+</pre><p>
+ </p><p>
+ The first two of these we saw already.
+ The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">tsquery</code>
+ is equivalent to <code class="literal">to_tsvector(x) @@ y</code>.
+ The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">text</code>
+ is equivalent to <code class="literal">to_tsvector(x) @@ plainto_tsquery(y)</code>.
+ </p><p>
+ Within a <code class="type">tsquery</code>, the <code class="literal">&amp;</code> (AND) operator
+ specifies that both its arguments must appear in the document to have a
+ match. Similarly, the <code class="literal">|</code> (OR) operator specifies that
+ at least one of its arguments must appear, while the <code class="literal">!</code> (NOT)
+ operator specifies that its argument must <span class="emphasis"><em>not</em></span> appear in
+ order to have a match.
+ For example, the query <code class="literal">fat &amp; ! rat</code> matches documents that
+ contain <code class="literal">fat</code> but not <code class="literal">rat</code>.
+ </p><p>
+ Searching for phrases is possible with the help of
+ the <code class="literal">&lt;-&gt;</code> (FOLLOWED BY) <code class="type">tsquery</code> operator, which
+ matches only if its arguments have matches that are adjacent and in the
+ given order. For example:
+
+</p><pre class="programlisting">
+SELECT to_tsvector('fatal error') @@ to_tsquery('fatal &lt;-&gt; error');
+ ?column?
+----------
+ t
+
+SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal &lt;-&gt; error');
+ ?column?
+----------
+ f
+</pre><p>
+
+ There is a more general version of the FOLLOWED BY operator having the
+ form <code class="literal">&lt;<em class="replaceable"><code>N</code></em>&gt;</code>,
+ where <em class="replaceable"><code>N</code></em> is an integer standing for the difference between
+ the positions of the matching lexemes. <code class="literal">&lt;1&gt;</code> is
+ the same as <code class="literal">&lt;-&gt;</code>, while <code class="literal">&lt;2&gt;</code>
+ allows exactly one other lexeme to appear between the matches, and so
+ on. The <code class="literal">phraseto_tsquery</code> function makes use of this
+ operator to construct a <code class="literal">tsquery</code> that can match a multi-word
+ phrase when some of the words are stop words. For example:
+
+</p><pre class="programlisting">
+SELECT phraseto_tsquery('cats ate rats');
+ phraseto_tsquery
+-------------------------------
+ 'cat' &lt;-&gt; 'ate' &lt;-&gt; 'rat'
+
+SELECT phraseto_tsquery('the cats ate the rats');
+ phraseto_tsquery
+-------------------------------
+ 'cat' &lt;-&gt; 'ate' &lt;2&gt; 'rat'
+</pre><p>
+ </p><p>
+ A special case that's sometimes useful is that <code class="literal">&lt;0&gt;</code>
+ can be used to require that two patterns match the same word.
+ </p><p>
+ Parentheses can be used to control nesting of the <code class="type">tsquery</code>
+ operators. Without parentheses, <code class="literal">|</code> binds least tightly,
+ then <code class="literal">&amp;</code>, then <code class="literal">&lt;-&gt;</code>,
+ and <code class="literal">!</code> most tightly.
+ </p><p>
+ It's worth noticing that the AND/OR/NOT operators mean something subtly
+ different when they are within the arguments of a FOLLOWED BY operator
+ than when they are not, because within FOLLOWED BY the exact position of
+ the match is significant. For example, normally <code class="literal">!x</code> matches
+ only documents that do not contain <code class="literal">x</code> anywhere.
+ But <code class="literal">!x &lt;-&gt; y</code> matches <code class="literal">y</code> if it is not
+ immediately after an <code class="literal">x</code>; an occurrence of <code class="literal">x</code>
+ elsewhere in the document does not prevent a match. Another example is
+ that <code class="literal">x &amp; y</code> normally only requires that <code class="literal">x</code>
+ and <code class="literal">y</code> both appear somewhere in the document, but
+ <code class="literal">(x &amp; y) &lt;-&gt; z</code> requires <code class="literal">x</code>
+ and <code class="literal">y</code> to match at the same place, immediately before
+ a <code class="literal">z</code>. Thus this query behaves differently from
+ <code class="literal">x &lt;-&gt; z &amp; y &lt;-&gt; z</code>, which will match a
+ document containing two separate sequences <code class="literal">x z</code> and
+ <code class="literal">y z</code>. (This specific query is useless as written,
+ since <code class="literal">x</code> and <code class="literal">y</code> could not match at the same place;
+ but with more complex situations such as prefix-match patterns, a query
+ of this form could be useful.)
+ </p></div><div class="sect2" id="TEXTSEARCH-INTRO-CONFIGURATIONS"><div class="titlepage"><div><div><h3 class="title">12.1.3. Configurations</h3></div></div></div><p>
+ The above are all simple text search examples. As mentioned before, full
+ text search functionality includes the ability to do many more things:
+ skip indexing certain words (stop words), process synonyms, and use
+ sophisticated parsing, e.g., parse based on more than just white space.
+ This functionality is controlled by <em class="firstterm">text search
+ configurations</em>. <span class="productname">PostgreSQL</span> comes with predefined
+ configurations for many languages, and you can easily create your own
+ configurations. (<span class="application">psql</span>'s <code class="command">\dF</code> command
+ shows all available configurations.)
+ </p><p>
+ During installation an appropriate configuration is selected and
+ <a class="xref" href="runtime-config-client.html#GUC-DEFAULT-TEXT-SEARCH-CONFIG">default_text_search_config</a> is set accordingly
+ in <code class="filename">postgresql.conf</code>. If you are using the same text search
+ configuration for the entire cluster you can use the value in
+ <code class="filename">postgresql.conf</code>. To use different configurations
+ throughout the cluster but the same configuration within any one database,
+ use <code class="command">ALTER DATABASE ... SET</code>. Otherwise, you can set
+ <code class="varname">default_text_search_config</code> in each session.
+ </p><p>
+ Each text search function that depends on a configuration has an optional
+ <code class="type">regconfig</code> argument, so that the configuration to use can be
+ specified explicitly. <code class="varname">default_text_search_config</code>
+ is used only when this argument is omitted.
+ </p><p>
+ To make it easier to build custom text search configurations, a
+ configuration is built up from simpler database objects.
+ <span class="productname">PostgreSQL</span>'s text search facility provides
+ four types of configuration-related database objects:
+ </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+ <em class="firstterm">Text search parsers</em> break documents into tokens
+ and classify each token (for example, as words or numbers).
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <em class="firstterm">Text search dictionaries</em> convert tokens to normalized
+ form and reject stop words.
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <em class="firstterm">Text search templates</em> provide the functions underlying
+ dictionaries. (A dictionary simply specifies a template and a set
+ of parameters for the template.)
+ </p></li><li class="listitem" style="list-style-type: disc"><p>
+ <em class="firstterm">Text search configurations</em> select a parser and a set
+ of dictionaries to use to normalize the tokens produced by the parser.
+ </p></li></ul></div><p>
+ Text search parsers and templates are built from low-level C functions;
+ therefore it requires C programming ability to develop new ones, and
+ superuser privileges to install one into a database. (There are examples
+ of add-on parsers and templates in the <code class="filename">contrib/</code> area of the
+ <span class="productname">PostgreSQL</span> distribution.) Since dictionaries and
+ configurations just parameterize and connect together some underlying
+ parsers and templates, no special privilege is needed to create a new
+ dictionary or configuration. Examples of creating custom dictionaries and
+ configurations appear later in this chapter.
+ </p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 12. Full Text Search </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.2. Tables and Indexes</td></tr></table></div></body></html> \ No newline at end of file