Adding upstream version 14.5.upstream/14.5 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-04 12:15:05 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-05-04 12:15:05 +0000
commit: 46651ce6fe013220ed397add242004d764fc0153 (patch)
tree: 6e5299f990f88e60174a1d3ae6e48eedd2688b2b /doc/src/sgml/html/textsearch-intro.html
parent: Initial commit. (diff)
download: postgresql-14-46651ce6fe013220ed397add242004d764fc0153.tar.xz
postgresql-14-46651ce6fe013220ed397add242004d764fc0153.zip
1 files changed, 339 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-intro.html b/doc/src/sgml/html/textsearch-intro.html
new file mode 100644
index 0000000..8b19891
--- /dev/null
+++ b/doc/src/sgml/html/textsearch-intro.html
@@ -0,0 +1,339 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.1. Introduction</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch.html" title="Chapter 12. Full Text Search" /><link rel="next" href="textsearch-tables.html" title="12.2. Tables and Indexes" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.1. Introduction</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-INTRO"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.1. Introduction</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-DOCUMENT">12.1.1. What Is a Document?</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-MATCHING">12.1.2. Basic Text Matching</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-INTRO-CONFIGURATIONS">12.1.3. Configurations</a></span></dt></dl></div><p>
+   Full Text Searching (or just <em class="firstterm">text search</em>) provides
+   the capability to identify natural-language <em class="firstterm">documents</em> that
+   satisfy a <em class="firstterm">query</em>, and optionally to sort them by
+   relevance to the query.  The most common type of search
+   is to find all documents containing given <em class="firstterm">query terms</em>
+   and return them in order of their <em class="firstterm">similarity</em> to the
+   query.  Notions of <code class="varname">query</code> and
+   <code class="varname">similarity</code> are very flexible and depend on the specific
+   application. The simplest search considers <code class="varname">query</code> as a
+   set of words and <code class="varname">similarity</code> as the frequency of query
+   words in the document.
+  </p><p>
+   Textual search operators have existed in databases for years.
+   <span class="productname">PostgreSQL</span> has
+   <code class="literal">~</code>, <code class="literal">~*</code>, <code class="literal">LIKE</code>, and
+   <code class="literal">ILIKE</code> operators for textual data types, but they lack
+   many essential properties required by modern information systems:
+  </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+     There is no linguistic support, even for English.  Regular expressions
+     are not sufficient because they cannot easily handle derived words, e.g.,
+     <code class="literal">satisfies</code> and <code class="literal">satisfy</code>. You might
+     miss documents that contain <code class="literal">satisfies</code>, although you
+     probably would like to find them when searching for
+     <code class="literal">satisfy</code>. It is possible to use <code class="literal">OR</code>
+     to search for multiple derived forms, but this is tedious and error-prone
+     (some words can have several thousand derivatives).
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     They provide no ordering (ranking) of search results, which makes them
+     ineffective when thousands of matching documents are found.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     They tend to be slow because there is no index support, so they must
+     process all documents for every search.
+    </p></li></ul></div><p>
+   Full text indexing allows documents to be <span class="emphasis"><em>preprocessed</em></span>
+   and an index saved for later rapid searching. Preprocessing includes:
+  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: none; "><li class="listitem" style="list-style-type: none"><p>
+     <span class="emphasis"><em>Parsing documents into <em class="firstterm">tokens</em></em></span>. It is
+     useful to identify various classes of tokens, e.g., numbers, words,
+     complex words, email addresses, so that they can be processed
+     differently.  In principle token classes depend on the specific
+     application, but for most purposes it is adequate to use a predefined
+     set of classes.
+     <span class="productname">PostgreSQL</span> uses a <em class="firstterm">parser</em> to
+     perform this step.  A standard parser is provided, and custom parsers
+     can be created for specific needs.
+    </p></li><li class="listitem" style="list-style-type: none"><p>
+     <span class="emphasis"><em>Converting tokens into <em class="firstterm">lexemes</em></em></span>.
+     A lexeme is a string, just like a token, but it has been
+     <em class="firstterm">normalized</em> so that different forms of the same word
+     are made alike.  For example, normalization almost always includes
+     folding upper-case letters to lower-case, and often involves removal
+     of suffixes (such as <code class="literal">s</code> or <code class="literal">es</code> in English).
+     This allows searches to find variant forms of the
+     same word, without tediously entering all the possible variants.
+     Also, this step typically eliminates <em class="firstterm">stop words</em>, which
+     are words that are so common that they are useless for searching.
+     (In short, then, tokens are raw fragments of the document text, while
+     lexemes are words that are believed useful for indexing and searching.)
+     <span class="productname">PostgreSQL</span> uses <em class="firstterm">dictionaries</em> to
+     perform this step.  Various standard dictionaries are provided, and
+     custom ones can be created for specific needs.
+    </p></li><li class="listitem" style="list-style-type: none"><p>
+     <span class="emphasis"><em>Storing preprocessed documents optimized for
+     searching</em></span>.  For example, each document can be represented
+     as a sorted array of normalized lexemes. Along with the lexemes it is
+     often desirable to store positional information to use for
+     <em class="firstterm">proximity ranking</em>, so that a document that
+     contains a more <span class="quote">“<span class="quote">dense</span>”</span> region of query words is
+     assigned a higher rank than one with scattered query words.
+    </p></li></ul></div><p>
+   Dictionaries allow fine-grained control over how tokens are normalized.
+   With appropriate dictionaries, you can:
+  </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+     Define stop words that should not be indexed.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     Map synonyms to a single word using <span class="application">Ispell</span>.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     Map phrases to a single word using a thesaurus.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     Map different variations of a word to a canonical form using
+     an <span class="application">Ispell</span> dictionary.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     Map different variations of a word to a canonical form using
+     <span class="application">Snowball</span> stemmer rules.
+    </p></li></ul></div><p>
+   A data type <code class="type">tsvector</code> is provided for storing preprocessed
+   documents, along with a type <code class="type">tsquery</code> for representing processed
+   queries (<a class="xref" href="datatype-textsearch.html" title="8.11. Text Search Types">Section 8.11</a>).  There are many
+   functions and operators available for these data types
+   (<a class="xref" href="functions-textsearch.html" title="9.13. Text Search Functions and Operators">Section 9.13</a>), the most important of which is
+   the match operator <code class="literal">@@</code>, which we introduce in
+   <a class="xref" href="textsearch-intro.html#TEXTSEARCH-MATCHING" title="12.1.2. Basic Text Matching">Section 12.1.2</a>.  Full text searches can be accelerated
+   using indexes (<a class="xref" href="textsearch-indexes.html" title="12.9. Preferred Index Types for Text Search">Section 12.9</a>).
+  </p><div class="sect2" id="TEXTSEARCH-DOCUMENT"><div class="titlepage"><div><div><h3 class="title">12.1.1. What Is a Document?</h3></div></div></div><a id="id-1.5.11.4.10.2" class="indexterm"></a><p>
+    A <em class="firstterm">document</em> is the unit of searching in a full text search
+    system; for example, a magazine article or email message.  The text search
+    engine must be able to parse documents and store associations of lexemes
+    (key words) with their parent document. Later, these associations are
+    used to search for documents that contain query words.
+   </p><p>
+    For searches within <span class="productname">PostgreSQL</span>,
+    a document is normally a textual field within a row of a database table,
+    or possibly a combination (concatenation) of such fields, perhaps stored
+    in several tables or obtained dynamically. In other words, a document can
+    be constructed from different parts for indexing and it might not be
+    stored anywhere as a whole. For example:
+
+</p><pre class="programlisting">
+SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
+FROM messages
+WHERE mid = 12;
+
+SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
+FROM messages m, docs d
+WHERE m.mid = d.did AND m.mid = 12;
+</pre><p>
+   </p><div class="note"><h3 class="title">Note</h3><p>
+     Actually, in these example queries, <code class="function">coalesce</code>
+     should be used to prevent a single <code class="literal">NULL</code> attribute from
+     causing a <code class="literal">NULL</code> result for the whole document.
+    </p></div><p>
+    Another possibility is to store the documents as simple text files in the
+    file system. In this case, the database can be used to store the full text
+    index and to execute searches, and some unique identifier can be used to
+    retrieve the document from the file system.  However, retrieving files
+    from outside the database requires superuser permissions or special
+    function support, so this is usually less convenient than keeping all
+    the data inside <span class="productname">PostgreSQL</span>.  Also, keeping
+    everything inside the database allows easy access
+    to document metadata to assist in indexing and display.
+   </p><p>
+    For text search purposes, each document must be reduced to the
+    preprocessed <code class="type">tsvector</code> format.  Searching and ranking
+    are performed entirely on the <code class="type">tsvector</code> representation
+    of a document — the original text need only be retrieved
+    when the document has been selected for display to a user.
+    We therefore often speak of the <code class="type">tsvector</code> as being the
+    document, but of course it is only a compact representation of
+    the full document.
+   </p></div><div class="sect2" id="TEXTSEARCH-MATCHING"><div class="titlepage"><div><div><h3 class="title">12.1.2. Basic Text Matching</h3></div></div></div><p>
+    Full text searching in <span class="productname">PostgreSQL</span> is based on
+    the match operator <code class="literal">@@</code>, which returns
+    <code class="literal">true</code> if a <code class="type">tsvector</code>
+    (document) matches a <code class="type">tsquery</code> (query).
+    It doesn't matter which data type is written first:
+
+</p><pre class="programlisting">
+SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
+ ?column?
+----------
+ t
+
+SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
+ ?column?
+----------
+ f
+</pre><p>
+   </p><p>
+    As the above example suggests, a <code class="type">tsquery</code> is not just raw
+    text, any more than a <code class="type">tsvector</code> is.  A <code class="type">tsquery</code>
+    contains search terms, which must be already-normalized lexemes, and
+    may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
+    (For syntax details see <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>.)  There are
+    functions <code class="function">to_tsquery</code>, <code class="function">plainto_tsquery</code>,
+    and <code class="function">phraseto_tsquery</code>
+    that are helpful in converting user-written text into a proper
+    <code class="type">tsquery</code>, primarily by normalizing words appearing in
+    the text.  Similarly, <code class="function">to_tsvector</code> is used to parse and
+    normalize a document string.  So in practice a text search match would
+    look more like this:
+
+</p><pre class="programlisting">
+SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
+ ?column? 
+----------
+ t
+</pre><p>
+
+    Observe that this match would not succeed if written as
+
+</p><pre class="programlisting">
+SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
+ ?column? 
+----------
+ f
+</pre><p>
+
+    since here no normalization of the word <code class="literal">rats</code> will occur.
+    The elements of a <code class="type">tsvector</code> are lexemes, which are assumed
+    already normalized, so <code class="literal">rats</code> does not match <code class="literal">rat</code>.
+   </p><p>
+    The <code class="literal">@@</code> operator also
+    supports <code class="type">text</code> input, allowing explicit conversion of a text
+    string to <code class="type">tsvector</code> or <code class="type">tsquery</code> to be skipped
+    in simple cases.  The variants available are:
+
+</p><pre class="programlisting">
+tsvector @@ tsquery
+tsquery  @@ tsvector
+text @@ tsquery
+text @@ text
+</pre><p>
+   </p><p>
+    The first two of these we saw already.
+    The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">tsquery</code>
+    is equivalent to <code class="literal">to_tsvector(x) @@ y</code>.
+    The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">text</code>
+    is equivalent to <code class="literal">to_tsvector(x) @@ plainto_tsquery(y)</code>.
+   </p><p>
+    Within a <code class="type">tsquery</code>, the <code class="literal">&amp;</code> (AND) operator
+    specifies that both its arguments must appear in the document to have a
+    match.  Similarly, the <code class="literal">|</code> (OR) operator specifies that
+    at least one of its arguments must appear, while the <code class="literal">!</code> (NOT)
+    operator specifies that its argument must <span class="emphasis"><em>not</em></span> appear in
+    order to have a match.
+    For example, the query <code class="literal">fat &amp; ! rat</code> matches documents that
+    contain <code class="literal">fat</code> but not <code class="literal">rat</code>.
+   </p><p>
+    Searching for phrases is possible with the help of
+    the <code class="literal">&lt;-&gt;</code> (FOLLOWED BY) <code class="type">tsquery</code> operator, which
+    matches only if its arguments have matches that are adjacent and in the
+    given order.  For example:
+
+</p><pre class="programlisting">
+SELECT to_tsvector('fatal error') @@ to_tsquery('fatal &lt;-&gt; error');
+ ?column? 
+----------
+ t
+
+SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal &lt;-&gt; error');
+ ?column? 
+----------
+ f
+</pre><p>
+
+    There is a more general version of the FOLLOWED BY operator having the
+    form <code class="literal">&lt;<em class="replaceable"><code>N</code></em>&gt;</code>,
+    where <em class="replaceable"><code>N</code></em> is an integer standing for the difference between
+    the positions of the matching lexemes.  <code class="literal">&lt;1&gt;</code> is
+    the same as <code class="literal">&lt;-&gt;</code>, while <code class="literal">&lt;2&gt;</code>
+    allows exactly one other lexeme to appear between the matches, and so
+    on.  The <code class="literal">phraseto_tsquery</code> function makes use of this
+    operator to construct a <code class="literal">tsquery</code> that can match a multi-word
+    phrase when some of the words are stop words.  For example:
+
+</p><pre class="programlisting">
+SELECT phraseto_tsquery('cats ate rats');
+       phraseto_tsquery        
+-------------------------------
+ 'cat' &lt;-&gt; 'ate' &lt;-&gt; 'rat'
+
+SELECT phraseto_tsquery('the cats ate the rats');
+       phraseto_tsquery        
+-------------------------------
+ 'cat' &lt;-&gt; 'ate' &lt;2&gt; 'rat'
+</pre><p>
+   </p><p>
+    A special case that's sometimes useful is that <code class="literal">&lt;0&gt;</code>
+    can be used to require that two patterns match the same word.
+   </p><p>
+    Parentheses can be used to control nesting of the <code class="type">tsquery</code>
+    operators.  Without parentheses, <code class="literal">|</code> binds least tightly,
+    then <code class="literal">&amp;</code>, then <code class="literal">&lt;-&gt;</code>,
+    and <code class="literal">!</code> most tightly.
+   </p><p>
+    It's worth noticing that the AND/OR/NOT operators mean something subtly
+    different when they are within the arguments of a FOLLOWED BY operator
+    than when they are not, because within FOLLOWED BY the exact position of
+    the match is significant.  For example, normally <code class="literal">!x</code> matches
+    only documents that do not contain <code class="literal">x</code> anywhere.
+    But <code class="literal">!x &lt;-&gt; y</code> matches <code class="literal">y</code> if it is not
+    immediately after an <code class="literal">x</code>; an occurrence of <code class="literal">x</code>
+    elsewhere in the document does not prevent a match.  Another example is
+    that <code class="literal">x &amp; y</code> normally only requires that <code class="literal">x</code>
+    and <code class="literal">y</code> both appear somewhere in the document, but
+    <code class="literal">(x &amp; y) &lt;-&gt; z</code> requires <code class="literal">x</code>
+    and <code class="literal">y</code> to match at the same place, immediately before
+    a <code class="literal">z</code>.  Thus this query behaves differently from
+    <code class="literal">x &lt;-&gt; z &amp; y &lt;-&gt; z</code>, which will match a
+    document containing two separate sequences <code class="literal">x z</code> and
+    <code class="literal">y z</code>.  (This specific query is useless as written,
+    since <code class="literal">x</code> and <code class="literal">y</code> could not match at the same place;
+    but with more complex situations such as prefix-match patterns, a query
+    of this form could be useful.)
+   </p></div><div class="sect2" id="TEXTSEARCH-INTRO-CONFIGURATIONS"><div class="titlepage"><div><div><h3 class="title">12.1.3. Configurations</h3></div></div></div><p>
+    The above are all simple text search examples.  As mentioned before, full
+    text search functionality includes the ability to do many more things:
+    skip indexing certain words (stop words), process synonyms, and use
+    sophisticated parsing, e.g., parse based on more than just white space.
+    This functionality is controlled by <em class="firstterm">text search
+    configurations</em>.  <span class="productname">PostgreSQL</span> comes with predefined
+    configurations for many languages, and you can easily create your own
+    configurations.  (<span class="application">psql</span>'s <code class="command">\dF</code> command
+    shows all available configurations.)
+   </p><p>
+    During installation an appropriate configuration is selected and
+    <a class="xref" href="runtime-config-client.html#GUC-DEFAULT-TEXT-SEARCH-CONFIG">default_text_search_config</a> is set accordingly
+    in <code class="filename">postgresql.conf</code>.  If you are using the same text search
+    configuration for the entire cluster you can use the value in
+    <code class="filename">postgresql.conf</code>.  To use different configurations
+    throughout the cluster but the same configuration within any one database,
+    use <code class="command">ALTER DATABASE ... SET</code>.  Otherwise, you can set
+    <code class="varname">default_text_search_config</code> in each session.
+   </p><p>
+    Each text search function that depends on a configuration has an optional
+    <code class="type">regconfig</code> argument, so that the configuration to use can be
+    specified explicitly.  <code class="varname">default_text_search_config</code>
+    is used only when this argument is omitted.
+   </p><p>
+    To make it easier to build custom text search configurations, a
+    configuration is built up from simpler database objects.
+    <span class="productname">PostgreSQL</span>'s text search facility provides
+    four types of configuration-related database objects:
+   </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p>
+     <em class="firstterm">Text search parsers</em> break documents into tokens
+     and classify each token (for example, as words or numbers).
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     <em class="firstterm">Text search dictionaries</em> convert tokens to normalized
+     form and reject stop words.
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     <em class="firstterm">Text search templates</em> provide the functions underlying
+     dictionaries.  (A dictionary simply specifies a template and a set
+     of parameters for the template.)
+    </p></li><li class="listitem" style="list-style-type: disc"><p>
+     <em class="firstterm">Text search configurations</em> select a parser and a set
+     of dictionaries to use to normalize the tokens produced by the parser.
+    </p></li></ul></div><p>
+    Text search parsers and templates are built from low-level C functions;
+    therefore it requires C programming ability to develop new ones, and
+    superuser privileges to install one into a database.  (There are examples
+    of add-on parsers and templates in the <code class="filename">contrib/</code> area of the
+    <span class="productname">PostgreSQL</span> distribution.)  Since dictionaries and
+    configurations just parameterize and connect together some underlying
+    parsers and templates, no special privilege is needed to create a new
+    dictionary or configuration.  Examples of creating custom dictionaries and
+    configurations appear later in this chapter.
+   </p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 12. Full Text Search </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.2. Tables and Indexes</td></tr></table></div></body></html>
+\ No newline at end of file
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-04 12:15:05 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-05-04 12:15:05 +0000
commit	46651ce6fe013220ed397add242004d764fc0153 (patch)
tree	6e5299f990f88e60174a1d3ae6e48eedd2688b2b /doc/src/sgml/html/textsearch-intro.html
parent	Initial commit. (diff)
download	postgresql-14-46651ce6fe013220ed397add242004d764fc0153.tar.xz postgresql-14-46651ce6fe013220ed397add242004d764fc0153.zip