diff options
Diffstat (limited to 'doc/src/sgml/html/textsearch-intro.html')
-rw-r--r-- | doc/src/sgml/html/textsearch-intro.html | 339 |
1 files changed, 339 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-intro.html b/doc/src/sgml/html/textsearch-intro.html new file mode 100644 index 0000000..7c5adaa --- /dev/null +++ b/doc/src/sgml/html/textsearch-intro.html @@ -0,0 +1,339 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.1. Introduction</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets V1.79.1" /><link rel="prev" href="textsearch.html" title="Chapter 12. Full Text Search" /><link rel="next" href="textsearch-tables.html" title="12.2. Tables and Indexes" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.1. Introduction</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-INTRO"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.1. Introduction</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-DOCUMENT">12.1.1. What Is a Document?</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-MATCHING">12.1.2. Basic Text Matching</a></span></dt><dt><span class="sect2"><a href="textsearch-intro.html#TEXTSEARCH-INTRO-CONFIGURATIONS">12.1.3. Configurations</a></span></dt></dl></div><p> + Full Text Searching (or just <em class="firstterm">text search</em>) provides + the capability to identify natural-language <em class="firstterm">documents</em> that + satisfy a <em class="firstterm">query</em>, and optionally to sort them by + relevance to the query. The most common type of search + is to find all documents containing given <em class="firstterm">query terms</em> + and return them in order of their <em class="firstterm">similarity</em> to the + query. Notions of <code class="varname">query</code> and + <code class="varname">similarity</code> are very flexible and depend on the specific + application. The simplest search considers <code class="varname">query</code> as a + set of words and <code class="varname">similarity</code> as the frequency of query + words in the document. + </p><p> + Textual search operators have existed in databases for years. + <span class="productname">PostgreSQL</span> has + <code class="literal">~</code>, <code class="literal">~*</code>, <code class="literal">LIKE</code>, and + <code class="literal">ILIKE</code> operators for textual data types, but they lack + many essential properties required by modern information systems: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + There is no linguistic support, even for English. Regular expressions + are not sufficient because they cannot easily handle derived words, e.g., + <code class="literal">satisfies</code> and <code class="literal">satisfy</code>. You might + miss documents that contain <code class="literal">satisfies</code>, although you + probably would like to find them when searching for + <code class="literal">satisfy</code>. It is possible to use <code class="literal">OR</code> + to search for multiple derived forms, but this is tedious and error-prone + (some words can have several thousand derivatives). + </p></li><li class="listitem" style="list-style-type: disc"><p> + They provide no ordering (ranking) of search results, which makes them + ineffective when thousands of matching documents are found. + </p></li><li class="listitem" style="list-style-type: disc"><p> + They tend to be slow because there is no index support, so they must + process all documents for every search. + </p></li></ul></div><p> + Full text indexing allows documents to be <span class="emphasis"><em>preprocessed</em></span> + and an index saved for later rapid searching. Preprocessing includes: + </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: none; "><li class="listitem" style="list-style-type: none"><p> + <span class="emphasis"><em>Parsing documents into <em class="firstterm">tokens</em></em></span>. It is + useful to identify various classes of tokens, e.g., numbers, words, + complex words, email addresses, so that they can be processed + differently. In principle token classes depend on the specific + application, but for most purposes it is adequate to use a predefined + set of classes. + <span class="productname">PostgreSQL</span> uses a <em class="firstterm">parser</em> to + perform this step. A standard parser is provided, and custom parsers + can be created for specific needs. + </p></li><li class="listitem" style="list-style-type: none"><p> + <span class="emphasis"><em>Converting tokens into <em class="firstterm">lexemes</em></em></span>. + A lexeme is a string, just like a token, but it has been + <em class="firstterm">normalized</em> so that different forms of the same word + are made alike. For example, normalization almost always includes + folding upper-case letters to lower-case, and often involves removal + of suffixes (such as <code class="literal">s</code> or <code class="literal">es</code> in English). + This allows searches to find variant forms of the + same word, without tediously entering all the possible variants. + Also, this step typically eliminates <em class="firstterm">stop words</em>, which + are words that are so common that they are useless for searching. + (In short, then, tokens are raw fragments of the document text, while + lexemes are words that are believed useful for indexing and searching.) + <span class="productname">PostgreSQL</span> uses <em class="firstterm">dictionaries</em> to + perform this step. Various standard dictionaries are provided, and + custom ones can be created for specific needs. + </p></li><li class="listitem" style="list-style-type: none"><p> + <span class="emphasis"><em>Storing preprocessed documents optimized for + searching</em></span>. For example, each document can be represented + as a sorted array of normalized lexemes. Along with the lexemes it is + often desirable to store positional information to use for + <em class="firstterm">proximity ranking</em>, so that a document that + contains a more <span class="quote">“<span class="quote">dense</span>”</span> region of query words is + assigned a higher rank than one with scattered query words. + </p></li></ul></div><p> + Dictionaries allow fine-grained control over how tokens are normalized. + With appropriate dictionaries, you can: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + Define stop words that should not be indexed. + </p></li><li class="listitem" style="list-style-type: disc"><p> + Map synonyms to a single word using <span class="application">Ispell</span>. + </p></li><li class="listitem" style="list-style-type: disc"><p> + Map phrases to a single word using a thesaurus. + </p></li><li class="listitem" style="list-style-type: disc"><p> + Map different variations of a word to a canonical form using + an <span class="application">Ispell</span> dictionary. + </p></li><li class="listitem" style="list-style-type: disc"><p> + Map different variations of a word to a canonical form using + <span class="application">Snowball</span> stemmer rules. + </p></li></ul></div><p> + A data type <code class="type">tsvector</code> is provided for storing preprocessed + documents, along with a type <code class="type">tsquery</code> for representing processed + queries (<a class="xref" href="datatype-textsearch.html" title="8.11. Text Search Types">Section 8.11</a>). There are many + functions and operators available for these data types + (<a class="xref" href="functions-textsearch.html" title="9.13. Text Search Functions and Operators">Section 9.13</a>), the most important of which is + the match operator <code class="literal">@@</code>, which we introduce in + <a class="xref" href="textsearch-intro.html#TEXTSEARCH-MATCHING" title="12.1.2. Basic Text Matching">Section 12.1.2</a>. Full text searches can be accelerated + using indexes (<a class="xref" href="textsearch-indexes.html" title="12.9. GIN and GiST Index Types">Section 12.9</a>). + </p><div class="sect2" id="TEXTSEARCH-DOCUMENT"><div class="titlepage"><div><div><h3 class="title">12.1.1. What Is a Document?</h3></div></div></div><a id="id-1.5.11.4.10.2" class="indexterm"></a><p> + A <em class="firstterm">document</em> is the unit of searching in a full text search + system; for example, a magazine article or email message. The text search + engine must be able to parse documents and store associations of lexemes + (key words) with their parent document. Later, these associations are + used to search for documents that contain query words. + </p><p> + For searches within <span class="productname">PostgreSQL</span>, + a document is normally a textual field within a row of a database table, + or possibly a combination (concatenation) of such fields, perhaps stored + in several tables or obtained dynamically. In other words, a document can + be constructed from different parts for indexing and it might not be + stored anywhere as a whole. For example: + +</p><pre class="programlisting"> +SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document +FROM messages +WHERE mid = 12; + +SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document +FROM messages m, docs d +WHERE m.mid = d.did AND m.mid = 12; +</pre><p> + </p><div class="note"><h3 class="title">Note</h3><p> + Actually, in these example queries, <code class="function">coalesce</code> + should be used to prevent a single <code class="literal">NULL</code> attribute from + causing a <code class="literal">NULL</code> result for the whole document. + </p></div><p> + Another possibility is to store the documents as simple text files in the + file system. In this case, the database can be used to store the full text + index and to execute searches, and some unique identifier can be used to + retrieve the document from the file system. However, retrieving files + from outside the database requires superuser permissions or special + function support, so this is usually less convenient than keeping all + the data inside <span class="productname">PostgreSQL</span>. Also, keeping + everything inside the database allows easy access + to document metadata to assist in indexing and display. + </p><p> + For text search purposes, each document must be reduced to the + preprocessed <code class="type">tsvector</code> format. Searching and ranking + are performed entirely on the <code class="type">tsvector</code> representation + of a document — the original text need only be retrieved + when the document has been selected for display to a user. + We therefore often speak of the <code class="type">tsvector</code> as being the + document, but of course it is only a compact representation of + the full document. + </p></div><div class="sect2" id="TEXTSEARCH-MATCHING"><div class="titlepage"><div><div><h3 class="title">12.1.2. Basic Text Matching</h3></div></div></div><p> + Full text searching in <span class="productname">PostgreSQL</span> is based on + the match operator <code class="literal">@@</code>, which returns + <code class="literal">true</code> if a <code class="type">tsvector</code> + (document) matches a <code class="type">tsquery</code> (query). + It doesn't matter which data type is written first: + +</p><pre class="programlisting"> +SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; + ?column? +---------- + t + +SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; + ?column? +---------- + f +</pre><p> + </p><p> + As the above example suggests, a <code class="type">tsquery</code> is not just raw + text, any more than a <code class="type">tsvector</code> is. A <code class="type">tsquery</code> + contains search terms, which must be already-normalized lexemes, and + may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. + (For syntax details see <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>.) There are + functions <code class="function">to_tsquery</code>, <code class="function">plainto_tsquery</code>, + and <code class="function">phraseto_tsquery</code> + that are helpful in converting user-written text into a proper + <code class="type">tsquery</code>, primarily by normalizing words appearing in + the text. Similarly, <code class="function">to_tsvector</code> is used to parse and + normalize a document string. So in practice a text search match would + look more like this: + +</p><pre class="programlisting"> +SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat'); + ?column? +---------- + t +</pre><p> + + Observe that this match would not succeed if written as + +</p><pre class="programlisting"> +SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat'); + ?column? +---------- + f +</pre><p> + + since here no normalization of the word <code class="literal">rats</code> will occur. + The elements of a <code class="type">tsvector</code> are lexemes, which are assumed + already normalized, so <code class="literal">rats</code> does not match <code class="literal">rat</code>. + </p><p> + The <code class="literal">@@</code> operator also + supports <code class="type">text</code> input, allowing explicit conversion of a text + string to <code class="type">tsvector</code> or <code class="type">tsquery</code> to be skipped + in simple cases. The variants available are: + +</p><pre class="programlisting"> +tsvector @@ tsquery +tsquery @@ tsvector +text @@ tsquery +text @@ text +</pre><p> + </p><p> + The first two of these we saw already. + The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">tsquery</code> + is equivalent to <code class="literal">to_tsvector(x) @@ y</code>. + The form <code class="type">text</code> <code class="literal">@@</code> <code class="type">text</code> + is equivalent to <code class="literal">to_tsvector(x) @@ plainto_tsquery(y)</code>. + </p><p> + Within a <code class="type">tsquery</code>, the <code class="literal">&</code> (AND) operator + specifies that both its arguments must appear in the document to have a + match. Similarly, the <code class="literal">|</code> (OR) operator specifies that + at least one of its arguments must appear, while the <code class="literal">!</code> (NOT) + operator specifies that its argument must <span class="emphasis"><em>not</em></span> appear in + order to have a match. + For example, the query <code class="literal">fat & ! rat</code> matches documents that + contain <code class="literal">fat</code> but not <code class="literal">rat</code>. + </p><p> + Searching for phrases is possible with the help of + the <code class="literal"><-></code> (FOLLOWED BY) <code class="type">tsquery</code> operator, which + matches only if its arguments have matches that are adjacent and in the + given order. For example: + +</p><pre class="programlisting"> +SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error'); + ?column? +---------- + t + +SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error'); + ?column? +---------- + f +</pre><p> + + There is a more general version of the FOLLOWED BY operator having the + form <code class="literal"><<em class="replaceable"><code>N</code></em>></code>, + where <em class="replaceable"><code>N</code></em> is an integer standing for the difference between + the positions of the matching lexemes. <code class="literal"><1></code> is + the same as <code class="literal"><-></code>, while <code class="literal"><2></code> + allows exactly one other lexeme to appear between the matches, and so + on. The <code class="literal">phraseto_tsquery</code> function makes use of this + operator to construct a <code class="literal">tsquery</code> that can match a multi-word + phrase when some of the words are stop words. For example: + +</p><pre class="programlisting"> +SELECT phraseto_tsquery('cats ate rats'); + phraseto_tsquery +------------------------------- + 'cat' <-> 'ate' <-> 'rat' + +SELECT phraseto_tsquery('the cats ate the rats'); + phraseto_tsquery +------------------------------- + 'cat' <-> 'ate' <2> 'rat' +</pre><p> + </p><p> + A special case that's sometimes useful is that <code class="literal"><0></code> + can be used to require that two patterns match the same word. + </p><p> + Parentheses can be used to control nesting of the <code class="type">tsquery</code> + operators. Without parentheses, <code class="literal">|</code> binds least tightly, + then <code class="literal">&</code>, then <code class="literal"><-></code>, + and <code class="literal">!</code> most tightly. + </p><p> + It's worth noticing that the AND/OR/NOT operators mean something subtly + different when they are within the arguments of a FOLLOWED BY operator + than when they are not, because within FOLLOWED BY the exact position of + the match is significant. For example, normally <code class="literal">!x</code> matches + only documents that do not contain <code class="literal">x</code> anywhere. + But <code class="literal">!x <-> y</code> matches <code class="literal">y</code> if it is not + immediately after an <code class="literal">x</code>; an occurrence of <code class="literal">x</code> + elsewhere in the document does not prevent a match. Another example is + that <code class="literal">x & y</code> normally only requires that <code class="literal">x</code> + and <code class="literal">y</code> both appear somewhere in the document, but + <code class="literal">(x & y) <-> z</code> requires <code class="literal">x</code> + and <code class="literal">y</code> to match at the same place, immediately before + a <code class="literal">z</code>. Thus this query behaves differently from + <code class="literal">x <-> z & y <-> z</code>, which will match a + document containing two separate sequences <code class="literal">x z</code> and + <code class="literal">y z</code>. (This specific query is useless as written, + since <code class="literal">x</code> and <code class="literal">y</code> could not match at the same place; + but with more complex situations such as prefix-match patterns, a query + of this form could be useful.) + </p></div><div class="sect2" id="TEXTSEARCH-INTRO-CONFIGURATIONS"><div class="titlepage"><div><div><h3 class="title">12.1.3. Configurations</h3></div></div></div><p> + The above are all simple text search examples. As mentioned before, full + text search functionality includes the ability to do many more things: + skip indexing certain words (stop words), process synonyms, and use + sophisticated parsing, e.g., parse based on more than just white space. + This functionality is controlled by <em class="firstterm">text search + configurations</em>. <span class="productname">PostgreSQL</span> comes with predefined + configurations for many languages, and you can easily create your own + configurations. (<span class="application">psql</span>'s <code class="command">\dF</code> command + shows all available configurations.) + </p><p> + During installation an appropriate configuration is selected and + <a class="xref" href="runtime-config-client.html#GUC-DEFAULT-TEXT-SEARCH-CONFIG">default_text_search_config</a> is set accordingly + in <code class="filename">postgresql.conf</code>. If you are using the same text search + configuration for the entire cluster you can use the value in + <code class="filename">postgresql.conf</code>. To use different configurations + throughout the cluster but the same configuration within any one database, + use <code class="command">ALTER DATABASE ... SET</code>. Otherwise, you can set + <code class="varname">default_text_search_config</code> in each session. + </p><p> + Each text search function that depends on a configuration has an optional + <code class="type">regconfig</code> argument, so that the configuration to use can be + specified explicitly. <code class="varname">default_text_search_config</code> + is used only when this argument is omitted. + </p><p> + To make it easier to build custom text search configurations, a + configuration is built up from simpler database objects. + <span class="productname">PostgreSQL</span>'s text search facility provides + four types of configuration-related database objects: + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + <em class="firstterm">Text search parsers</em> break documents into tokens + and classify each token (for example, as words or numbers). + </p></li><li class="listitem" style="list-style-type: disc"><p> + <em class="firstterm">Text search dictionaries</em> convert tokens to normalized + form and reject stop words. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <em class="firstterm">Text search templates</em> provide the functions underlying + dictionaries. (A dictionary simply specifies a template and a set + of parameters for the template.) + </p></li><li class="listitem" style="list-style-type: disc"><p> + <em class="firstterm">Text search configurations</em> select a parser and a set + of dictionaries to use to normalize the tokens produced by the parser. + </p></li></ul></div><p> + Text search parsers and templates are built from low-level C functions; + therefore it requires C programming ability to develop new ones, and + superuser privileges to install one into a database. (There are examples + of add-on parsers and templates in the <code class="filename">contrib/</code> area of the + <span class="productname">PostgreSQL</span> distribution.) Since dictionaries and + configurations just parameterize and connect together some underlying + parsers and templates, no special privilege is needed to create a new + dictionary or configuration. Examples of creating custom dictionaries and + configurations appear later in this chapter. + </p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch.html" title="Chapter 12. Full Text Search">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-tables.html" title="12.2. Tables and Indexes">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 12. Full Text Search </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 13.4 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.2. Tables and Indexes</td></tr></table></div></body></html>
\ No newline at end of file |