diff options
Diffstat (limited to 'doc/src/sgml/html/textsearch-controls.html')
-rw-r--r-- | doc/src/sgml/html/textsearch-controls.html | 550 |
1 files changed, 550 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-controls.html b/doc/src/sgml/html/textsearch-controls.html new file mode 100644 index 0000000..face378 --- /dev/null +++ b/doc/src/sgml/html/textsearch-controls.html @@ -0,0 +1,550 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.3. Controlling Text Search</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-tables.html" title="12.2. Tables and Indexes" /><link rel="next" href="textsearch-features.html" title="12.4. Additional Features" /></head><body id="docContent" class="container-fluid col-10"><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.3. Controlling Text Search</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="TEXTSEARCH-CONTROLS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.3. Controlling Text Search</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-DOCUMENTS">12.3.1. Parsing Documents</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES">12.3.2. Parsing Queries</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-RANKING">12.3.3. Ranking Search Results</a></span></dt><dt><span class="sect2"><a href="textsearch-controls.html#TEXTSEARCH-HEADLINE">12.3.4. Highlighting Results</a></span></dt></dl></div><p> + To implement full text searching there must be a function to create a + <code class="type">tsvector</code> from a document and a <code class="type">tsquery</code> from a + user query. Also, we need to return results in a useful order, so we need + a function that compares documents with respect to their relevance to + the query. It's also important to be able to display the results nicely. + <span class="productname">PostgreSQL</span> provides support for all of these + functions. + </p><div class="sect2" id="TEXTSEARCH-PARSING-DOCUMENTS"><div class="titlepage"><div><div><h3 class="title">12.3.1. Parsing Documents</h3></div></div></div><p> + <span class="productname">PostgreSQL</span> provides the + function <code class="function">to_tsvector</code> for converting a document to + the <code class="type">tsvector</code> data type. + </p><a id="id-1.5.11.6.3.3" class="indexterm"></a><pre class="synopsis"> +to_tsvector([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>) returns <code class="type">tsvector</code> +</pre><p> + <code class="function">to_tsvector</code> parses a textual document into tokens, + reduces the tokens to lexemes, and returns a <code class="type">tsvector</code> which + lists the lexemes together with their positions in the document. + The document is processed according to the specified or default + text search configuration. + Here is a simple example: + +</p><pre class="screen"> +SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); + to_tsvector +----------------------------------------------------- + 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 +</pre><p> + </p><p> + In the example above we see that the resulting <code class="type">tsvector</code> does not + contain the words <code class="literal">a</code>, <code class="literal">on</code>, or + <code class="literal">it</code>, the word <code class="literal">rats</code> became + <code class="literal">rat</code>, and the punctuation sign <code class="literal">-</code> was + ignored. + </p><p> + The <code class="function">to_tsvector</code> function internally calls a parser + which breaks the document text into tokens and assigns a type to + each token. For each token, a list of + dictionaries (<a class="xref" href="textsearch-dictionaries.html" title="12.6. Dictionaries">Section 12.6</a>) is consulted, + where the list can vary depending on the token type. The first dictionary + that <em class="firstterm">recognizes</em> the token emits one or more normalized + <em class="firstterm">lexemes</em> to represent the token. For example, + <code class="literal">rats</code> became <code class="literal">rat</code> because one of the + dictionaries recognized that the word <code class="literal">rats</code> is a plural + form of <code class="literal">rat</code>. Some words are recognized as + <em class="firstterm">stop words</em> (<a class="xref" href="textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS" title="12.6.1. Stop Words">Section 12.6.1</a>), which + causes them to be ignored since they occur too frequently to be useful in + searching. In our example these are + <code class="literal">a</code>, <code class="literal">on</code>, and <code class="literal">it</code>. + If no dictionary in the list recognizes the token then it is also ignored. + In this example that happened to the punctuation sign <code class="literal">-</code> + because there are in fact no dictionaries assigned for its token type + (<code class="literal">Space symbols</code>), meaning space tokens will never be + indexed. The choices of parser, dictionaries and which types of tokens to + index are determined by the selected text search configuration (<a class="xref" href="textsearch-configuration.html" title="12.7. Configuration Example">Section 12.7</a>). It is possible to have + many different configurations in the same database, and predefined + configurations are available for various languages. In our example + we used the default configuration <code class="literal">english</code> for the + English language. + </p><p> + The function <code class="function">setweight</code> can be used to label the + entries of a <code class="type">tsvector</code> with a given <em class="firstterm">weight</em>, + where a weight is one of the letters <code class="literal">A</code>, <code class="literal">B</code>, + <code class="literal">C</code>, or <code class="literal">D</code>. + This is typically used to mark entries coming from + different parts of a document, such as title versus body. Later, this + information can be used for ranking of search results. + </p><p> + Because <code class="function">to_tsvector</code>(<code class="literal">NULL</code>) will + return <code class="literal">NULL</code>, it is recommended to use + <code class="function">coalesce</code> whenever a field might be null. + Here is the recommended method for creating + a <code class="type">tsvector</code> from a structured document: + +</p><pre class="programlisting"> +UPDATE tt SET ti = + setweight(to_tsvector(coalesce(title,'')), 'A') || + setweight(to_tsvector(coalesce(keyword,'')), 'B') || + setweight(to_tsvector(coalesce(abstract,'')), 'C') || + setweight(to_tsvector(coalesce(body,'')), 'D'); +</pre><p> + + Here we have used <code class="function">setweight</code> to label the source + of each lexeme in the finished <code class="type">tsvector</code>, and then merged + the labeled <code class="type">tsvector</code> values using the <code class="type">tsvector</code> + concatenation operator <code class="literal">||</code>. (<a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> gives details about these + operations.) + </p></div><div class="sect2" id="TEXTSEARCH-PARSING-QUERIES"><div class="titlepage"><div><div><h3 class="title">12.3.2. Parsing Queries</h3></div></div></div><p> + <span class="productname">PostgreSQL</span> provides the + functions <code class="function">to_tsquery</code>, + <code class="function">plainto_tsquery</code>, + <code class="function">phraseto_tsquery</code> and + <code class="function">websearch_to_tsquery</code> + for converting a query to the <code class="type">tsquery</code> data type. + <code class="function">to_tsquery</code> offers access to more features + than either <code class="function">plainto_tsquery</code> or + <code class="function">phraseto_tsquery</code>, but it is less forgiving about its + input. <code class="function">websearch_to_tsquery</code> is a simplified version + of <code class="function">to_tsquery</code> with an alternative syntax, similar + to the one used by web search engines. + </p><a id="id-1.5.11.6.4.3" class="indexterm"></a><pre class="synopsis"> +to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code> +</pre><p> + <code class="function">to_tsquery</code> creates a <code class="type">tsquery</code> value from + <em class="replaceable"><code>querytext</code></em>, which must consist of single tokens + separated by the <code class="type">tsquery</code> operators <code class="literal">&</code> (AND), + <code class="literal">|</code> (OR), <code class="literal">!</code> (NOT), and + <code class="literal"><-></code> (FOLLOWED BY), possibly grouped + using parentheses. In other words, the input to + <code class="function">to_tsquery</code> must already follow the general rules for + <code class="type">tsquery</code> input, as described in <a class="xref" href="datatype-textsearch.html#DATATYPE-TSQUERY" title="8.11.2. tsquery">Section 8.11.2</a>. The difference is that while basic + <code class="type">tsquery</code> input takes the tokens at face value, + <code class="function">to_tsquery</code> normalizes each token into a lexeme using + the specified or default configuration, and discards any tokens that are + stop words according to the configuration. For example: + +</p><pre class="screen"> +SELECT to_tsquery('english', 'The & Fat & Rats'); + to_tsquery +--------------- + 'fat' & 'rat' +</pre><p> + + As in basic <code class="type">tsquery</code> input, weight(s) can be attached to each + lexeme to restrict it to match only <code class="type">tsvector</code> lexemes of those + weight(s). For example: + +</p><pre class="screen"> +SELECT to_tsquery('english', 'Fat | Rats:AB'); + to_tsquery +------------------ + 'fat' | 'rat':AB +</pre><p> + + Also, <code class="literal">*</code> can be attached to a lexeme to specify prefix matching: + +</p><pre class="screen"> +SELECT to_tsquery('supern:*A & star:A*B'); + to_tsquery +-------------------------- + 'supern':*A & 'star':*AB +</pre><p> + + Such a lexeme will match any word in a <code class="type">tsvector</code> that begins + with the given string. + </p><p> + <code class="function">to_tsquery</code> can also accept single-quoted + phrases. This is primarily useful when the configuration includes a + thesaurus dictionary that may trigger on such phrases. + In the example below, a thesaurus contains the rule <code class="literal">supernovae + stars : sn</code>: + +</p><pre class="screen"> +SELECT to_tsquery('''supernovae stars'' & !crab'); + to_tsquery +--------------- + 'sn' & !'crab' +</pre><p> + + Without quotes, <code class="function">to_tsquery</code> will generate a syntax + error for tokens that are not separated by an AND, OR, or FOLLOWED BY + operator. + </p><a id="id-1.5.11.6.4.7" class="indexterm"></a><pre class="synopsis"> +plainto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code> +</pre><p> + <code class="function">plainto_tsquery</code> transforms the unformatted text + <em class="replaceable"><code>querytext</code></em> to a <code class="type">tsquery</code> value. + The text is parsed and normalized much as for <code class="function">to_tsvector</code>, + then the <code class="literal">&</code> (AND) <code class="type">tsquery</code> operator is + inserted between surviving words. + </p><p> + Example: + +</p><pre class="screen"> +SELECT plainto_tsquery('english', 'The Fat Rats'); + plainto_tsquery +----------------- + 'fat' & 'rat' +</pre><p> + + Note that <code class="function">plainto_tsquery</code> will not + recognize <code class="type">tsquery</code> operators, weight labels, + or prefix-match labels in its input: + +</p><pre class="screen"> +SELECT plainto_tsquery('english', 'The Fat & Rats:C'); + plainto_tsquery +--------------------- + 'fat' & 'rat' & 'c' +</pre><p> + + Here, all the input punctuation was discarded. + </p><a id="id-1.5.11.6.4.11" class="indexterm"></a><pre class="synopsis"> +phraseto_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code> +</pre><p> + <code class="function">phraseto_tsquery</code> behaves much like + <code class="function">plainto_tsquery</code>, except that it inserts + the <code class="literal"><-></code> (FOLLOWED BY) operator between + surviving words instead of the <code class="literal">&</code> (AND) operator. + Also, stop words are not simply discarded, but are accounted for by + inserting <code class="literal"><<em class="replaceable"><code>N</code></em>></code> operators rather + than <code class="literal"><-></code> operators. This function is useful + when searching for exact lexeme sequences, since the FOLLOWED BY + operators check lexeme order not just the presence of all the lexemes. + </p><p> + Example: + +</p><pre class="screen"> +SELECT phraseto_tsquery('english', 'The Fat Rats'); + phraseto_tsquery +------------------ + 'fat' <-> 'rat' +</pre><p> + + Like <code class="function">plainto_tsquery</code>, the + <code class="function">phraseto_tsquery</code> function will not + recognize <code class="type">tsquery</code> operators, weight labels, + or prefix-match labels in its input: + +</p><pre class="screen"> +SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); + phraseto_tsquery +----------------------------- + 'fat' <-> 'rat' <-> 'c' +</pre><p> + </p><pre class="synopsis"> +websearch_to_tsquery([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>querytext</code></em> <code class="type">text</code>) returns <code class="type">tsquery</code> +</pre><p> + <code class="function">websearch_to_tsquery</code> creates a <code class="type">tsquery</code> + value from <em class="replaceable"><code>querytext</code></em> using an alternative + syntax in which simple unformatted text is a valid query. + Unlike <code class="function">plainto_tsquery</code> + and <code class="function">phraseto_tsquery</code>, it also recognizes certain + operators. Moreover, this function will never raise syntax errors, + which makes it possible to use raw user-supplied input for search. + The following syntax is supported: + + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">unquoted text</code>: text not inside quote marks will be + converted to terms separated by <code class="literal">&</code> operators, as + if processed by <code class="function">plainto_tsquery</code>. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">"quoted text"</code>: text inside quote marks will be + converted to terms separated by <code class="literal"><-></code> + operators, as if processed by <code class="function">phraseto_tsquery</code>. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">OR</code>: the word <span class="quote">“<span class="quote">or</span>”</span> will be converted to + the <code class="literal">|</code> operator. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">-</code>: a dash will be converted to + the <code class="literal">!</code> operator. + </p></li></ul></div><p> + + Other punctuation is ignored. So + like <code class="function">plainto_tsquery</code> + and <code class="function">phraseto_tsquery</code>, + the <code class="function">websearch_to_tsquery</code> function will not + recognize <code class="type">tsquery</code> operators, weight labels, or prefix-match + labels in its input. + </p><p> + Examples: +</p><pre class="screen"> +SELECT websearch_to_tsquery('english', 'The fat rats'); + websearch_to_tsquery +---------------------- + 'fat' & 'rat' +(1 row) + +SELECT websearch_to_tsquery('english', '"supernovae stars" -crab'); + websearch_to_tsquery +---------------------------------- + 'supernova' <-> 'star' & !'crab' +(1 row) + +SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"'); + websearch_to_tsquery +----------------------------------- + 'sad' <-> 'cat' | 'fat' <-> 'rat' +(1 row) + +SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"'); + websearch_to_tsquery +--------------------------------------- + 'signal' & !( 'segment' <-> 'fault' ) +(1 row) + +SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->'); + websearch_to_tsquery +---------------------- + 'dummi' & 'queri' +(1 row) +</pre><p> + </p></div><div class="sect2" id="TEXTSEARCH-RANKING"><div class="titlepage"><div><div><h3 class="title">12.3.3. Ranking Search Results</h3></div></div></div><p> + Ranking attempts to measure how relevant documents are to a particular + query, so that when there are many matches the most relevant ones can be + shown first. <span class="productname">PostgreSQL</span> provides two + predefined ranking functions, which take into account lexical, proximity, + and structural information; that is, they consider how often the query + terms appear in the document, how close together the terms are in the + document, and how important is the part of the document where they occur. + However, the concept of relevancy is vague and very application-specific. + Different applications might require additional information for ranking, + e.g., document modification time. The built-in ranking functions are only + examples. You can write your own ranking functions and/or combine their + results with additional factors to fit your specific needs. + </p><p> + The two ranking functions currently available are: + + </p><div class="variablelist"><dl class="variablelist"><dt><span class="term"> + <a id="id-1.5.11.6.5.3.1.1.1.1" class="indexterm"></a> + + <code class="literal">ts_rank([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code> + </span></dt><dd><p> + Ranks vectors based on the frequency of their matching lexemes. + </p></dd><dt><span class="term"> + <a id="id-1.5.11.6.5.3.1.2.1.1" class="indexterm"></a> + + <code class="literal">ts_rank_cd([<span class="optional"> <em class="replaceable"><code>weights</code></em> <code class="type">float4[]</code>, </span>] <em class="replaceable"><code>vector</code></em> <code class="type">tsvector</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>normalization</code></em> <code class="type">integer</code> </span>]) returns <code class="type">float4</code></code> + </span></dt><dd><p> + This function computes the <em class="firstterm">cover density</em> + ranking for the given document vector and query, as described in + Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three + Term Queries" in the journal "Information Processing and Management", + 1999. Cover density is similar to <code class="function">ts_rank</code> ranking + except that the proximity of matching lexemes to each other is + taken into consideration. + </p><p> + This function requires lexeme positional information to perform + its calculation. Therefore, it ignores any <span class="quote">“<span class="quote">stripped</span>”</span> + lexemes in the <code class="type">tsvector</code>. If there are no unstripped + lexemes in the input, the result will be zero. (See <a class="xref" href="textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR" title="12.4.1. Manipulating Documents">Section 12.4.1</a> for more information + about the <code class="function">strip</code> function and positional information + in <code class="type">tsvector</code>s.) + </p></dd></dl></div><p> + + </p><p> + For both these functions, + the optional <em class="replaceable"><code>weights</code></em> + argument offers the ability to weigh word instances more or less + heavily depending on how they are labeled. The weight arrays specify + how heavily to weigh each category of word, in the order: + +</p><pre class="synopsis"> +{D-weight, C-weight, B-weight, A-weight} +</pre><p> + + If no <em class="replaceable"><code>weights</code></em> are provided, + then these defaults are used: + +</p><pre class="programlisting"> +{0.1, 0.2, 0.4, 1.0} +</pre><p> + + Typically weights are used to mark words from special areas of the + document, like the title or an initial abstract, so they can be + treated with more or less importance than words in the document body. + </p><p> + Since a longer document has a greater chance of containing a query term + it is reasonable to take into account document size, e.g., a hundred-word + document with five instances of a search word is probably more relevant + than a thousand-word document with five instances. Both ranking functions + take an integer <em class="replaceable"><code>normalization</code></em> option that + specifies whether and how a document's length should impact its rank. + The integer option controls several behaviors, so it is a bit mask: + you can specify one or more behaviors using + <code class="literal">|</code> (for example, <code class="literal">2|4</code>). + + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + 0 (the default) ignores the document length + </p></li><li class="listitem" style="list-style-type: disc"><p> + 1 divides the rank by 1 + the logarithm of the document length + </p></li><li class="listitem" style="list-style-type: disc"><p> + 2 divides the rank by the document length + </p></li><li class="listitem" style="list-style-type: disc"><p> + 4 divides the rank by the mean harmonic distance between extents + (this is implemented only by <code class="function">ts_rank_cd</code>) + </p></li><li class="listitem" style="list-style-type: disc"><p> + 8 divides the rank by the number of unique words in document + </p></li><li class="listitem" style="list-style-type: disc"><p> + 16 divides the rank by 1 + the logarithm of the number + of unique words in document + </p></li><li class="listitem" style="list-style-type: disc"><p> + 32 divides the rank by itself + 1 + </p></li></ul></div><p> + + If more than one flag bit is specified, the transformations are + applied in the order listed. + </p><p> + It is important to note that the ranking functions do not use any global + information, so it is impossible to produce a fair normalization to 1% or + 100% as sometimes desired. Normalization option 32 + (<code class="literal">rank/(rank+1)</code>) can be applied to scale all ranks + into the range zero to one, but of course this is just a cosmetic change; + it will not affect the ordering of the search results. + </p><p> + Here is an example that selects only the ten highest-ranked matches: + +</p><pre class="screen"> +SELECT title, ts_rank_cd(textsearch, query) AS rank +FROM apod, to_tsquery('neutrino|(dark & matter)') query +WHERE query @@ textsearch +ORDER BY rank DESC +LIMIT 10; + title | rank +-----------------------------------------------+---------- + Neutrinos in the Sun | 3.1 + The Sudbury Neutrino Detector | 2.4 + A MACHO View of Galactic Dark Matter | 2.01317 + Hot Gas and Dark Matter | 1.91171 + The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 + Rafting for Solar Neutrinos | 1.9 + NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 + Hot Gas and Dark Matter | 1.6123 + Ice Fishing for Cosmic Neutrinos | 1.6 + Weak Lensing Distorts the Universe | 0.818218 +</pre><p> + + This is the same example using normalized ranking: + +</p><pre class="screen"> +SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank +FROM apod, to_tsquery('neutrino|(dark & matter)') query +WHERE query @@ textsearch +ORDER BY rank DESC +LIMIT 10; + title | rank +-----------------------------------------------+------------------- + Neutrinos in the Sun | 0.756097569485493 + The Sudbury Neutrino Detector | 0.705882361190954 + A MACHO View of Galactic Dark Matter | 0.668123210574724 + Hot Gas and Dark Matter | 0.65655958650282 + The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 + Rafting for Solar Neutrinos | 0.655172410958162 + NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 + Hot Gas and Dark Matter | 0.617195790024749 + Ice Fishing for Cosmic Neutrinos | 0.615384618911517 + Weak Lensing Distorts the Universe | 0.450010798361481 +</pre><p> + </p><p> + Ranking can be expensive since it requires consulting the + <code class="type">tsvector</code> of each matching document, which can be I/O bound and + therefore slow. Unfortunately, it is almost impossible to avoid since + practical queries often result in large numbers of matches. + </p></div><div class="sect2" id="TEXTSEARCH-HEADLINE"><div class="titlepage"><div><div><h3 class="title">12.3.4. Highlighting Results</h3></div></div></div><p> + To present search results it is ideal to show a part of each document and + how it is related to the query. Usually, search engines show fragments of + the document with marked search terms. <span class="productname">PostgreSQL</span> + provides a function <code class="function">ts_headline</code> that + implements this functionality. + </p><a id="id-1.5.11.6.6.3" class="indexterm"></a><pre class="synopsis"> +ts_headline([<span class="optional"> <em class="replaceable"><code>config</code></em> <code class="type">regconfig</code>, </span>] <em class="replaceable"><code>document</code></em> <code class="type">text</code>, <em class="replaceable"><code>query</code></em> <code class="type">tsquery</code> [<span class="optional">, <em class="replaceable"><code>options</code></em> <code class="type">text</code> </span>]) returns <code class="type">text</code> +</pre><p> + <code class="function">ts_headline</code> accepts a document along + with a query, and returns an excerpt from + the document in which terms from the query are highlighted. The + configuration to be used to parse the document can be specified by + <em class="replaceable"><code>config</code></em>; if <em class="replaceable"><code>config</code></em> + is omitted, the + <code class="varname">default_text_search_config</code> configuration is used. + </p><p> + If an <em class="replaceable"><code>options</code></em> string is specified it must + consist of a comma-separated list of one or more + <em class="replaceable"><code>option</code></em><code class="literal">=</code><em class="replaceable"><code>value</code></em> pairs. + The available options are: + + </p><div class="itemizedlist"><ul class="itemizedlist compact" style="list-style-type: bullet; "><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">MaxWords</code>, <code class="literal">MinWords</code> (integers): + these numbers determine the longest and shortest headlines to output. + The default values are 35 and 15. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">ShortWord</code> (integer): words of this length or less + will be dropped at the start and end of a headline, unless they are + query terms. The default value of three eliminates common English + articles. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">HighlightAll</code> (boolean): if + <code class="literal">true</code> the whole document will be used as the + headline, ignoring the preceding three parameters. The default + is <code class="literal">false</code>. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">MaxFragments</code> (integer): maximum number of text + fragments to display. The default value of zero selects a + non-fragment-based headline generation method. A value greater + than zero selects fragment-based headline generation (see below). + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">StartSel</code>, <code class="literal">StopSel</code> (strings): + the strings with which to delimit query words appearing in the + document, to distinguish them from other excerpted words. The + default values are <span class="quote">“<span class="quote"><code class="literal"><b></code></span>”</span> and + <span class="quote">“<span class="quote"><code class="literal"></b></code></span>”</span>, which can be suitable + for HTML output. + </p></li><li class="listitem" style="list-style-type: disc"><p> + <code class="literal">FragmentDelimiter</code> (string): When more than one + fragment is displayed, the fragments will be separated by this string. + The default is <span class="quote">“<span class="quote"><code class="literal"> ... </code></span>”</span>. + </p></li></ul></div><p> + + These option names are recognized case-insensitively. + You must double-quote string values if they contain spaces or commas. + </p><p> + In non-fragment-based headline + generation, <code class="function">ts_headline</code> locates matches for the + given <em class="replaceable"><code>query</code></em> and chooses a + single one to display, preferring matches that have more query words + within the allowed headline length. + In fragment-based headline generation, <code class="function">ts_headline</code> + locates the query matches and splits each match + into <span class="quote">“<span class="quote">fragments</span>”</span> of no more than <code class="literal">MaxWords</code> + words each, preferring fragments with more query words, and when + possible <span class="quote">“<span class="quote">stretching</span>”</span> fragments to include surrounding + words. The fragment-based mode is thus more useful when the query + matches span large sections of the document, or when it's desirable to + display multiple matches. + In either mode, if no query matches can be identified, then a single + fragment of the first <code class="literal">MinWords</code> words in the document + will be displayed. + </p><p> + For example: + +</p><pre class="screen"> +SELECT ts_headline('english', + 'The most common type of search +is to find all documents containing given query terms +and return them in order of their similarity to the +query.', + to_tsquery('english', 'query & similarity')); + ts_headline +------------------------------------------------------------ + containing given <b>query</b> terms + + and return them in order of their <b>similarity</b> to the+ + <b>query</b>. + +SELECT ts_headline('english', + 'Search terms may occur +many times in a document, +requiring ranking of the search matches to decide which +occurrences to display in the result.', + to_tsquery('english', 'search & term'), + 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>'); + ts_headline +------------------------------------------------------------ + <<Search>> <<terms>> may occur + + many times ... ranking of the <<search>> matches to decide +</pre><p> + </p><p> + <code class="function">ts_headline</code> uses the original document, not a + <code class="type">tsvector</code> summary, so it can be slow and should be used with + care. + </p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-tables.html" title="12.2. Tables and Indexes">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-features.html" title="12.4. Additional Features">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.2. Tables and Indexes </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 14.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.4. Additional Features</td></tr></table></div></body></html>
\ No newline at end of file |