summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/textsearch-parsers.html
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:17:33 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 12:17:33 +0000
commit5e45211a64149b3c659b90ff2de6fa982a5a93ed (patch)
tree739caf8c461053357daa9f162bef34516c7bf452 /doc/src/sgml/html/textsearch-parsers.html
parentInitial commit. (diff)
downloadpostgresql-15-5e45211a64149b3c659b90ff2de6fa982a5a93ed.tar.xz
postgresql-15-5e45211a64149b3c659b90ff2de6fa982a5a93ed.zip
Adding upstream version 15.5.upstream/15.5
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/src/sgml/html/textsearch-parsers.html')
-rw-r--r--doc/src/sgml/html/textsearch-parsers.html59
1 files changed, 59 insertions, 0 deletions
diff --git a/doc/src/sgml/html/textsearch-parsers.html b/doc/src/sgml/html/textsearch-parsers.html
new file mode 100644
index 0000000..233e61b
--- /dev/null
+++ b/doc/src/sgml/html/textsearch-parsers.html
@@ -0,0 +1,59 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>12.5. Parsers</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="textsearch-features.html" title="12.4. Additional Features" /><link rel="next" href="textsearch-dictionaries.html" title="12.6. Dictionaries" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">12.5. Parsers</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="textsearch-features.html" title="12.4. Additional Features">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><th width="60%" align="center">Chapter 12. Full Text Search</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="textsearch-dictionaries.html" title="12.6. Dictionaries">Next</a></td></tr></table><hr /></div><div class="sect1" id="TEXTSEARCH-PARSERS"><div class="titlepage"><div><div><h2 class="title" style="clear: both">12.5. Parsers</h2></div></div></div><p>
+ Text search parsers are responsible for splitting raw document text
+ into <em class="firstterm">tokens</em> and identifying each token's type, where
+ the set of possible types is defined by the parser itself.
+ Note that a parser does not modify the text at all — it simply
+ identifies plausible word boundaries. Because of this limited scope,
+ there is less need for application-specific custom parsers than there is
+ for custom dictionaries. At present <span class="productname">PostgreSQL</span>
+ provides just one built-in parser, which has been found to be useful for a
+ wide range of applications.
+ </p><p>
+ The built-in parser is named <code class="literal">pg_catalog.default</code>.
+ It recognizes 23 token types, shown in <a class="xref" href="textsearch-parsers.html#TEXTSEARCH-DEFAULT-PARSER" title="Table 12.1. Default Parser's Token Types">Table 12.1</a>.
+ </p><div class="table" id="TEXTSEARCH-DEFAULT-PARSER"><p class="title"><strong>Table 12.1. Default Parser's Token Types</strong></p><div class="table-contents"><table class="table" summary="Default Parser's Token Types" border="1"><colgroup><col class="col1" /><col class="col2" /><col class="col3" /></colgroup><thead><tr><th>Alias</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td><code class="literal">asciiword</code></td><td>Word, all ASCII letters</td><td><code class="literal">elephant</code></td></tr><tr><td><code class="literal">word</code></td><td>Word, all letters</td><td><code class="literal">mañana</code></td></tr><tr><td><code class="literal">numword</code></td><td>Word, letters and digits</td><td><code class="literal">beta1</code></td></tr><tr><td><code class="literal">asciihword</code></td><td>Hyphenated word, all ASCII</td><td><code class="literal">up-to-date</code></td></tr><tr><td><code class="literal">hword</code></td><td>Hyphenated word, all letters</td><td><code class="literal">lógico-matemática</code></td></tr><tr><td><code class="literal">numhword</code></td><td>Hyphenated word, letters and digits</td><td><code class="literal">postgresql-beta1</code></td></tr><tr><td><code class="literal">hword_asciipart</code></td><td>Hyphenated word part, all ASCII</td><td><code class="literal">postgresql</code> in the context <code class="literal">postgresql-beta1</code></td></tr><tr><td><code class="literal">hword_part</code></td><td>Hyphenated word part, all letters</td><td><code class="literal">lógico</code> or <code class="literal">matemática</code>
+ in the context <code class="literal">lógico-matemática</code></td></tr><tr><td><code class="literal">hword_numpart</code></td><td>Hyphenated word part, letters and digits</td><td><code class="literal">beta1</code> in the context
+ <code class="literal">postgresql-beta1</code></td></tr><tr><td><code class="literal">email</code></td><td>Email address</td><td><code class="literal">foo@example.com</code></td></tr><tr><td><code class="literal">protocol</code></td><td>Protocol head</td><td><code class="literal">http://</code></td></tr><tr><td><code class="literal">url</code></td><td>URL</td><td><code class="literal">example.com/stuff/index.html</code></td></tr><tr><td><code class="literal">host</code></td><td>Host</td><td><code class="literal">example.com</code></td></tr><tr><td><code class="literal">url_path</code></td><td>URL path</td><td><code class="literal">/stuff/index.html</code>, in the context of a URL</td></tr><tr><td><code class="literal">file</code></td><td>File or path name</td><td><code class="literal">/usr/local/foo.txt</code>, if not within a URL</td></tr><tr><td><code class="literal">sfloat</code></td><td>Scientific notation</td><td><code class="literal">-1.234e56</code></td></tr><tr><td><code class="literal">float</code></td><td>Decimal notation</td><td><code class="literal">-1.234</code></td></tr><tr><td><code class="literal">int</code></td><td>Signed integer</td><td><code class="literal">-1234</code></td></tr><tr><td><code class="literal">uint</code></td><td>Unsigned integer</td><td><code class="literal">1234</code></td></tr><tr><td><code class="literal">version</code></td><td>Version number</td><td><code class="literal">8.3.0</code></td></tr><tr><td><code class="literal">tag</code></td><td>XML tag</td><td><code class="literal">&lt;a href="dictionaries.html"&gt;</code></td></tr><tr><td><code class="literal">entity</code></td><td>XML entity</td><td><code class="literal">&amp;amp;</code></td></tr><tr><td><code class="literal">blank</code></td><td>Space symbols</td><td>(any whitespace or punctuation not otherwise recognized)</td></tr></tbody></table></div></div><br class="table-break" /><div class="note"><h3 class="title">Note</h3><p>
+ The parser's notion of a <span class="quote">“<span class="quote">letter</span>”</span> is determined by the database's
+ locale setting, specifically <code class="varname">lc_ctype</code>. Words containing
+ only the basic ASCII letters are reported as a separate token type,
+ since it is sometimes useful to distinguish them. In most European
+ languages, token types <code class="literal">word</code> and <code class="literal">asciiword</code>
+ should be treated alike.
+ </p><p>
+ <code class="literal">email</code> does not support all valid email characters as
+ defined by <a class="ulink" href="https://tools.ietf.org/html/rfc5322" target="_top">RFC 5322</a>.
+ Specifically, the only non-alphanumeric characters supported for
+ email user names are period, dash, and underscore.
+ </p></div><p>
+ It is possible for the parser to produce overlapping tokens from the same
+ piece of text. As an example, a hyphenated word will be reported both
+ as the entire word and as each component:
+
+</p><pre class="screen">
+SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
+ alias | description | token
+-----------------+------------------------------------------+---------------
+ numhword | Hyphenated word, letters and digits | foo-bar-beta1
+ hword_asciipart | Hyphenated word part, all ASCII | foo
+ blank | Space symbols | -
+ hword_asciipart | Hyphenated word part, all ASCII | bar
+ blank | Space symbols | -
+ hword_numpart | Hyphenated word part, letters and digits | beta1
+</pre><p>
+
+ This behavior is desirable since it allows searches to work for both
+ the whole compound word and for components. Here is another
+ instructive example:
+
+</p><pre class="screen">
+SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
+ alias | description | token
+----------+---------------+------------------------------
+ protocol | Protocol head | http://
+ url | URL | example.com/stuff/index.html
+ host | Host | example.com
+ url_path | URL path | /stuff/index.html
+</pre><p>
+ </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="textsearch-features.html" title="12.4. Additional Features">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="textsearch.html" title="Chapter 12. Full Text Search">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="textsearch-dictionaries.html" title="12.6. Dictionaries">Next</a></td></tr><tr><td width="40%" align="left" valign="top">12.4. Additional Features </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.5 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 12.6. Dictionaries</td></tr></table></div></body></html> \ No newline at end of file