From 5e45211a64149b3c659b90ff2de6fa982a5a93ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 4 May 2024 14:17:33 +0200 Subject: Adding upstream version 15.5. Signed-off-by: Daniel Baumann --- doc/src/sgml/html/textsearch-parsers.html | 59 +++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 doc/src/sgml/html/textsearch-parsers.html (limited to 'doc/src/sgml/html/textsearch-parsers.html') diff --git a/doc/src/sgml/html/textsearch-parsers.html b/doc/src/sgml/html/textsearch-parsers.html new file mode 100644 index 0000000..233e61b --- /dev/null +++ b/doc/src/sgml/html/textsearch-parsers.html @@ -0,0 +1,59 @@ + +12.5. Parsers

12.5. Parsers

+ Text search parsers are responsible for splitting raw document text + into tokens and identifying each token's type, where + the set of possible types is defined by the parser itself. + Note that a parser does not modify the text at all — it simply + identifies plausible word boundaries. Because of this limited scope, + there is less need for application-specific custom parsers than there is + for custom dictionaries. At present PostgreSQL + provides just one built-in parser, which has been found to be useful for a + wide range of applications. +

+ The built-in parser is named pg_catalog.default. + It recognizes 23 token types, shown in Table 12.1. +

Table 12.1. Default Parser's Token Types

AliasDescriptionExample
asciiwordWord, all ASCII letterselephant
wordWord, all lettersmañana
numwordWord, letters and digitsbeta1
asciihwordHyphenated word, all ASCIIup-to-date
hwordHyphenated word, all letterslógico-matemática
numhwordHyphenated word, letters and digitspostgresql-beta1
hword_asciipartHyphenated word part, all ASCIIpostgresql in the context postgresql-beta1
hword_partHyphenated word part, all letterslógico or matemática + in the context lógico-matemática
hword_numpartHyphenated word part, letters and digitsbeta1 in the context + postgresql-beta1
emailEmail addressfoo@example.com
protocolProtocol headhttp://
urlURLexample.com/stuff/index.html
hostHostexample.com
url_pathURL path/stuff/index.html, in the context of a URL
fileFile or path name/usr/local/foo.txt, if not within a URL
sfloatScientific notation-1.234e56
floatDecimal notation-1.234
intSigned integer-1234
uintUnsigned integer1234
versionVersion number8.3.0
tagXML tag<a href="dictionaries.html">
entityXML entity&amp;
blankSpace symbols(any whitespace or punctuation not otherwise recognized)

Note

+ The parser's notion of a letter is determined by the database's + locale setting, specifically lc_ctype. Words containing + only the basic ASCII letters are reported as a separate token type, + since it is sometimes useful to distinguish them. In most European + languages, token types word and asciiword + should be treated alike. +

+ email does not support all valid email characters as + defined by RFC 5322. + Specifically, the only non-alphanumeric characters supported for + email user names are period, dash, and underscore. +

+ It is possible for the parser to produce overlapping tokens from the same + piece of text. As an example, a hyphenated word will be reported both + as the entire word and as each component: + +

+SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
+      alias      |               description                |     token
+-----------------+------------------------------------------+---------------
+ numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
+ hword_asciipart | Hyphenated word part, all ASCII          | foo
+ blank           | Space symbols                            | -
+ hword_asciipart | Hyphenated word part, all ASCII          | bar
+ blank           | Space symbols                            | -
+ hword_numpart   | Hyphenated word part, letters and digits | beta1
+

+ + This behavior is desirable since it allows searches to work for both + the whole compound word and for components. Here is another + instructive example: + +

+SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
+  alias   |  description  |            token
+----------+---------------+------------------------------
+ protocol | Protocol head | http://
+ url      | URL           | example.com/stuff/index.html
+ host     | Host          | example.com
+ url_path | URL path      | /stuff/index.html
+

+

\ No newline at end of file -- cgit v1.2.3