From 311bcfc6b3acdd6fd152798c7f287ddf74fa2a98 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Tue, 16 Apr 2024 21:46:48 +0200 Subject: Adding upstream version 15.4. Signed-off-by: Daniel Baumann --- doc/src/sgml/html/unaccent.html | 131 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 doc/src/sgml/html/unaccent.html (limited to 'doc/src/sgml/html/unaccent.html') diff --git a/doc/src/sgml/html/unaccent.html b/doc/src/sgml/html/unaccent.html new file mode 100644 index 0000000..72554f8 --- /dev/null +++ b/doc/src/sgml/html/unaccent.html @@ -0,0 +1,131 @@ + +F.48. unaccent

F.48. unaccent

+ unaccent is a text search dictionary that removes accents + (diacritic signs) from lexemes. + It's a filtering dictionary, which means its output is + always passed to the next dictionary (if any), unlike the normal + behavior of dictionaries. This allows accent-insensitive processing + for full text search. +

+ The current implementation of unaccent cannot be used as a + normalizing dictionary for the thesaurus dictionary. +

+ This module is considered trusted, that is, it can be + installed by non-superusers who have CREATE privilege + on the current database. +

F.48.1. Configuration

+ An unaccent dictionary accepts the following options: +

  • + RULES is the base name of the file containing the list of + translation rules. This file must be stored in + $SHAREDIR/tsearch_data/ (where $SHAREDIR means + the PostgreSQL installation's shared-data directory). + Its name must end in .rules (which is not to be included in + the RULES parameter). +

+ The rules file has the following format: +

  • + Each line represents one translation rule, consisting of a character with + accent followed by a character without accent. The first is translated + into the second. For example, +

    +À        A
    +Á        A
    +Â        A
    +Ã        A
    +Ä        A
    +Å        A
    +Æ        AE
    +

    + The two characters must be separated by whitespace, and any leading or + trailing whitespace on a line is ignored. +

  • + Alternatively, if only one character is given on a line, instances of + that character are deleted; this is useful in languages where accents + are represented by separate characters. +

  • + Actually, each character can be any string not containing + whitespace, so unaccent dictionaries could be used for + other sorts of substring substitutions besides diacritic removal. +

  • + As with other PostgreSQL text search configuration files, + the rules file must be stored in UTF-8 encoding. The data is + automatically translated into the current database's encoding when + loaded. Any lines containing untranslatable characters are silently + ignored, so that rules files can contain rules that are not applicable in + the current encoding. +

+ A more complete example, which is directly useful for most European + languages, can be found in unaccent.rules, which is installed + in $SHAREDIR/tsearch_data/ when the unaccent + module is installed. This rules file translates characters with accents + to the same characters without accents, and it also expands ligatures + into the equivalent series of simple characters (for example, Æ to + AE). +

F.48.2. Usage

+ Installing the unaccent extension creates a text + search template unaccent and a dictionary unaccent + based on it. The unaccent dictionary has the default + parameter setting RULES='unaccent', which makes it immediately + usable with the standard unaccent.rules file. + If you wish, you can alter the parameter, for example + +

+mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
+

+ + or create new dictionaries based on the template. +

+ To test the dictionary, you can try: +

+mydb=# select ts_lexize('unaccent','Hôtel');
+ ts_lexize
+-----------
+ {Hotel}
+(1 row)
+

+

+ Here is an example showing how to insert the + unaccent dictionary into a text search configuration: +

+mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
+mydb=# ALTER TEXT SEARCH CONFIGURATION fr
+        ALTER MAPPING FOR hword, hword_part, word
+        WITH unaccent, french_stem;
+mydb=# select to_tsvector('fr','Hôtels de la Mer');
+    to_tsvector
+-------------------
+ 'hotel':1 'mer':4
+(1 row)
+
+mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
+ ?column?
+----------
+ t
+(1 row)
+
+mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
+      ts_headline
+------------------------
+ <b>Hôtel</b> de la Mer
+(1 row)
+

+

F.48.3. Functions

+ The unaccent() function removes accents (diacritic signs) from + a given string. Basically, it's a wrapper around + unaccent-type dictionaries, but it can be used outside normal + text search contexts. +

+unaccent([dictionary regdictionary, ] string text) returns text
+

+ If the dictionary argument is + omitted, the text search dictionary named unaccent and + appearing in the same schema as the unaccent() + function itself is used. +

+ For example: +

+SELECT unaccent('unaccent', 'Hôtel');
+SELECT unaccent('Hôtel');
+

+

\ No newline at end of file -- cgit v1.2.3