1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
|
<!-- doc/src/sgml/unaccent.sgml -->
<sect1 id="unaccent" xreflabel="unaccent">
<title>unaccent</title>
<indexterm zone="unaccent">
<primary>unaccent</primary>
</indexterm>
<para>
<filename>unaccent</filename> is a text search dictionary that removes accents
(diacritic signs) from lexemes.
It's a filtering dictionary, which means its output is
always passed to the next dictionary (if any), unlike the normal
behavior of dictionaries. This allows accent-insensitive processing
for full text search.
</para>
<para>
The current implementation of <filename>unaccent</filename> cannot be used as a
normalizing dictionary for the <filename>thesaurus</filename> dictionary.
</para>
<para>
This module is considered <quote>trusted</quote>, that is, it can be
installed by non-superusers who have <literal>CREATE</literal> privilege
on the current database.
</para>
<sect2>
<title>Configuration</title>
<para>
An <literal>unaccent</literal> dictionary accepts the following options:
</para>
<itemizedlist>
<listitem>
<para>
<literal>RULES</literal> is the base name of the file containing the list of
translation rules. This file must be stored in
<filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
the <productname>PostgreSQL</productname> installation's shared-data directory).
Its name must end in <literal>.rules</literal> (which is not to be included in
the <literal>RULES</literal> parameter).
</para>
</listitem>
</itemizedlist>
<para>
The rules file has the following format:
</para>
<itemizedlist>
<listitem>
<para>
Each line represents one translation rule, consisting of a character with
accent followed by a character without accent. The first is translated
into the second. For example,
<programlisting>
À A
Á A
 A
à A
Ä A
Å A
Æ AE
</programlisting>
The two characters must be separated by whitespace, and any leading or
trailing whitespace on a line is ignored.
</para>
</listitem>
<listitem>
<para>
Alternatively, if only one character is given on a line, instances of
that character are deleted; this is useful in languages where accents
are represented by separate characters.
</para>
</listitem>
<listitem>
<para>
Actually, each <quote>character</quote> can be any string not containing
whitespace, so <filename>unaccent</filename> dictionaries could be used for
other sorts of substring substitutions besides diacritic removal.
</para>
</listitem>
<listitem>
<para>
As with other <productname>PostgreSQL</productname> text search configuration files,
the rules file must be stored in UTF-8 encoding. The data is
automatically translated into the current database's encoding when
loaded. Any lines containing untranslatable characters are silently
ignored, so that rules files can contain rules that are not applicable in
the current encoding.
</para>
</listitem>
</itemizedlist>
<para>
A more complete example, which is directly useful for most European
languages, can be found in <filename>unaccent.rules</filename>, which is installed
in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
module is installed. This rules file translates characters with accents
to the same characters without accents, and it also expands ligatures
into the equivalent series of simple characters (for example, Æ to
AE).
</para>
</sect2>
<sect2>
<title>Usage</title>
<para>
Installing the <literal>unaccent</literal> extension creates a text
search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
based on it. The <literal>unaccent</literal> dictionary has the default
parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
usable with the standard <filename>unaccent.rules</filename> file.
If you wish, you can alter the parameter, for example
<programlisting>
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
</programlisting>
or create new dictionaries based on the template.
</para>
<para>
To test the dictionary, you can try:
<programlisting>
mydb=# select ts_lexize('unaccent','Hôtel');
ts_lexize
-----------
{Hotel}
(1 row)
</programlisting>
</para>
<para>
Here is an example showing how to insert the
<filename>unaccent</filename> dictionary into a text search configuration:
<programlisting>
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','Hôtels de la Mer');
to_tsvector
-------------------
'hotel':1 'mer':4
(1 row)
mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
?column?
----------
t
(1 row)
mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
ts_headline
------------------------
<b>Hôtel</b> de la Mer
(1 row)
</programlisting>
</para>
</sect2>
<sect2>
<title>Functions</title>
<para>
The <function>unaccent()</function> function removes accents (diacritic signs) from
a given string. Basically, it's a wrapper around
<filename>unaccent</filename>-type dictionaries, but it can be used outside normal
text search contexts.
</para>
<indexterm>
<primary>unaccent</primary>
</indexterm>
<synopsis>
unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
</synopsis>
<para>
If the <replaceable class="parameter">dictionary</replaceable> argument is
omitted, the text search dictionary named <literal>unaccent</literal> and
appearing in the same schema as the <function>unaccent()</function>
function itself is used.
</para>
<para>
For example:
<programlisting>
SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');
</programlisting>
</para>
</sect2>
</sect1>
|