summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/html/unaccent.html
blob: 0ebfc8a5973013749f73ee448ef784b9e9baf054 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>F.48. unaccent — a text search dictionary which removes diacritics</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="tsm-system-time.html" title="F.47. tsm_system_time — the SYSTEM_TIME sampling method for TABLESAMPLE" /><link rel="next" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">F.48. unaccent — a text search dictionary which removes diacritics</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="tsm-system-time.html" title="F.47. tsm_system_time —&#10;   the SYSTEM_TIME sampling method for TABLESAMPLE">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules and Extensions">Up</a></td><th width="60%" align="center">Appendix F. Additional Supplied Modules and Extensions</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator">Next</a></td></tr></table><hr /></div><div class="sect1" id="UNACCENT"><div class="titlepage"><div><div><h2 class="title" style="clear: both">F.48. unaccent — a text search dictionary which removes diacritics <a href="#UNACCENT" class="id_link">#</a></h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="unaccent.html#UNACCENT-CONFIGURATION">F.48.1. Configuration</a></span></dt><dt><span class="sect2"><a href="unaccent.html#UNACCENT-USAGE">F.48.2. Usage</a></span></dt><dt><span class="sect2"><a href="unaccent.html#UNACCENT-FUNCTIONS">F.48.3. Functions</a></span></dt></dl></div><a id="id-1.11.7.58.2" class="indexterm"></a><p>
  <code class="filename">unaccent</code> is a text search dictionary that removes accents
  (diacritic signs) from lexemes.
  It's a filtering dictionary, which means its output is
  always passed to the next dictionary (if any), unlike the normal
  behavior of dictionaries.  This allows accent-insensitive processing
  for full text search.
 </p><p>
  The current implementation of <code class="filename">unaccent</code> cannot be used as a
  normalizing dictionary for the <code class="filename">thesaurus</code> dictionary.
 </p><p>
  This module is considered <span class="quote"><span class="quote">trusted</span></span>, that is, it can be
  installed by non-superusers who have <code class="literal">CREATE</code> privilege
  on the current database.
 </p><div class="sect2" id="UNACCENT-CONFIGURATION"><div class="titlepage"><div><div><h3 class="title">F.48.1. Configuration <a href="#UNACCENT-CONFIGURATION" class="id_link">#</a></h3></div></div></div><p>
   An <code class="literal">unaccent</code> dictionary accepts the following options:
  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
     <code class="literal">RULES</code> is the base name of the file containing the list of
     translation rules.  This file must be stored in
     <code class="filename">$SHAREDIR/tsearch_data/</code> (where <code class="literal">$SHAREDIR</code> means
     the <span class="productname">PostgreSQL</span> installation's shared-data directory).
     Its name must end in <code class="literal">.rules</code> (which is not to be included in
     the <code class="literal">RULES</code> parameter).
    </p></li></ul></div><p>
   The rules file has the following format:
  </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
     Each line represents one translation rule, consisting of a character with
     accent followed by a character without accent.  The first is translated
     into the second.  For example,
</p><pre class="programlisting">
À        A
Á        A
        A
à       A
Ä        A
Å        A
Æ        AE
</pre><p>
     The two characters must be separated by whitespace, and any leading or
     trailing whitespace on a line is ignored.
    </p></li><li class="listitem"><p>
     Alternatively, if only one character is given on a line, instances of
     that character are deleted; this is useful in languages where accents
     are represented by separate characters.
    </p></li><li class="listitem"><p>
     Actually, each <span class="quote"><span class="quote">character</span></span> can be any string not containing
     whitespace, so <code class="filename">unaccent</code> dictionaries could be used for
     other sorts of substring substitutions besides diacritic removal.
    </p></li><li class="listitem"><p>
     As with other <span class="productname">PostgreSQL</span> text search configuration files,
     the rules file must be stored in UTF-8 encoding.  The data is
     automatically translated into the current database's encoding when
     loaded.  Any lines containing untranslatable characters are silently
     ignored, so that rules files can contain rules that are not applicable in
     the current encoding.
    </p></li></ul></div><p>
   A more complete example, which is directly useful for most European
   languages, can be found in <code class="filename">unaccent.rules</code>, which is installed
   in <code class="filename">$SHAREDIR/tsearch_data/</code> when the <code class="filename">unaccent</code>
   module is installed.  This rules file translates characters with accents
   to the same characters without accents, and it also expands ligatures
   into the equivalent series of simple characters (for example, Æ to
   AE).
  </p></div><div class="sect2" id="UNACCENT-USAGE"><div class="titlepage"><div><div><h3 class="title">F.48.2. Usage <a href="#UNACCENT-USAGE" class="id_link">#</a></h3></div></div></div><p>
   Installing the <code class="literal">unaccent</code> extension creates a text
   search template <code class="literal">unaccent</code> and a dictionary <code class="literal">unaccent</code>
   based on it.  The <code class="literal">unaccent</code> dictionary has the default
   parameter setting <code class="literal">RULES='unaccent'</code>, which makes it immediately
   usable with the standard <code class="filename">unaccent.rules</code> file.
   If you wish, you can alter the parameter, for example

</p><pre class="programlisting">
mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
</pre><p>

   or create new dictionaries based on the template.
  </p><p>
   To test the dictionary, you can try:
</p><pre class="programlisting">
mydb=# select ts_lexize('unaccent','Hôtel');
 ts_lexize
-----------
 {Hotel}
(1 row)
</pre><p>
  </p><p>
   Here is an example showing how to insert the
   <code class="filename">unaccent</code> dictionary into a text search configuration:
</p><pre class="programlisting">
mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','Hôtels de la Mer');
    to_tsvector
-------------------
 'hotel':1 'mer':4
(1 row)

mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 ?column?
----------
 t
(1 row)

mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
      ts_headline
------------------------
 &lt;b&gt;Hôtel&lt;/b&gt; de la Mer
(1 row)
</pre><p>
  </p></div><div class="sect2" id="UNACCENT-FUNCTIONS"><div class="titlepage"><div><div><h3 class="title">F.48.3. Functions <a href="#UNACCENT-FUNCTIONS" class="id_link">#</a></h3></div></div></div><p>
  The <code class="function">unaccent()</code> function removes accents (diacritic signs) from
  a given string.  Basically, it's a wrapper around
  <code class="filename">unaccent</code>-type dictionaries, but it can be used outside normal
  text search contexts.
 </p><a id="id-1.11.7.58.8.3" class="indexterm"></a><pre class="synopsis">
unaccent([<span class="optional"><em class="replaceable"><code>dictionary</code></em> <code class="type">regdictionary</code>, </span>] <em class="replaceable"><code>string</code></em> <code class="type">text</code>) returns <code class="type">text</code>
</pre><p>
  If the <em class="replaceable"><code>dictionary</code></em> argument is
  omitted, the text search dictionary named <code class="literal">unaccent</code> and
  appearing in the same schema as the <code class="function">unaccent()</code>
  function itself is used.
 </p><p>
  For example:
</p><pre class="programlisting">
SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');
</pre><p>
 </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="tsm-system-time.html" title="F.47. tsm_system_time —&#10;   the SYSTEM_TIME sampling method for TABLESAMPLE">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules and Extensions">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="uuid-ossp.html" title="F.49. uuid-ossp — a UUID generator">Next</a></td></tr><tr><td width="40%" align="left" valign="top">F.47. tsm_system_time —
   the <code class="literal">SYSTEM_TIME</code> sampling method for <code class="literal">TABLESAMPLE</code> </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 16.2 Documentation">Home</a></td><td width="40%" align="right" valign="top"> F.49. uuid-ossp — a UUID generator</td></tr></table></div></body></html>