1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
|
# About Hunspell
Hunspell is a free spell checker and morphological analyzer library
and command-line tool, licensed under LGPL/GPL/MPL tri-license.
Hunspell is used by LibreOffice office suite, free browsers, like
Mozilla Firefox and Google Chrome, and other tools and OSes, like
Linux distributions and macOS. It is also a command-line tool for
Linux, Unix-like and other OSes.
It is designed for quick and high quality spell checking and
correcting for languages with word-level writing system,
including languages with rich morphology, complex word compounding
and character encoding.
Hunspell interfaces: Ispell-like terminal interface using Curses
library, Ispell pipe interface, C++/C APIs and shared library, also
with existing language bindings for other programming languages.
Hunspell's code base comes from OpenOffice.org's MySpell library,
developed by Kevin Hendricks (originally a C++ reimplementation of
spell checking and affixation of Geoff Kuenning's International
Ispell from scratch, later extended with eg. n-gram suggestions),
see http://lingucomponent.openoffice.org/MySpell-3.zip, and
its README, CONTRIBUTORS and license.readme (here: license.myspell) files.
Main features of Hunspell library, developed by László Németh:
- Unicode support
- Highly customizable suggestions: word-part replacement tables and
stem-level phonetic and other alternative transcriptions to recognize
and fix all typical misspellings, don't suggest offensive words etc.
- Complex morphology: dictionary and affix homonyms; twofold affix
stripping to handle inflectional and derivational morpheme groups for
agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian,
Turkish; 64 thousand affix classes with arbitrary number of affixes;
conditional affixes, circumfixes, fogemorphemes, zero morphemes,
virtual dictionary stems, forbidden words to avoid overgeneration etc.
- Handling complex compounds (for example, for Finno-Ugric, German and
Indo-Aryan languages): recognizing compounds made of arbitrary
number of words, handle affixation within compounds etc.
- Custom dictionaries with affixation
- Stemming
- Morphological analysis (in custom item and arrangement style)
- Morphological generation
- SPELLML XML API over plain spell() API function for easier integration
of stemming, morpological generation and custom dictionaries with affixation
- Language specific algorithms, like special casing of Azeri or Turkish
dotted i and German sharp s, and special compound rules of Hungarian.
Main features of Hunspell command line tool, developed by László Németh:
- Reimplementation of quick interactive interface of Geoff Kuenning's Ispell
- Parsing formats: text, OpenDocument, TeX/LaTeX, HTML/SGML/XML, nroff/troff
- Custom dictionaries with optional affixation, specified by a model word
- Multiple dictionary usage (for example hunspell -d en_US,de_DE,de_medical)
- Various filtering options (bad or good words/lines)
- Morphological analysis (option -m)
- Stemming (option -s)
See man hunspell, man 3 hunspell, man 5 hunspell for complete manual.
# Dependencies
Build only dependencies:
g++ make autoconf automake autopoint libtool
Runtime dependencies:
| | Mandatory | Optional |
|---------------|------------------|------------------|
|libhunspell | | |
|hunspell tool | libiconv gettext | ncurses readline |
# Compiling on GNU/Linux and Unixes
We first need to download the dependencies. On Linux, `gettext` and
`libiconv` are part of the standard library. On other Unixes we
need to manually install them.
For Ubuntu:
sudo apt install autoconf automake autopoint libtool
Then run the following commands:
autoreconf -vfi
./configure
make
sudo make install
sudo ldconfig
For dictionary development, use the `--with-warnings` option of
configure.
For interactive user interface of Hunspell executable, use the
`--with-ui option`.
Optional developer packages:
- ncurses (need for --with-ui), eg. libncursesw5 for UTF-8
- readline (for fancy input line editing, configure parameter:
--with-readline)
In Ubuntu, the packages are:
libncurses5-dev libreadline-dev
# Compiling on OSX and macOS
On macOS for compiler always use `clang` and not `g++` because Homebrew
dependencies are build with that.
brew install autoconf automake libtool gettext
brew link gettext --force
Then run autoreconf, configure, make. See above.
# Compiling on Windows
## Compiling with Mingw64 and MSYS2
Download Msys2, update everything and install the following
packages:
pacman -S base-devel mingw-w64-x86_64-toolchain mingw-w64-x86_64-libtool
Open Mingw-w64 Win64 prompt and compile the same way as on Linux, see
above.
## Compiling in Cygwin environment
Download and install Cygwin environment for Windows with the following
extra packages:
- make
- automake
- autoconf
- libtool
- gcc-g++ development package
- ncurses, readline (for user interface)
- iconv (character conversion)
Then compile the same way as on Linux. Cygwin builds depend on
Cygwin1.dll.
# Debugging
It is recommended to install a debug build of the standard library:
libstdc++6-6-dbg
For debugging we need to create a debug build and then we need to start
`gdb`.
./configure CXXFLAGS='-g -O0 -Wall -Wextra'
make
./libtool --mode=execute gdb src/tools/hunspell
You can also pass the `CXXFLAGS` directly to `make` without calling
`./configure`, but we don't recommend this way during long development
sessions.
If you like to develop and debug with an IDE, see documentation at
https://github.com/hunspell/hunspell/wiki/IDE-Setup
# Testing
Testing Hunspell (see tests in tests/ subdirectory):
make check
or with Valgrind debugger:
make check
VALGRIND=[Valgrind_tool] make check
For example:
make check
VALGRIND=memcheck make check
# Documentation
features and dictionary format:
man 5 hunspell
man hunspell
hunspell -h
http://hunspell.github.io/
# Usage
After compiling and installing (see INSTALL) you can run the Hunspell
spell checker (compiled with user interface) with a Hunspell or Myspell
dictionary:
hunspell -d en_US text.txt
or without interface:
hunspell
hunspell -d en_GB -l <text.txt
Dictionaries consist of an affix (.aff) and dictionary (.dic) file, for
example, download American English dictionary files of LibreOffice
(older version, but with stemming and morphological generation) with
wget -O en_US.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff?id=a4473e06b56bfe35187e302754f6baaa8d75e54f
wget -O en_US.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic?id=a4473e06b56bfe35187e302754f6baaa8d75e54f
and with command line input and output, it's possible to check its work quickly,
for example with the input words "example", "examples", "teached" and
"verybaaaaaaaaaaaaaaaaaaaaaad":
$ hunspell -d en_US
Hunspell 1.7.0
example
*
examples
+ example
teached
& teached 9 0: taught, teased, reached, teaches, teacher, leached, beached
verybaaaaaaaaaaaaaaaaaaaaaad
# verybaaaaaaaaaaaaaaaaaaaaaad 0
Where in the output, `*` and `+` mean correct (accepted) words (`*` = dictionary stem,
`+` = affixed forms of the following dictionary stem), and
`&` and `#` mean bad (rejected) words (`&` = with suggestions, `#` = without suggestions)
(see man hunspell).
Example for stemming:
$ hunspell -d en_US -s
mice
mice mouse
Example for morphological analysis (very limited with this English dictionary):
$ hunspell -d en_US -m
mice
mice st:mouse ts:Ns
cats
cats st:cat ts:0 is:Ns
cats st:cat ts:0 is:Vs
# Other executables
The src/tools directory contains the following executables after compiling.
- The main executable:
- hunspell: main program for spell checking and others (see
manual)
- Example tools:
- analyze: example of spell checking, stemming and morphological
analysis
- chmorph: example of automatic morphological generation and
conversion
- example: example of spell checking and suggestion
- Tools for dictionary development:
- affixcompress: dictionary generation from large (millions of
words) vocabularies
- makealias: alias compression (Hunspell only, not back compatible
with MySpell)
- wordforms: word generation (Hunspell version of unmunch)
- hunzip: decompressor of hzip format
- hzip: compressor of hzip format
- munch (DEPRECATED, use affixcompress): dictionary generation
from vocabularies (it needs an affix file, too).
- unmunch (DEPRECATED, use wordforms): list all recognized words
of a MySpell dictionary
Example for morphological generation:
$ ~/hunspell/src/tools/analyze en_US.aff en_US.dic /dev/stdin
cat mice
generate(cat, mice) = cats
mouse cats
generate(mouse, cats) = mice
generate(mouse, cats) = mouses
# Using Hunspell library with GCC
Including in your program:
#include <hunspell.hxx>
Linking with Hunspell static library:
g++ -lhunspell-1.7 example.cxx
# or better, use pkg-config
g++ $(pkg-config --cflags --libs hunspell) example.cxx
## Dictionaries
Hunspell (MySpell) dictionaries:
- https://wiki.documentfoundation.org/Language_support_of_LibreOffice
- http://cgit.freedesktop.org/libreoffice/dictionaries
- http://extensions.libreoffice.org
- http://extensions.openoffice.org
- http://wiki.services.openoffice.org/wiki/Dictionaries
Aspell dictionaries (conversion: man 5 hunspell):
- ftp://ftp.gnu.org/gnu/aspell/dict
László Németh, nemeth at numbertext org
|