diff options
Diffstat (limited to 'extensions/spellcheck/docs/index.rst')
-rw-r--r-- | extensions/spellcheck/docs/index.rst | 199 |
1 files changed, 199 insertions, 0 deletions
diff --git a/extensions/spellcheck/docs/index.rst b/extensions/spellcheck/docs/index.rst new file mode 100644 index 0000000000..5ec2a224d9 --- /dev/null +++ b/extensions/spellcheck/docs/index.rst @@ -0,0 +1,199 @@ +====================================== +Managing the built-in en-US dictionary +====================================== + +The en-US build of Firefox includes a built-in Hunspell dictionary based on the +`SCOWL`_ dataset. This document describes the process to add new words to the +dictionary, or update it to the current upstream version. + +For more information about Hunspell or the affix file format, you can check +`the Ubuntu man page for hunspell +<https://manpages.ubuntu.com/manpages/bionic/man5/hunspell.5.html>`_. + +Requesting to add new words to the en-US dictionary +=================================================== + +If you’d like to add new words to the dictionary, you can add your request to +`this bug <https://bugzilla.mozilla.org/show_bug.cgi?id=enus-dictionary>`_: + +* Include all possible forms, e.g. plural and genitive forms for nouns, + different tenses for verbs. +* Try to provide information on the terms you want to add, in particular + references to external sources that confirm the usage of the term (e.g. + Merriam-Webster or Oxford online dictionaries). + +.. note:: + + If you’re fixing the existing bug with pending requests, make sure to `file a + new bug`_ and move the alias ``enus-dictionary`` (in the *Details* section) + from the old bug to the new one. + +Adding new words to the en-US dictionary +======================================== + +This section describes the process for adding new words to the dictionary: + +#. Get a clone of mozilla-central (see :ref:`Firefox Contributors' Quick + Reference`), if you don’t already have one, and make sure you can build it + successfully. +#. Move in the dictionary sources directory using this command: + ``cd extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``. +#. Identify the current version of SCOWL by checking the file + ``README_en_US.txt`` (at the beginning of the file there is a line similar to + ``Generated from SCOWL Version 2020.12.07``, where ``2020.12.07`` is the + SCOWL version). +#. Download the same version of the dictionary from the `SCOWL`_ homepage or + `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory. + Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``. +#. There’s a special script used for editing dictionaries. The script + only works if you have the environment variable ``EDITOR`` set to the + executable of an editor program; if you don’t have it set, you can use + ``EDITOR=vim sh edit-dictionary.sh`` to edit using ``vim`` (or you can + substitute it with another editor), or you can just type + ``sh edit-dictionary.sh`` if you have an ``EDITOR`` already specified. + + Copy and paste the full list of words, then save and quit the editor. It’s + not necessary to put the words in alphabetical order, as it will be corrected + by the script. +#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make + sure it runs without errors. For more details on this script, see the + `make-new-dict.sh`_ section. +#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For + example, make sure that the size is about the same as the original dictionary + (or slightly larger). +#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the + generated file in the right position. +#. Build Firefox and test your updated dictionary. Once you’re + satisfied, use the process described in :ref:`write_a_patch` to create a + patch. + +Note that the update script will modify 2 versions of the dictionary, and both +need to be committed: + +* ``en-US.dic``: the dictionary actually shipping in the build, it uses + ISO-8859-1 encoding. +* ``utf8/en-US.dic``: a version of the same dictionary with UTF-8 encoding. This + is used to work around issues with Phabricator, and it allows to display + actual changes in the diff. + +Exclude words from suggestions +============================== + +It’s possible to completely exclude words from suggested alternatives by adding +an affix rule ``!`` at the end of the definition in the ``.dic`` file. For +example: + +* ``bum`` would be changed to ``bum/!`` (note the additional forward slash). +* ``bum/MS`` would be changed to ``bum/MS!``. + +In order to exclude a word from suggestions, follow the instructions available +in `Adding new words to the en-US dictionary`_. Instead of running the +``edit-dictionary.sh`` script (point 5), use a text editor to edit the file +``en-US.dic`` directly, then proceed with the remaining instructions. + +.. warning:: + + Make sure to open ``en-US.dic`` with the correct encoding. For example, Visual + Studio Code will try to open it as ``UTF-8``, and it needs to be reopened with + encoding ``Western (ISO 8859-1)``. + +Upgrading dictionary to a new upstream version of SCOWL +======================================================= + +The English dictionary available in mozilla-central is based on the +`SCOWL`_ dictionary. Some scripts distributed with the SCOWL package are +used to generate the files for the en-US dictionary. + +The working directory for this process is +``extensions/spellcheck/locales/en-US/hunspell/dictionary-sources``. + +#. Download the latest version of the dictionary from the `SCOWL`_ homepage or + `SourceForce`_ as a tarball (tag.gz) and unpack it in the working directory. + Rename the resulting folder from ``scowl-YYYY.MM.DD`` to ``scowl``. +#. Run the script ``sh make-new-dict.sh`` to generate a new dictionary and make + sure it runs without errors. For more details on this script, see the + `make-new-dict.sh`_ section. +#. Do a sanity check on the resulting dictionary file ``en_US-mozilla.dic``. For + example, make sure that the size is about the same as the original dictionary + (or slightly larger). +#. If everything looks correct, use ``sh install-new-dict.sh`` to copy the + generated file in the right position and use the process described in + :ref:`write_a_patch` to create a patch. + +Info about the file structure +============================= + +mozilla-specific.txt +-------------------- + +This file contains Mozilla-specific words that should not be submitted +upstream. For example, ``Firefox`` should go in this file (see `bug 237921`_). + +Note that the file ``5-mozilla-specific.txt`` is generated by expanding +``mozilla-specific.txt`` and should not be edited directly. + +utf8 folder +----------- + +``dictionary-sources/utf8`` is used to store a copy with UTF-8 encoding of the +dictionary files. This is used to work around limitations in Phabricator, which +treats ISO-8859-1 files as binary and won’t display a diff when updating them. + +Info about the included scripts +=============================== + +make-new-dict.sh +---------------- + +The dictionary upgrade scripts ``make-new-dict.sh`` works by expanding (i.e. +“unmunching”) the affix compression dictionaries to create wordlists and +use those to generate a new dictionary. + +The upgrade script expects the current upstream version to be kept in the +directory ``orig``. + +The script will create a few files in ``dictionary-sources/support_file`` in the +following order: + +* ``0-special.txt`` contains numbers and ordinals expanded from SCOWL + ``en.dic.supp``. +* ``1-base.txt`` contains words expanded from ``en_US-custom.dic`` in the + **previous** version of SCOWL (from the ``orig`` folder). +* ``2-mozilla.txt`` contains words expanded from the current Mozilla dictionary. +* ``3-upstream.txt`` contains words expanded from ``en_US-custom.dic`` in the + **new** version of SCOWL (from the ``scowl/speller`` folder). +* ``2-mozilla-removed.txt`` contains words that are only available in the SCOWL + dictionary, i.e. removed by Mozilla. +* ``2-mozilla-added.txt`` contains words that are only available in the current + Mozilla dictionary, i.e. added by Mozilla. +* ``4-patched.txt`` contains words from the new SCOWL dictionary + (``3-upstream.txt``), with words from (``2-mozilla-removed.txt``) removed and + words (``2-mozilla-added.txt``) added. +* ``5-mozilla-specific.txt`` is expanded from ``mozilla-specific.txt`` using the + current affix rules from the Mozilla dictionary. +* ``5-mozilla-removed.txt`` and ``5-mozilla-added.txt`` contain words that are + respectively removed and added by Mozilla compared to the **new** SCOWL + version. These files could be used to submit upstream changes, but words + included in ``5-mozilla-specific.txt`` should be removed from this list. + +The new dictionary is available as ``en_US-mozilla.dic`` and should be copied +over using the ``install-new-dict.sh`` script. + +install-new-dict.sh +------------------- + +The script: + +* Creates a copy of ``orig`` as ``support_files/orig-bk`` and copies the new + upstream version to ``orig``. +* Copies the existing Mozilla dictionary in ``support_files/mozilla-bk``. +* Converts the dictionary (.dic) generated by ``make-new-dict.sh`` from UTF-8 to + ISO-8859-1 and moves it to the parent folder. +* Sets the affix file (.aff) to use ``ISO8859-1`` as ``SET`` instead of the + original ``UTF-8``, removes ``ICONV`` patterns (input conversion tables). + + +.. _SCOWL: http://wordlist.aspell.net +.. _file a new bug: https://bugzilla.mozilla.org/show_bug.cgi?id=enus-dictionary +.. _SourceForce: https://sourceforge.net/projects/wordlist/files/SCOWL/ +.. _bug 237921: https://bugzilla.mozilla.org/show_bug.cgi?id=237921 |