summaryrefslogtreecommitdiffstats
path: root/doc/docs/unicode.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 11:33:32 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-04 11:33:32 +0000
commit1f403ad2197fc7442409f434ee574f3e6b46fb73 (patch)
tree0299c6dd11d5edfa918a29b6456bc1875f1d288c /doc/docs/unicode.rst
parentInitial commit. (diff)
downloadpygments-1f403ad2197fc7442409f434ee574f3e6b46fb73.tar.xz
pygments-1f403ad2197fc7442409f434ee574f3e6b46fb73.zip
Adding upstream version 2.14.0+dfsg.upstream/2.14.0+dfsgupstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/docs/unicode.rst')
-rw-r--r--doc/docs/unicode.rst58
1 files changed, 58 insertions, 0 deletions
diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst
new file mode 100644
index 0000000..dca9111
--- /dev/null
+++ b/doc/docs/unicode.rst
@@ -0,0 +1,58 @@
+=====================
+Unicode and Encodings
+=====================
+
+Since Pygments 0.6, all lexers use unicode strings internally. Because of that
+you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings
+with the wrong encoding.
+
+Per default all lexers have their input encoding set to `guess`. This means
+that the following encodings are tried:
+
+* UTF-8 (including BOM handling)
+* The locale encoding (i.e. the result of `locale.getpreferredencoding()`)
+* As a last resort, `latin1`
+
+If you pass a lexer a byte string object (not unicode), it tries to decode the
+data using this encoding.
+
+You can override the encoding using the `encoding` or `inencoding` lexer
+options. If you have the `chardet`_ library installed and set the encoding to
+``chardet`` if will analyse the text and use the encoding it thinks is the
+right one automatically:
+
+.. sourcecode:: python
+
+ from pygments.lexers import PythonLexer
+ lexer = PythonLexer(encoding='chardet')
+
+The best way is to pass Pygments unicode objects. In that case you can't get
+unexpected output.
+
+The formatters now send Unicode objects to the stream if you don't set the
+output encoding. You can do so by passing the formatters an `encoding` option:
+
+.. sourcecode:: python
+
+ from pygments.formatters import HtmlFormatter
+ f = HtmlFormatter(encoding='utf-8')
+
+**You will have to set this option if you have non-ASCII characters in the
+source and the output stream does not accept Unicode written to it!**
+This is the case for all regular files and for terminals.
+
+Note: The Terminal formatter tries to be smart: if its output stream has an
+`encoding` attribute, and you haven't set the option, it will encode any
+Unicode string with this encoding before writing it. This is the case for
+`sys.stdout`, for example. The other formatters don't have that behavior.
+
+Another note: If you call Pygments via the command line (`pygmentize`),
+encoding is handled differently, see :doc:`the command line docs <cmdline>`.
+
+.. versionadded:: 0.7
+ The formatters now also accept an `outencoding` option which will override
+ the `encoding` option if given. This makes it possible to use a single
+ options dict with lexers and formatters, and still have different input and
+ output encodings.
+
+.. _chardet: https://chardet.github.io/