summaryrefslogtreecommitdiffstats
path: root/Contributing.md
diff options
context:
space:
mode:
Diffstat (limited to 'Contributing.md')
-rw-r--r--Contributing.md167
1 files changed, 167 insertions, 0 deletions
diff --git a/Contributing.md b/Contributing.md
new file mode 100644
index 0000000..93da428
--- /dev/null
+++ b/Contributing.md
@@ -0,0 +1,167 @@
+Licensing
+=========
+
+The code is distributed under the BSD 2-clause license. Contributors making pull
+requests must agree that they are able and willing to put their contributions
+under that license.
+
+Goals & non-goals of Pygments
+=============================
+
+Python support
+--------------
+
+Pygments supports all supported Python versions as per the [Python Developer's Guide](https://devguide.python.org/versions/). Additionally, the default Python version of the latest stable version of RHEL, Ubuntu LTS, and Debian are supported, even if they're officially EOL. Supporting other end-of-life versions is a non-goal of Pygments.
+
+Validation
+----------
+
+Pygments does not attempt to validate the input. Accepting code that is not legal for a given language is acceptable if it simplifies the codebase and does not result in surprising behavior. For instance, in C89, accepting `//` based comments would be fine because de-facto all compilers supported it, and having a separate lexer for it would not be worth it.
+
+Contribution checklist
+======================
+
+* Check the documentation for how to write
+ [a new lexer](https://pygments.org/docs/lexerdevelopment/),
+ [a new formatter](https://pygments.org/docs/formatterdevelopment/) or
+ [a new filter](https://pygments.org/docs/filterdevelopment/)
+
+* Make sure to add a test for your new functionality, and where applicable,
+ write documentation.
+
+* When writing rules, try to merge simple rules. For instance, combine:
+
+ ```python
+ _PUNCTUATION = [
+ (r"\(", token.Punctuation),
+ (r"\)", token.Punctuation),
+ (r"\[", token.Punctuation),
+ (r"\]", token.Punctuation),
+ ("{", token.Punctuation),
+ ("}", token.Punctuation),
+ ]
+ ```
+
+ into:
+
+ ```python
+ (r"[\(\)\[\]{}]", token.Punctuation)
+ ```
+
+* Be careful with ``.*``. This matches greedily as much as it can. For instance,
+ a rule like ``@.*@`` will match the whole string ``@first@ second @third@``,
+ instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
+ this case to stop early. The ``?`` tries to match _as few times_ as possible.
+
+* Beware of so-called "catastrophic backtracking". As a first example, consider
+ the regular expression ``(A+)*C``. This is equivalent to ``A*B`` regarding
+ what it matches, but *non*-matches will take very long. This is because
+ of the way the regular expression engine works. Suppose you feed it 50
+ 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``,
+ but finds that it cannot match the end since 'B' is not the same as 'C'.
+ Then it backtracks, removing one 'A' from the first ``A+`` and trying to
+ match the rest as another ``(A+)*``. This fails again, so it backtracks
+ further left in the input string, etc. In effect, it tries all combinations
+
+ ```
+ (AAAAAAAAAAAAAAAAA)
+ (AAAAAAAAAAAAAAAA)(A)
+ (AAAAAAAAAAAAAAA)(AA)
+ (AAAAAAAAAAAAAAA)(A)(A)
+ (AAAAAAAAAAAAAA)(AAA)
+ (AAAAAAAAAAAAAA)(AA)(A)
+ ...
+ ```
+
+ Thus, the matching has exponential complexity. In a lexer, the
+ effect is that Pygments will seemingly hang when parsing invalid
+ input.
+
+ ```python
+ >>> import re
+ >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
+ ```
+
+ As a more subtle and real-life example, here is a badly written
+ regular expression to match strings:
+
+ ```python
+ r'"(\\?.)*?"'
+ ```
+
+ If the ending quote is missing, the regular expression engine will
+ find that it cannot match at the end, and try to backtrack with less
+ matches in the ``*?``. When it finds a backslash, as it has already
+ tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
+ simple character without meaning), which leads to the same
+ exponential backtracking problem if there are lots of backslashes in
+ the (invalid) input string. A good way to write this would be
+ ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
+ way. Better yet is to use a dedicated state, which not only
+ sidesteps the issue without headaches, but allows you to highlight
+ string escapes.
+
+ ```python
+ 'root': [
+ ...,
+ (r'"', String, 'string'),
+ ...
+ ],
+ 'string': [
+ (r'\\.', String.Escape),
+ (r'"', String, '#pop'),
+ (r'[^\\"]+', String),
+ ]
+ ```
+
+* When writing rules for patterns such as comments or strings, match as many
+ characters as possible in each token. This is an example of what not to
+ do:
+
+ ```python
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'.', Comment.Multiline),
+ ]
+ ```
+
+ This generates one token per character in the comment, which slows
+ down the lexing process, and also makes the raw token output (and in
+ particular the test output) hard to read. Do this instead:
+
+ ```python
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'[^*]+', Comment.Multiline),
+ (r'\*', Comment.Multiline),
+ ]
+ ```
+
+* Don't add imports of your lexer anywhere in the codebase. (In case you're
+ curious about ``compiled.py`` -- this file exists for backwards compatibility
+ reasons.)
+
+* Use the standard importing convention: ``from token import Punctuation``
+
+* For test cases that assert on the tokens produced by a lexer, use tools:
+
+ * You can use the ``testcase`` formatter to produce a piece of code that
+ can be pasted into a unittest file:
+ ``python -m pygments -l lua -f testcase <<< "local a = 5"``
+
+ * Most snippets should instead be put as a sample file under
+ ``tests/snippets/<lexer_alias>/*.txt``. These files are automatically
+ picked up as individual tests, asserting that the input produces the
+ expected tokens.
+
+ To add a new test, create a file with just your code snippet under a
+ subdirectory based on your lexer's main alias. Then run
+ ``pytest --update-goldens <filename.txt>`` to auto-populate the currently
+ expected tokens. Check that they look good and check in the file.
+
+ Also run the same command whenever you need to update the test if the
+ actual produced tokens change (assuming the change is expected).
+
+ * Large test files should go in ``tests/examplefiles``. This works
+ similar to ``snippets``, but the token output is stored in a separate
+ file. Output can also be regenerated with ``--update-goldens``.