From 12e8343068b906f8b2afddc5569968a8a91fa5b0 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Mon, 29 Apr 2024 06:24:24 +0200 Subject: Adding upstream version 2.1.0. Signed-off-by: Daniel Baumann --- docs/architecture.md | 176 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 docs/architecture.md (limited to 'docs/architecture.md') diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..bebcf9d --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,176 @@ +(md/architecture)= + +# markdown-it design principles + +(md/data-flow)= +## Data flow + +Input data is parsed via nested chains of rules. There are 3 nested chains - +`core`, `block` & `inline`: + +``` +core + core.rule1 (normalize) + ... + core.ruleX + + block + block.rule1 (blockquote) + ... + block.ruleX + + core.ruleX1 (intermediate rule that applies on block tokens, nothing yet) + ... + core.ruleXX + + inline (applied to each block token with "inline" type) + inline.rule1 (text) + ... + inline.ruleX + + core.ruleYY (applies to all tokens) + ... (abbreviation, footnote, typographer, linkifier) + +``` + +The result of the parsing is a *list of tokens*, that will be passed to the `renderer` to generate the html content. + +These tokens can be themselves parsed again to generate more tokens (ex: a `list token` can be divided into multiple `inline tokens`). + +An `env` sandbox can be used alongside tokens to inject external variables for your parsers and renderers. + +Each chain (core / block / inline) uses an independent `state` object when parsing data, so that each parsing operation is independent and can be disabled on the fly. + + +## Token stream + +Instead of traditional AST we use more low-level data representation - *tokens*. +The difference is simple: + +- Tokens are a simple sequence (Array). +- Opening and closing tags are separate. +- There are special token objects, "inline containers", having nested tokens. + sequences with inline markup (bold, italic, text, ...). + +See [token class](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/token.py) +for details about each token content. + +In total, a token stream is: + +- On the top level - array of paired or single "block" tokens: + - open/close for headers, lists, blockquotes, paragraphs, ... + - codes, fenced blocks, horizontal rules, html blocks, inlines containers +- Each inline token have a `.children` property with a nested token stream for inline content: + - open/close for strong, em, link, code, ... + - text, line breaks + +Why not AST? Because it's not needed for our tasks. We follow KISS principle. +If you wish - you can call a parser without a renderer and convert the token stream +to an AST. + +More details about tokens: + +- [Renderer source](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/renderer.py) +- [Token source](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/token.py) +- [Live demo](https://markdown-it.github.io/) - type your text and click `debug` tab. + + +## Rules + +Rules are functions, doing "magic" with parser `state` objects. A rule is associated with one or more *chains* and is unique. For instance, a `blockquote` token is associated with `blockquote`, `paragraph`, `heading` and `list` chains. + +Rules are managed by names via [Ruler](https://markdown-it.github.io/markdown-it/#Ruler) instances and can be `enabled` / `disabled` from the [MarkdownIt](https://markdown-it.github.io/markdown-it/#MarkdownIt) methods. + +You can note, that some rules have a `validation mode` - in this mode rules do not +modify the token stream, and only look ahead for the end of a token. It's one +important design principle - a token stream is "write only" on block & inline parse stages. + +Parsers are designed to keep rules independent of each other. You can safely enable/disable them, or +add new ones. There are no universal recipes for how to create new rules - design of +distributed state machines with good data isolation is a tricky business. But you +can investigate existing rules & plugins to see possible approaches. + +Also, in complex cases you can try to ask for help in tracker. Condition is very +simple - it should be clear from your ticket, that you studied docs, sources, +and tried to do something yourself. We never reject with help to real developers. + + +## Renderer + +After the token stream is generated, it's passed to a [renderer](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/renderer.py). +It then plays all the tokens, passing each to a rule with the same name as token type. + +Renderer rules are located in `md.renderer.rules[name]` and are simple functions +with the same signature: + +```python +def function(renderer, tokens, idx, options, env): + return htmlResult +``` + +In many cases that allows easy output change even without parser intrusion. +For example, let's replace images with vimeo links to player's iframe: + +```python +import re +md = MarkdownIt("commonmark") + +vimeoRE = re.compile(r'^https?:\/\/(www\.)?vimeo.com\/(\d+)($|\/)') + +def render_vimeo(self, tokens, idx, options, env): + token = tokens[idx] + + if vimeoRE.match(token.attrs["src"]): + + ident = vimeoRE.match(token.attrs["src"])[2] + + return ('
\n' + + ' \n' + + '
\n') + return self.image(tokens, idx, options, env) + +md = MarkdownIt("commonmark") +md.add_render_rule("image", render_vimeo) +print(md.render("![](https://www.vimeo.com/123)")) +``` + +Here is another example, how to add `target="_blank"` to all links: + +```python +from markdown_it import MarkdownIt + +def render_blank_link(self, tokens, idx, options, env): + tokens[idx].attrSet("target", "_blank") + + # pass token to default renderer. + return self.renderToken(tokens, idx, options, env) + +md = MarkdownIt("commonmark") +md.add_render_rule("link_open", render_blank_link) +print(md.render("[a]\n\n[a]: b")) +``` + +Note, if you need to add attributes, you can do things without renderer override. +For example, you can update tokens in `core` chain. That is slower, than direct +renderer override, but can be more simple. + +You also can write your own renderer to generate other formats than HTML, such as +JSON/XML... You can even use it to generate AST. + +## Summary + +This was mentioned in [Data flow](md/data-flow), but let's repeat sequence again: + +1. Blocks are parsed, and top level of token stream filled with block tokens. +2. Content on inline containers is parsed, filling `.children` properties. +3. Rendering happens. + +And somewhere between you can apply additional transformations :) . Full content +of each chain can be seen on the top of +[parser_core.py](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/parser_core.py), +[parser_block.py](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/parser_block.py) and +[parser_inline.py](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/parser_inline.py) +files. + +Also you can change output directly in [renderer](https://github.com/executablebooks/markdown-it-py/tree/master/markdown_it/renderer.py) for many simple cases. -- cgit v1.2.3