summaryrefslogtreecommitdiffstats
path: root/vendor/tendril/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'vendor/tendril/README.md')
-rw-r--r--vendor/tendril/README.md96
1 files changed, 96 insertions, 0 deletions
diff --git a/vendor/tendril/README.md b/vendor/tendril/README.md
new file mode 100644
index 000000000..fced4b70d
--- /dev/null
+++ b/vendor/tendril/README.md
@@ -0,0 +1,96 @@
+# tendril
+
+**Warning**: This library is at a very early stage of development, and it
+contains a substantial amount of `unsafe` code. Use at your own risk!
+
+[![Build Status](https://github.com/servo/tendril/workflows/CI/badge.svg)](https://github.com/servo/tendril/actions)
+
+[API Documentation](https://doc.servo.org/tendril/index.html)
+
+## Introduction
+
+`Tendril` is a compact string/buffer type, optimized for zero-copy parsing.
+Tendrils have the semantics of owned strings, but are sometimes views into
+shared buffers. When you mutate a tendril, an owned copy is made if necessary.
+Further mutations occur in-place until the string becomes shared, e.g. with
+`clone()` or `subtendril()`.
+
+Buffer sharing is accomplished through thread-local (non-atomic) reference
+counting, which has very low overhead. The Rust type system will prevent
+you at compile time from sending a tendril between threads. (See below
+for thoughts on relaxing this restriction.)
+
+Whereas `String` allocates in the heap for any non-empty string, `Tendril` can
+store small strings (up to 8 bytes) in-line, without a heap allocation.
+`Tendril` is also smaller than `String` on 64-bit platforms — 16 bytes versus
+24. `Option<Tendril>` is the same size as `Tendril`, thanks to
+[`NonZero`][NonZero].
+
+The maximum length of a tendril is 4 GB. The library will panic if you attempt
+to go over the limit.
+
+## Formats and encoding
+
+`Tendril` uses
+[phantom types](https://doc.rust-lang.org/stable/rust-by-example/generics/phantom.html)
+to track a buffer's format. This determines at compile time which
+operations are available on a given tendril. For example, `Tendril<UTF8>` and
+`Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively.
+
+`Tendril` also integrates with
+[rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has
+preliminary support for [WTF-8][] buffers.
+
+## Plans for the future
+
+### Ropes
+
+[html5ever][] will use `Tendril` as a zero-copy text representation. It would
+be good to preserve this all the way through to Servo's DOM. This would reduce
+memory consumption, and possibly speed up text shaping and painting. However,
+DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous
+in memory around e.g. a character entity reference.
+
+*Solution:* Build a **[rope][] on top of these strings** and use that as
+Servo's representation of DOM text. We can perhaps do text shaping and/or
+painting in parallel for different chunks of a rope. html5ever can additionally
+use this rope type as a replacement for `BufferQueue`.
+
+Because the underlying buffers are reference-counted, the bulk of this rope
+is already a [persistent data structure][]. Consider what happens when
+appending two ropes to get a "new" rope. A vector-backed rope would copy a
+vector of small structs, one for each chunk, and would bump the corresponding
+refcounts. But it would not copy any of the string data.
+
+If we want more sharing, then a [2-3 finger tree][] could be a good choice.
+We would probably stick with `VecDeque` for ropes under a certain size.
+
+### UTF-16 compatibility
+
+SpiderMonkey expects text to be in UCS-2 format for the most part. The
+semantics of JavaScript strings are difficult to implement on UTF-8. This also
+applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a
+string that isn't contiguous in memory will incur additional overhead and
+complexity, if not a full copy.
+
+*Solution:* Use **WTF-8 in parsing** and in the DOM. Servo will **convert to
+contiguous UTF-16 when necessary**. The conversion can easily be parallelized,
+if we find a practical need to convert huge chunks of text all at once.
+
+### Source span information
+
+Some html5ever API consumers want to know the originating location in the HTML
+source file(s) of each token or parse error. An example application would be a
+command-line HTML validator with diagnostic output similar to `rustc`'s.
+
+*Solution:* Accept **some metadata along with each input string**. The type of
+metadata is chosen by the API consumer; it defaults to `()`, which has size
+zero. For any non-inline string, we can provide the associated metadata as well
+as a byte offset.
+
+[NonZero]: https://doc.rust-lang.org/core/nonzero/struct.NonZero.html
+[html5ever]: https://github.com/servo/html5ever
+[WTF-8]: https://simonsapin.github.io/wtf-8/
+[rope]: https://en.wikipedia.org/wiki/Rope_%28data_structure%29
+[persistent data structure]: https://en.wikipedia.org/wiki/Persistent_data_structure
+[2-3 finger tree]: https://www.staff.city.ac.uk/~ross/papers/FingerTree.html