diff options
Diffstat (limited to 'vendor/tendril/README.md')
-rw-r--r-- | vendor/tendril/README.md | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/vendor/tendril/README.md b/vendor/tendril/README.md new file mode 100644 index 000000000..fced4b70d --- /dev/null +++ b/vendor/tendril/README.md @@ -0,0 +1,96 @@ +# tendril + +**Warning**: This library is at a very early stage of development, and it +contains a substantial amount of `unsafe` code. Use at your own risk! + +[![Build Status](https://github.com/servo/tendril/workflows/CI/badge.svg)](https://github.com/servo/tendril/actions) + +[API Documentation](https://doc.servo.org/tendril/index.html) + +## Introduction + +`Tendril` is a compact string/buffer type, optimized for zero-copy parsing. +Tendrils have the semantics of owned strings, but are sometimes views into +shared buffers. When you mutate a tendril, an owned copy is made if necessary. +Further mutations occur in-place until the string becomes shared, e.g. with +`clone()` or `subtendril()`. + +Buffer sharing is accomplished through thread-local (non-atomic) reference +counting, which has very low overhead. The Rust type system will prevent +you at compile time from sending a tendril between threads. (See below +for thoughts on relaxing this restriction.) + +Whereas `String` allocates in the heap for any non-empty string, `Tendril` can +store small strings (up to 8 bytes) in-line, without a heap allocation. +`Tendril` is also smaller than `String` on 64-bit platforms — 16 bytes versus +24. `Option<Tendril>` is the same size as `Tendril`, thanks to +[`NonZero`][NonZero]. + +The maximum length of a tendril is 4 GB. The library will panic if you attempt +to go over the limit. + +## Formats and encoding + +`Tendril` uses +[phantom types](https://doc.rust-lang.org/stable/rust-by-example/generics/phantom.html) +to track a buffer's format. This determines at compile time which +operations are available on a given tendril. For example, `Tendril<UTF8>` and +`Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively. + +`Tendril` also integrates with +[rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has +preliminary support for [WTF-8][] buffers. + +## Plans for the future + +### Ropes + +[html5ever][] will use `Tendril` as a zero-copy text representation. It would +be good to preserve this all the way through to Servo's DOM. This would reduce +memory consumption, and possibly speed up text shaping and painting. However, +DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous +in memory around e.g. a character entity reference. + +*Solution:* Build a **[rope][] on top of these strings** and use that as +Servo's representation of DOM text. We can perhaps do text shaping and/or +painting in parallel for different chunks of a rope. html5ever can additionally +use this rope type as a replacement for `BufferQueue`. + +Because the underlying buffers are reference-counted, the bulk of this rope +is already a [persistent data structure][]. Consider what happens when +appending two ropes to get a "new" rope. A vector-backed rope would copy a +vector of small structs, one for each chunk, and would bump the corresponding +refcounts. But it would not copy any of the string data. + +If we want more sharing, then a [2-3 finger tree][] could be a good choice. +We would probably stick with `VecDeque` for ropes under a certain size. + +### UTF-16 compatibility + +SpiderMonkey expects text to be in UCS-2 format for the most part. The +semantics of JavaScript strings are difficult to implement on UTF-8. This also +applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a +string that isn't contiguous in memory will incur additional overhead and +complexity, if not a full copy. + +*Solution:* Use **WTF-8 in parsing** and in the DOM. Servo will **convert to +contiguous UTF-16 when necessary**. The conversion can easily be parallelized, +if we find a practical need to convert huge chunks of text all at once. + +### Source span information + +Some html5ever API consumers want to know the originating location in the HTML +source file(s) of each token or parse error. An example application would be a +command-line HTML validator with diagnostic output similar to `rustc`'s. + +*Solution:* Accept **some metadata along with each input string**. The type of +metadata is chosen by the API consumer; it defaults to `()`, which has size +zero. For any non-inline string, we can provide the associated metadata as well +as a byte offset. + +[NonZero]: https://doc.rust-lang.org/core/nonzero/struct.NonZero.html +[html5ever]: https://github.com/servo/html5ever +[WTF-8]: https://simonsapin.github.io/wtf-8/ +[rope]: https://en.wikipedia.org/wiki/Rope_%28data_structure%29 +[persistent data structure]: https://en.wikipedia.org/wiki/Persistent_data_structure +[2-3 finger tree]: https://www.staff.city.ac.uk/~ross/papers/FingerTree.html |