diff options
Diffstat (limited to 'third_party/rust/encoding_rs/Ideas.md')
-rw-r--r-- | third_party/rust/encoding_rs/Ideas.md | 106 |
1 files changed, 106 insertions, 0 deletions
diff --git a/third_party/rust/encoding_rs/Ideas.md b/third_party/rust/encoding_rs/Ideas.md new file mode 100644 index 0000000000..e6fccdda14 --- /dev/null +++ b/third_party/rust/encoding_rs/Ideas.md @@ -0,0 +1,106 @@ +This document contains notes about various ideas that for one reason or another +are not being actively pursued. + +## Next byte is non-ASCII after ASCII optimization + +The current plan for a SIMD-accelerated inner loop for handling ASCII bytes +makes no use of the bit of information that if the buffers didn't end but the +ASCII loop exited, the next byte will not be an ASCII byte. + +## Handling ASCII with table lookups when decoding single-byte to UTF-16 + +Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16. +unconv doesn't even do anything fancy to manually unroll the loop (see below). +Both handle even the ASCII range using table lookup. That is, there's no branch +for checking if we're in the lower or upper half of the encoding. + +However, adding SIMD acceleration for the ASCII half will likely be a bigger +win than eliminating the branch to decide ASCII vs. non-ASCII. + +## Manual loop unrolling for single-byte encodings + +ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte +encoding to UTF-16. This appears to be thanks to manually unrolling the +conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1]. + +[1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp + +Notably, none of the single-byte encodings have bytes that'd decode to the +upper half of BMP. Therefore, if the unmappable marker has the highest bit set +instead of being zero, the check for unmappables within a 16-character stride +can be done either by ORing the BMP characters in the stride together and +checking the high bit or by loading the upper halves of the BMP charaters +in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8` +/ `pmovmskb` SSE2 instruction. + +## After non-ASCII, handle ASCII punctuation without SIMD + +Since the failure mode of SIMD ASCII acceleration involves wasted aligment +checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin +scripts have runs of non-ASCII even if ASCII spaces and punctuation is used, +consider handling the next two or three bytes following non-ASCII as non-SIMD +before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if +there's ASCII that's not space or punctuation. Maybe with the "space or +punctuation" check in place, this code can be allowed to be in place even for +UTF-8 and Latin single-byte (i.e. not having different code for Latin and +non-Latin single-byte). + +## Prefer maintaining aligment + +Instead of returning to acceleration directly after non-ASCII, consider +continuing to the alignment boundary without acceleration. + +## Read from SIMD lanes instead of RAM (cache) when ASCII check fails + +When the SIMD ASCII check fails, the data has already been read from memory. +Test whether it's faster to read the data by lane from the SIMD register than +to read it again from RAM (cache). + +## Use Level 2 Hanzi and Level 2 Kanji ordering + +These two are ordered by radical and then by stroke count, so in principle, +they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't +fully Unicode-ordered. Is "mostly" good enough for encode accelelation? + +## Create a `divmod_94()` function + +Experiment with a function that computes `(i / 94, i % 94)` more efficiently +than generic code. + +## Align writes on Aarch64 + +On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112 +), it might be a good idea to move the destination into 16-byte alignment. + +## Unalign UTF-8 validation on Aarch64 + +Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns +reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!) + +## Table-driven UTF-8 validation + +When there are at least four bytes left, read all four. With each byte +index into tables corresponding to magic values indexable by byte in +each position. + +In the value read from the table indexed by lead byte, encode the +following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional +bits one of which is set to indicate the type of lead byte (8 valid +types, in the 8 lowest bits, and invalid, ASCII would be tenth type), +and the mask for extracting the payload bits from the lead byte +(for conversion to UTF-16 or UTF-32). + +In the tables indexable by the trail bytes, in each positions +corresponding byte the lead byte type, store 1 if the trail is +invalid given the lead and 0 if valid given the lead. + +Use the low 8 bits of the of the 16 bits read from the first +table to mask (bitwise AND) one positional bit from each of the +three other values. Bitwise OR the results together with the +bit that is 1 if the lead is invalid. If the result is zero, +the sequence is valid. Otherwise it's invalid. + +Use the advance to advance. In the conversion to UTF-16 or +UTF-32 case, use the mast for extracting the meaningful +bits from the lead byte to mask them from the lead. Shift +left by 6 as many times as the advance indicates, etc. |