1 files changed, 106 insertions, 0 deletions
diff --git a/vendor/encoding_rs/Ideas.md b/vendor/encoding_rs/Ideas.md
new file mode 100644
index 000000000..e6fccdda1
--- /dev/null
+++ b/vendor/encoding_rs/Ideas.md
@@ -0,0 +1,106 @@
+This document contains notes about various ideas that for one reason or another
+are not being actively pursued.
+
+## Next byte is non-ASCII after ASCII optimization
+
+The current plan for a SIMD-accelerated inner loop for handling ASCII bytes
+makes no use of the bit of information that if the buffers didn't end but the
+ASCII loop exited, the next byte will not be an ASCII byte.
+
+## Handling ASCII with table lookups when decoding single-byte to UTF-16
+
+Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16.
+unconv doesn't even do anything fancy to manually unroll the loop (see below).
+Both handle even the ASCII range using table lookup. That is, there's no branch
+for checking if we're in the lower or upper half of the encoding.
+
+However, adding SIMD acceleration for the ASCII half will likely be a bigger
+win than eliminating the branch to decide ASCII vs. non-ASCII.
+
+## Manual loop unrolling for single-byte encodings
+
+ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte
+encoding to UTF-16. This appears to be thanks to manually unrolling the
+conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1].
+
+[1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp
+
+Notably, none of the single-byte encodings have bytes that'd decode to the
+upper half of BMP. Therefore, if the unmappable marker has the highest bit set
+instead of being zero, the check for unmappables within a 16-character stride
+can be done either by ORing the BMP characters in the stride together and
+checking the high bit or by loading the upper halves of the BMP charaters
+in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8`
+/ `pmovmskb` SSE2 instruction.
+
+## After non-ASCII, handle ASCII punctuation without SIMD
+
+Since the failure mode of SIMD ASCII acceleration involves wasted aligment
+checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin
+scripts have runs of non-ASCII even if ASCII spaces and punctuation is used,
+consider handling the next two or three bytes following non-ASCII as non-SIMD
+before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if
+there's ASCII that's not space or punctuation. Maybe with the "space or
+punctuation" check in place, this code can be allowed to be in place even for
+UTF-8 and Latin single-byte (i.e. not having different code for Latin and
+non-Latin single-byte).
+
+## Prefer maintaining aligment
+
+Instead of returning to acceleration directly after non-ASCII, consider
+continuing to the alignment boundary without acceleration.
+
+## Read from SIMD lanes instead of RAM (cache) when ASCII check fails
+
+When the SIMD ASCII check fails, the data has already been read from memory.
+Test whether it's faster to read the data by lane from the SIMD register than
+to read it again from RAM (cache).
+
+## Use Level 2 Hanzi and Level 2 Kanji ordering
+
+These two are ordered by radical and then by stroke count, so in principle,
+they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't
+fully Unicode-ordered. Is "mostly" good enough for encode accelelation?
+
+## Create a `divmod_94()` function
+
+Experiment with a function that computes `(i / 94, i % 94)` more efficiently
+than generic code.
+
+## Align writes on Aarch64
+
+On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112
+), it might be a good idea to move the destination into 16-byte alignment.
+
+## Unalign UTF-8 validation on Aarch64
+
+Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns
+reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!)
+
+## Table-driven UTF-8 validation
+
+When there are at least four bytes left, read all four. With each byte
+index into tables corresponding to magic values indexable by byte in
+each position.
+
+In the value read from the table indexed by lead byte, encode the
+following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional
+bits one of which is set to indicate the type of lead byte (8 valid
+types, in the 8 lowest bits, and invalid, ASCII would be tenth type),
+and the mask for extracting the payload bits from the lead byte
+(for conversion to UTF-16 or UTF-32).
+
+In the tables indexable by the trail bytes, in each positions
+corresponding byte the lead byte type, store 1 if the trail is
+invalid given the lead and 0 if valid given the lead.
+
+Use the low 8 bits of the of the 16 bits read from the first
+table to mask (bitwise AND) one positional bit from each of the 
+three other values. Bitwise OR the results together with the
+bit that is 1 if the lead is invalid. If the result is zero,
+the sequence is valid. Otherwise it's invalid.
+
+Use the advance to advance. In the conversion to UTF-16 or
+UTF-32 case, use the mast for extracting the meaningful
+bits from the lead byte to mask them from the lead. Shift
+left by 6 as many times as the advance indicates, etc.