This document contains notes about various ideas that for one reason or another are not being actively pursued. ## Next byte is non-ASCII after ASCII optimization The current plan for a SIMD-accelerated inner loop for handling ASCII bytes makes no use of the bit of information that if the buffers didn't end but the ASCII loop exited, the next byte will not be an ASCII byte. ## Handling ASCII with table lookups when decoding single-byte to UTF-16 Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16. unconv doesn't even do anything fancy to manually unroll the loop (see below). Both handle even the ASCII range using table lookup. That is, there's no branch for checking if we're in the lower or upper half of the encoding. However, adding SIMD acceleration for the ASCII half will likely be a bigger win than eliminating the branch to decide ASCII vs. non-ASCII. ## Manual loop unrolling for single-byte encodings ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte encoding to UTF-16. This appears to be thanks to manually unrolling the conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1]. [1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp Notably, none of the single-byte encodings have bytes that'd decode to the upper half of BMP. Therefore, if the unmappable marker has the highest bit set instead of being zero, the check for unmappables within a 16-character stride can be done either by ORing the BMP characters in the stride together and checking the high bit or by loading the upper halves of the BMP charaters in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8` / `pmovmskb` SSE2 instruction. ## After non-ASCII, handle ASCII punctuation without SIMD Since the failure mode of SIMD ASCII acceleration involves wasted aligment checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin scripts have runs of non-ASCII even if ASCII spaces and punctuation is used, consider handling the next two or three bytes following non-ASCII as non-SIMD before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if there's ASCII that's not space or punctuation. Maybe with the "space or punctuation" check in place, this code can be allowed to be in place even for UTF-8 and Latin single-byte (i.e. not having different code for Latin and non-Latin single-byte). ## Prefer maintaining aligment Instead of returning to acceleration directly after non-ASCII, consider continuing to the alignment boundary without acceleration. ## Read from SIMD lanes instead of RAM (cache) when ASCII check fails When the SIMD ASCII check fails, the data has already been read from memory. Test whether it's faster to read the data by lane from the SIMD register than to read it again from RAM (cache). ## Use Level 2 Hanzi and Level 2 Kanji ordering These two are ordered by radical and then by stroke count, so in principle, they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't fully Unicode-ordered. Is "mostly" good enough for encode accelelation? ## Create a `divmod_94()` function Experiment with a function that computes `(i / 94, i % 94)` more efficiently than generic code. ## Align writes on Aarch64 On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112 ), it might be a good idea to move the destination into 16-byte alignment. ## Unalign UTF-8 validation on Aarch64 Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!) ## Table-driven UTF-8 validation When there are at least four bytes left, read all four. With each byte index into tables corresponding to magic values indexable by byte in each position. In the value read from the table indexed by lead byte, encode the following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional bits one of which is set to indicate the type of lead byte (8 valid types, in the 8 lowest bits, and invalid, ASCII would be tenth type), and the mask for extracting the payload bits from the lead byte (for conversion to UTF-16 or UTF-32). In the tables indexable by the trail bytes, in each positions corresponding byte the lead byte type, store 1 if the trail is invalid given the lead and 0 if valid given the lead. Use the low 8 bits of the of the 16 bits read from the first table to mask (bitwise AND) one positional bit from each of the three other values. Bitwise OR the results together with the bit that is 1 if the lead is invalid. If the result is zero, the sequence is valid. Otherwise it's invalid. Use the advance to advance. In the conversion to UTF-16 or UTF-32 case, use the mast for extracting the meaningful bits from the lead byte to mask them from the lead. Shift left by 6 as many times as the advance indicates, etc.