2024-02-20 - Ring buffer v2 =========================== Goals: - improve the multi-thread performance of rings so that traces can be written from all threads in parallel without the huge bottleneck of the lock that is currently necessary to protect the buffer. This is important for mmapped areas that are left as a file when the process crashes. - keep traces synchronous within a given thread, i.e. when the TRACE() call returns, the trace is either written into the ring or lost due to slow readers. - try hard to limit the cache line bounces between threads due to the use of a shared work area. - make waiting threads not disturb working ones - continue to work on all supported platforms, with a particular focus on performance for modern platforms (memory ordering, DWCAS etc can be used if they provide any benefit), with a fallback for inferior platforms. - do not reorder traces within a given thread. - do not break existing features - do not significantly increase memory usage Analysis of the current situation ================================= Currently, there is a read lock around the call to __sink_write() in order to make sure that an attempt to write the number of lost messages is delivered with highest priority and is consistent with the lost counter. This doesn't seem to pose any problem at this point though if it were, it could possibly be revisited. __sink_write() calls ring_write() which first measures the input string length from the multiple segments, and locks the ring: - while trying to free space - while copying the message, due to the buffer's API Because of this, there is a huge serialization and threads wait in queue. Tests involving a split of the lock and a release around the message copy have shown a +60% performance increase, which is still not acceptable. First proposed approach ======================= The first approach would have consisted in writing messages in small parts: 1) write 0xFF in the tag to mean "size not filled yet" 2) write the message's length and write a zero tag after the message's location 3) replace the first tag to 0xFE to indicate the size is known, but the message is not filled yet. 4) memcpy() of the message to the area 5) replace the first tag to 0 to mark the entry as valid. It's worth noting that doing that without any lock will allow a second thread looping on the first tag to jump to the second tag after step 3. But the cost is high: in a 64-thread scenario where each of them wants to send one message, the work would look like this: - 64 threads try to CAS the tag. One gets it, 63 fail. They loop on the byte in question in read-only mode, waiting for the byte to change. This loop constantly forces the cache line to switch from MODIFIED to SHARED in the writer thread, and makes it a pain for it to write the message's length just after it. - once the first writer thread finally manages to write the length (step 2), it writes 0xFE on the tag to release the waiting threads, and starts with step 4. At this point, 63 threads try a CAS on the same entry, and this hammering further complicates the memcpy() of step 4 for the first 63 bytes of the message (well, 32 on avg since the tag is not necessarily aligned). One thread wins, 62 fail. All read the size field and jump to the next tag, waiting in read loops there. The second thread starts to write its size and faces the same difficulty as described above, facing 62 competitors when writing its size and the beginning of its message. - when the first writer thread writes the end of its message, it gets close to the final tag where the 62 waiting threads are still reading, causing a slow down again with the loss of exclusivity on the cache line. This is the same for the second thread etc. Thus, on average, a writing thread is hindered by N-1 threads at the beginning of its message area (in the first 32 bytes on avg) and by N-2 threads at the end of its area (in the last 32 bytes on avg). Given that messages are roughly 218 bytes on avg for HTTP/1, this means that roughly 1/3 of the message is written under severe cache contention. In addition to this, the buffer's tail needs to be updated once all threads are ready, something that adds the need for synchronization so that the last writing threads (the most likely to complete fast due to less perturbations) needs to wait for all previous ones. This also means N atomic writes to the tail. New proposal ============ In order to address the contention scenarios above, let's try to factor the work as much as possible. The principle is that threads that want to write will either do it themselves or declare their intent and wait for a writing thread to do it for them. This aims at ensuring a maximum usage of read-only data between threads, and to leave the work area read-write between very few threads, and exclusive for multiple messages at once, avoiding the bounces. First, the buffer will have 2 indexes: - head: where the valid data start - tail: where new data need to be appended When a thread starts to work, it will keep a copy of $tail and push it forward by as many bytes as needed to write all the messages it has to. In order to guarantee that neither the previous nor the new $tail point to an outdated or overwritten location but that there is always a tag there, $tail contains a lock bit in its highest bit that will guarantee that only one at a time will update it. The goal here is to perform as few atomic ops as possible in the contended path so as to later amortize the costs and make sure to limit the number of atomic ops on the wait path to the strict minimum so that waiting threads do not hinder the workers: Fast path: 1 load($tail) to check the topmost bit 1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work) 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message 1 store(1 byte tag=0) at the beginning to release the message Contented path: N load($tail) while waiting for the bit to be zero M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message 1 store(1 byte tag=0) at the beginning to release the message Queue ----- In order to limit the contention, writers will not start to write but will wait in a queue, announcing their message pointers/lengths and total lengths. The queue is made of a (ptr, len) pair that points to one such descriptor, located in the waiter thread's stack, that itself points to the next pair. In fact messages are ordered in a LIFO fashion but that isn't important since intra- thread ordering is preserved (and in the worst case it will also be possible to write them from end to beginning). The approach is the following: a writer loasd $tail and sees it's busy, there's no point continuing, it will add itself to the queue, announcing (ptr, len + next->len) so that by just reading the first entry, one knows the total size of the queue. And it will wait there as long as $tail has its topmost bit set and the queue points to itself (meaning it's the queue's leader), so that only one thread in the queue watches $tail, limiting the number of cache line bounces. If the queue doesn't point anymore to the current thread, it means another thread has taken it over so there's no point continuing, this thread just becomes passive. If the lock bit is dropped from $tail, the watching thread needs to re-check that it's still the queue's leader before trying to grab the lock, so that only the leading thread will attempt it. Indeed, a few of the last leading threads might still be looping, unaware that they're no longer leaders. A CAS(&queue, self, self) will do it. Upon failure, the thread just becomes a passive thread. Upon success, the thread is a confirmed leader, it must then try to grab the tail lock. Only this thread and a few potential newcomers will compete on this one. If the leading thread wins, it brings all the queue with it and the newcomers will queue again. If the leading thread loses, it needs to loop back to the point above, watching $tail and the queue. In this case a newcomer might have grabbed the lock. It will notice the non-empty queue and will take it with it. Thus in both cases the winner thread does a CAS(queue, queue, NULL) to reset the queue, keeping the previous pointer. At this point the winner thread considers its own message size plus the retrieved queue's size as the total required size and advances $tail by as much, and will iterate over all messages to copy them in turn. The passive threads are released by doing XCHG(&ptr->next, ptr) for each message, that is normally impossible otherwise. As such, a passive thread just has to loop over its own value, stored in its own stack, reading from its L1 cache in loops without any risk of disturbing others, hence no need for EBO. During the time it took to update $tail, more messages will have been accumulating in the queue from various other threads, and once $tail is written, one thread can pick them up again. The benefit here is that the longer it takes one thread to free some space, the more messages add up in the queue and the larger the next batch, so that there are always very few contenders on the ring area and on the tail index. At worst, the queue pointer is hammered but it's not on the fast path, since wasting time here means all waiters will be queued. Also, if we keep the first tag unchanged after it's set to 0xFF, it allows to avoid atomic ops inside all the message. Indeed there's no reader in the area as long as the tag is 0xFF, so we can just write all contents at once including the varints and subsequent message tags without ever using atomic ops, hence not forcing ordered writes. So maybe in the end there is some value in writing the messages backwards from end to beginning, and just writing the first tag atomically but not the rest. The scenario would look like this: (without queue) - before starting to work: do { while (ret=(load(&tail) & BIT63)) ; } while (!cas(&tail, &ret, ret | BIT63)); - at this point, alone on it and guaranteed not to change - after new size is calculated, write it and drop the lock: store(&tail, new_tail & ~BIT63); - that's sufficient to unlock other waiters. (with queue) in_queue = 0; do { ret = load(&tail); if (ret & BIT63) { if (!in_queue) { queue_this_node(); in_queue = 1; } while (ret & BIT63) ; } } while (!cas(&tail, &ret, ret | BIT63)); dequeue(in_queue) etc. Fast path: 1 load($tail) to check the topmost bit 1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work) 1 load of the queue to see that it's empty 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message 1 store(1 byte tag=0) at the beginning to release the message Contented path: 1 load($tail) to see the tail is changing M CAS(queue,queue,self) to try to add the thread to the queue (avgmax nbthr/2) N load($tail) while waiting for the lock bit to become zero 1 CAS(queue,self,self) to check the leader still is M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others 1 CAS(queue,queue,NULL) to reset the queue 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message P copies of individual messages P stores of individual pointers to release writers 1 store(1 byte tag=0) at the beginning to release the message Optimal approach (later if needed?): multiple queues. Each thread has one queue assigned, either from a thread group, or using a modulo from the thread ID. Same as above then. Steps ----- It looks that the queue is what allows the process to scale by amortizing a single lock for every N messages, but that it's not a prerequisite to start, without a queue threads can just wait on $tail. Options ------- It is possible to avoid the extra check on CAS(queue,self,self) by forcing writers into the queue all the time. It would slow down the fast path but may improve the slow path, both of which would become the same: Contented path: 1 XCHG(queue,self) to try to add the thread to the queue N load($tail) while waiting for the lock bit to become zero M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others 1 CAS(queue,self,NULL) to reset the queue 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message P copies of individual messages P stores of individual pointers to release writers 1 store(1 byte tag=0) at the beginning to release the message There seems to remain a race when resetting the queue, where a newcomer thread would queue itself while not being the leader. It seems it can be addressed by deciding that whoever gets the bit is not important, what matters is the thread that manages to reset the queue. This can then be done using another XCHG: 1 XCHG(queue,self) to try to add the thread to the queue N load($tail) while waiting for the lock bit to become zero M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others 1 XCHG(queue,NULL) to reset the queue 1 store(1 byte tag=0xFF) at the beginning to mark the area busy 1 store($tail) to update the new value 1 copy of the whole message P copies of individual messages P stores of individual pointers to release writers 1 store(1 byte tag=0) at the beginning to release the message However this time this can cause fragmentation of multiple sub-queues that will need to be reassembled. So finally the CAS is better, the leader thread should recognize itself. It seems tricky to reliably store the next pointer in each element, and a DWCAS wouldn't help here either. Maybe uninitialized elements should just have a special value (eg 0x1) for their next pointer, meaning "not initialized yet", and that the thread will then replace with the previous queue pointer. A reader would have to wait on this value when meeting it, knowing the pointer is not filled yet but is coming.