diff options
Diffstat (limited to 'doc/design-thoughts')
-rw-r--r-- | doc/design-thoughts/ring-v2.txt | 312 |
1 files changed, 312 insertions, 0 deletions
diff --git a/doc/design-thoughts/ring-v2.txt b/doc/design-thoughts/ring-v2.txt new file mode 100644 index 0000000..48c539a --- /dev/null +++ b/doc/design-thoughts/ring-v2.txt @@ -0,0 +1,312 @@ +2024-02-20 - Ring buffer v2 +=========================== + +Goals: + - improve the multi-thread performance of rings so that traces can be written + from all threads in parallel without the huge bottleneck of the lock that + is currently necessary to protect the buffer. This is important for mmapped + areas that are left as a file when the process crashes. + + - keep traces synchronous within a given thread, i.e. when the TRACE() call + returns, the trace is either written into the ring or lost due to slow + readers. + + - try hard to limit the cache line bounces between threads due to the use of + a shared work area. + + - make waiting threads not disturb working ones + + - continue to work on all supported platforms, with a particular focus on + performance for modern platforms (memory ordering, DWCAS etc can be used if + they provide any benefit), with a fallback for inferior platforms. + + - do not reorder traces within a given thread. + + - do not break existing features + + - do not significantly increase memory usage + + +Analysis of the current situation +================================= + +Currently, there is a read lock around the call to __sink_write() in order to +make sure that an attempt to write the number of lost messages is delivered +with highest priority and is consistent with the lost counter. This doesn't +seem to pose any problem at this point though if it were, it could possibly +be revisited. + +__sink_write() calls ring_write() which first measures the input string length +from the multiple segments, and locks the ring: + - while trying to free space + - while copying the message, due to the buffer's API + +Because of this, there is a huge serialization and threads wait in queue. Tests +involving a split of the lock and a release around the message copy have shown +a +60% performance increase, which is still not acceptable. + + +First proposed approach +======================= + +The first approach would have consisted in writing messages in small parts: + 1) write 0xFF in the tag to mean "size not filled yet" + 2) write the message's length and write a zero tag after the message's + location + 3) replace the first tag to 0xFE to indicate the size is known, but the + message is not filled yet. + 4) memcpy() of the message to the area + 5) replace the first tag to 0 to mark the entry as valid. + +It's worth noting that doing that without any lock will allow a second thread +looping on the first tag to jump to the second tag after step 3. But the cost +is high: in a 64-thread scenario where each of them wants to send one message, +the work would look like this: + - 64 threads try to CAS the tag. One gets it, 63 fail. They loop on the byte + in question in read-only mode, waiting for the byte to change. This loop + constantly forces the cache line to switch from MODIFIED to SHARED in the + writer thread, and makes it a pain for it to write the message's length + just after it. + + - once the first writer thread finally manages to write the length (step 2), + it writes 0xFE on the tag to release the waiting threads, and starts with + step 4. At this point, 63 threads try a CAS on the same entry, and this + hammering further complicates the memcpy() of step 4 for the first 63 bytes + of the message (well, 32 on avg since the tag is not necessarily aligned). + One thread wins, 62 fail. All read the size field and jump to the next tag, + waiting in read loops there. The second thread starts to write its size and + faces the same difficulty as described above, facing 62 competitors when + writing its size and the beginning of its message. + + - when the first writer thread writes the end of its message, it gets close + to the final tag where the 62 waiting threads are still reading, causing + a slow down again with the loss of exclusivity on the cache line. This is + the same for the second thread etc. + +Thus, on average, a writing thread is hindered by N-1 threads at the beginning +of its message area (in the first 32 bytes on avg) and by N-2 threads at the +end of its area (in the last 32 bytes on avg). Given that messages are roughly +218 bytes on avg for HTTP/1, this means that roughly 1/3 of the message is +written under severe cache contention. + +In addition to this, the buffer's tail needs to be updated once all threads are +ready, something that adds the need for synchronization so that the last writing +threads (the most likely to complete fast due to less perturbations) needs to +wait for all previous ones. This also means N atomic writes to the tail. + + +New proposal +============ + +In order to address the contention scenarios above, let's try to factor the +work as much as possible. The principle is that threads that want to write will +either do it themselves or declare their intent and wait for a writing thread +to do it for them. This aims at ensuring a maximum usage of read-only data +between threads, and to leave the work area read-write between very few +threads, and exclusive for multiple messages at once, avoiding the bounces. + +First, the buffer will have 2 indexes: + - head: where the valid data start + - tail: where new data need to be appended + +When a thread starts to work, it will keep a copy of $tail and push it forward +by as many bytes as needed to write all the messages it has to. In order to +guarantee that neither the previous nor the new $tail point to an outdated or +overwritten location but that there is always a tag there, $tail contains a +lock bit in its highest bit that will guarantee that only one at a time will +update it. The goal here is to perform as few atomic ops as possible in the +contended path so as to later amortize the costs and make sure to limit the +number of atomic ops on the wait path to the strict minimum so that waiting +threads do not hinder the workers: + + Fast path: + 1 load($tail) to check the topmost bit + 1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work) + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + 1 store(1 byte tag=0) at the beginning to release the message + + Contented path: + N load($tail) while waiting for the bit to be zero + M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + 1 store(1 byte tag=0) at the beginning to release the message + +Queue +----- + +In order to limit the contention, writers will not start to write but will wait +in a queue, announcing their message pointers/lengths and total lengths. The +queue is made of a (ptr, len) pair that points to one such descriptor, located +in the waiter thread's stack, that itself points to the next pair. In fact +messages are ordered in a LIFO fashion but that isn't important since intra- +thread ordering is preserved (and in the worst case it will also be possible +to write them from end to beginning). + +The approach is the following: a writer loasd $tail and sees it's busy, there's +no point continuing, it will add itself to the queue, announcing (ptr, len + +next->len) so that by just reading the first entry, one knows the total size +of the queue. And it will wait there as long as $tail has its topmost bit set +and the queue points to itself (meaning it's the queue's leader), so that only +one thread in the queue watches $tail, limiting the number of cache line +bounces. If the queue doesn't point anymore to the current thread, it means +another thread has taken it over so there's no point continuing, this thread +just becomes passive. If the lock bit is dropped from $tail, the watching +thread needs to re-check that it's still the queue's leader before trying to +grab the lock, so that only the leading thread will attempt it. Indeed, a few +of the last leading threads might still be looping, unaware that they're no +longer leaders. A CAS(&queue, self, self) will do it. Upon failure, the thread +just becomes a passive thread. Upon success, the thread is a confirmed leader, +it must then try to grab the tail lock. Only this thread and a few potential +newcomers will compete on this one. If the leading thread wins, it brings all +the queue with it and the newcomers will queue again. If the leading thread +loses, it needs to loop back to the point above, watching $tail and the +queue. In this case a newcomer might have grabbed the lock. It will notice +the non-empty queue and will take it with it. Thus in both cases the winner +thread does a CAS(queue, queue, NULL) to reset the queue, keeping the previous +pointer. + +At this point the winner thread considers its own message size plus the +retrieved queue's size as the total required size and advances $tail by as +much, and will iterate over all messages to copy them in turn. The passive +threads are released by doing XCHG(&ptr->next, ptr) for each message, that +is normally impossible otherwise. As such, a passive thread just has to +loop over its own value, stored in its own stack, reading from its L1 cache +in loops without any risk of disturbing others, hence no need for EBO. + +During the time it took to update $tail, more messages will have been +accumulating in the queue from various other threads, and once $tail is +written, one thread can pick them up again. + +The benefit here is that the longer it takes one thread to free some space, +the more messages add up in the queue and the larger the next batch, so that +there are always very few contenders on the ring area and on the tail index. +At worst, the queue pointer is hammered but it's not on the fast path, since +wasting time here means all waiters will be queued. + +Also, if we keep the first tag unchanged after it's set to 0xFF, it allows to +avoid atomic ops inside all the message. Indeed there's no reader in the area +as long as the tag is 0xFF, so we can just write all contents at once including +the varints and subsequent message tags without ever using atomic ops, hence +not forcing ordered writes. So maybe in the end there is some value in writing +the messages backwards from end to beginning, and just writing the first tag +atomically but not the rest. + +The scenario would look like this: + + (without queue) + + - before starting to work: + do { + while (ret=(load(&tail) & BIT63)) + ; + } while (!cas(&tail, &ret, ret | BIT63)); + + - at this point, alone on it and guaranteed not to change + - after new size is calculated, write it and drop the lock: + + store(&tail, new_tail & ~BIT63); + + - that's sufficient to unlock other waiters. + + (with queue) + + in_queue = 0; + do { + ret = load(&tail); + if (ret & BIT63) { + if (!in_queue) { + queue_this_node(); + in_queue = 1; + } + while (ret & BIT63) + ; + } + } while (!cas(&tail, &ret, ret | BIT63)); + + dequeue(in_queue) etc. + + Fast path: + 1 load($tail) to check the topmost bit + 1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work) + 1 load of the queue to see that it's empty + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + 1 store(1 byte tag=0) at the beginning to release the message + + Contented path: + 1 load($tail) to see the tail is changing + M CAS(queue,queue,self) to try to add the thread to the queue (avgmax nbthr/2) + N load($tail) while waiting for the lock bit to become zero + 1 CAS(queue,self,self) to check the leader still is + M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others + 1 CAS(queue,queue,NULL) to reset the queue + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + P copies of individual messages + P stores of individual pointers to release writers + 1 store(1 byte tag=0) at the beginning to release the message + +Optimal approach (later if needed?): multiple queues. Each thread has one queue +assigned, either from a thread group, or using a modulo from the thread ID. +Same as above then. + + +Steps +----- + +It looks that the queue is what allows the process to scale by amortizing a +single lock for every N messages, but that it's not a prerequisite to start, +without a queue threads can just wait on $tail. + + +Options +------- + +It is possible to avoid the extra check on CAS(queue,self,self) by forcing +writers into the queue all the time. It would slow down the fast path but +may improve the slow path, both of which would become the same: + + Contented path: + 1 XCHG(queue,self) to try to add the thread to the queue + N load($tail) while waiting for the lock bit to become zero + M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others + 1 CAS(queue,self,NULL) to reset the queue + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + P copies of individual messages + P stores of individual pointers to release writers + 1 store(1 byte tag=0) at the beginning to release the message + +There seems to remain a race when resetting the queue, where a newcomer thread +would queue itself while not being the leader. It seems it can be addressed by +deciding that whoever gets the bit is not important, what matters is the thread +that manages to reset the queue. This can then be done using another XCHG: + + 1 XCHG(queue,self) to try to add the thread to the queue + N load($tail) while waiting for the lock bit to become zero + M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others + 1 XCHG(queue,NULL) to reset the queue + 1 store(1 byte tag=0xFF) at the beginning to mark the area busy + 1 store($tail) to update the new value + 1 copy of the whole message + P copies of individual messages + P stores of individual pointers to release writers + 1 store(1 byte tag=0) at the beginning to release the message + +However this time this can cause fragmentation of multiple sub-queues that will +need to be reassembled. So finally the CAS is better, the leader thread should +recognize itself. + +It seems tricky to reliably store the next pointer in each element, and a DWCAS +wouldn't help here either. Maybe uninitialized elements should just have a +special value (eg 0x1) for their next pointer, meaning "not initialized yet", +and that the thread will then replace with the previous queue pointer. A reader +would have to wait on this value when meeting it, knowing the pointer is not +filled yet but is coming. |