2024-02-20 - Ring buffer v2
===========================

Goals:
  - improve the multi-thread performance of rings so that traces can be written
    from all threads in parallel without the huge bottleneck of the lock that
    is currently necessary to protect the buffer. This is important for mmapped
    areas that are left as a file when the process crashes.

  - keep traces synchronous within a given thread, i.e. when the TRACE() call
    returns, the trace is either written into the ring or lost due to slow
    readers.

  - try hard to limit the cache line bounces between threads due to the use of
    a shared work area.

  - make waiting threads not disturb working ones

  - continue to work on all supported platforms, with a particular focus on
    performance for modern platforms (memory ordering, DWCAS etc can be used if
    they provide any benefit), with a fallback for inferior platforms.

  - do not reorder traces within a given thread.

  - do not break existing features

  - do not significantly increase memory usage


Analysis of the current situation
=================================

Currently, there is a read lock around the call to __sink_write() in order to
make sure that an attempt to write the number of lost messages is delivered
with highest priority and is consistent with the lost counter. This doesn't
seem to pose any problem at this point though if it were, it could possibly
be revisited.

__sink_write() calls ring_write() which first measures the input string length
from the multiple segments, and locks the ring:
  - while trying to free space
  - while copying the message, due to the buffer's API

Because of this, there is a huge serialization and threads wait in queue. Tests
involving a split of the lock and a release around the message copy have shown
a +60% performance increase, which is still not acceptable.


First proposed approach
=======================

The first approach would have consisted in writing messages in small parts:
  1) write 0xFF in the tag to mean "size not filled yet"
  2) write the message's length and write a zero tag after the message's
     location
  3) replace the first tag to 0xFE to indicate the size is known, but the
     message is not filled yet.
  4) memcpy() of the message to the area
  5) replace the first tag to 0 to mark the entry as valid.

It's worth noting that doing that without any lock will allow a second thread
looping on the first tag to jump to the second tag after step 3. But the cost
is high: in a 64-thread scenario where each of them wants to send one message,
the work would look like this:
  - 64 threads try to CAS the tag. One gets it, 63 fail. They loop on the byte
    in question in read-only mode, waiting for the byte to change. This loop
    constantly forces the cache line to switch from MODIFIED to SHARED in the
    writer thread, and makes it a pain for it to write the message's length
    just after it.

  - once the first writer thread finally manages to write the length (step 2),
    it writes 0xFE on the tag to release the waiting threads, and starts with
    step 4. At this point, 63 threads try a CAS on the same entry, and this
    hammering further complicates the memcpy() of step 4 for the first 63 bytes
    of the message (well, 32 on avg since the tag is not necessarily aligned).
    One thread wins, 62 fail. All read the size field and jump to the next tag,
    waiting in read loops there. The second thread starts to write its size and
    faces the same difficulty as described above, facing 62 competitors when
    writing its size and the beginning of its message.

  - when the first writer thread writes the end of its message, it gets close
    to the final tag where the 62 waiting threads are still reading, causing
    a slow down again with the loss of exclusivity on the cache line. This is
    the same for the second thread etc.

Thus, on average, a writing thread is hindered by N-1 threads at the beginning
of its message area (in the first 32 bytes on avg) and by N-2 threads at the
end of its area (in the last 32 bytes on avg). Given that messages are roughly
218 bytes on avg for HTTP/1, this means that roughly 1/3 of the message is
written under severe cache contention.

In addition to this, the buffer's tail needs to be updated once all threads are
ready, something that adds the need for synchronization so that the last writing
threads (the most likely to complete fast due to less perturbations) needs to
wait for all previous ones. This also means N atomic writes to the tail.


New proposal
============

In order to address the contention scenarios above, let's try to factor the
work as much as possible. The principle is that threads that want to write will
either do it themselves or declare their intent and wait for a writing thread
to do it for them. This aims at ensuring a maximum usage of read-only data
between threads, and to leave the work area read-write between very few
threads, and exclusive for multiple messages at once, avoiding the bounces.

First, the buffer will have 2 indexes:
  - head: where the valid data start
  - tail: where new data need to be appended

When a thread starts to work, it will keep a copy of $tail and push it forward
by as many bytes as needed to write all the messages it has to. In order to
guarantee that neither the previous nor the new $tail point to an outdated or
overwritten location but that there is always a tag there, $tail contains a
lock bit in its highest bit that will guarantee that only one at a time will
update it. The goal here is to perform as few atomic ops as possible in the
contended path so as to later amortize the costs and make sure to limit the
number of atomic ops on the wait path to the strict minimum so that waiting
threads do not hinder the workers:

  Fast path:
    1 load($tail) to check the topmost bit
    1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work)
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    1 store(1 byte tag=0) at the beginning to release the message

  Contented path:
    N load($tail) while waiting for the bit to be zero
    M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    1 store(1 byte tag=0) at the beginning to release the message

Queue
-----

In order to limit the contention, writers will not start to write but will wait
in a queue, announcing their message pointers/lengths and total lengths. The
queue is made of a (ptr, len) pair that points to one such descriptor, located
in the waiter thread's stack, that itself points to the next pair. In fact
messages are ordered in a LIFO fashion but that isn't important since intra-
thread ordering is preserved (and in the worst case it will also be possible
to write them from end to beginning).

The approach is the following: a writer loasd $tail and sees it's busy, there's
no point continuing, it will add itself to the queue, announcing (ptr, len +
next->len) so that by just reading the first entry, one knows the total size
of the queue. And it will wait there as long as $tail has its topmost bit set
and the queue points to itself (meaning it's the queue's leader), so that only
one thread in the queue watches $tail, limiting the number of cache line
bounces. If the queue doesn't point anymore to the current thread, it means
another thread has taken it over so there's no point continuing, this thread
just becomes passive. If the lock bit is dropped from $tail, the watching
thread needs to re-check that it's still the queue's leader before trying to
grab the lock, so that only the leading thread will attempt it. Indeed, a few
of the last leading threads might still be looping, unaware that they're no
longer leaders. A CAS(&queue, self, self) will do it. Upon failure, the thread
just becomes a passive thread. Upon success, the thread is a confirmed leader,
it must then try to grab the tail lock. Only this thread and a few potential
newcomers will compete on this one. If the leading thread wins, it brings all
the queue with it and the newcomers will queue again. If the leading thread
loses, it needs to loop back to the point above, watching $tail and the
queue. In this case a newcomer might have grabbed the lock. It will notice
the non-empty queue and will take it with it. Thus in both cases the winner
thread does a CAS(queue, queue, NULL) to reset the queue, keeping the previous
pointer.

At this point the winner thread considers its own message size plus the
retrieved queue's size as the total required size and advances $tail by as
much, and will iterate over all messages to copy them in turn. The passive
threads are released by doing XCHG(&ptr->next, ptr) for each message, that
is normally impossible otherwise. As such, a passive thread just has to
loop over its own value, stored in its own stack, reading from its L1 cache
in loops without any risk of disturbing others, hence no need for EBO.

During the time it took to update $tail, more messages will have been
accumulating in the queue from various other threads, and once $tail is
written, one thread can pick them up again.

The benefit here is that the longer it takes one thread to free some space,
the more messages add up in the queue and the larger the next batch, so that
there are always very few contenders on the ring area and on the tail index.
At worst, the queue pointer is hammered but it's not on the fast path, since
wasting time here means all waiters will be queued.

Also, if we keep the first tag unchanged after it's set to 0xFF, it allows to
avoid atomic ops inside all the message. Indeed there's no reader in the area
as long as the tag is 0xFF, so we can just write all contents at once including
the varints and subsequent message tags without ever using atomic ops, hence
not forcing ordered writes. So maybe in the end there is some value in writing
the messages backwards from end to beginning, and just writing the first tag
atomically but not the rest.

The scenario would look like this:

  (without queue)

    - before starting to work:
        do {
            while (ret=(load(&tail) & BIT63))
                ;
        } while (!cas(&tail, &ret, ret | BIT63));

    - at this point, alone on it and guaranteed not to change
    - after new size is calculated, write it and drop the lock:

        store(&tail, new_tail & ~BIT63);

    - that's sufficient to unlock other waiters.

  (with queue)

      in_queue = 0;
      do {
         ret = load(&tail);
         if (ret & BIT63) {
             if (!in_queue) {
                 queue_this_node();
                 in_queue = 1;
             }
             while (ret & BIT63)
                 ;
         }
      } while (!cas(&tail, &ret, ret | BIT63));

      dequeue(in_queue) etc.

  Fast path:
    1 load($tail) to check the topmost bit
    1 CAS($tail,$tail|BIT63) to set the bit (atomic_fetch_or / atomic_bts also work)
    1 load of the queue to see that it's empty
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    1 store(1 byte tag=0) at the beginning to release the message

  Contented path:
    1 load($tail) to see the tail is changing
    M CAS(queue,queue,self) to try to add the thread to the queue (avgmax nbthr/2)
    N load($tail) while waiting for the lock bit to become zero
    1 CAS(queue,self,self) to check the leader still is
    M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others
    1 CAS(queue,queue,NULL) to reset the queue
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    P copies of individual messages
    P stores of individual pointers to release writers
    1 store(1 byte tag=0) at the beginning to release the message

Optimal approach (later if needed?): multiple queues. Each thread has one queue
assigned, either from a thread group, or using a modulo from the thread ID.
Same as above then.


Steps
-----

It looks that the queue is what allows the process to scale by amortizing a
single lock for every N messages, but that it's not a prerequisite to start,
without a queue threads can just wait on $tail.


Options
-------

It is possible to avoid the extra check on CAS(queue,self,self) by forcing
writers into the queue all the time. It would slow down the fast path but
may improve the slow path, both of which would become the same:

  Contented path:
    1 XCHG(queue,self) to try to add the thread to the queue
    N load($tail) while waiting for the lock bit to become zero
    M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others
    1 CAS(queue,self,NULL) to reset the queue
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    P copies of individual messages
    P stores of individual pointers to release writers
    1 store(1 byte tag=0) at the beginning to release the message

There seems to remain a race when resetting the queue, where a newcomer thread
would queue itself while not being the leader. It seems it can be addressed by
deciding that whoever gets the bit is not important, what matters is the thread
that manages to reset the queue. This can then be done using another XCHG:

    1 XCHG(queue,self) to try to add the thread to the queue
    N load($tail) while waiting for the lock bit to become zero
    M CAS($tail,$tail|BIT63) to try to set the bit on tail, competing with others
    1 XCHG(queue,NULL) to reset the queue
    1 store(1 byte tag=0xFF) at the beginning to mark the area busy
    1 store($tail) to update the new value
    1 copy of the whole message
    P copies of individual messages
    P stores of individual pointers to release writers
    1 store(1 byte tag=0) at the beginning to release the message

However this time this can cause fragmentation of multiple sub-queues that will
need to be reassembled. So finally the CAS is better, the leader thread should
recognize itself.

It seems tricky to reliably store the next pointer in each element, and a DWCAS
wouldn't help here either. Maybe uninitialized elements should just have a
special value (eg 0x1) for their next pointer, meaning "not initialized yet",
and that the thread will then replace with the previous queue pointer. A reader
would have to wait on this value when meeting it, knowing the pointer is not
filled yet but is coming.