diff options
Diffstat (limited to 'doc/design-thoughts/thread-group.txt')
-rw-r--r-- | doc/design-thoughts/thread-group.txt | 528 |
1 files changed, 528 insertions, 0 deletions
diff --git a/doc/design-thoughts/thread-group.txt b/doc/design-thoughts/thread-group.txt new file mode 100644 index 0000000..f1aad7b --- /dev/null +++ b/doc/design-thoughts/thread-group.txt @@ -0,0 +1,528 @@ +Thread groups +############# + +2021-07-13 - first draft +========== + +Objective +--------- +- support multi-socket systems with limited cache-line bouncing between + physical CPUs and/or L3 caches + +- overcome the 64-thread limitation + +- Support a reasonable number of groups. I.e. if modern CPUs arrive with + core complexes made of 8 cores, with 8 CC per chip and 2 chips in a + system, it makes sense to support 16 groups. + + +Non-objective +------------- +- no need to optimize to the last possible cycle. I.e. some algos like + leastconn will remain shared across all threads, servers will keep a + single queue, etc. Global information remains global. + +- no stubborn enforcement of FD sharing. Per-server idle connection lists + can become per-group; listeners can (and should probably) be per-group. + Other mechanisms (like SO_REUSEADDR) can already overcome this. + +- no need to go beyond 64 threads per group. + + +Identified tasks +================ + +General +------- +Everywhere tid_bit is used we absolutely need to find a complement using +either the current group or a specific one. Thread debugging will need to +be extended as masks are extensively used. + + +Scheduler +--------- +The global run queue and global wait queue must become per-group. This +means that a task may only be queued into one of them at a time. It +sounds like tasks may only belong to a given group, but doing so would +bring back the original issue that it's impossible to perform remote wake +ups. + +We could probably ignore the group if we don't need to set the thread mask +in the task. the task's thread_mask is never manipulated using atomics so +it's safe to complement it with a group. + +The sleeping_thread_mask should become per-group. Thus possibly that a +wakeup may only be performed on the assigned group, meaning that either +a task is not assigned, in which case it be self-assigned (like today), +otherwise the tg to be woken up will be retrieved from the task itself. + +Task creation currently takes a thread mask of either tid_bit, a specific +mask, or MAX_THREADS_MASK. How to create a task able to run anywhere +(checks, Lua, ...) ? + +Profiling +--------- +There should be one task_profiling_mask per thread group. Enabling or +disabling profiling should be made per group (possibly by iterating). + +Thread isolation +---------------- +Thread isolation is difficult as we solely rely on atomic ops to figure +who can complete. Such operation is rare, maybe we could have a global +read_mostly flag containing a mask of the groups that require isolation. +Then the threads_want_rdv_mask etc can become per-group. However setting +and clearing the bits will become problematic as this will happen in two +steps hence will require careful ordering. + +FD +-- +Tidbit is used in a number of atomic ops on the running_mask. If we have +one fdtab[] per group, the mask implies that it's within the group. +Theoretically we should never face a situation where an FD is reported nor +manipulated for a remote group. + +There will still be one poller per thread, except that this time all +operations will be related to the current thread_group. No fd may appear +in two thread_groups at once, but we can probably not prevent that (e.g. +delayed close and reopen). Should we instead have a single shared fdtab[] +(less memory usage also) ? Maybe adding the group in the fdtab entry would +work, but when does a thread know it can leave it ? Currently this is +solved by running_mask and by update_mask. Having two tables could help +with this (each table sees the FD in a different group with a different +mask) but this looks overkill. + +There's polled_mask[] which needs to be decided upon. Probably that it +should be doubled as well. Note, polled_mask left fdtab[] for cacheline +alignment reasons in commit cb92f5cae4. + +If we have one fdtab[] per group, what *really* prevents from using the +same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() +need to check for no-thread usage before closing the FD. This could be +a limiting factor. Enabling could require to wake every poller. + +Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, +used only during creation and close) ? + +Other problem, if we have one fdtab[] per TG, disabling/enabling an FD +(e.g. pause/resume on listener) can become a problem if it's not necessarily +on the current TG. We'll then need a way to figure that one. It sounds like +FDs from listeners and receivers are very specific and suffer from problems +all other ones under high load do not suffer from. Maybe something specific +ought to be done for them, if we can guarantee there is no risk of accidental +reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the +FD's flags). The risk is always that a close() can result in instant pop-up +of the same FD on any other thread of the same process. + +Observations: right now fdtab[].thread_mask more or less corresponds to a +declaration of interest, it's very close to meaning "active per thread". It is +in fact located in the FD while it ought to do nothing there, as it should be +where the FD is used as it rules accesses to a shared resource that is not +the FD but what uses it. Indeed, if neither polled_mask nor running_mask have +a thread's bit, the FD is unknown to that thread and the element using it may +only be reached from above and not from the FD. As such we ought to have a +thread_mask on a listener and another one on connections. These ones will +indicate who uses them. A takeover could then be simplified (atomically set +exclusivity on the FD's running_mask, upon success, takeover the connection, +clear the running mask). Probably that the change ought to be performed on +the connection level first, not the FD level by the way. But running and +polled are the two relevant elements, one indicates userland knowledge, +the other one kernel knowledge. For listeners there's no exclusivity so it's +a bit different but the rule remains the same that we don't have to know +what threads are *interested* in the FD, only its holder. + +Not exact in fact, see FD notes below. + +activity +-------- +There should be one activity array per thread group. The dump should +simply scan them all since the cumuled values are not very important +anyway. + +applets +------- +They use tid_bit only for the task. It looks like the appctx's thread_mask +is never used (now removed). Furthermore, it looks like the argument is +*always* tid_bit. + +CPU binding +----------- +This is going to be tough. It will be needed to detect that threads overlap +and are not bound (i.e. all threads on same mask). In this case, if the number +of threads is higher than the number of threads per physical socket, one must +try hard to evenly spread them among physical sockets (e.g. one thread group +per physical socket) and start as many threads as needed on each, bound to +all threads/cores of each socket. If there is a single socket, the same job +may be done based on L3 caches. Maybe it could always be done based on L3 +caches. The difficulty behind this is the number of sockets to be bound: it +is not possible to bind several FDs per listener. Maybe with a new bind +keyword we can imagine to automatically duplicate listeners ? In any case, +the initially bound cpumap (via taskset) must always be respected, and +everything should probably start from there. + +Frontend binding +---------------- +We'll have to define a list of threads and thread-groups per frontend. +Probably that having a group mask and a same thread-mask for each group +would suffice. + +Threads should have two numbers: + - the per-process number (e.g. 1..256) + - the per-group number (1..64) + +The "bind-thread" lines ought to use the following syntax: + - bind 45 ## bind to process' thread 45 + - bind 1/45 ## bind to group 1's thread 45 + - bind all/45 ## bind to thread 45 in each group + - bind 1/all ## bind to all threads in group 1 + - bind all ## bind to all threads + - bind all/all ## bind to all threads in all groups (=all) + - bind 1/65 ## rejected + - bind 65 ## OK if there are enough + - bind 35-45 ## depends. Rejected if it crosses a group boundary. + +The global directive "nbthread 28" means 28 total threads for the process. The +number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 +threads per group. At the beginning, the nbgroup should be manual since it +implies config adjustments to bind lines. + +There should be a trivial way to map a global thread to a group and local ID +and to do the opposite. + + +Panic handler + watchdog +------------------------ +Will probably depend on what's done for thread_isolate + +Per-thread arrays inside structures +----------------------------------- +- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should + we simply bump the limit ? +- same for servers with idle connections. +=> doesn't seem very practical. +- another solution might be to point to dynamically allocated arrays of + arrays (e.g. nbthread * nbgroup) or a first level per group and a second + per thread. +=> dynamic allocation based on the global number + +Other +----- +- what about dynamic thread start/stop (e.g. for containers/VMs) ? + E.g. if we decide to start $MANY threads in 4 groups, and only use + one, in the end it will not be possible to use less than one thread + per group, and at most 64 will be present in each group. + + +FD Notes +-------- + - updt_fd_polling() uses thread_mask to figure where to send the update, + the local list or a shared list, and which bits to set in update_mask. + This could be changed so that it takes the update mask in argument. The + call from the poller's fork would just have to broadcast everywhere. + + - pollers use it to figure whether they're concerned or not by the activity + update. This looks important as otherwise we could re-enable polling on + an FD that changed to another thread. + + - thread_mask being a per-thread active mask looks more exact and is + precisely used this way by _update_fd(). In this case using it instead + of running_mask to gauge a change or temporarily lock it during a + removal could make sense. + + - running should be conditioned by thread. Polled not (since deferred + or migrated). In this case testing thread_mask can be enough most of + the time, but this requires synchronization that will have to be + extended to tgid.. But migration seems a different beast that we shouldn't + care about here: if first performed at the higher level it ought to + be safe. + +In practice the update_mask can be dropped to zero by the first fd_delete() +as the only authority allowed to fd_delete() is *the* owner, and as soon as +all running_mask are gone, the FD will be closed, hence removed from all +pollers. This will be the only way to make sure that update_mask always +refers to the current tgid. + +However, it may happen that a takeover within the same group causes a thread +to read the update_mask late, while the FD is being wiped by another thread. +That other thread may close it, causing another thread in another group to +catch it, and change the tgid and start to update the update_mask. This means +that it would be possible for a thread entering do_poll() to see the correct +tgid, then the fd would be closed, reopened and reassigned to another tgid, +and the thread would see its bit in the update_mask, being confused. Right +now this should already happen when the update_mask is not cleared, except +that upon wakeup a migration would be detected and that would be all. + +Thus we might need to set the running bit to prevent the FD from migrating +before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( + +Also even fd_update_events() leaves a risk of updating update_mask after +clearing running, thus affecting the wrong one. Probably that update_mask +should be updated before clearing running_mask there. Also, how about not +creating an update on a close ? Not trivial if done before running, unless +thread_mask==0. + +########################################################### + +Current state: + + +Mux / takeover / fd_delete() code ||| poller code +-------------------------------------------------|||--------------------------------------------------- + \|/ +mux_takeover(): | fd_set_running(): + if (fd_takeover()<0) | old = {running, thread}; + return fail; | new = {tid_bit, tid_bit}; + ... | +fd_takeover(): | do { + atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) + old = {running, thread}; | return -1; + new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } + if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); + atomic_and(runnning, ~tid_bit); | + return -1; // fail | fd_clr_running(): + } | return atomic_and_fetch(running, ~tid_bit); + | + while (old == {tid_bit, !=0 }) | poll(): + if (dwcas({running, thread}, &old, &new)) { | if (!owner) + atomic_and(runnning, ~tid_bit); | continue; + return 0; // success | + } | if (!(thread_mask & tid_bit)) { + } | epoll_ctl_del(); + | continue; + atomic_and(runnning, ~tid_bit); | } + return -1; // fail | + | // via fd_update_events() +fd_delete(): | if (fd_set_running() != -1) { + atomic_or(running, tid_bit); | iocb(); + atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) + if (fd_clr_running(fd) = 0) | fd_delete_orphan(); + fd_delete_orphan(); | } + + +The idle_conns_lock prevents the connection from being *picked* and released +while someone else is reading it. What it does is guarantee that on idle +connections, the caller of the IOCB will not dereference the task's context +while the connection is still in the idle list, since it might be picked then +freed at the same instant by another thread. As soon as the IOCB manages to +get that lock, it removes the connection from the list so that it cannot be +taken over anymore. Conversely, the mux's takeover() code runs under that +lock so that if it frees the connection and task, this will appear atomic +to the IOCB. The timeout task (which is another entry point for connection +deletion) does the same. Thus, when coming from the low-level (I/O or timeout): + - task always exists, but ctx checked under lock validates; conn removal + from list prevents takeover(). + - t->context is stable, except during changes under takeover lock. So + h2_timeout_task may well run on a different thread than h2_io_cb(). + +Coming from the top: + - takeover() done under lock() clears task's ctx and possibly closes the FD + (unless some running remains present). + +Unlikely but currently possible situations: + - multiple pollers (up to N) may have an idle connection's FD being + polled, if the connection was passed from thread to thread. The first + event on the connection would wake all of them. Most of them would + see fdtab[].owner set (the late ones might miss it). All but one would + see that their bit is missing from fdtab[].thread_mask and give up. + However, just after this test, others might take over the connection, + so in practice if terribly unlucky, all but 1 could see their bit in + thread_mask just before it gets removed, all of them set their bit + in running_mask, and all of them call iocb() (sock_conn_iocb()). + Thus all of them dereference the connection and touch the subscriber + with no protection, then end up in conn_notify_mux() that will call + the mux's wake(). + + - multiple pollers (up to N-1) might still be in fd_update_events() + manipulating fdtab[].state. The cause is that the "locked" variable + is determined by atleast2(thread_mask) but that thread_mask is read + at a random instant (i.e. it may be stolen by another one during a + takeover) since we don't yet hold running to prevent this from being + done. Thus we can arrive here with thread_mask==something_else (1bit), + locked==0 and fdtab[].state assigned non-atomically. + + - it looks like nothing prevents h2_release() from being called on a + thread (e.g. from the top or task timeout) while sock_conn_iocb() + dereferences the connection on another thread. Those killing the + connection don't yet consider the fact that it's an FD that others + might currently be waking up on. + +################### + +pb with counter: + +users count doesn't say who's using the FD and two users can do the same +close in turn. The thread_mask should define who's responsible for closing +the FD, and all those with a bit in it ought to do it. + + +2021-08-25 - update with minimal locking on tgid value +========== + + - tgid + refcount at once using CAS + - idle_conns lock during updates + - update: + if tgid differs => close happened, thus drop update + otherwise normal stuff. Lock tgid until running if needed. + - poll report: + if tgid differs => closed + if thread differs => stop polling (migrated) + keep tgid lock until running + - test on thread_id: + if (xadd(&tgid,65536) != my_tgid) { + // was closed + sub(&tgid, 65536) + return -1 + } + if !(thread_id & tidbit) => migrated/closed + set_running() + sub(tgid,65536) + - note: either fd_insert() or the final close() ought to set + polled and update to 0. + +2021-09-13 - tid / tgroups etc. +========== + + - tid currently is the thread's global ID. It's essentially used as an index + for arrays. It must be clearly stated that it works this way. + + - tid_bit makes no sense process-wide, so it must be redefined to represent + the thread's tid within its group. The name is not much welcome though, but + there are 286 of it that are not going to be changed that fast. + + - just like "ti" is the thread_info, we need to have "tg" pointing to the + thread_group. + + - other less commonly used elements should be retrieved from ti->xxx. E.g. + the thread's local ID. + + - lock debugging must reproduce tgid + + - an offset might be placed in the tgroup so that even with 64 threads max + we could have completely separate tid_bits over several groups. + +2021-09-15 - bind + listen() + rx +========== + + - thread_mask (in bind_conf->rx_settings) should become an array of + MAX_TGROUP longs. + - when parsing "thread 123" or "thread 2/37", the proper bit is set, + assuming the array is either a contiguous bitfield or a tgroup array. + An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on + how the thread num was parsed, so that we reject mixes. + - end of parsing: entries translated to the cleanest form (to be determined) + - binding: for each socket()/bind()/listen()... just perform one extra dup() + for each tgroup and store the multiple FDs into an FD array indexed on + MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence + to have multiple entries in all tgroup pollers without requiring the user + to duplicate the bind line. + +2021-09-15 - global thread masks +========== + +Some global variables currently expect to know about thread IDs and it's +uncertain what must be done with them: + - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ + => touched under the rq lock. Change it per-group ? What exact use is made ? + + - sleeping_thread_mask /* Threads that are about to sleep in poll() */ + => seems that it can be made per group + + - all_threads_mask: a bit complicated, derived from nbthread and used with + masks and with my_ffsl() to wake threads up. Should probably be per-group + but we might miss something for global. + + - stopping_thread_mask: used in combination with all_threads_mask, should + move per-group. + + - threads_harmless_mask: indicates all threads that are currently harmless in + that they promise not to access a shared resource. Must be made per-group + but then we'll likely need a second stage to have the harmless groups mask. + threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one + above. Maybe the right approach will be to request harmless on a group mask + so that we can detect collisions and arbiter them like today, but on top of + this it becomes possible to request harmless only on the local group if + desired. The subtlety is that requesting harmless at the group level does + not mean it's achieved since the requester cannot vouch for the other ones + in the same group. + +In addition, some variables are related to the global runqueue: + __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ + struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ + unsigned int grq_total; /* total number of entries in the global run queue, atomic */ + static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ + +And others to the global wait queue: + struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ + __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ + struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ + + +2021-09-29 - group designation and masks +========== + +Neither FDs nor tasks will belong to incomplete subsets of threads spanning +over multiple thread groups. In addition there may be a difference between +configuration and operation (for FDs). This allows to fix the following rules: + + group mask description + 0 0 bind_conf: groups & thread not set. bind to any/all + task: it would be nice to mean "run on the same as the caller". + + 0 xxx bind_conf: thread set but not group: thread IDs are global + FD/task: group 0, mask xxx + + G>0 0 bind_conf: only group is set: bind to all threads of group G + FD/task: mask 0 not permitted (= not owned). May be used to + mention "any thread of this group", though already covered by + G/xxx like today. + + G>0 xxx bind_conf: Bind to these threads of this group + FD/task: group G, mask xxx + +It looks like keeping groups starting at zero internally complicates everything +though. But forcing it to start at 1 might also require that we rescan all tasks +to replace 0 with 1 upon startup. This would also allow group 0 to be special and +be used as the default group for any new thread creation, so that group0.count +would keep the number of unassigned threads. Let's try: + + group mask description + 0 0 bind_conf: groups & thread not set. bind to any/all + task: "run on the same group & thread as the caller". + + 0 xxx bind_conf: thread set but not group: thread IDs are global + FD/task: invalid. Or maybe for a task we could use this to + mean "run on current group, thread XXX", which would cover + the need for health checks (g/t 0/0 while sleeping, 0/xxx + while running) and have wake_expired_tasks() detect 0/0 and + wake them up to a random group. + + G>0 0 bind_conf: only group is set: bind to all threads of group G + FD/task: mask 0 not permitted (= not owned). May be used to + mention "any thread of this group", though already covered by + G/xxx like today. + + G>0 xxx bind_conf: Bind to these threads of this group + FD/task: group G, mask xxx + +With a single group declared in the config, group 0 would implicitly find the +first one. + + +The problem with the approach above is that a task queued in one group+thread's +wait queue could very well receive a signal from another thread and/or group, +and that there is no indication about where the task is queued, nor how to +dequeue it. Thus it seems that it's up to the application itself to unbind/ +rebind a task. This contradicts the principle of leaving a task waiting in a +wait queue and waking it anywhere. + +Another possibility might be to decide that a task having a defined group but +a mask of zero is shared and will always be queued into its group's wait queue. +However, upon expiry, the scheduler would notice the thread-mask 0 and would +broadcast it to any group. + +Right now in the code we have: + - 18 calls of task_new(tid_bit) + - 18 calls of task_new(MAX_THREADS_MASK) + - 2 calls with a single bit + +Thus it looks like "task_new_anywhere()", "task_new_on()" and +"task_new_here()" would be sufficient. |