From da76459dc21b5af2449af2d36eb95226cb186ce2 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 28 Apr 2024 11:35:11 +0200 Subject: Adding upstream version 2.6.12. Signed-off-by: Daniel Baumann --- doc/internals/fd-migration.txt | 138 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 doc/internals/fd-migration.txt (limited to 'doc/internals/fd-migration.txt') diff --git a/doc/internals/fd-migration.txt b/doc/internals/fd-migration.txt new file mode 100644 index 0000000..aaddad3 --- /dev/null +++ b/doc/internals/fd-migration.txt @@ -0,0 +1,138 @@ +2021-07-30 - File descriptor migration between threads + +An FD migration may happen on any idle connection that experiences a takeover() +operation by another thread. In this case the acting thread becomes the owner +of the connection (and FD) while previous one(s) need to forget about it. + +File descriptor migration between threads is a fairly complex operation because +it is required to maintain a durable consistency between the pollers states and +the haproxy's desired state. Indeed, very often the FD is registered within one +thread's poller and that thread might be waiting in the system, so there is no +way to synchronously update it. This is where thread_mask, polled_mask and per +thread updates are used: + + - a thread knows if it's allowed to manipulate an FD by looking at its bit in + the FD's thread_mask ; + + - each thread knows if it was polling an FD by looking at its bit in the + polled_mask field ; a recent migration is usually indicated by a bit being + present in polled_mask and absent from thread_mask. + + - other threads know whether it's safe to take over an FD by looking at the + running mask: if it contains any other thread's bit, then other threads are + using it and it's not safe to take it over. + + - sleeping threads are notified about the need to update their polling via + local or global updates to the FD. Each thread has its own local update + list and its own bit in the update_mask to know whether there are pending + updates for it. This allows to reconverge polling with the desired state + at the last instant before polling. + +While the description above could be seen as "progressive" (it technically is) +in that there is always a transition and convergence period in a migrated FD's +life, functionally speaking it's perfectly atomic thanks to the running bit and +to the per-thread idle connections lock: no takeover is permitted without +holding the idle_conns lock, and takeover may only happen by atomically picking +a connection from the list that is also protected by this lock. In practice, an +FD is never taken over by itself, but always in the context of a connection, +and by atomically removing a connection from an idle list, it is possible to +guarantee that a connection will not be picked, hence that its FD will not be +taken over. + +same thread as list! + +The possible entry points to a race to use a file descriptor are the following +ones, with their respective sequences: + + 1) takeover: requested by conn_backend_get() on behalf of connect_server() + - take the idle_conns_lock, protecting against a parallel access from the + I/O tasklet or timeout task + - pick the first connection from the list + - attempt an fd_takeover() on this connection's fd. Usually it works, + unless a late wakeup of the owning thread shows up in the FD's running + mask. The operation is performed in fd_takeover() using a DWCAS which + tries to switch both running and thread_mask to the caller's tid_bit. A + concurrent bit in running is enough to make it fail. This guarantees + another thread does not wakeup from I/O in the middle of the takeover. + In case of conflict, this FD is skipped and the attempt is tried again + with the next connection. + - resets the task/tasklet contexts to NULL, as a signal that they are not + allowed to run anymore. The tasks retrieve their execution context from + the scheduler in the arguments, but will check the tasks' context from + the structure under the lock to detect this possible change, and abort. + - at this point the takeover succeeded, the idle_conns_lock is released and + the connection and its FD are now owned by the caller + + 2) poll report: happens on late rx, shutdown or error on idle conns + - fd_set_running() is called to atomically set the running_mask and check + that the caller's tid_bit is still present in the thread_mask. Upon + failure the caller arranges itself to stop reporting that FD (e.g. by + immediate removal or by an asynchronous update). Upon success, it's + guaranteed that any concurrent fd_takeover() will fail the DWCAS and that + another connection will need to be picked instead. + - FD's state is possibly updated + - the iocb is called if needed (almost always) + - if the iocb didn't kill the connection, release the bit from running_mask + making the connection possibly available to a subsequent fd_takeover(). + + 3) I/O tasklet, timeout task: timeout or subscribed wakeup + - start by taking the idle_conns_lock, ensuring no takeover() will pick the + same connection from this point. + - check the task/tasklet's context to verify that no recently completed + takeover() stole the connection. If it's NULL, the connection was lost, + the lock is released and the task/tasklet killed. Otherwise it is + guaranteed that no other thread may use that connection (current takeover + candidates are waiting on the lock, previous owners waking from poll() + lost their bit in the thread_mask and will not touch the FD). + - the connection is removed from the idle conns list. From this point on, + no other thread will even find it there nor even try fd_takeover() on it. + - the idle_conns_lock is now released, the connection is protected and its + FD is not reachable by other threads anymore. + - the task does what it has to do + - if the connection is still usable (i.e. not upon timeout), it's inserted + again into the idle conns list, meaning it may instantly be taken over + by a competing thread. + + 4) wake() callback: happens on last user after xfers (may free() the conn) + - the connection is still owned by the caller, it's still subscribed to + polling but the connection is idle thus inactive. Errors or shutdowns + may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus + the running bit is set (i.e. a concurrent fd_takeover() will fail). + - if the connection is in the list, the idle_conns_lock is grabbed, the + connection is removed from the list, and the lock is released. + - mux->wake() is called + - if the connection previously was in the list, it's reinserted under the + idle_conns_lock. + + +With the DWCAS removal between running_mask & thread_mask: + +fd_takeover: + 1 if (!CAS(&running_mask, 0, tid_bit)) + 2 return fail; + 3 atomic_store(&thread_mask, tid_bit); + 4 atomic_and(&running_mask, ~tid_bit); + +poller: + 1 do { + 2 /* read consistent running_mask & thread_mask */ + 3 do { + 4 run = atomic_load(&running_mask); + 5 thr = atomic_load(&thread_mask); + 6 } while (run & ~thr); + 7 + 8 if (!(thr & tid_bit)) { + 9 /* takeover has started */ + 10 goto disable_fd; + 11 } + 12 } while (!CAS(&running_mask, run, run | tid_bit)); + +fd_delete: + 1 atomic_or(&running_mask, tid_bit); + 2 atomic_store(&thread_mask, 0); + 3 atomic_and(&running_mask, ~tid_bit); + +The loop in poller:3-6 is used to make sure the thread_mask we read matches +the last updated running_mask. If nobody can give up on fd_takeover(), it +might even be possible to spin on thread_mask only. Late pollers will not +set running anymore with this. -- cgit v1.2.3