diff options
Diffstat (limited to '')
-rw-r--r-- | doc/devel/lloadd/design.md | 282 |
1 files changed, 282 insertions, 0 deletions
diff --git a/doc/devel/lloadd/design.md b/doc/devel/lloadd/design.md new file mode 100644 index 0000000..62fcd88 --- /dev/null +++ b/doc/devel/lloadd/design.md @@ -0,0 +1,282 @@ +TODO: +- [ ] keep a global op in-flight counter? (might need locking) +- [-] scheduling (who does what, more than one select thread? How does the proxy + work get distributed between threads?) +- [ ] managing timeouts? +- [X] outline locking policy: seems like there might be a lock inversion in the + design looming: when working with op, might need a lock on both client and + upstream but depending on where we started, we might want to start with + locking one, then other +- [ ] how to deal with the balancer running out of fds? Especially when we hit + the limit, then lose an upstream connection and accept() a client, we + wouldn't be able to initiate a new one. A bit of a DoS... But probably not + a concern for Ericsson +- [ ] non-Linux? No idea how anything other than poll works (moot if building a + libevent/libuv-based load balancer since they take care of that, except + edge-triggered I/O?) +- [-] rootDSE? Controls and exops might have different semantics and need + binding to the same upstream connection. +- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates + in the core and the module/subsystem would be a very invasive one. On the + other hand, allows to expose live configuration and monitoring over LDAP + over the current slapd listeners without re-inventing the wheel. + + +Expecting to handle only LDAPv3 + +terms: + server - configured target + upstream - a single connection to a server + client - an incoming connection + +To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use +queues and put timeouts in + +Runtime organisation +------ +- main thread with its own event base handling signals +- one thread (later possibly more) listening on the rendezvous sockets, handing + the new sockets to worker threads +- n worker threads dealing with client and server I/O (dispatching actual work + to the thread pool most likely) +- a thread pool to handle actual work + +Operational behaviour +------ + +- client read -> upstream write: + - client read: + - if TLS_SETUP, keep processing, set state back when finished and note that + we're under TLS + - ber_get_next(), if we don't have a tag, finished (unless we have true + edge-triggered I/O, also put the fd back into the ones we're waiting for) + - peek at op tag: + - unbind: + - with a single lock, mark all pending ops in upstreams abandoned, clear + client link (would it be fast enough if we remove them from upstream + map instead?) + - locked per op: + - remove op from upstream map + - check upstream is not write-suspended, if it is ... + - try to write the abandon op to upstream, suspend upstream if not + fully sent + - remove op from client map (how if we're in avl_apply?, another pass?) + - would be nice if we could wipe the complete client map then, otherwise + we need to queue it to have it freed when all abandons get passed onto + the upstream (just dropping them might put extra strain on upstreams, + will probably have a queue on each client/upstream anyway, not just a + single Ber) + - bind: + - check mechanism is not EXTERNAL (or implement it) + - abandon existing ops (see unbind) + - set state to BINDING, put DN into authzid + - pick upstream, create PDU and sent + - abandon: + - find op, mark for abandon, send to appropriate upstream + - Exop: + - check not BINDING (unless it's a cancel?) + - check OID: + - STARTTLS: + - check we don't have TLS yet + - abandon all + - set state to TLS_SETUP + - send the hello + - VC(?): + - similar to bind except for the abandons/state change + - other: + - check not BINDING + - pick an upstream + - create a PDU, send (marking upstream suspended if not written in full) + - check if should read again (keep a counter of number of times to read + off a connection in a single pass so that we maintain fairness) + - if read enough requests and can still read, re-queue ourselves (if we + don't have true edge-triggered I/O, we can just register the fd again) + - upstream write (only when suspended): + - flush the current BER + - there shouldn't be anything else? +- upstream read -> client write: + - upstream read: + - ber_get_next(), if we don't have a tag, finished (unless we have true + edge-triggered I/O, also put the fd back into the ones we're waiting for) + - when we get it, peek at msgid, resolve client connection, lock, check: + - if unsolicited, handle as close (and mark connection closing) + - if op is abandoned or does not exist, drop PDU and op, update counters + - if client backlogged, suspend upstream, register callback to unsuspend + (on progress when writing to client or abandon from client (connection + death, abandon proper, ...)) + - reconstruct final PDU, write BER to client, if did not write fully, + suspend client + - if a final response, decrement operation counts on upstream and client + - check if should read again (keep a counter of number of responses to read + off a connection in a single pass so that we don't starve any?) + - client write ready (only checked for when suspended): + - write the rest of pending BER if any + - on successful write, pick all pending ops that need failure response, push + to client (are there any controls that need to be present in response even + in the case of failure?, what to do with them?) + - on successfully flushing them, walk through suspended upstreams, picking + the pending PDU (unsuspending the upstream) and writing, if PDU flushed + successfully, pick next upstream + - if we successfully flushed all suspended upstreams, unsuspend client + (and disable the write callback) +- upstream close/error: + - look up pending ops, try to write to clients, mark clients suspended that + have ops that need responses (another queue associated with client to speed + up?) + - schedule a new connection open +- client close/error: + - same as unbind +- client inactive (no pending ops and nothing happened in x seconds) + - might just send notice of disconnection and close +- op timeout handling: + - mark for abandon + - send abandon + - send timeLimitExceeded/adminLimitExceeded to client + +Picking an upstream: +- while there is a level available: + - pick a random ordering of upstreams based on weights + - while there is an upstream in the level: + - check number of ops in-flight (this is where we lock the upstream map) + - find the least busy connection (and check if a new connection should be + opened) + - try to lock for socket write, if available (no BER queued) we have our + upstream + +PDU processing: +- request (have an upstream selected): + - get new msgid from upstream + - create an Op structure (actually, with the need for freelist lock, we can + make it a cache for freed operation structures, avoiding some malloc + traffic, to reset, we need slap_sl_mem_create( ,,, 1 )) + - check proxyauthz is not present? or just let upstream reject it if there are + two? + - add own controls at the end: + - construct proxyauthz from authzid + - construct session tracking from remote IP, own name, authzid + - send over + - insert Op into client and upstream maps +- response/intermediate/entry: + - look up Op in upstream's map + - write old msgid, rest of the response can go unchanged + - if a response, remove Op from all maps (client and upstream) + +Managing upstreams: +- async connect up to min_connections (is there a point in having a connection + count range if we can't use it when needed since all of the below is async?) +- when connected, set up TLS (if requested) +- when done, send a bind +- go for the bind interaction +- when done, add it to the upstream's connection list +- (if a connection is suspended or connections are over 75 % op limit, schedule + creating a new connection setup unless connection limit has been hit) + +Managing timeouts: +- two options: + - maintain a separate locked priority queue to give a perfect ordering to when + each operation is to time out, would need to maintain yet another place + where operations can be found. + - the locking protocol for disposing of the operation would need to be + adjusted and might become even more complicated, might do the alternative + initially and then attempt this if it helps performance + - just do a sweep over all clients (that mutex is less contended) every so + often. With many in-flight operations might be a lot of wasted work. + - we still need to sweep over all clients to check if they should be killed + anyway + +Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)): +- poll on all registered fds +- remove each fd that's ready from the registered list and schedule the work +- work threads can put their fd back in if they deem necessary (=not suspended) +- this works as a poor man's edge-triggered polling, with enough workers, should + we do proper edge triggered I/O? What about non-Linux? + +Listener thread: +- slapd has just one, which then reassigns the sockets to separate I/O + threads + +Threading: +- if using slap_sl_malloc, how much perf do we gain? To allocate a context per + op, we should have a dedicated parent context so that when we free it, we can + use that exclusively. The parent context's parent would be the main thread's + context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 ) + and making sure an op does not allocate/free things from two threads at the + same time (might need an Op mutex after all? Not such a huge cost if we + routinely reuse Op structures) + +Locking policy: +- read mutexes are unnecessary, we only have one thread receiving data from the + connection - the one started from the dispatcher +- two reference counters of operation structures (an op is accessible from + client and upstream map, each counter is consistent when thread has a lock on + corresponding map), when decreasing the counter to zero, start freeing + procedure +- place to mark disposal finished for each side, consistency enforced by holding + the freelist lock when reading/manipulating +- when op is created, we already have a write lock on upstream socket and map, + start writing, insert to upstream map with upstream refcount 1, unlock, lock + client, insert (client refcount 0), unlock, lock upstream, decrement refcount + (triggers a test if we need to drop it now), unlock upstream, done +- when upstream processes a PDU, locks its map, increments counter, (potentially + removes if it's a response), unlocks, locks client's map, write mutex (this + order?) and full client mutex (if a bind response) +- when client side wants to work with a PDU (abandon, (un)bind), locks its map, + increase refcount, unlocks, locks upstream map, write mutex, sends or queues + abandon, unlocks write mutex, initiates freeing procedure from upstream side + (or if having to remember we've already increased client-side refcount, mark + for deletion, lose upstream lock, lock client, decref, either triggering + deletion from client or mark for it) +- if we have operation lock, we can simplify a bit (no need for three-stage + locking above) + +Shutdown: +- stop accept() thread(s) - potentially add a channel to hand these listening + sockets over for zero-downtime restart +- if very gentle, mark connections as closing, start timeout and: + - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE + - when receiving a PDU from upstream, send over to client, if no ops pending, + send unsolicited response and close (RFC4511 suggests unsolicited response + is the last PDU coming from the upstream and libldap agrees, so we can't + send it for a socket we want to shut down more gracefully) +- gentle (or very gentle timed out): + - set timeout + - mark all ops as abandoned + - send unbind to all upstreams + - send unsolicited to all clients +- imminent (or gentle timed out): + - async close all connections? + - exit() + +RootDSE: +- default option is not to care and if a control/exop has special restrictions, + it is the admin's job to flag it as such in the load-balancer's config +- another is not to care about the search request but check each search entry + being passed back, check DN and if it's a rootDSE, filter the list of + controls/exops/sasl mechs (external!) that are supported +- last one is to check all search requests for the DN/scope and synthesise the + response locally - probably not (would need to configure the complete list of + controls, exops, sasl mechs, naming contexts in the balancer) + +Potential red flags: +- we suspend upstreams, if we ever suspend clients we need to be sure we can't + create dependency cycles + - is this an issue when only suspending the read side of each? Because even if + we stop reading from everything, we should eventually flush data to those we + can still talk to, as upstreams are flushed, we can start sending new + requests from live clients (those that are suspended are due to their own + inability to accept data) + - we might need to suspend a client if there is a reason to choose a + particular upstream (multi-request operation - bind, VC, PR, TXN, ...) + - a SASL bind, but that means there are no outstanding ops to receive + it holds that !suspended(client) \or !suspended(upstream), so they + cannot participate in a cycle + - VC - multiple binds at the same time - !!! more analysis needed + - PR - should only be able to have one per connection (that's a problem + for later, maybe even needs a dedicated upstream connection) + - TXN - ??? probably same situation as PR + - or if we have a queue for pending Bers on the server, we not need to suspend + clients, upstream is only chosen if the queue is free or there is a reason + to send it to that particular upstream (multi-stage bind/VC, PR, ...), but + that still makes it possible for a client to exhaust all our memory by + sending requests (VC or other ones bound to a slow upstream or by not + reading the responses at all) |