summaryrefslogtreecommitdiffstats
path: root/doc/devel/lloadd/design.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/devel/lloadd/design.md')
-rw-r--r--doc/devel/lloadd/design.md282
1 files changed, 282 insertions, 0 deletions
diff --git a/doc/devel/lloadd/design.md b/doc/devel/lloadd/design.md
new file mode 100644
index 0000000..62fcd88
--- /dev/null
+++ b/doc/devel/lloadd/design.md
@@ -0,0 +1,282 @@
+TODO:
+- [ ] keep a global op in-flight counter? (might need locking)
+- [-] scheduling (who does what, more than one select thread? How does the proxy
+ work get distributed between threads?)
+- [ ] managing timeouts?
+- [X] outline locking policy: seems like there might be a lock inversion in the
+ design looming: when working with op, might need a lock on both client and
+ upstream but depending on where we started, we might want to start with
+ locking one, then other
+- [ ] how to deal with the balancer running out of fds? Especially when we hit
+ the limit, then lose an upstream connection and accept() a client, we
+ wouldn't be able to initiate a new one. A bit of a DoS... But probably not
+ a concern for Ericsson
+- [ ] non-Linux? No idea how anything other than poll works (moot if building a
+ libevent/libuv-based load balancer since they take care of that, except
+ edge-triggered I/O?)
+- [-] rootDSE? Controls and exops might have different semantics and need
+ binding to the same upstream connection.
+- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
+ in the core and the module/subsystem would be a very invasive one. On the
+ other hand, allows to expose live configuration and monitoring over LDAP
+ over the current slapd listeners without re-inventing the wheel.
+
+
+Expecting to handle only LDAPv3
+
+terms:
+ server - configured target
+ upstream - a single connection to a server
+ client - an incoming connection
+
+To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
+queues and put timeouts in
+
+Runtime organisation
+------
+- main thread with its own event base handling signals
+- one thread (later possibly more) listening on the rendezvous sockets, handing
+ the new sockets to worker threads
+- n worker threads dealing with client and server I/O (dispatching actual work
+ to the thread pool most likely)
+- a thread pool to handle actual work
+
+Operational behaviour
+------
+
+- client read -> upstream write:
+ - client read:
+ - if TLS_SETUP, keep processing, set state back when finished and note that
+ we're under TLS
+ - ber_get_next(), if we don't have a tag, finished (unless we have true
+ edge-triggered I/O, also put the fd back into the ones we're waiting for)
+ - peek at op tag:
+ - unbind:
+ - with a single lock, mark all pending ops in upstreams abandoned, clear
+ client link (would it be fast enough if we remove them from upstream
+ map instead?)
+ - locked per op:
+ - remove op from upstream map
+ - check upstream is not write-suspended, if it is ...
+ - try to write the abandon op to upstream, suspend upstream if not
+ fully sent
+ - remove op from client map (how if we're in avl_apply?, another pass?)
+ - would be nice if we could wipe the complete client map then, otherwise
+ we need to queue it to have it freed when all abandons get passed onto
+ the upstream (just dropping them might put extra strain on upstreams,
+ will probably have a queue on each client/upstream anyway, not just a
+ single Ber)
+ - bind:
+ - check mechanism is not EXTERNAL (or implement it)
+ - abandon existing ops (see unbind)
+ - set state to BINDING, put DN into authzid
+ - pick upstream, create PDU and sent
+ - abandon:
+ - find op, mark for abandon, send to appropriate upstream
+ - Exop:
+ - check not BINDING (unless it's a cancel?)
+ - check OID:
+ - STARTTLS:
+ - check we don't have TLS yet
+ - abandon all
+ - set state to TLS_SETUP
+ - send the hello
+ - VC(?):
+ - similar to bind except for the abandons/state change
+ - other:
+ - check not BINDING
+ - pick an upstream
+ - create a PDU, send (marking upstream suspended if not written in full)
+ - check if should read again (keep a counter of number of times to read
+ off a connection in a single pass so that we maintain fairness)
+ - if read enough requests and can still read, re-queue ourselves (if we
+ don't have true edge-triggered I/O, we can just register the fd again)
+ - upstream write (only when suspended):
+ - flush the current BER
+ - there shouldn't be anything else?
+- upstream read -> client write:
+ - upstream read:
+ - ber_get_next(), if we don't have a tag, finished (unless we have true
+ edge-triggered I/O, also put the fd back into the ones we're waiting for)
+ - when we get it, peek at msgid, resolve client connection, lock, check:
+ - if unsolicited, handle as close (and mark connection closing)
+ - if op is abandoned or does not exist, drop PDU and op, update counters
+ - if client backlogged, suspend upstream, register callback to unsuspend
+ (on progress when writing to client or abandon from client (connection
+ death, abandon proper, ...))
+ - reconstruct final PDU, write BER to client, if did not write fully,
+ suspend client
+ - if a final response, decrement operation counts on upstream and client
+ - check if should read again (keep a counter of number of responses to read
+ off a connection in a single pass so that we don't starve any?)
+ - client write ready (only checked for when suspended):
+ - write the rest of pending BER if any
+ - on successful write, pick all pending ops that need failure response, push
+ to client (are there any controls that need to be present in response even
+ in the case of failure?, what to do with them?)
+ - on successfully flushing them, walk through suspended upstreams, picking
+ the pending PDU (unsuspending the upstream) and writing, if PDU flushed
+ successfully, pick next upstream
+ - if we successfully flushed all suspended upstreams, unsuspend client
+ (and disable the write callback)
+- upstream close/error:
+ - look up pending ops, try to write to clients, mark clients suspended that
+ have ops that need responses (another queue associated with client to speed
+ up?)
+ - schedule a new connection open
+- client close/error:
+ - same as unbind
+- client inactive (no pending ops and nothing happened in x seconds)
+ - might just send notice of disconnection and close
+- op timeout handling:
+ - mark for abandon
+ - send abandon
+ - send timeLimitExceeded/adminLimitExceeded to client
+
+Picking an upstream:
+- while there is a level available:
+ - pick a random ordering of upstreams based on weights
+ - while there is an upstream in the level:
+ - check number of ops in-flight (this is where we lock the upstream map)
+ - find the least busy connection (and check if a new connection should be
+ opened)
+ - try to lock for socket write, if available (no BER queued) we have our
+ upstream
+
+PDU processing:
+- request (have an upstream selected):
+ - get new msgid from upstream
+ - create an Op structure (actually, with the need for freelist lock, we can
+ make it a cache for freed operation structures, avoiding some malloc
+ traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
+ - check proxyauthz is not present? or just let upstream reject it if there are
+ two?
+ - add own controls at the end:
+ - construct proxyauthz from authzid
+ - construct session tracking from remote IP, own name, authzid
+ - send over
+ - insert Op into client and upstream maps
+- response/intermediate/entry:
+ - look up Op in upstream's map
+ - write old msgid, rest of the response can go unchanged
+ - if a response, remove Op from all maps (client and upstream)
+
+Managing upstreams:
+- async connect up to min_connections (is there a point in having a connection
+ count range if we can't use it when needed since all of the below is async?)
+- when connected, set up TLS (if requested)
+- when done, send a bind
+- go for the bind interaction
+- when done, add it to the upstream's connection list
+- (if a connection is suspended or connections are over 75 % op limit, schedule
+ creating a new connection setup unless connection limit has been hit)
+
+Managing timeouts:
+- two options:
+ - maintain a separate locked priority queue to give a perfect ordering to when
+ each operation is to time out, would need to maintain yet another place
+ where operations can be found.
+ - the locking protocol for disposing of the operation would need to be
+ adjusted and might become even more complicated, might do the alternative
+ initially and then attempt this if it helps performance
+ - just do a sweep over all clients (that mutex is less contended) every so
+ often. With many in-flight operations might be a lot of wasted work.
+ - we still need to sweep over all clients to check if they should be killed
+ anyway
+
+Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
+- poll on all registered fds
+- remove each fd that's ready from the registered list and schedule the work
+- work threads can put their fd back in if they deem necessary (=not suspended)
+- this works as a poor man's edge-triggered polling, with enough workers, should
+ we do proper edge triggered I/O? What about non-Linux?
+
+Listener thread:
+- slapd has just one, which then reassigns the sockets to separate I/O
+ threads
+
+Threading:
+- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
+ op, we should have a dedicated parent context so that when we free it, we can
+ use that exclusively. The parent context's parent would be the main thread's
+ context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
+ and making sure an op does not allocate/free things from two threads at the
+ same time (might need an Op mutex after all? Not such a huge cost if we
+ routinely reuse Op structures)
+
+Locking policy:
+- read mutexes are unnecessary, we only have one thread receiving data from the
+ connection - the one started from the dispatcher
+- two reference counters of operation structures (an op is accessible from
+ client and upstream map, each counter is consistent when thread has a lock on
+ corresponding map), when decreasing the counter to zero, start freeing
+ procedure
+- place to mark disposal finished for each side, consistency enforced by holding
+ the freelist lock when reading/manipulating
+- when op is created, we already have a write lock on upstream socket and map,
+ start writing, insert to upstream map with upstream refcount 1, unlock, lock
+ client, insert (client refcount 0), unlock, lock upstream, decrement refcount
+ (triggers a test if we need to drop it now), unlock upstream, done
+- when upstream processes a PDU, locks its map, increments counter, (potentially
+ removes if it's a response), unlocks, locks client's map, write mutex (this
+ order?) and full client mutex (if a bind response)
+- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
+ increase refcount, unlocks, locks upstream map, write mutex, sends or queues
+ abandon, unlocks write mutex, initiates freeing procedure from upstream side
+ (or if having to remember we've already increased client-side refcount, mark
+ for deletion, lose upstream lock, lock client, decref, either triggering
+ deletion from client or mark for it)
+- if we have operation lock, we can simplify a bit (no need for three-stage
+ locking above)
+
+Shutdown:
+- stop accept() thread(s) - potentially add a channel to hand these listening
+ sockets over for zero-downtime restart
+- if very gentle, mark connections as closing, start timeout and:
+ - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
+ - when receiving a PDU from upstream, send over to client, if no ops pending,
+ send unsolicited response and close (RFC4511 suggests unsolicited response
+ is the last PDU coming from the upstream and libldap agrees, so we can't
+ send it for a socket we want to shut down more gracefully)
+- gentle (or very gentle timed out):
+ - set timeout
+ - mark all ops as abandoned
+ - send unbind to all upstreams
+ - send unsolicited to all clients
+- imminent (or gentle timed out):
+ - async close all connections?
+ - exit()
+
+RootDSE:
+- default option is not to care and if a control/exop has special restrictions,
+ it is the admin's job to flag it as such in the load-balancer's config
+- another is not to care about the search request but check each search entry
+ being passed back, check DN and if it's a rootDSE, filter the list of
+ controls/exops/sasl mechs (external!) that are supported
+- last one is to check all search requests for the DN/scope and synthesise the
+ response locally - probably not (would need to configure the complete list of
+ controls, exops, sasl mechs, naming contexts in the balancer)
+
+Potential red flags:
+- we suspend upstreams, if we ever suspend clients we need to be sure we can't
+ create dependency cycles
+ - is this an issue when only suspending the read side of each? Because even if
+ we stop reading from everything, we should eventually flush data to those we
+ can still talk to, as upstreams are flushed, we can start sending new
+ requests from live clients (those that are suspended are due to their own
+ inability to accept data)
+ - we might need to suspend a client if there is a reason to choose a
+ particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
+ - a SASL bind, but that means there are no outstanding ops to receive
+ it holds that !suspended(client) \or !suspended(upstream), so they
+ cannot participate in a cycle
+ - VC - multiple binds at the same time - !!! more analysis needed
+ - PR - should only be able to have one per connection (that's a problem
+ for later, maybe even needs a dedicated upstream connection)
+ - TXN - ??? probably same situation as PR
+ - or if we have a queue for pending Bers on the server, we not need to suspend
+ clients, upstream is only chosen if the queue is free or there is a reason
+ to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
+ that still makes it possible for a client to exhaust all our memory by
+ sending requests (VC or other ones bound to a slow upstream or by not
+ reading the responses at all)