1 files changed, 282 insertions, 0 deletions
diff --git a/doc/devel/lloadd/design.md b/doc/devel/lloadd/design.md
new file mode 100644
index 0000000..62fcd88
--- /dev/null
+++ b/doc/devel/lloadd/design.md
@@ -0,0 +1,282 @@
+TODO:
+- [ ] keep a global op in-flight counter? (might need locking)
+- [-] scheduling (who does what, more than one select thread? How does the proxy
+      work get distributed between threads?)
+- [ ] managing timeouts?
+- [X] outline locking policy: seems like there might be a lock inversion in the
+      design looming: when working with op, might need a lock on both client and
+      upstream but depending on where we started, we might want to start with
+      locking one, then other
+- [ ] how to deal with the balancer running out of fds? Especially when we hit
+      the limit, then lose an upstream connection and accept() a client, we
+      wouldn't be able to initiate a new one. A bit of a DoS... But probably not
+      a concern for Ericsson
+- [ ] non-Linux? No idea how anything other than poll works (moot if building a
+      libevent/libuv-based load balancer since they take care of that, except
+      edge-triggered I/O?)
+- [-] rootDSE? Controls and exops might have different semantics and need
+      binding to the same upstream connection.
+- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
+      in the core and the module/subsystem would be a very invasive one. On the
+      other hand, allows to expose live configuration and monitoring over LDAP
+      over the current slapd listeners without re-inventing the wheel.
+
+
+Expecting to handle only LDAPv3
+
+terms:
+  server - configured target
+  upstream - a single connection to a server
+  client - an incoming connection
+
+To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
+queues and put timeouts in
+
+Runtime organisation
+------
+- main thread with its own event base handling signals
+- one thread (later possibly more) listening on the rendezvous sockets, handing
+  the new sockets to worker threads
+- n worker threads dealing with client and server I/O (dispatching actual work
+  to the thread pool most likely)
+- a thread pool to handle actual work
+
+Operational behaviour
+------
+
+- client read -> upstream write:
+  - client read:
+    - if TLS_SETUP, keep processing, set state back when finished and note that
+      we're under TLS
+    - ber_get_next(), if we don't have a tag, finished (unless we have true
+      edge-triggered I/O, also put the fd back into the ones we're waiting for)
+    - peek at op tag:
+      - unbind:
+        - with a single lock, mark all pending ops in upstreams abandoned, clear
+          client link (would it be fast enough if we remove them from upstream
+          map instead?)
+        - locked per op:
+          - remove op from upstream map
+          - check upstream is not write-suspended, if it is ...
+          - try to write the abandon op to upstream, suspend upstream if not
+            fully sent
+          - remove op from client map (how if we're in avl_apply?, another pass?)
+        - would be nice if we could wipe the complete client map then, otherwise
+          we need to queue it to have it freed when all abandons get passed onto
+          the upstream (just dropping them might put extra strain on upstreams,
+          will probably have a queue on each client/upstream anyway, not just a
+          single Ber)
+      - bind:
+        - check mechanism is not EXTERNAL (or implement it)
+        - abandon existing ops (see unbind)
+        - set state to BINDING, put DN into authzid
+        - pick upstream, create PDU and sent
+      - abandon:
+        - find op, mark for abandon, send to appropriate upstream
+      - Exop:
+        - check not BINDING (unless it's a cancel?)
+        - check OID:
+          - STARTTLS:
+            - check we don't have TLS yet
+            - abandon all
+            - set state to TLS_SETUP
+            - send the hello
+          - VC(?):
+            - similar to bind except for the abandons/state change
+      - other:
+        - check not BINDING
+        - pick an upstream
+        - create a PDU, send (marking upstream suspended if not written in full)
+      - check if should read again (keep a counter of number of times to read
+        off a connection in a single pass so that we maintain fairness)
+      - if read enough requests and can still read, re-queue ourselves (if we
+        don't have true edge-triggered I/O, we can just register the fd again)
+  - upstream write (only when suspended):
+    - flush the current BER
+    - there shouldn't be anything else?
+- upstream read -> client write:
+  - upstream read:
+    - ber_get_next(), if we don't have a tag, finished (unless we have true
+      edge-triggered I/O, also put the fd back into the ones we're waiting for)
+    - when we get it, peek at msgid, resolve client connection, lock, check:
+      - if unsolicited, handle as close (and mark connection closing)
+      - if op is abandoned or does not exist, drop PDU and op, update counters
+      - if client backlogged, suspend upstream, register callback to unsuspend
+        (on progress when writing to client or abandon from client (connection
+        death, abandon proper, ...))
+    - reconstruct final PDU, write BER to client, if did not write fully,
+      suspend client
+    - if a final response, decrement operation counts on upstream and client
+    - check if should read again (keep a counter of number of responses to read
+      off a connection in a single pass so that we don't starve any?)
+  - client write ready (only checked for when suspended):
+    - write the rest of pending BER if any
+    - on successful write, pick all pending ops that need failure response, push
+      to client (are there any controls that need to be present in response even
+      in the case of failure?, what to do with them?)
+    - on successfully flushing them, walk through suspended upstreams, picking
+      the pending PDU (unsuspending the upstream) and writing, if PDU flushed
+      successfully, pick next upstream
+    - if we successfully flushed all suspended upstreams, unsuspend client
+      (and disable the write callback)
+- upstream close/error:
+  - look up pending ops, try to write to clients, mark clients suspended that
+    have ops that need responses (another queue associated with client to speed
+    up?)
+  - schedule a new connection open
+- client close/error:
+  - same as unbind
+- client inactive (no pending ops and nothing happened in x seconds)
+  - might just send notice of disconnection and close
+- op timeout handling:
+  - mark for abandon
+  - send abandon
+  - send timeLimitExceeded/adminLimitExceeded to client
+
+Picking an upstream:
+- while there is a level available:
+  - pick a random ordering of upstreams based on weights
+  - while there is an upstream in the level:
+    - check number of ops in-flight (this is where we lock the upstream map)
+    - find the least busy connection (and check if a new connection should be
+      opened)
+    - try to lock for socket write, if available (no BER queued) we have our
+      upstream
+
+PDU processing:
+- request (have an upstream selected):
+  - get new msgid from upstream
+  - create an Op structure (actually, with the need for freelist lock, we can
+    make it a cache for freed operation structures, avoiding some malloc
+    traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
+  - check proxyauthz is not present? or just let upstream reject it if there are
+    two?
+  - add own controls at the end:
+    - construct proxyauthz from authzid
+    - construct session tracking from remote IP, own name, authzid
+  - send over
+  - insert Op into client and upstream maps
+- response/intermediate/entry:
+  - look up Op in upstream's map
+  - write old msgid, rest of the response can go unchanged
+  - if a response, remove Op from all maps (client and upstream)
+
+Managing upstreams:
+- async connect up to min_connections (is there a point in having a connection
+  count range if we can't use it when needed since all of the below is async?)
+- when connected, set up TLS (if requested)
+- when done, send a bind
+- go for the bind interaction
+- when done, add it to the upstream's connection list
+- (if a connection is suspended or connections are over 75 % op limit, schedule
+  creating a new connection setup unless connection limit has been hit)
+
+Managing timeouts:
+- two options:
+  - maintain a separate locked priority queue to give a perfect ordering to when
+    each operation is to time out, would need to maintain yet another place
+    where operations can be found.
+    - the locking protocol for disposing of the operation would need to be
+      adjusted and might become even more complicated, might do the alternative
+      initially and then attempt this if it helps performance
+  - just do a sweep over all clients (that mutex is less contended) every so
+    often. With many in-flight operations might be a lot of wasted work.
+    - we still need to sweep over all clients to check if they should be killed
+      anyway
+
+Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
+- poll on all registered fds
+- remove each fd that's ready from the registered list and schedule the work
+- work threads can put their fd back in if they deem necessary (=not suspended)
+- this works as a poor man's edge-triggered polling, with enough workers, should
+  we do proper edge triggered I/O? What about non-Linux?
+
+Listener thread:
+- slapd has just one, which then reassigns the sockets to separate I/O
+  threads
+
+Threading:
+- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
+  op, we should have a dedicated parent context so that when we free it, we can
+  use that exclusively. The parent context's parent would be the main thread's
+  context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
+  and making sure an op does not allocate/free things from two threads at the
+  same time (might need an Op mutex after all? Not such a huge cost if we
+  routinely reuse Op structures)
+
+Locking policy:
+- read mutexes are unnecessary, we only have one thread receiving data from the
+  connection - the one started from the dispatcher
+- two reference counters of operation structures (an op is accessible from
+  client and upstream map, each counter is consistent when thread has a lock on
+  corresponding map), when decreasing the counter to zero, start freeing
+  procedure
+- place to mark disposal finished for each side, consistency enforced by holding
+  the freelist lock when reading/manipulating
+- when op is created, we already have a write lock on upstream socket and map,
+  start writing, insert to upstream map with upstream refcount 1, unlock, lock
+  client, insert (client refcount 0), unlock, lock upstream, decrement refcount
+  (triggers a test if we need to drop it now), unlock upstream, done
+- when upstream processes a PDU, locks its map, increments counter, (potentially
+  removes if it's a response), unlocks, locks client's map, write mutex (this
+  order?) and full client mutex (if a bind response)
+- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
+  increase refcount, unlocks, locks upstream map, write mutex, sends or queues
+  abandon, unlocks write mutex, initiates freeing procedure from upstream side
+  (or if having to remember we've already increased client-side refcount, mark
+  for deletion, lose upstream lock, lock client, decref, either triggering
+  deletion from client or mark for it)
+- if we have operation lock, we can simplify a bit (no need for three-stage
+  locking above)
+
+Shutdown:
+- stop accept() thread(s) - potentially add a channel to hand these listening
+  sockets over for zero-downtime restart
+- if very gentle, mark connections as closing, start timeout and:
+  - when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
+  - when receiving a PDU from upstream, send over to client, if no ops pending,
+    send unsolicited response and close (RFC4511 suggests unsolicited response
+    is the last PDU coming from the upstream and libldap agrees, so we can't
+    send it for a socket we want to shut down more gracefully)
+- gentle (or very gentle timed out):
+  - set timeout
+  - mark all ops as abandoned
+  - send unbind to all upstreams
+  - send unsolicited to all clients
+- imminent (or gentle timed out):
+  - async close all connections?
+  - exit()
+
+RootDSE:
+- default option is not to care and if a control/exop has special restrictions,
+  it is the admin's job to flag it as such in the load-balancer's config
+- another is not to care about the search request but check each search entry
+  being passed back, check DN and if it's a rootDSE, filter the list of
+  controls/exops/sasl mechs (external!) that are supported
+- last one is to check all search requests for the DN/scope and synthesise the
+  response locally - probably not (would need to configure the complete list of
+  controls, exops, sasl mechs, naming contexts in the balancer)
+
+Potential red flags:
+- we suspend upstreams, if we ever suspend clients we need to be sure we can't
+  create dependency cycles
+  - is this an issue when only suspending the read side of each? Because even if
+    we stop reading from everything, we should eventually flush data to those we
+    can still talk to, as upstreams are flushed, we can start sending new
+    requests from live clients (those that are suspended are due to their own
+    inability to accept data)
+  - we might need to suspend a client if there is a reason to choose a
+    particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
+    - a SASL bind, but that means there are no outstanding ops to receive
+      it holds that !suspended(client) \or !suspended(upstream), so they
+      cannot participate in a cycle
+    - VC - multiple binds at the same time - !!! more analysis needed
+    - PR - should only be able to have one per connection (that's a problem
+      for later, maybe even needs a dedicated upstream connection)
+    - TXN - ??? probably same situation as PR
+  - or if we have a queue for pending Bers on the server, we not need to suspend
+    clients, upstream is only chosen if the queue is free or there is a reason
+    to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
+    that still makes it possible for a client to exhaust all our memory by
+    sending requests (VC or other ones bound to a slow upstream or by not
+    reading the responses at all)