1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
|
TODO:
- [ ] keep a global op in-flight counter? (might need locking)
- [-] scheduling (who does what, more than one select thread? How does the proxy
work get distributed between threads?)
- [ ] managing timeouts?
- [X] outline locking policy: seems like there might be a lock inversion in the
design looming: when working with op, might need a lock on both client and
upstream but depending on where we started, we might want to start with
locking one, then other
- [ ] how to deal with the balancer running out of fds? Especially when we hit
the limit, then lose an upstream connection and accept() a client, we
wouldn't be able to initiate a new one. A bit of a DoS... But probably not
a concern for Ericsson
- [ ] non-Linux? No idea how anything other than poll works (moot if building a
libevent/libuv-based load balancer since they take care of that, except
edge-triggered I/O?)
- [-] rootDSE? Controls and exops might have different semantics and need
binding to the same upstream connection.
- [ ] Just piggybacking on OpenLDAP as a module? Would still need some updates
in the core and the module/subsystem would be a very invasive one. On the
other hand, allows to expose live configuration and monitoring over LDAP
over the current slapd listeners without re-inventing the wheel.
Expecting to handle only LDAPv3
terms:
server - configured target
upstream - a single connection to a server
client - an incoming connection
To maintain fairness `G( requested => ( F( progressed | failed ) ) )`, use
queues and put timeouts in
Runtime organisation
------
- main thread with its own event base handling signals
- one thread (later possibly more) listening on the rendezvous sockets, handing
the new sockets to worker threads
- n worker threads dealing with client and server I/O (dispatching actual work
to the thread pool most likely)
- a thread pool to handle actual work
Operational behaviour
------
- client read -> upstream write:
- client read:
- if TLS_SETUP, keep processing, set state back when finished and note that
we're under TLS
- ber_get_next(), if we don't have a tag, finished (unless we have true
edge-triggered I/O, also put the fd back into the ones we're waiting for)
- peek at op tag:
- unbind:
- with a single lock, mark all pending ops in upstreams abandoned, clear
client link (would it be fast enough if we remove them from upstream
map instead?)
- locked per op:
- remove op from upstream map
- check upstream is not write-suspended, if it is ...
- try to write the abandon op to upstream, suspend upstream if not
fully sent
- remove op from client map (how if we're in avl_apply?, another pass?)
- would be nice if we could wipe the complete client map then, otherwise
we need to queue it to have it freed when all abandons get passed onto
the upstream (just dropping them might put extra strain on upstreams,
will probably have a queue on each client/upstream anyway, not just a
single Ber)
- bind:
- check mechanism is not EXTERNAL (or implement it)
- abandon existing ops (see unbind)
- set state to BINDING, put DN into authzid
- pick upstream, create PDU and sent
- abandon:
- find op, mark for abandon, send to appropriate upstream
- Exop:
- check not BINDING (unless it's a cancel?)
- check OID:
- STARTTLS:
- check we don't have TLS yet
- abandon all
- set state to TLS_SETUP
- send the hello
- VC(?):
- similar to bind except for the abandons/state change
- other:
- check not BINDING
- pick an upstream
- create a PDU, send (marking upstream suspended if not written in full)
- check if should read again (keep a counter of number of times to read
off a connection in a single pass so that we maintain fairness)
- if read enough requests and can still read, re-queue ourselves (if we
don't have true edge-triggered I/O, we can just register the fd again)
- upstream write (only when suspended):
- flush the current BER
- there shouldn't be anything else?
- upstream read -> client write:
- upstream read:
- ber_get_next(), if we don't have a tag, finished (unless we have true
edge-triggered I/O, also put the fd back into the ones we're waiting for)
- when we get it, peek at msgid, resolve client connection, lock, check:
- if unsolicited, handle as close (and mark connection closing)
- if op is abandoned or does not exist, drop PDU and op, update counters
- if client backlogged, suspend upstream, register callback to unsuspend
(on progress when writing to client or abandon from client (connection
death, abandon proper, ...))
- reconstruct final PDU, write BER to client, if did not write fully,
suspend client
- if a final response, decrement operation counts on upstream and client
- check if should read again (keep a counter of number of responses to read
off a connection in a single pass so that we don't starve any?)
- client write ready (only checked for when suspended):
- write the rest of pending BER if any
- on successful write, pick all pending ops that need failure response, push
to client (are there any controls that need to be present in response even
in the case of failure?, what to do with them?)
- on successfully flushing them, walk through suspended upstreams, picking
the pending PDU (unsuspending the upstream) and writing, if PDU flushed
successfully, pick next upstream
- if we successfully flushed all suspended upstreams, unsuspend client
(and disable the write callback)
- upstream close/error:
- look up pending ops, try to write to clients, mark clients suspended that
have ops that need responses (another queue associated with client to speed
up?)
- schedule a new connection open
- client close/error:
- same as unbind
- client inactive (no pending ops and nothing happened in x seconds)
- might just send notice of disconnection and close
- op timeout handling:
- mark for abandon
- send abandon
- send timeLimitExceeded/adminLimitExceeded to client
Picking an upstream:
- while there is a level available:
- pick a random ordering of upstreams based on weights
- while there is an upstream in the level:
- check number of ops in-flight (this is where we lock the upstream map)
- find the least busy connection (and check if a new connection should be
opened)
- try to lock for socket write, if available (no BER queued) we have our
upstream
PDU processing:
- request (have an upstream selected):
- get new msgid from upstream
- create an Op structure (actually, with the need for freelist lock, we can
make it a cache for freed operation structures, avoiding some malloc
traffic, to reset, we need slap_sl_mem_create( ,,, 1 ))
- check proxyauthz is not present? or just let upstream reject it if there are
two?
- add own controls at the end:
- construct proxyauthz from authzid
- construct session tracking from remote IP, own name, authzid
- send over
- insert Op into client and upstream maps
- response/intermediate/entry:
- look up Op in upstream's map
- write old msgid, rest of the response can go unchanged
- if a response, remove Op from all maps (client and upstream)
Managing upstreams:
- async connect up to min_connections (is there a point in having a connection
count range if we can't use it when needed since all of the below is async?)
- when connected, set up TLS (if requested)
- when done, send a bind
- go for the bind interaction
- when done, add it to the upstream's connection list
- (if a connection is suspended or connections are over 75 % op limit, schedule
creating a new connection setup unless connection limit has been hit)
Managing timeouts:
- two options:
- maintain a separate locked priority queue to give a perfect ordering to when
each operation is to time out, would need to maintain yet another place
where operations can be found.
- the locking protocol for disposing of the operation would need to be
adjusted and might become even more complicated, might do the alternative
initially and then attempt this if it helps performance
- just do a sweep over all clients (that mutex is less contended) every so
often. With many in-flight operations might be a lot of wasted work.
- we still need to sweep over all clients to check if they should be killed
anyway
Dispatcher thread (2^n of them, fd x is handled by thread no x % (2^n)):
- poll on all registered fds
- remove each fd that's ready from the registered list and schedule the work
- work threads can put their fd back in if they deem necessary (=not suspended)
- this works as a poor man's edge-triggered polling, with enough workers, should
we do proper edge triggered I/O? What about non-Linux?
Listener thread:
- slapd has just one, which then reassigns the sockets to separate I/O
threads
Threading:
- if using slap_sl_malloc, how much perf do we gain? To allocate a context per
op, we should have a dedicated parent context so that when we free it, we can
use that exclusively. The parent context's parent would be the main thread's
context. This implies a lot of slap_sl_mem_setctx/slap_sl_mem_create( ,,, 0 )
and making sure an op does not allocate/free things from two threads at the
same time (might need an Op mutex after all? Not such a huge cost if we
routinely reuse Op structures)
Locking policy:
- read mutexes are unnecessary, we only have one thread receiving data from the
connection - the one started from the dispatcher
- two reference counters of operation structures (an op is accessible from
client and upstream map, each counter is consistent when thread has a lock on
corresponding map), when decreasing the counter to zero, start freeing
procedure
- place to mark disposal finished for each side, consistency enforced by holding
the freelist lock when reading/manipulating
- when op is created, we already have a write lock on upstream socket and map,
start writing, insert to upstream map with upstream refcount 1, unlock, lock
client, insert (client refcount 0), unlock, lock upstream, decrement refcount
(triggers a test if we need to drop it now), unlock upstream, done
- when upstream processes a PDU, locks its map, increments counter, (potentially
removes if it's a response), unlocks, locks client's map, write mutex (this
order?) and full client mutex (if a bind response)
- when client side wants to work with a PDU (abandon, (un)bind), locks its map,
increase refcount, unlocks, locks upstream map, write mutex, sends or queues
abandon, unlocks write mutex, initiates freeing procedure from upstream side
(or if having to remember we've already increased client-side refcount, mark
for deletion, lose upstream lock, lock client, decref, either triggering
deletion from client or mark for it)
- if we have operation lock, we can simplify a bit (no need for three-stage
locking above)
Shutdown:
- stop accept() thread(s) - potentially add a channel to hand these listening
sockets over for zero-downtime restart
- if very gentle, mark connections as closing, start timeout and:
- when a new non-abandon PDU comes in from client - return LDAP_UNAVAILABLE
- when receiving a PDU from upstream, send over to client, if no ops pending,
send unsolicited response and close (RFC4511 suggests unsolicited response
is the last PDU coming from the upstream and libldap agrees, so we can't
send it for a socket we want to shut down more gracefully)
- gentle (or very gentle timed out):
- set timeout
- mark all ops as abandoned
- send unbind to all upstreams
- send unsolicited to all clients
- imminent (or gentle timed out):
- async close all connections?
- exit()
RootDSE:
- default option is not to care and if a control/exop has special restrictions,
it is the admin's job to flag it as such in the load-balancer's config
- another is not to care about the search request but check each search entry
being passed back, check DN and if it's a rootDSE, filter the list of
controls/exops/sasl mechs (external!) that are supported
- last one is to check all search requests for the DN/scope and synthesise the
response locally - probably not (would need to configure the complete list of
controls, exops, sasl mechs, naming contexts in the balancer)
Potential red flags:
- we suspend upstreams, if we ever suspend clients we need to be sure we can't
create dependency cycles
- is this an issue when only suspending the read side of each? Because even if
we stop reading from everything, we should eventually flush data to those we
can still talk to, as upstreams are flushed, we can start sending new
requests from live clients (those that are suspended are due to their own
inability to accept data)
- we might need to suspend a client if there is a reason to choose a
particular upstream (multi-request operation - bind, VC, PR, TXN, ...)
- a SASL bind, but that means there are no outstanding ops to receive
it holds that !suspended(client) \or !suspended(upstream), so they
cannot participate in a cycle
- VC - multiple binds at the same time - !!! more analysis needed
- PR - should only be able to have one per connection (that's a problem
for later, maybe even needs a dedicated upstream connection)
- TXN - ??? probably same situation as PR
- or if we have a queue for pending Bers on the server, we not need to suspend
clients, upstream is only chosen if the queue is free or there is a reason
to send it to that particular upstream (multi-stage bind/VC, PR, ...), but
that still makes it possible for a client to exhaust all our memory by
sending requests (VC or other ones bound to a slow upstream or by not
reading the responses at all)
|