doc/internals/notes-layers.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

2018-02-21 - Layering in haproxy 1.9
------------------------------------

2 main zones :
  - application : reads from conn_streams, writes to conn_streams, often uses
    streams

  - connection : receives data from the network, presented into buffers
    available via conn_streams, sends data to the network


The connection zone contains multiple layers which behave independently in each
direction. The Rx direction is activated upon callbacks from the lower layers.
The Tx direction is activated recursively from the upper layers. Between every
two layers there may be a buffer, in each direction. When a buffer is full
either in Tx or Rx direction, this direction is paused from the network layer
and the location where the congestion is encountered. Upon end of congestion
(cs_recv() from the upper layer, of sendto() at the lower layers), a
tasklet_wakeup() is performed on the blocked layer so that suspended operations
can be resumed. In this case, the Rx side restarts propagating data upwards
from the lowest blocked level, while the Tx side restarts propagating data
downwards from the highest blocked level. Proceeding like this ensures that
information known to the producer may always be used to tailor the buffer sizes
or decide of a strategy to best aggregate data. Additionally, each time a layer
is crossed without transformation, it becomes possible to send without copying.

The Rx side notifies the application of data readiness using a wakeup or a
callback. The Tx side notifies the application of room availability once data
have been moved resulting in the uppermost buffer having some free space.

When crossing a mux downwards, it is possible that the sender is not allowed to
access the buffer because it is not yet its turn. It is not a problem, the data
remains in the conn_stream's buffer (or the stream one) and will be restarted
once the mux is ready to consume these data.


          cs_recv()        -------.           cs_send()
     ^          +-------->  |||||| -------------+       ^
     |          |          -------'             |       |             stream
   --|----------|-------------------------------|-------|-------------------
     |          |                               V       |         connection
    data      .---.                           |   |    room
    ready!    |---|                           |---|    available!
              |---|                           |---|
              |---|                           |---|
              |   |                           '---'
                ^   +------------+-------+      |
                |   |            ^       |      /
                /   V            |       V      /
                / recvfrom()     |     sendto() |
   -------------|----------------|--------------|---------------------------
                |                | poll!        V                     kernel


The cs_recv() function should act on pointers to buffer pointers, so that the
callee may decide to pass its own buffer directly by simply swapping pointers.
Similarly for cs_send() it is desirable to let the callee steal the buffer by
swapping the pointers. This way it remains possible to implement zero-copy
forwarding.

Some operation flags will be needed on cs_recv() :
  - RECV_ZERO_COPY : refuse to merge new data into the current buffer if it
    will result in a data copy (ie the buffer is not empty), unless no more
    than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper
    than waiting and playing with pointers)

  - RECV_AT_ONCE : only perform the operation if it will result in the source
    buffer to become empty at the end of the operation so that no two buffers
    remain allocated at the end. It will most of the time result in either a
    small read or a zero-copy operation.

  - RECV_PEEK : retrieve a copy of pending data without removing these data
    from the source buffer. Maybe an alternate solution could consist in
    finding the pointer to the source buffer and accessing these data directly,
    except that it might be less interesting for the long term, thread-wise.

  - RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail.
    This should help various protocol parsers which need to receive a complete
    frame before proceeding.

  - RECV_ENOUGH : no more data expected after this read if it's of the
    requested size, thus no need to re-enable receiving on the lower layers.

  - RECV_ONE_SHOT : perform a single read without re-enabling reading on the
    lower layers, like we currently do when receiving an HTTP/1 request. Like
    RECV_ENOUGH where any size is enough. Probably that the two could be merged
    (eg: by having a MIN argument like RECV_MIN).


Some operation flags will be needed on cs_send() :
  - SEND_ZERO_COPY : refuse to merge the presented data with existing data and
    prefer to wait for current data to leave and try again, unless the consumer
    considers the amount of data acceptable for a copy.

  - SEND_AT_ONCE : only perform the operation if it will result in the source
    buffer to become empty at the end of the operation so that no two buffers
    remain allocated at the end. It will most of the time result in either a
    small write or a zero-copy operation.


Both operations should return a composite status :
  - number of bytes transferred
  - status flags (shutr, shutw, reset, empty, full, ...)


2018-07-23 - Update after merging rxbuf
---------------------------------------

It becomes visible that the mux will not always be welcome to decode incoming
data because it will sometimes imply extra memory copies and/or usage for no
benefit.

Ideally, when when a stream is instantiated based on incoming data, these
incoming data should be passed and the upper layers called, but it should then
be up these upper layers to peek more data in certain circumstances. Typically
if the pending connection data are larger than what is expected to be passed
above, it means some data may cause head-of-line blocking (HOL) to other
streams, and needs to be pushed up through the layers to let other streams
continue to work. Similarly very large H2 data frames after header frames
should probably not be passed as they may require copies that could be avoided
if passed later. However if the decoded frame fits into the conn_stream's
buffer, there is an opportunity to use a single buffer for the conn_stream
and the channel. The H2 demux could set a blocking flag indicating it's waiting
for the upper stream to take over demuxing. This flag would be purged once the
upper stream would start reading, or when extra data come and change the
conditions.

Forcing structured headers and raw data to coexist within a single buffer is
quite challenging for many code parts. For example it's perfectly possible to
see a fragmented buffer containing series of headers, then a small data chunk
that was received at the same time, then a few other headers added by request
processing, then another data block received afterwards, then possibly yet
another header added by option http-send-name-header, and yet another data
block. This causes some pain for compression which still needs to know where
compressed and uncompressed data start/stop. It also makes it very difficult
to account the exact bytes to pass through the various layers.

One solution consists in thinking about buffers using 3 representations :

  - a structured message, which is used for the internal HTTP representation.
    This message may only be atomically processed. It has no clear byte count,
    it's a message.

  - a raw stream, consisting in sequences of bytes. That's typically what
    happens in data sequences or in tunnel.

  - a pipe, which contains data to be forwarded, and that haproxy cannot have
    access to.

The processing efficiency decreases with the higher complexity above, but the
capabilities increase. The structured message can contain anything including
serialized data blocks to be processed or forwarded. The raw stream contains
data blocks to be processed or forwarded. The pipe only contains data blocks
to be forwarded. The the latter ones are only an optimization of the former
ones.

Thus ideally a channel should have access to all such 3 storage areas at once,
depending on the use case :
  (1) a structured message,
  (2) a raw stream,
  (3) a pipe

Right now a channel only has (2) and (3) but after the native HTTP rework, it
will only have (1) and (3). Placing a raw stream exclusively in (1) comes with
some performance drawbacks which are not easily recovered, and with some quite
difficult management still involving the reserve to ensure that a data block
doesn't prevent headers from being appended. But during header processing, the
payload may be necessary so we cannot decide to drop this option.

A long-term approach would consist in ensuring that a single channel may have
access to all 3 representations at once, and to enumerate priority rules to
define how they interact together. That's exactly what is currently being done
with the pipe and the raw buffer right now. Doing so would also save the need
for storing payload in the structured message and void the requirement for the
reserve. But it would cost more memory to process POST data and server
responses. Thus an intermediary step consists in keeping this model in mind but
not implementing everything yet.

Short term proposal : a channel has access to a buffer and a pipe. A non-empty
buffer is either in structured message format OR raw stream format. Only the
channel knows. However a structured buffer MAY contain raw data in a properly
formatted way (using the envelope defined by the structured message format).

By default, when a demux writes to a CS rxbuf, it will try to use the lowest
possible level for what is being done (i.e. splice if possible, otherwise raw
stream, otherwise structured message). If the buffer already contains a
structured message, then this format is exclusive. From this point the MUX has
two options : either encode the incoming data to match the structured message
format, or refrain from receiving into the CS's rxbuf and wait until the upper
layer request those data.

This opens a simplified option which could be suited even for the long term :
  - cs_recv() will take one or two flags to indicate if a buffer already
    contains a structured message or not ; the upper layer knows it.

  - cs_recv() will take two flags to indicate what the upper layer is willing
    to take :
      - structured message only
      - raw stream only
      - any of them

    From this point the mux can decide to either pass anything or refrain from
    doing so.

  - the demux stores the knowledge it has from the contents into some CS flags
    to indicate whether or not some structured message are still available, and
    whether or not some raw data are still available. Thus the caller knows
    whether or not extra data are available.

  - when the demux works on its own, it refrains from passing structured data
    to a non-empty buffer, unless these data are causing trouble to other
    streams (HOL).

  - when a demux has to encapsulate raw data into a structured message, it will
    always have to respect a configured reserve so that extra header processing
    can be done on the structured message inside the buffer, regardless of the
    supposed available room. In addition, the upper layer may indicate using an
    extra recv() flag whether it wants the demux to defragment serialized data
    (for example by moving trailing headers apart) or if it's not necessary.
    This flag will be set by the stream interface if compression is required or
    if the http-buffer-request option is set for example. Probably that using
    to_forward==0 is a stronger indication that the reserve must be respected.

  - cs_recv() and cs_send() when fed with a message, should not return byte
    counts but message counts (i.e. 0 or 1). This implies that a single call to
    either of these functions cannot mix raw data and structured messages at
    the same time.

At this point it looks like the conn_stream will have some encapsulation work
to do for the payload if it needs to be encapsulated into a message. This
further magnifies the importance of *not* decoding DATA frames into the CS's
rxbuf until really needed.

The CS will probably need to hold indication of what is available at the mux
level, not only in the CS. Eg: we know that payload is still available.

Using these elements, it should be possible to ensure that full header frames
may be received without enforcing any reserve, that too large frames that do
not fit will be detected because they return 0 message and indicate that such
a message is still pending, and that data availability is correctly detected
(later we may expect that the stream-interface allocates a larger or second
buffer to place the payload).

Regarding the ability for the channel to forward data, it looks like having a
new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in
optimizing the forwarding to make use of splicing when available. It is not yet
totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)"
followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it
still needs to be studied. The general idea seems to be that the receiver might
have to call the sender directly once they agree on how to transfer data (pipe
or buffer). If the transfer is incomplete, the cs_xfer() return value and/or
flags will indicate the current situation (src empty, dst full, etc) so that
the caller may register for notifications on the appropriate event and wait to
be called again to continue.

Short term implementation :
  1) add new CS flags to qualify what the buffer contains and what we expect
     to read into it;

  2) set these flags to pretend we have a structured message when receiving
     headers (after all, H1 is an atomic header as well) and see what it
     implies for the code; for H1 it's unclear whether it makes sense to try
     to set it without the H1 mux.

  3) use these flags to refrain from sending DATA frames after HEADERS frames
     in H2.

  4) flush the flags at the stream interface layer when performing a cs_send().

  5) use the flags to enforce receipt of data only when necessary

We should be able to end up with sequential receipt in H2 modelling what is
needed for other protocols without interfering with the native H1 devs.


2018-08-17 - Considerations after killing cs_recv()
---------------------------------------------------

With the ongoing reorganisation of the I/O layers, it's visible that cs_recv()
will have to transfer data between the cs' rxbuf and the channel's buffer while
not being aware of the data format. Moreover, in case there's no data there, it
needs to recursively call the mux's rcv_buf() to trigger a decoding, while this
function is sometimes replaced with cs_recv(). All this shows that cs_recv() is
in fact needed while data are pushed upstream from the lower layers, and is not
suitable for the "pull" mode. Thus it was decided to remove this function and
put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be
replaced with cs_recv() since it is the only one knowing about the buffer's
format.

This opportunity simplified something : if the cs's rxbuf is only read by the
mux's rcv_buf() method, then it doesn't need to be located into the CS and is
well placed into the mux's representation of the stream. This has an important
impact for H2 as it offers more freedom to the mux to allocate/free/reallocate
this buffer, and it ensures the mux always has access to it.

Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1
mux has already uncovered the difficulty related to the channel shutting down
on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled
to the stream and the stream can close immediately once its buffers are empty,
it required a way to support orphaned CS with pending data in their txbuf. This
is something that the H2 mux already has to deal with, by carefully leaving the
data in the channel's buffer. But due to the snd_buf() call being top-down, it
is always possible to push the stream's data via the mux's snd_buf() call
without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only
implemented in the mux and attached to the mux's representation of the stream,
and doing so allows to immediately release the channel once the data are safe
in the mux's buffer.

This is an important change which clarifies the roles and responsibilities of
each layer in the chain : when receiving data from a mux, it's the mux's
responsibility to make sure it can correctly decode the incoming data and to
buffer the possible excess of data it cannot pass to the requester. This means
that decoding an H2 frame, which is not retryable since it has an impact on the
HPACK decompression context, and which cannot be reordered for the same reason,
simply needs to be performed to the H2 stream's rxbuf which will then be passed
to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a
time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to
read as much as it needs to be able to restart later, possibly by buffering
some data into a local buffer. And it's only once all the output data has been
consumed by snd_buf() that the stream is free to disappear.

This model presents the nice benefit of being infinitely stackable and solving
the last identified showstoppers to move towards a structured message internal
representation, as it will give full power to the rcv_buf() and snd_buf() to
process what they need.

For now the conn_stream's flags indicating whether a shutdown has been seen in
any direction or if an end of stream was seen will remain in the conn_stream,
though it's likely that some of them will move to the mux's representation of
the stream after structured messages are implemented.