2018-02-21 - Layering in haproxy 1.9 ------------------------------------ 2 main zones : - application : reads from conn_streams, writes to conn_streams, often uses streams - connection : receives data from the network, presented into buffers available via conn_streams, sends data to the network The connection zone contains multiple layers which behave independently in each direction. The Rx direction is activated upon callbacks from the lower layers. The Tx direction is activated recursively from the upper layers. Between every two layers there may be a buffer, in each direction. When a buffer is full either in Tx or Rx direction, this direction is paused from the network layer and the location where the congestion is encountered. Upon end of congestion (cs_recv() from the upper layer, of sendto() at the lower layers), a tasklet_wakeup() is performed on the blocked layer so that suspended operations can be resumed. In this case, the Rx side restarts propagating data upwards from the lowest blocked level, while the Tx side restarts propagating data downwards from the highest blocked level. Proceeding like this ensures that information known to the producer may always be used to tailor the buffer sizes or decide of a strategy to best aggregate data. Additionally, each time a layer is crossed without transformation, it becomes possible to send without copying. The Rx side notifies the application of data readiness using a wakeup or a callback. The Tx side notifies the application of room availability once data have been moved resulting in the uppermost buffer having some free space. When crossing a mux downwards, it is possible that the sender is not allowed to access the buffer because it is not yet its turn. It is not a problem, the data remains in the conn_stream's buffer (or the stream one) and will be restarted once the mux is ready to consume these data. cs_recv() -------. cs_send() ^ +--------> |||||| -------------+ ^ | | -------' | | stream --|----------|-------------------------------|-------|------------------- | | V | connection data .---. | | room ready! |---| |---| available! |---| |---| |---| |---| | | '---' ^ +------------+-------+ | | | ^ | / / V | V / / recvfrom() | sendto() | -------------|----------------|--------------|--------------------------- | | poll! V kernel The cs_recv() function should act on pointers to buffer pointers, so that the callee may decide to pass its own buffer directly by simply swapping pointers. Similarly for cs_send() it is desirable to let the callee steal the buffer by swapping the pointers. This way it remains possible to implement zero-copy forwarding. Some operation flags will be needed on cs_recv() : - RECV_ZERO_COPY : refuse to merge new data into the current buffer if it will result in a data copy (ie the buffer is not empty), unless no more than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper than waiting and playing with pointers) - RECV_AT_ONCE : only perform the operation if it will result in the source buffer to become empty at the end of the operation so that no two buffers remain allocated at the end. It will most of the time result in either a small read or a zero-copy operation. - RECV_PEEK : retrieve a copy of pending data without removing these data from the source buffer. Maybe an alternate solution could consist in finding the pointer to the source buffer and accessing these data directly, except that it might be less interesting for the long term, thread-wise. - RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail. This should help various protocol parsers which need to receive a complete frame before proceeding. - RECV_ENOUGH : no more data expected after this read if it's of the requested size, thus no need to re-enable receiving on the lower layers. - RECV_ONE_SHOT : perform a single read without re-enabling reading on the lower layers, like we currently do when receiving an HTTP/1 request. Like RECV_ENOUGH where any size is enough. Probably that the two could be merged (eg: by having a MIN argument like RECV_MIN). Some operation flags will be needed on cs_send() : - SEND_ZERO_COPY : refuse to merge the presented data with existing data and prefer to wait for current data to leave and try again, unless the consumer considers the amount of data acceptable for a copy. - SEND_AT_ONCE : only perform the operation if it will result in the source buffer to become empty at the end of the operation so that no two buffers remain allocated at the end. It will most of the time result in either a small write or a zero-copy operation. Both operations should return a composite status : - number of bytes transferred - status flags (shutr, shutw, reset, empty, full, ...) 2018-07-23 - Update after merging rxbuf --------------------------------------- It becomes visible that the mux will not always be welcome to decode incoming data because it will sometimes imply extra memory copies and/or usage for no benefit. Ideally, when when a stream is instantiated based on incoming data, these incoming data should be passed and the upper layers called, but it should then be up these upper layers to peek more data in certain circumstances. Typically if the pending connection data are larger than what is expected to be passed above, it means some data may cause head-of-line blocking (HOL) to other streams, and needs to be pushed up through the layers to let other streams continue to work. Similarly very large H2 data frames after header frames should probably not be passed as they may require copies that could be avoided if passed later. However if the decoded frame fits into the conn_stream's buffer, there is an opportunity to use a single buffer for the conn_stream and the channel. The H2 demux could set a blocking flag indicating it's waiting for the upper stream to take over demuxing. This flag would be purged once the upper stream would start reading, or when extra data come and change the conditions. Forcing structured headers and raw data to coexist within a single buffer is quite challenging for many code parts. For example it's perfectly possible to see a fragmented buffer containing series of headers, then a small data chunk that was received at the same time, then a few other headers added by request processing, then another data block received afterwards, then possibly yet another header added by option http-send-name-header, and yet another data block. This causes some pain for compression which still needs to know where compressed and uncompressed data start/stop. It also makes it very difficult to account the exact bytes to pass through the various layers. One solution consists in thinking about buffers using 3 representations : - a structured message, which is used for the internal HTTP representation. This message may only be atomically processed. It has no clear byte count, it's a message. - a raw stream, consisting in sequences of bytes. That's typically what happens in data sequences or in tunnel. - a pipe, which contains data to be forwarded, and that haproxy cannot have access to. The processing efficiency decreases with the higher complexity above, but the capabilities increase. The structured message can contain anything including serialized data blocks to be processed or forwarded. The raw stream contains data blocks to be processed or forwarded. The pipe only contains data blocks to be forwarded. The the latter ones are only an optimization of the former ones. Thus ideally a channel should have access to all such 3 storage areas at once, depending on the use case : (1) a structured message, (2) a raw stream, (3) a pipe Right now a channel only has (2) and (3) but after the native HTTP rework, it will only have (1) and (3). Placing a raw stream exclusively in (1) comes with some performance drawbacks which are not easily recovered, and with some quite difficult management still involving the reserve to ensure that a data block doesn't prevent headers from being appended. But during header processing, the payload may be necessary so we cannot decide to drop this option. A long-term approach would consist in ensuring that a single channel may have access to all 3 representations at once, and to enumerate priority rules to define how they interact together. That's exactly what is currently being done with the pipe and the raw buffer right now. Doing so would also save the need for storing payload in the structured message and void the requirement for the reserve. But it would cost more memory to process POST data and server responses. Thus an intermediary step consists in keeping this model in mind but not implementing everything yet. Short term proposal : a channel has access to a buffer and a pipe. A non-empty buffer is either in structured message format OR raw stream format. Only the channel knows. However a structured buffer MAY contain raw data in a properly formatted way (using the envelope defined by the structured message format). By default, when a demux writes to a CS rxbuf, it will try to use the lowest possible level for what is being done (i.e. splice if possible, otherwise raw stream, otherwise structured message). If the buffer already contains a structured message, then this format is exclusive. From this point the MUX has two options : either encode the incoming data to match the structured message format, or refrain from receiving into the CS's rxbuf and wait until the upper layer request those data. This opens a simplified option which could be suited even for the long term : - cs_recv() will take one or two flags to indicate if a buffer already contains a structured message or not ; the upper layer knows it. - cs_recv() will take two flags to indicate what the upper layer is willing to take : - structured message only - raw stream only - any of them From this point the mux can decide to either pass anything or refrain from doing so. - the demux stores the knowledge it has from the contents into some CS flags to indicate whether or not some structured message are still available, and whether or not some raw data are still available. Thus the caller knows whether or not extra data are available. - when the demux works on its own, it refrains from passing structured data to a non-empty buffer, unless these data are causing trouble to other streams (HOL). - when a demux has to encapsulate raw data into a structured message, it will always have to respect a configured reserve so that extra header processing can be done on the structured message inside the buffer, regardless of the supposed available room. In addition, the upper layer may indicate using an extra recv() flag whether it wants the demux to defragment serialized data (for example by moving trailing headers apart) or if it's not necessary. This flag will be set by the stream interface if compression is required or if the http-buffer-request option is set for example. Probably that using to_forward==0 is a stronger indication that the reserve must be respected. - cs_recv() and cs_send() when fed with a message, should not return byte counts but message counts (i.e. 0 or 1). This implies that a single call to either of these functions cannot mix raw data and structured messages at the same time. At this point it looks like the conn_stream will have some encapsulation work to do for the payload if it needs to be encapsulated into a message. This further magnifies the importance of *not* decoding DATA frames into the CS's rxbuf until really needed. The CS will probably need to hold indication of what is available at the mux level, not only in the CS. Eg: we know that payload is still available. Using these elements, it should be possible to ensure that full header frames may be received without enforcing any reserve, that too large frames that do not fit will be detected because they return 0 message and indicate that such a message is still pending, and that data availability is correctly detected (later we may expect that the stream-interface allocates a larger or second buffer to place the payload). Regarding the ability for the channel to forward data, it looks like having a new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in optimizing the forwarding to make use of splicing when available. It is not yet totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)" followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it still needs to be studied. The general idea seems to be that the receiver might have to call the sender directly once they agree on how to transfer data (pipe or buffer). If the transfer is incomplete, the cs_xfer() return value and/or flags will indicate the current situation (src empty, dst full, etc) so that the caller may register for notifications on the appropriate event and wait to be called again to continue. Short term implementation : 1) add new CS flags to qualify what the buffer contains and what we expect to read into it; 2) set these flags to pretend we have a structured message when receiving headers (after all, H1 is an atomic header as well) and see what it implies for the code; for H1 it's unclear whether it makes sense to try to set it without the H1 mux. 3) use these flags to refrain from sending DATA frames after HEADERS frames in H2. 4) flush the flags at the stream interface layer when performing a cs_send(). 5) use the flags to enforce receipt of data only when necessary We should be able to end up with sequential receipt in H2 modelling what is needed for other protocols without interfering with the native H1 devs. 2018-08-17 - Considerations after killing cs_recv() --------------------------------------------------- With the ongoing reorganisation of the I/O layers, it's visible that cs_recv() will have to transfer data between the cs' rxbuf and the channel's buffer while not being aware of the data format. Moreover, in case there's no data there, it needs to recursively call the mux's rcv_buf() to trigger a decoding, while this function is sometimes replaced with cs_recv(). All this shows that cs_recv() is in fact needed while data are pushed upstream from the lower layers, and is not suitable for the "pull" mode. Thus it was decided to remove this function and put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be replaced with cs_recv() since it is the only one knowing about the buffer's format. This opportunity simplified something : if the cs's rxbuf is only read by the mux's rcv_buf() method, then it doesn't need to be located into the CS and is well placed into the mux's representation of the stream. This has an important impact for H2 as it offers more freedom to the mux to allocate/free/reallocate this buffer, and it ensures the mux always has access to it. Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1 mux has already uncovered the difficulty related to the channel shutting down on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled to the stream and the stream can close immediately once its buffers are empty, it required a way to support orphaned CS with pending data in their txbuf. This is something that the H2 mux already has to deal with, by carefully leaving the data in the channel's buffer. But due to the snd_buf() call being top-down, it is always possible to push the stream's data via the mux's snd_buf() call without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only implemented in the mux and attached to the mux's representation of the stream, and doing so allows to immediately release the channel once the data are safe in the mux's buffer. This is an important change which clarifies the roles and responsibilities of each layer in the chain : when receiving data from a mux, it's the mux's responsibility to make sure it can correctly decode the incoming data and to buffer the possible excess of data it cannot pass to the requester. This means that decoding an H2 frame, which is not retryable since it has an impact on the HPACK decompression context, and which cannot be reordered for the same reason, simply needs to be performed to the H2 stream's rxbuf which will then be passed to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to read as much as it needs to be able to restart later, possibly by buffering some data into a local buffer. And it's only once all the output data has been consumed by snd_buf() that the stream is free to disappear. This model presents the nice benefit of being infinitely stackable and solving the last identified showstoppers to move towards a structured message internal representation, as it will give full power to the rcv_buf() and snd_buf() to process what they need. For now the conn_stream's flags indicating whether a shutdown has been seen in any direction or if an end of stream was seen will remain in the conn_stream, though it's likely that some of them will move to the mux's representation of the stream after structured messages are implemented.