diff options
Diffstat (limited to 'doc/design-thoughts/http2.txt')
-rw-r--r-- | doc/design-thoughts/http2.txt | 277 |
1 files changed, 277 insertions, 0 deletions
diff --git a/doc/design-thoughts/http2.txt b/doc/design-thoughts/http2.txt new file mode 100644 index 0000000..c21ac10 --- /dev/null +++ b/doc/design-thoughts/http2.txt @@ -0,0 +1,277 @@ +2014/10/23 - design thoughts for HTTP/2 + +- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a + connection holds a compression context (headers table, etc...). We probably + need to have an h2_conn struct. + +- multiple transactions will be handled in parallel for a given h2_conn. They + are called streams in HTTP/2 terminology. + +- multiplexing : for a given client-side h2 connection, we can have multiple + server-side h2 connections. And for a server-side h2 connection, we can have + multiple client-side h2 connections. Streams circulate in N-to-N fashion. + +- flow control : flow control will be applied between multiple streams. Special + care must be taken so that an H2 client cannot block some H2 servers by + sending requests spread over multiple servers to the point where one server + response is blocked and prevents other responses from the same server from + reaching their clients. H2 connection buffers must always be empty or nearly + empty. The per-stream flow control needs to be respected as well as the + connection's buffers. It is important to implement some fairness between all + the streams so that it's not always the same which gets the bandwidth when + the connection is congested. + +- some clients can be H1 with an H2 server (is this really needed ?). Most of + the initial use case will be H2 clients to H1 servers. It is important to keep + in mind that H1 servers do not do flow control and that we don't want them to + block transfers (eg: post upload). + +- internal tasks : some H2 clients will be internal tasks (eg: health checks). + Some H2 servers will be internal tasks (eg: stats, cache). The model must be + compatible with this use case. + +- header indexing : headers are transported compressed, with a reference to a + static or a dynamic header, or a literal, possibly huffman-encoded. Indexing + is specific to the H2 connection. This means there is no way any binary data + can flow between both sides, headers will have to be decoded according to the + incoming connection's context and re-encoded according to the outgoing + connection's context, which can significantly differ. In order to avoid the + parsing trouble we currently face, headers will have to be clearly split + between name and value. It is worth noting that neither the incoming nor the + outgoing connections' contexts will be of any use while processing the + headers. At best we can have some shortcuts for well-known names that map + well to the static ones (eg: use the first static entry with same name), and + maybe have a few special cases for static name+value as well. Probably we can + classify headers in such categories : + + - static name + value + - static name + other value + - dynamic name + other value + + This will allow for better processing in some specific cases. Headers + supporting a single value (:method, :status, :path, ...) should probably + be stored in a single location with a direct access. That would allow us + to retrieve a method using hdr[METHOD]. All such indexing must be performed + while parsing. That also means that HTTP/1 will have to be converted to this + representation very early in the parser and possibly converted back to H/1 + after processing. + + Header names/values will have to be placed in a small memory area that will + inevitably get fragmented as headers are rewritten. An automatic packing + mechanism must be implemented so that when there's no more room, headers are + simply defragmented/packet to a new table and the old one is released. Just + like for the static chunks, we need to have a few such tables pre-allocated + and ready to be swapped at any moment. Repacking must not change any index + nor affect the way headers are compressed so that it can happen late after a + retry (send-name-header for example). + +- header processing : can still happen on a (header, value) basis. Reqrep/ + rsprep completely disappear and will have to be replaced with something else + to support renaming headers and rewriting url/path/... + +- push_promise : servers can push dummy requests+responses. They advertise + the stream ID in the push_promise frame indicating the associated stream ID. + This means that it is possible to initiate a client-server stream from the + information coming from the server and make the data flow as if the client + had made it. It's likely that we'll have to support two types of server + connections: those which support push and those which do not. That way client + streams will be distributed to existing server connections based on their + capabilities. It's important to keep in mind that PUSH will not be rewritten + in responses. + +- stream ID mapping : since the stream ID is per H2 connection, stream IDs will + have to be mapped. Thus a given stream is an entity with two IDs (one per + side). Or more precisely a stream has two end points, each one carrying an ID + when it ends on an HTTP2 connection. Also, for each stream ID we need to + quickly find the associated transaction in progress. Using a small quick + unique tree seems indicated considering the wide range of valid values. + +- frame sizes : frame have to be remapped between both sides as multiplexed + connections won't always have the same characteristics. Thus some frames + might be spliced and others will be sliced. + +- error processing : care must be taken to never break a connection unless it + is dead or corrupt at the protocol level. Stats counter must exist to observe + the causes. Timeouts are a great problem because silent connections might + die out of inactivity. Ping frames should probably be scheduled a few seconds + before the connection timeout so that an unused connection is verified before + being killed. Abnormal requests must be dealt with using RST_STREAM. + +- ALPN : ALPN must be observed on the client side, and transmitted to the server + side. + +- proxy protocol : proxy protocol makes little to no sense in a multiplexed + protocol. A per-stream equivalent will surely be needed if implementations + do not quickly generalize the use of Forward. + +- simplified protocol for local devices (eg: haproxy->varnish in clear and + without handshake, and possibly even with splicing if the connection's + settings are shared) + +- logging : logging must report a number of extra information such as the + stream ID, and whether the transaction was initiated by the client or by the + server (which can be deduced from the stream ID's parity). In case of push, + the number of the associated stream must also be reported. + +- memory usage : H2 increases memory usage by mandating use of 16384 bytes + frame size minimum. That means slightly more than 16kB of buffer in each + direction to process any frame. It will definitely have an impact on the + deployed maxconn setting in places using less than this (4..8kB are common). + Also, the header list is persistent per connection, so if we reach the same + size as the request, that's another 16kB in each direction, resulting in + about 48kB of memory where 8 were previously used. A more careful encoder + can work with a much smaller set even if that implies evicting entries + between multiple headers of the same message. + +- HTTP/1.0 should very carefully be transported over H2. Since there's no way + to pass version information in the protocol, the server could use some + features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers, + ...). + +- host / :authority : ":authority" is the norm, and "host" will be absent when + H2 clients generate :authority. This probably means that a dummy Host header + will have to be produced internally from :authority and removed when passing + to H2 behind. This can cause some trouble when passing H2 requests to H1 + proxies, because there's no way to know if the request should contain scheme + and authority in H1 or not based on the H2 request. Thus a "proxy" option + will have to be explicitly mentioned on HTTP/1 server lines. One of the + problem that it creates is that it's not longer possible to pass H/1 requests + to H/1 proxies without an explicit configuration. Maybe a table of the + various combinations is needed. + + :scheme :authority host + HTTP/2 request present present absent + HTTP/1 server req absent absent present + HTTP/1 proxy req present present present + + So in the end the issue is only with H/2 requests passed to H/1 proxies. + +- ping frames : they don't indicate any stream ID so by definition they cannot + be forwarded to any server. The H2 connection should deal with them only. + +There's a layering problem with H2. The framing layer has to be aware of the +upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass +it over a framing layer to mux the streams, the frame type must be passed below +so that frames are properly arranged. Header encoding is connection-based and +all streams using the same connection will interact in the way their headers +are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and +this entity has to know for each stream what its headers are. + +Probably that we should remove *all* headers from transported data and move +them on the fly to a parallel structure that can be shared between H1 and H2 +and consumed at the appropriate level. That means buffers only transport data. +Trailers have to be dealt with differently. + +So if we consider an H1 request being forwarded between a client and a server, +it would look approximately like this : + + - request header + body land into a stream's receive buffer + - headers are indexed and stripped out so that only the body and whatever + follows remain in the buffer + - both the header index and the buffer with the body stay attached to the + stream + - the sender can rebuild the whole headers. Since they're found in a table + supposed to be stable, it can rebuild them as many times as desired and + will always get the same result, so it's safe to build them into the trash + buffer for immediate sending, just as we do for the PROXY protocol. + - the upper protocol should probably provide a build_hdr() callback which + when called by the socket layer, builds this header block based on the + current stream's header list, ready to be sent. + - the socket layer has to know how many bytes from the headers are left to be + forwarded prior to processing the body. + - the socket layer needs to consume only the acceptable part of the body and + must not release the buffer if any data remains in it (eg: pipelining over + H1). This is already handled by channel->o and channel->to_forward. + - we could possibly have another optional callback to send a preamble before + data, that could be used to send chunk sizes in H1. The danger is that it + absolutely needs to be stable if it has to be retried. But it could + considerably simplify de-chunking. + +When the request is sent to an H2 server, an H2 stream request must be made +to the server, we find an existing connection whose settings are compatible +with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If +none is found, a new connection must be established, unless maxconn is reached. + +Servers must have a maxstream setting just like they have a maxconn. The same +queue may be used for that. + +The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2 +that becomes impossible (and useless). We still need something like the +"tcp-request session" hook to apply just after the SSL handshake is done. + +It is impossible to defragment the body on the fly in HTTP/2. Since multiple +messages are interleaved, we cannot wait for all of them and block the head of +line. Thus if body analysis is required, it will have to use the stream's +buffer, which necessarily implies a copy. That means that with each H2 end we +necessarily have at least one copy. Sometimes we might be able to "splice" some +bytes from one side to the other without copying into the stream buffer (same +rules as for TCP splicing). + +In theory, only data should flow through the channel buffer, so each side's +connector is responsible for encoding data (H1: linear/chunks, H2: frames). +Maybe the same mechanism could be extrapolated to tunnels / TCP. + +Since we'd use buffers only for data (and for receipt of headers), we need to +have dynamic buffer allocation. + +Thus : +- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to + send something that we need to build and that needs to be persistent in case + of partial send. H1 headers are built on the fly from the header table to a + temporary buffer that is immediately sent and whose amount of sent bytes is + the only information kept (like for PROXY protocol). H2 headers are more + complex since the encoding depends on what was successfully sent. Thus we + need to build them and put them into a temporary buffer that remains + persistent in case send() fails. It is possible to have a limited pool of + Tx buffers and refrain from sending if there is no more buffer available in + the pool. In that case we need a wake-up mechanism once a buffer is + available. Once the data are sent, the Tx buffer is then immediately recycled + in its pool. Note that no tx buffer being used (eg: for hdr or control) means + that we have to be able to serialize access to the connection and retry with + the same stream. It also means that a stream that times out while waiting for + the connector to read the second half of its request has to stay there, or at + least needs to be handled gracefully. However if the connector cannot read + the data to be sent, it means that the buffer is congested and the connection + is dead, so that probably means it can be killed. + +- Rx buffers have to be pre-allocated just before calling recv(). A connection + will first try to pick a buffer and disable reception if it fails, then + subscribe to the list of tasks waiting for an Rx buffer. + +- full Rx buffers might sometimes be moved around to the next buffer instead of + experiencing a copy. That means that channels and connectors must use the + same format of buffer, and that only the channel will have to see its + pointers adjusted. + +- Tx of data should be made as much as possible without copying. That possibly + means by directly looking into the connection buffer on the other side if + the local Tx buffer does not exist and the stream buffer is not allocated, or + even performing a splice() call between the two sides. One of the problem in + doing this is that it requires proper ordering of the operations (eg: when + multiple readers are attached to a same buffer). If the splitting occurs upon + receipt, there's no problem. If we expect to retrieve data directly from the + original buffer, it's harder since it contains various things in an order + which does not even indicate what belongs to whom. Thus possibly the only + mechanism to implement is the buffer permutation which guarantees zero-copy + and only in the 100% safe case. Also it's atomic and does not cause HOL + blocking. + +It makes sense to chose the frontend_accept() function right after the +handshake ended. It is then possible to check the ALPN, the SNI, the ciphers +and to accept to switch to the h2_conn_accept handler only if everything is OK. +The h2_conn_accept handler will have to deal with the connection setup, +initialization of the header table, exchange of the settings frames and +preparing whatever is needed to fire new streams upon receipt of unknown +stream IDs. Note: most of the time it will not be possible to splice() because +we need to know in advance the amount of bytes to write the header, and here it +will not be possible. + +H2 health checks must be seen as regular transactions/streams. The check runs a +normal client which seeks an available stream from a server. The server then +finds one on an existing connection or initiates a new H2 connection. The H2 +checks will have to be configurable for sharing streams or not. Another option +could be to specify how many requests can be made over existing connections +before insisting on getting a separate connection. Note that such separate +connections might end up stacking up once released. So probably that they need +to be recycled very quickly (eg: fix how many unused ones can exist max). + |