diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-28 09:35:11 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-28 09:35:11 +0000 |
commit | da76459dc21b5af2449af2d36eb95226cb186ce2 (patch) | |
tree | 542ebb3c1e796fac2742495b8437331727bbbfa0 /doc/internals/api | |
parent | Initial commit. (diff) | |
download | haproxy-da76459dc21b5af2449af2d36eb95226cb186ce2.tar.xz haproxy-da76459dc21b5af2449af2d36eb95226cb186ce2.zip |
Adding upstream version 2.6.12.upstream/2.6.12upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/internals/api')
-rw-r--r-- | doc/internals/api/appctx.txt | 142 | ||||
-rw-r--r-- | doc/internals/api/buffer-api.txt | 653 | ||||
-rw-r--r-- | doc/internals/api/filters.txt | 1186 | ||||
-rw-r--r-- | doc/internals/api/htx-api.txt | 570 | ||||
-rw-r--r-- | doc/internals/api/initcalls.txt | 360 | ||||
-rw-r--r-- | doc/internals/api/ist.txt | 167 | ||||
-rw-r--r-- | doc/internals/api/layers.txt | 190 | ||||
-rw-r--r-- | doc/internals/api/list.txt | 195 | ||||
-rw-r--r-- | doc/internals/api/pools.txt | 577 | ||||
-rw-r--r-- | doc/internals/api/scheduler.txt | 226 |
10 files changed, 4266 insertions, 0 deletions
diff --git a/doc/internals/api/appctx.txt b/doc/internals/api/appctx.txt new file mode 100644 index 0000000..137ec7b --- /dev/null +++ b/doc/internals/api/appctx.txt @@ -0,0 +1,142 @@ +Instantiation of applet contexts (appctx) in 2.6. + + +1. Background + +Most applets are in fact simplified services that are called by the CLI when a +registered keyword is matched. Some of them only have a ->parse() function +which immediately returns with a final result, while others will return zero +asking for the->io_handler() one to be called till the end. For these ones, a +context is generally needed between calls to know where to restart from. + +Other applets are completely autonomous applets with their init function and +an I/O handler, and these ones also need a persistent context between calls to +the I/O handler. These ones are typically instantiated by "use-service" or by +other means. + +Originally a few integers were provided to keep a trivial state (st0, st1, st2) +and these ones progressively proved insufficient, leading to a "ctx.cli" sub- +context that was allowed to use extra fields of various types. Other applets +preferred to use their own context definition. + +All this resulted in the appctx->ctx to contain a myriad of definitions of +various service contexts, and in some services abusing other services' +definitions by laziness, and others being extended to use their own definition +after having run for a long time on the generic types, some of which were not +noticed and mistakenly used the same storage locations by accident. A massive +cleanup was needed. + + +2. New approach in 2.6 + +In 2.6, there's an "svcctx" pointer that's initialized to NULL before any +instantiation of an applet or of a CLI keyword's function. Applets and keyword +handlers are free to make it point wherever they want, and to find it unaltered +between subsequent calls, including up to the ->release() call. The "st2" state +that was totally abused with random enums is not used anymore and was marked as +deprecated. It's still initialized to zero before the first call though. + +One special area, "svc.storage[]", is large enough to contain any of the +contexts that used to be present under "appctx->ctx". The "svcctx" may be set +to point to this area so that a small structure can be allocated for free and +without requiring error checking. In order to make this easier, a specially +purposed function is provided: "applet_reserve_svcctx()". This function will +require the caller to indicate how large an area it needs, and will return a +pointer to this area after checking that it fits. If it does not, haproxy will +crash. This is purposely done so that it's known during development that if a +small structure doesn't fit, a different approach is required. + +As such, for the vast majority of commands, the process is the following one: + + struct foo_ctx { + int myfield1; + int myfield2; + char *myfield3; + }; + + int io_handler(struct appctx *appctx) + { + struct foo_ctx *ctx = applet_reserve_svcctx(appctx, sizeof(*ctx)); + + if (!ctx->myfield1) { + /* first call */ + ctx->myfield1++; + } + ... + } + +The pointer may be directly accessed from the I/O handler if it's known that it +was already reserved by the init handler or parsing function. Otherwise it's +guaranteed to be NULL so that can also serve as a test for a first call: + + int parse_handler(struct appctx *appctx) + { + struct foo_ctx *ctx = applet_reserve_svcctx(appctx, sizeof(*ctx)); + ctx->myfield1 = 12; + return 0; + } + + int io_handler(struct appctx *appctx) + { + struct foo_ctx *ctx = appctx->svcctx; + + for (; !ctx->myfield1; ctx->myfield1--) { + do_something(); + } + ... + } + +There is no need to free anything because that space is not allocated but just +points to a reserved area. + +If it is too small (its size is APPLET_MAX_SVCCTX bytes), it is preferable to +use it with dynamically allocated structures (pools, malloc, etc). For example: + + int io_handler(struct appctx *appctx) + { + struct foo_ctx *ctx = appctx->svcctx; + + if (!ctx) { + /* first call */ + ctx = pool_alloc(pool_foo_ctx); + if (!ctx) + return 1; + } + ... + } + + void io_release(struct appctx *appctx) + { + pool_free(pool_foo_ctx, appctx->svcctx); + } + +The CLI code itself uses this mechanism for the cli_print_*() functions. Since +these functions are terminal (i.e. not meant to be used in the middle of an I/O +handler as they share the same contextual space), they always reset the svcctx +pointer to place it to the "cli_print_ctx" mapped in ->svc.storage. + + +3. Transition for old code + +A lot of care was taken to make the transition as smooth as possible for +out-of-tree code since that's an API change. A dummy "ctx.cli" struct still +exists in the appctx struct, and it happens to map perfectly to the one set by +cli_print_*, so that if some code uses a mix of both, it will still work. +However, it will build with "deprecated" warnings allowing to spot the +remaining places. It's a good exercise to rename "ctx.cli" in "appctx" and see +if the code still compiles. + +Regarding the "st2" sub-state, it will disappear as well after 2.6, but is +still provided and initialized so that code relying on it will still work even +if it builds with deprecation warnings. The correct approach is to move this +state into the newly defined applet's context, and to stop using the stats +enums STAT_ST_* that often barely match the needs and result in code that is +more complicated than desired (the STAT_ST_* enum values have also been marked +as deprecated). + +The code dealing with "show fd", "show sess" and the peers applet show good +examples of how to convert a registered keyword or an applet. + +All this transition code requires complex layouts that will be removed during +2.7-dev so there is no other long-term option but to update the code (or better +get it merged if it can be useful to other users). diff --git a/doc/internals/api/buffer-api.txt b/doc/internals/api/buffer-api.txt new file mode 100644 index 0000000..ac35300 --- /dev/null +++ b/doc/internals/api/buffer-api.txt @@ -0,0 +1,653 @@ +2018-07-13 - HAProxy Internal Buffer API + + +1. Background + +HAProxy uses a "struct buffer" internally to store data received from external +agents, as well as data to be sent to external agents. These buffers are also +used during data transformation such as compression, header insertion or +defragmentation, and are used to carry intermediary representations between the +various internal layers. They support wrapping at the end, and they carry their +own size information so that in theory it would be possible to use different +buffer sizes in parallel even though this is not currently implemented. + +The format of this structure has evolved over time, to reach a point where it +is convenient and versatile enough to have permitted to make several internal +types converge into a single one (specifically the struct chunk disappeared). + + +2. Representation as of 1.9-dev1 + +The current buffer representation consists in a linear storage area of known +size, with a head position indicating the oldest data, and a total data count +expressed in bytes. The head position, data count and size are expressed as +integers and are positive or null. By convention, the head position is strictly +smaller than the buffer size and the data count is smaller than or equal to the +size, so that wrapping can be resolved with a single subtract. A buffer not +respecting these rules is said to be degenerate. Unless specified otherwise, +the various API functions will adopt an undefined behaviour when passed such a +degenerate buffer. + + Buffer declaration : + + struct buffer { + size_t size; // size of the storage area (wrapping point) + char *area; // start of the storage area + size_t data; // contents length after head + size_t head; // start offset of remaining data relative to area + }; + + + Linear buffer representation : + + area + | + V<--------------------------------------------------------->| size + +-----------+---------------------------------+-------------+ + | |/////////////////////////////////| | + +-----------+---------------------------------+-------------+ + |<--------->|<------------------------------->| + head data ^ + | + tail + + + Wrapping buffer representation : + + area + | + V<--------------------------------------------------------->| size + +---------------+------------------------+------------------+ + |///////////////| |//////////////////| + +---------------+------------------------+------------------+ + |<-------------------------------------->| head + |-------------->| ...data data...|<-----------------| + ^ + | + tail + + +3. Terminology + +Manipulating a buffer just based on a head and a wrapping data count is not +very convenient, so we define a certain number of terms for important elements +characterizing a buffer : + + - origin : pointer to relative position 0 in the storage area. Undefined + when the buffer is not allocated. + + - size : the allocated size of the storage area starting at the origin, + expressed in bytes. A buffer whose size is zero is said not to + be allocated, and its origin in this case is undefined. + + - data : the amount of data the buffer contains, in bytes. It is always + lower than or equal to the buffer's size, hence it is always 0 + for an unallocated buffer. + + - emptiness : a buffer is said to be empty when it contains no data, hence + data == 0. It is possible for such buffers not to be allocated + and to have size == 0 as well. + + - room : the available space in the buffer. This is its size minus data. + + - head : position relative to origin where the oldest data byte is found + (it typically is what send() uses to pick outgoing data). The + head is strictly smaller than the size. + + - tail : position relative to origin where the first spare byte is found + (it typically is what recv() uses to store incoming data). It + is always equal to the buffer's data added to its head modulo + the buffer's size. + + - wrapping : the byte following the last one of the storage area loops back + to position 0. This is called wrapping. The wrapping point is + the first position relative to origin which doesn't belong to + the storage area. There is no wrapping when a buffer is not + allocated. Wrapping requires special care and means that the + regular string manipulation functions are not usable on most + buffers, unless it is known that no wrapping happens. Free + space may wrap as well if the buffer only contains data in the + middle. + + - alignment : a buffer is said to be aligned if its data do not wrap. That + is, its head is strictly before the tail, or the buffer is + empty and the head is null. Aligning a buffer may be required + to use regular string manipulation functions which have no + support for wrapping. + + +A buffer may be in three different states : + - unallocated : size == 0, area == 0 (b_is_null() is true) + - waiting : size == 0, area != 0 + - allocated : size > 0, area > 0 + +It is not permitted to have area == 0 with a non-null size. In addition, the +waiting state may also be used to indicate a read-only buffer which does not +wrap and which must not be freed (e.g. for use with error messages). + +The basic API only covers allocated buffers. Switching to/from the other states +is covered by the management API since it requires specific allocation and free +calls. + + +4. Using buffers + +Buffers are defined in a few files : + - include/common/buf.h : structure definition, and manipulation functions + - include/common/buffer.h : resource management (alloc/free/wait lists) + - include/common/istbuf.h : advanced string manipulation + + +4.1. Basic API + +The basic API is made of the functions which abstract accesses to the buffers +and which help calculating their state, free space or used space. + +====================+==================+======================================= +Function | Arguments/Return | Description +--------------------+------------------+--------------------------------------- +b_is_null() | const buffer *buf| returns true if (and only if) the + | ret: int | buffer is not yet allocated and thus + | | points to a NULL area +--------------------+------------------+--------------------------------------- +b_orig() | const buffer *buf| returns the pointer to the origin of + | ret: char * | the storage, which is the location of + | | byte at offset zero. This is mostly + | | used by functions which handle the + | | wrapping by themselves +--------------------+------------------+--------------------------------------- +b_size() | const buffer *buf| returns the size of the buffer + | ret: size_t | +--------------------+------------------+--------------------------------------- +b_wrap() | const buffer *buf| returns the pointer to the wrapping + | ret: char * | position of the buffer area, which is + | | by definition the first byte not part + | | of the buffer +--------------------+------------------+--------------------------------------- +b_data() | const buffer *buf| returns the number of bytes present in + | ret: size_t | the buffer +--------------------+------------------+--------------------------------------- +b_room() | const buffer *buf| returns the amount of room left in the + | ret: size_t | buffer +--------------------+------------------+--------------------------------------- +b_full() | const buffer *buf| returns true if the buffer is full + | ret: int | +--------------------+------------------+--------------------------------------- +__b_stop() | const buffer *buf| returns a pointer to the byte + | ret: char * | following the end of the buffer, which + | | may be out of the buffer if the buffer + | | ends on the last byte of the area. It + | | is the caller's responsibility to + | | either know that the buffer does not + | | wrap or to check that the result does + | | not wrap +--------------------+------------------+--------------------------------------- +__b_stop_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the byte following the end + | | of the buffer, which may be out of the + | | buffer if the buffer ends on the last + | | byte of the area. It's the caller's + | | responsibility to either know that the + | | buffer does not wrap or to check that + | | the result does not wrap +--------------------+------------------+--------------------------------------- +b_stop() | const buffer *buf| returns the pointer to the byte + | ret: char * | following the end of the buffer, which + | | may be out of the buffer if the buffer + | | ends on the last byte of the area +--------------------+------------------+--------------------------------------- +b_stop_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the byte following the end + | | of the buffer, which may be out of the + | | buffer if the buffer ends on the last + | | byte of the area +--------------------+------------------+--------------------------------------- +__b_peek() | const buffer *buf| returns a pointer to the data at + | size_t ofs | position <ofs> relative to the head of + | ret: char * | the buffer. Will typically point to + | | input data if called with the amount + | | of output data. It's the caller's + | | responsibility to either know that the + | | buffer does not wrap or to check that + | | the result does not wrap +--------------------+------------------+--------------------------------------- +__b_peek_ofs() | const buffer *buf| returns an origin-relative offset + | size_t ofs | pointing to the data at position <ofs> + | ret: size_t | relative to the head of the + | | buffer. Will typically point to input + | | data if called with the amount of + | | output data. It's the caller's + | | responsibility to either know that the + | | buffer does not wrap or to check that + | | the result does not wrap +--------------------+------------------+--------------------------------------- +b_peek() | const buffer *buf| returns a pointer to the data at + | size_t ofs | position <ofs> relative to the head of + | ret: char * | the buffer. Will typically point to + | | input data if called with the amount + | | of output data. If applying <ofs> to + | | the buffers' head results in a + | | position between <size> and 2*>size>-1 + | | included, a wrapping compensation is + | | applied to the result +--------------------+------------------+--------------------------------------- +b_peek_ofs() | const buffer *buf| returns an origin-relative offset + | size_t ofs | pointing to the data at position <ofs> + | ret: size_t | relative to the head of the + | | buffer. Will typically point to input + | | data if called with the amount of + | | output data. If applying <ofs> to the + | | buffers' head results in a position + | | between <size> and 2*>size>-1 + | | included, a wrapping compensation is + | | applied to the result +--------------------+------------------+--------------------------------------- +__b_head() | const buffer *buf| returns the pointer to the buffer's + | ret: char * | head, which is the location of the + | | next byte to be dequeued. The result + | | is undefined for unallocated buffers +--------------------+------------------+--------------------------------------- +__b_head_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the buffer's head, which + | | is the location of the next byte to be + | | dequeued. The result is undefined for + | | unallocated buffers +--------------------+------------------+--------------------------------------- +b_head() | const buffer *buf| returns the pointer to the buffer's + | ret: char * | head, which is the location of the + | | next byte to be dequeued. The result + | | is undefined for unallocated + | | buffers. If applying <ofs> to the + | | buffers' head results in a position + | | between <size> and 2*>size>-1 + | | included, a wrapping compensation is + | | applied to the result +--------------------+------------------+--------------------------------------- +b_head_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the buffer's head, which + | | is the location of the next byte to be + | | dequeued. The result is undefined for + | | unallocated buffers. If applying + | | <ofs> to the buffers' head results in + | | a position between <size> and + | | 2*>size>-1 included, a wrapping + | | compensation is applied to the result +--------------------+------------------+--------------------------------------- +__b_tail() | const buffer *buf| returns the pointer to the tail of the + | ret: char * | buffer, which is the location of the + | | first byte where it is possible to + | | enqueue new data. The result is + | | undefined for unallocated buffers +--------------------+------------------+--------------------------------------- +__b_tail_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the tail of the buffer, + | | which is the location of the first + | | byte where it is possible to enqueue + | | new data. The result is undefined for + | | unallocated buffers +--------------------+------------------+--------------------------------------- +b_tail() | const buffer *buf| returns the pointer to the tail of the + | ret: char * | buffer, which is the location of the + | | first byte where it is possible to + | | enqueue new data. The result is + | | undefined for unallocated buffers +--------------------+------------------+--------------------------------------- +b_tail_ofs() | const buffer *buf| returns an origin-relative offset + | ret: size_t | pointing to the tail of the buffer, + | | which is the location of the first + | | byte where it is possible to enqueue + | | new data. The result is undefined for + | | unallocated buffers +--------------------+------------------+--------------------------------------- +b_next() | const buffer *buf| for an absolute pointer <p> pointing + | const char *p | to a valid location within buffer <b>, + | ret: char * | returns the absolute pointer to the + | | next byte, which usually is at (p + 1) + | | unless p reaches the wrapping point + | | and wrapping is needed +--------------------+------------------+--------------------------------------- +b_next_ofs() | const buffer *buf| for an origin-relative offset <o> + | size_t o | pointing to a valid location within + | ret: size_t | buffer <b>, returns either the + | | relative offset pointing to the next + | | byte, which usually is at (o + 1) + | | unless o reaches the wrapping point + | | and wrapping is needed +--------------------+------------------+--------------------------------------- +b_dist() | const buffer *buf| returns the distance between two + | const char *from | pointers, taking into account the + | const char *to | ability to wrap around the buffer's + | ret: size_t | end. The operation is not defined if + | | either of the pointers does not belong + | | to the buffer or if their distance is + | | greater than the buffer's size +--------------------+------------------+--------------------------------------- +b_almost_full() | const buffer *buf| returns 1 if the buffer uses at least + | ret: int | 3/4 of its capacity, otherwise + | | zero. Buffers of size zero are + | | considered full +--------------------+------------------+--------------------------------------- +b_space_wraps() | const buffer *buf| returns non-zero only if the buffer's + | ret: int | free space wraps, which means that the + | | buffer contains data that are not + | | touching at least one edge +--------------------+------------------+--------------------------------------- +b_contig_data() | const buffer *buf| returns the amount of data that can + | size_t start | contiguously be read at once starting + | ret: size_t | from a relative offset <start> (which + | | allows to easily pre-compute blocks + | | for memcpy). The start point will + | | typically contain the amount of past + | | data already returned by a previous + | | call to this function +--------------------+------------------+--------------------------------------- +b_contig_space() | const buffer *buf| returns the amount of bytes that can + | ret: size_t | be appended to the buffer at once +--------------------+------------------+--------------------------------------- +b_getblk() | const buffer *buf| gets one full block of data at once + | char *blk | from a buffer, starting from offset + | size_t len | <offset> after the buffer's head, and + | size_t offset | limited to no more than <len> bytes. + | ret: size_t | The caller is responsible for ensuring + | | that neither <offset> nor <offset> + + | | <len> exceed the total number of bytes + | | available in the buffer. Return zero + | | if not enough data was available, in + | | which case blk is left undefined, or + | | the number of bytes read which is + | | equal to the requested size +--------------------+------------------+--------------------------------------- +b_getblk_nc() | const buffer *buf| gets one or two blocks of data at once + | const char **blk1| from a buffer, starting from offset + | size_t *len1 | <ofs> after the beginning of its + | const char **blk2| output, and limited to no more than + | size_t *len2 | <max> bytes. The caller is responsible + | size_t ofs | for ensuring that neither <ofs> nor + | size_t max | <ofs>+<max> exceed the total number of + | ret: int | bytes available in the buffer. Returns + | | 0 if not enough data were available, + | | or the number of blocks filled (1 or + | | 2). <blk1> is always filled before + | | <blk2>. The unused blocks are left + | | undefined, and the buffer is left + | | unaffected. Unused buffers are left in + | | an undefined state +--------------------+------------------+--------------------------------------- +b_reset() | buffer *buf | resets a buffer. The size is not + | ret: void | touched. In practice it resets the + | | head and the data length +--------------------+------------------+--------------------------------------- +b_sub() | buffer *buf | decreases the buffer length by <count> + | size_t count | without touching the head position + | ret: void | (only the tail moves). this may mostly + | | be used to trim pending data before + | | reusing a buffer. The caller is + | | responsible for not removing more than + | | the available data +--------------------+------------------+--------------------------------------- +b_add() | buffer *buf | increase the buffer length by <count> + | size_t count | without touching the head position + | ret: void | (only the tail moves). This is used + | | when adding data at the tail of a + | | buffer. The caller is responsible for + | | not adding more than the available + | | room +--------------------+------------------+--------------------------------------- +b_set_data() | buffer *buf | sets the buffer's length, by adjusting + | size_t len | the buffer's tail only. The caller is + | ret: void | responsible for passing a valid length +--------------------+------------------+--------------------------------------- +b_del() | buffer *buf | deletes <del> bytes at the head of + | size_t del | buffer <b> and updates the head. The + | ret: void | caller is responsible for not removing + | | more than the available data. This is + | | used after sending data from the + | | buffer +--------------------+------------------+--------------------------------------- +b_realign_if_empty()| buffer *buf | realigns a buffer if it's empty, does + | ret: void | nothing otherwise. This is mostly used + | | after b_del() to make an empty + | | buffer's free space contiguous +--------------------+------------------+--------------------------------------- +b_slow_realign() | buffer *buf | realigns a possibly wrapping buffer so + | size_t output | that the part remaining to be parsed + | ret: void | is contiguous and starts at the + | | beginning of the buffer and the + | | already parsed output part ends at the + | | end of the buffer. This provides the + | | best conditions since it allows the + | | largest inputs to be processed at once + | | and ensures that once the output data + | | leaves, the whole buffer is available + | | at once. The number of output bytes + | | supposedly present at the beginning of + | | the buffer and which need to be moved + | | to the end must be passed in <output>. + | | It will effectively make this offset + | | the new wrapping point. A temporary + | | swap area at least as large as b->size + | | must be provided in <swap>. It's up + | | to the caller to ensure <output> is no + | | larger than the difference between the + | | whole buffer's length and its input +--------------------+------------------+--------------------------------------- +b_putchar() | buffer *buf | tries to append char <c> at the end of + | char c | buffer <b>. Supports wrapping. New + | ret: void | data are silently discarded if the + | | buffer is already full +--------------------+------------------+--------------------------------------- +b_putblk() | buffer *buf | tries to append block <blk> at the end + | const char *blk | of buffer <b>. Supports wrapping. Data + | size_t len | are truncated if the buffer is too + | ret: size_t | short or if not enough space is + | | available. It returns the number of + | | bytes really copied +--------------------+------------------+--------------------------------------- +b_move() | buffer *buf | moves block (src,len) left or right + | size_t src | by <shift> bytes, supporting wrapping + | size_t len | and overlapping. + | size_t shift | +--------------------+------------------+--------------------------------------- +b_rep_blk() | buffer *buf | writes the block <blk> at position + | char *pos | <pos> which must be in buffer <b>, and + | char *end | moves the part between <end> and the + | const char *blk | buffer's tail just after the end of + | size_t len | the copy of <blk>. This effectively + | ret: int | replaces the part located between + | | <pos> and <end> with a copy of <blk> + | | of length <len>. The buffer's length + | | is automatically updated. This is used + | | to replace a block with another one + | | inside a buffer. The shift value + | | (positive or negative) is returned. If + | | there's no space left, the move is not + | | done. If <len> is null, the <blk> + | | pointer is allowed to be null, in + | | order to erase a block +--------------------+------------------+--------------------------------------- +b_xfer() | buffer *src | transfers at most <count> bytes from + | buffer *dst | buffer <src> to buffer <dst> and + | size_t cout | returns the number of bytes copied. + | ret: size_t | The bytes are removed from <src> and + | | added to <dst>. The caller guarantees + | | that <count> is <= b_room(dst) +====================+==================+======================================= + + +4.2. String API + +The string API aims at providing both convenient and efficient ways to read and +write to/from buffers using indirect strings (ist). These strings and some +associated functions are defined in ist.h. + +====================+==================+======================================= +Function | Arguments/Return | Description +--------------------+------------------+--------------------------------------- +b_isteq() | const buffer *b | b_isteq() : returns > 0 if the first + | size_t o | <n> characters of buffer <b> starting + | size_t n | at offset <o> relative to the buffer's + | const ist ist | head match <ist>. (empty strings do + | ret: int | match). It is designed to be used with + | | reasonably small strings (it matches a + | | single byte per loop iteration). It is + | | expected to be used with an offset to + | | skip old data. Return value number of + | | matching bytes if >0, not enough bytes + | | or empty string if 0, or non-matching + | | byte found if <0. +--------------------+------------------+--------------------------------------- +b_isteat | struct buffer *b | b_isteat() : "eats" string <ist> from + | const ist ist | the head of buffer <b>. Wrapping data + | ret: ssize_t | is explicitly supported. It matches a + | | single byte per iteration so strings + | | should remain reasonably small. + | | Returns the number of bytes matched + | | and eaten if >0, not enough bytes or + | | matched empty string if 0, or non + | | matching byte found if <0. +--------------------+------------------+--------------------------------------- +b_istput | struct buffer *b | b_istput() : injects string <ist> at + | const ist ist | the tail of output buffer <b> provided + | ret: ssize_t | that it fits. Wrapping is supported. + | | It's designed for small strings as it + | | only writes a single byte per + | | iteration. Returns the number of + | | characters copied (ist.len), 0 if it + | | temporarily does not fit, or -1 if it + | | will never fit. It will only modify + | | the buffer upon success. In all cases, + | | the contents are copied prior to + | | reporting an error, so that the + | | destination at least contains a valid + | | but truncated string. +--------------------+------------------+--------------------------------------- +b_putist | struct buffer *b | b_putist() : tries to copy as much as + | const ist ist | possible of string <ist> into buffer + | ret: size_t | <b> and returns the number of bytes + | | copied (truncation is possible). It + | | uses b_putblk() and is suitable for + | | large blocks. +====================+==================+======================================= + + +4.3. Management API + +The management API makes a distinction between an empty buffer, which by +definition is not allocated but is ready to be allocated at any time, and a +buffer which failed an allocation and is waiting for an available area to be +offered. The functions allow to register on a list to be notified about buffer +availability, to notify others of a number of buffers just released, and to be +and to be notified of buffer availability. All allocations are made through the +standard buffer pools. + +====================+==================+======================================= +Function | Arguments/Return | Description +--------------------+------------------+--------------------------------------- +buffer_almost_full | const buffer *buf| returns true if the buffer is not null + | ret: int | and at least 3/4 of the buffer's space + | | are used. A waiting buffer will match. +--------------------+------------------+--------------------------------------- +b_alloc | buffer *buf | ensures that <buf> is allocated or + | ret: buffer * | allocates a buffer and assigns it to + | | *buf. If no memory is available, (1) + | | is assigned instead with a zero size. + | | The allocated buffer is returned, or + | | NULL in case no memory is available +--------------------+------------------+--------------------------------------- +__b_free | buffer *buf | releases <buf> which must be allocated + | ret: void | and marks it empty +--------------------+------------------+--------------------------------------- +b_free | buffer *buf | releases <buf> only if it is allocated + | ret: void | and marks it empty +--------------------+------------------+--------------------------------------- +offer_buffers() | void *from | offer a buffer currently belonging to + | uint threshold | target <from> to whoever needs + | ret: void | one. Any pointer is valid for <from>, + | | including NULL. Its purpose is to + | | avoid passing a buffer to oneself in + | | case of failed allocations (e.g. need + | | two buffers, get one, fail, release it + | | and wake up self again). In case of + | | normal buffer release where it is + | | expected that the caller is not + | | waiting for a buffer, NULL is fine +====================+==================+======================================= + + +5. Porting code from older versions + +The previous buffer API introduced in 1.5-dev9 (May 2012) used to look like the +following (with the struct renamed to old_buffer here to avoid confusion during +quick lookups at the doc). It's worth noting that the "data" field used to be +part of the struct but with a different type and meaning. It's important to be +careful about potential code making use of &b->data as it will silently compile +but fail. + + Previous buffer declaration : + + struct old_buffer { + char *p; /* buffer's start pointer, separates in and out data */ + unsigned int size; /* buffer size in bytes */ + unsigned int i; /* number of input bytes pending for analysis in the buffer */ + unsigned int o; /* number of out bytes the sender can consume from this buffer */ + char data[0]; /* <size> bytes */ + }; + + Previous linear buffer representation : + + data p + | | + V V + +-----------+--------------------+------------+-------------+ + | |////////////////////|////////////| | + +-----------+--------------------+------------+-------------+ + <---------------------------------------------------------> size + <------------------> <----------> + o i + +There is this correspondence between old and new fields (some will involve a +knowledge of a channel when the output byte count is required) : + + Old | New + --------+---------------------------------------------------- + p | data + head + co_data(channel) // ci_head(channel) + size | size + i | data - co_data(channel) // ci_data(channel) + o | co_data(channel) // channel->output + data | area + --------+----------------------------------------------------- + +Then some common expressions can be mapped like this : + + Old | New + -----------------------+--------------------------------------- + b->data | b_orig(b) + &b->data | b_orig(b) + bi_ptr(b) | ci_head(channel) + bi_end(b) | b_tail(b) + bo_ptr(b) | b_head(b) + bo_end(b) | co_tail(channel) + bi_putblk(b,s,l) | b_putblk(b,s,l) + bo_getblk(b,s,l,o) | b_getblk(b,s,l,o) + bo_getblk_nc(b,s,l,o) | b_getblk_nc(b,s,l,o,0,co_data(channel)) + b->i + b->o | b_data(b) + b->data + b->size | b_wrap(b) + b->i += len | b_add(b, len) + b->i -= len | b_sub(b, len) + b->i = len | b_set_data(b, co_data(channel) + len) + b->o += len | b_add(b, len); channel->output += len + b->o -= len | b_del(b, len); channel->output -= len + -----------------------+--------------------------------------- + +The buffer modification functions are less straightforward and depend a lot on +the context where they are used. It is strongly advised to figure in the list +of functions above what is available based on what is attempted to be done in +the existing code. + +Note that it is very likely that any out-of-tree code relying on buffers will +not use both ->i and ->o but instead will use exclusively ->i on the side +producing data and use exclusively ->o on the side consuming data (such as in a +mux or in an applet). In both cases, it should be assumed that the other side +is always zero and that either ->i or ->o is replaced with ->data, making the +remaining code much simpler (no more code duplication based on the data +direction). diff --git a/doc/internals/api/filters.txt b/doc/internals/api/filters.txt new file mode 100644 index 0000000..eee74cf --- /dev/null +++ b/doc/internals/api/filters.txt @@ -0,0 +1,1186 @@ + ----------------------------------------- + Filters Guide - version 2.5 + ( Last update: 2021-02-24 ) + ------------------------------------------ + Author : Christopher Faulet + Contact : christopher dot faulet at capflam dot org + + +ABSTRACT +-------- + +The filters support is a new feature of HAProxy 1.7. It is a way to extend +HAProxy without touching its core code and, in certain extent, without knowing +its internals. This feature will ease contributions, reducing impact of +changes. Another advantage will be to simplify HAProxy by replacing some parts +by filters. As we will see, and as an example, the HTTP compression is the first +feature moved in a filter. + +This document describes how to write a filter and what to keep in mind to do +so. It also talks about the known limits and the pitfalls to avoid. + +As said, filters are quite new for now. The API is not freezed and will be +updated/modified/improved/extended as needed. + + + +SUMMARY +------- + + 1. Filters introduction + 2. How to use filters + 3. How to write a new filter + 3.1. API Overview + 3.2. Defining the filter name and its configuration + 3.3. Managing the filter lifecycle + 3.3.1. Dealing with threads + 3.4. Handling the streams activity + 3.5. Analyzing the channels activity + 3.6. Filtering the data exchanged + 4. FAQ + + + +1. FILTERS INTRODUCTION +----------------------- + +First of all, to fully understand how filters work and how to create one, it is +best to know, at least from a distance, what is a proxy (frontend/backend), a +stream and a channel in HAProxy and how these entities are linked to each other. +doc/internals/entities.pdf is a good overview. + +Then, to support filters, many callbacks has been added to HAProxy at different +places, mainly around channel analyzers. Their purpose is to allow filters to +be involved in the data processing, from the stream creation/destruction to +the data forwarding. Depending of what it should do, a filter can implement all +or part of these callbacks. For now, existing callbacks are focused on +streams. But future improvements could enlarge filters scope. For instance, it +could be useful to handle events at the connection level. + +In HAProxy configuration file, a filter is declared in a proxy section, except +default. So the configuration corresponding to a filter declaration is attached +to a specific proxy, and will be shared by all its instances. it is opaque from +the HAProxy point of view, this is the filter responsibility to manage it. For +each filter declaration matches a uniq configuration. Several declarations of +the same filter in the same proxy will be handle as different filters by +HAProxy. + +A filter instance is represented by a partially opaque context (or a state) +attached to a stream and passed as arguments to callbacks. Through this context, +filter instances are stateful. Depending the filter is declared in a frontend or +a backend section, its instances will be created, respectively, when a stream is +created or when a backend is selected. Their behaviors will also be +different. Only instances of filters declared in a frontend section will be +aware of the creation and the destruction of the stream, and will take part in +the channels analyzing before the backend is defined. + +It is important to remember the configuration of a filter is shared by all its +instances, while the context of an instance is owned by a uniq stream. + +Filters are designed to be chained. It is possible to declare several filters in +the same proxy section. The declaration order is important because filters will +be called one after the other respecting this order. Frontend and backend +filters are also chained, frontend ones called first. Even if the filters +processing is serialized, each filter will bahave as it was alone (unless it was +developed to be aware of other filters). For all that, some constraints are +imposed to filters, especially when data exchanged between the client and the +server are processed. We will discuss again these constraints when we will tackle +the subject of writing a filter. + + + +2. HOW TO USE FILTERS +--------------------- + +To use a filter, the parameter 'filter' should be used, followed by the filter +name and, optionally, its configuration in the desired listen, frontend or +backend section. For instance : + + listen test + ... + filter trace name TST + ... + + +See doc/configuration.txt for a formal definition of the parameter 'filter'. +Note that additional parameters on the filter line must be parsed by the filter +itself. + +The list of available filters is reported by 'haproxy -vv' : + + $> haproxy -vv + HAProxy version 1.7-dev2-3a1d4a-33 2016/03/21 + Copyright 2000-2016 Willy Tarreau <willy@haproxy.org> + + [...] + + Available filters : + [COMP] compression + [TRACE] trace + + +Multiple filter lines can be used in a proxy section to chain filters. Filters +will be called in the declaration order. + +Some filters can support implicit declarations in certain circumstances +(without the filter line). This is not recommended for new features but are +useful for existing ones moved in a filter, for backward compatibility +reasons. Implicit declarations are supported when there is only one filter used +on a proxy. When several filters are used, explicit declarations are mandatory. +The HTTP compression filter is one of these filters. Alone, using 'compression' +keywords is enough to use it. But when at least a second filter is used, a +filter line must be added. + + # filter line is optional + listen t1 + bind *:80 + compression algo gzip + compression offload + server srv x.x.x.x:80 + + # filter line is mandatory for the compression filter + listen t2 + bind *:81 + filter trace name T2 + filter compression + compression algo gzip + compression offload + server srv x.x.x.x:80 + + + + +3. HOW TO WRITE A NEW FILTER +---------------------------- + +To write a filter, there are 2 header files to explore : + + * include/haproxy/filters-t.h : This is the main header file, containing all + important structures to use. It represents the + filter API. + + * include/haproxy/filters.h : This header file contains helper functions that + may be used. It also contains the internal API + used by HAProxy to handle filters. + +To ease the filters integration, it is better to follow some conventions : + + * Use 'flt_' prefix to name the filter (e.g flt_http_comp or flt_trace). + + * Keep everything related to the filter in a same file. + +The filter 'trace' can be used as a template to write new filter. It is a good +start to see how filters really work. + +3.1 API OVERVIEW +---------------- + +Writing a filter can be summarized to write functions and attach them to the +existing callbacks. Available callbacks are listed in the following structure : + + struct flt_ops { + /* + * Callbacks to manage the filter lifecycle + */ + int (*init) (struct proxy *p, struct flt_conf *fconf); + void (*deinit) (struct proxy *p, struct flt_conf *fconf); + int (*check) (struct proxy *p, struct flt_conf *fconf); + int (*init_per_thread) (struct proxy *p, struct flt_conf *fconf); + void (*deinit_per_thread)(struct proxy *p, struct flt_conf *fconf); + + /* + * Stream callbacks + */ + int (*attach) (struct stream *s, struct filter *f); + int (*stream_start) (struct stream *s, struct filter *f); + int (*stream_set_backend)(struct stream *s, struct filter *f, struct proxy *be); + void (*stream_stop) (struct stream *s, struct filter *f); + void (*detach) (struct stream *s, struct filter *f); + void (*check_timeouts) (struct stream *s, struct filter *f); + + /* + * Channel callbacks + */ + int (*channel_start_analyze)(struct stream *s, struct filter *f, + struct channel *chn); + int (*channel_pre_analyze) (struct stream *s, struct filter *f, + struct channel *chn, + unsigned int an_bit); + int (*channel_post_analyze) (struct stream *s, struct filter *f, + struct channel *chn, + unsigned int an_bit); + int (*channel_end_analyze) (struct stream *s, struct filter *f, + struct channel *chn); + + /* + * HTTP callbacks + */ + int (*http_headers) (struct stream *s, struct filter *f, + struct http_msg *msg); + int (*http_payload) (struct stream *s, struct filter *f, + struct http_msg *msg, unsigned int offset, + unsigned int len); + int (*http_end) (struct stream *s, struct filter *f, + struct http_msg *msg); + + void (*http_reset) (struct stream *s, struct filter *f, + struct http_msg *msg); + void (*http_reply) (struct stream *s, struct filter *f, + short status, + const struct buffer *msg); + + /* + * TCP callbacks + */ + int (*tcp_payload) (struct stream *s, struct filter *f, + struct channel *chn, unsigned int offset, + unsigned int len); + }; + + +We will explain in following parts when these callbacks are called and what they +should do. + +Filters are declared in proxy sections. So each proxy have an ordered list of +filters, possibly empty if no filter is used. When the configuration of a proxy +is parsed, each filter line represents an entry in this list. In the structure +'proxy', the filters configurations are stored in the field 'filter_configs', +each one of type 'struct flt_conf *' : + + /* + * Structure representing the filter configuration, attached to a proxy and + * accessible from a filter when instantiated in a stream + */ + struct flt_conf { + const char *id; /* The filter id */ + struct flt_ops *ops; /* The filter callbacks */ + void *conf; /* The filter configuration */ + struct list list; /* Next filter for the same proxy */ + unsigned int flags; /* FLT_CFG_FL_* */ + }; + + * 'flt_conf.id' is an identifier, defined by the filter. It can be + NULL. HAProxy does not use this field. Filters can use it in log messages or + as a uniq identifier to check multiple declarations. It is the filter + responsibility to free it, if necessary. + + * 'flt_conf.conf' is opaque. It is the internal configuration of a filter, + generally allocated and filled by its parsing function (See § 3.2). It is + the filter responsibility to free it. + + * 'flt_conf.ops' references the callbacks implemented by the filter. This + field must be set during the parsing phase (See § 3.2) and can be refine + during the initialization phase (See § 3.3). If it is dynamically allocated, + it is the filter responsibility to free it. + + * 'flt_conf.flags' is a bitfield to specify the filter capabilities. For now, + only FLT_CFG_FL_HTX may be set when a filter is able to process HTX + streams. If not set, the filter is excluded from the HTTP filtering. + + +The filter configuration is global and shared by all its instances. A filter +instance is created in the context of a stream and attached to this stream. in +the structure 'stream', the field 'strm_flt' is the state of all filter +instances attached to a stream : + + /* + * Structure representing the "global" state of filters attached to a + * stream. + */ + struct strm_flt { + struct list filters; /* List of filters attached to a stream */ + struct filter *current[2]; /* From which filter resume processing, for a specific channel. + * This is used for resumable callbacks only, + * If NULL, we start from the first filter. + * 0: request channel, 1: response channel */ + unsigned short flags; /* STRM_FL_* */ + unsigned char nb_req_data_filters; /* Number of data filters registered on the request channel */ + unsigned char nb_rsp_data_filters; /* Number of data filters registered on the response channel */ + unsigned long long offset[2]; /* gloal offset of input data already filtered for a specific channel + * 0: request channel, 1: response channel */ + }; + + +Filter instances attached to a stream are stored in the field +'strm_flt.filters', each instance is of type 'struct filter *' : + + /* + * Structure representing a filter instance attached to a stream + * + * 2D-Array fields are used to store info per channel. The first index + * stands for the request channel, and the second one for the response + * channel. Especially, <next> and <fwd> are offsets representing amount of + * data that the filter are, respectively, parsed and forwarded on a + * channel. Filters can access these values using FLT_NXT and FLT_FWD + * macros. + */ + struct filter { + struct flt_conf *config; /* the filter's configuration */ + void *ctx; /* The filter context (opaque) */ + unsigned short flags; /* FLT_FL_* */ + unsigned long long offset[2]; /* Offset of input data already filtered for a specific channel + * 0: request channel, 1: response channel */ + unsigned int pre_analyzers; /* bit field indicating analyzers to + * pre-process */ + unsigned int post_analyzers; /* bit field indicating analyzers to + * post-process */ + struct list list; /* Next filter for the same proxy/stream */ + }; + + * 'filter.config' is the filter configuration previously described. All + instances of a filter share it. + + * 'filter.ctx' is an opaque context. It is managed by the filter, so it is its + responsibility to free it. + + * 'filter.pre_analyzers and 'filter.post_analyzers will be described later + (See § 3.5). + + * 'filter.offset' will be described later (See § 3.6). + + +3.2. DEFINING THE FILTER NAME AND ITS CONFIGURATION +--------------------------------------------------- + +During the filter development, the first thing to do is to add it in the +supported filters. To do so, its name must be registered as a valid keyword on +the filter line : + + /* Declare the filter parser for "my_filter" keyword */ + static struct flt_kw_list flt_kws = { "MY_FILTER_SCOPE", { }, { + { "my_filter", parse_my_filter_cfg, NULL /* private data */ }, + { NULL, NULL, NULL }, + } + }; + INITCALL1(STG_REGISTER, flt_register_keywords, &flt_kws); + + +Then the filter internal configuration must be defined. For instance : + + struct my_filter_config { + struct proxy *proxy; + char *name; + /* ... */ + }; + + +All callbacks implemented by the filter must then be declared. Here, a global +variable is used : + + struct flt_ops my_filter_ops { + .init = my_filter_init, + .deinit = my_filter_deinit, + .check = my_filter_config_check, + + /* ... */ + }; + + +Finally, the function to parse the filter configuration must be written, here +'parse_my_filter_cfg'. This function must parse all remaining keywords on the +filter line : + + /* Return -1 on error, else 0 */ + static int + parse_my_filter_cfg(char **args, int *cur_arg, struct proxy *px, + struct flt_conf *flt_conf, char **err, void *private) + { + struct my_filter_config *my_conf; + int pos = *cur_arg; + + /* Allocate the internal configuration used by the filter */ + my_conf = calloc(1, sizeof(*my_conf)); + if (!my_conf) { + memprintf(err, "%s : out of memory", args[*cur_arg]); + return -1; + } + my_conf->proxy = px; + + /* ... */ + + /* Parse all keywords supported by the filter and fill the internal + * configuration */ + pos++; /* Skip the filter name */ + while (*args[pos]) { + if (!strcmp(args[pos], "name")) { + if (!*args[pos + 1]) { + memprintf(err, "'%s' : '%s' option without value", + args[*cur_arg], args[pos]); + goto error; + } + my_conf->name = strdup(args[pos + 1]); + if (!my_conf->name) { + memprintf(err, "%s : out of memory", args[*cur_arg]); + goto error; + } + pos += 2; + } + + /* ... parse other keywords ... */ + } + *cur_arg = pos; + + /* Set callbacks supported by the filter */ + flt_conf->ops = &my_filter_ops; + + /* Last, save the internal configuration */ + flt_conf->conf = my_conf; + return 0; + + error: + if (my_conf->name) + free(my_conf->name); + free(my_conf); + return -1; + } + + +WARNING : In this parsing function, 'flt_conf->ops' must be initialized. All + arguments of the filter line must also be parsed. This is mandatory. + +In the previous example, the filter lne should be read as follows : + + filter my_filter name MY_NAME ... + + +Optionally, by implementing the 'flt_ops.check' callback, an extra set is added +to check the internal configuration of the filter after the parsing phase, when +the HAProxy configuration is fully defined. For instance : + + /* Check configuration of a trace filter for a specified proxy. + * Return 1 on error, else 0. */ + static int + my_filter_config_check(struct proxy *px, struct flt_conf *my_conf) + { + if (px->mode != PR_MODE_HTTP) { + Alert("The filter 'my_filter' cannot be used in non-HTTP mode.\n"); + return 1; + } + + /* ... */ + + return 0; + } + + + +3.3. MANAGING THE FILTER LIFECYCLE +---------------------------------- + +Once the configuration parsed and checked, filters are ready to by used. There +are two main callbacks to manage the filter lifecycle : + + * 'flt_ops.init' : It initializes the filter for a proxy. This callback may be + defined to finish the filter configuration. + + * 'flt_ops.deinit' : It cleans up what the parsing function and the init + callback have done. This callback is useful to release + memory allocated for the filter configuration. + +Here is an example : + + /* Initialize the filter. Returns -1 on error, else 0. */ + static int + my_filter_init(struct proxy *px, struct flt_conf *fconf) + { + struct my_filter_config *my_conf = fconf->conf; + + /* ... */ + + return 0; + } + + /* Free resources allocated by the trace filter. */ + static void + my_filter_deinit(struct proxy *px, struct flt_conf *fconf) + { + struct my_filter_config *my_conf = fconf->conf; + + if (my_conf) { + free(my_conf->name); + /* ... */ + free(my_conf); + } + fconf->conf = NULL; + } + + +3.3.1 DEALING WITH THREADS +-------------------------- + +When HAProxy is compiled with the threads support and started with more that one +thread (global.nbthread > 1), then it is possible to manage the filter per +thread with following callbacks : + + * 'flt_ops.init_per_thread': It initializes the filter for each thread. It + works the same way than 'flt_ops.init' but in the + context of a thread. This callback is called + after the thread creation. + + * 'flt_ops.deinit_per_thread': It cleans up what the init_per_thread callback + have done. It is called in the context of a + thread, before exiting it. + +It is the filter responsibility to deal with concurrency. check, init and deinit +callbacks are called on the main thread. All others are called on a "worker" +thread (not always the same). It is also the filter responsibility to know if +HAProxy is started with more than one thread. If it is started with one thread +(or compiled without the threads support), these callbacks will be silently +ignored (in this case, global.nbthread will be always equal to one). + + +3.4. HANDLING THE STREAMS ACTIVITY +----------------------------------- + +It may be interesting to handle streams activity. For now, there is three +callbacks that should define to do so : + + * 'flt_ops.stream_start' : It is called when a stream is started. This + callback can fail by returning a negative value. It + will be considered as a critical error by HAProxy + which disabled the listener for a short time. + + * 'flt_ops.stream_set_backend' : It is called when a backend is set for a + stream. This callbacks will be called for all + filters attached to a stream (frontend and + backend). Note this callback is not called if + the frontend and the backend are the same. + + * 'flt_ops.stream_stop' : It is called when a stream is stopped. This callback + always succeed. Anyway, it is too late to return an + error. + +For instance : + + /* Called when a stream is created. Returns -1 on error, else 0. */ + static int + my_filter_stream_start(struct stream *s, struct filter *filter) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... */ + + return 0; + } + + /* Called when a backend is set for a stream */ + static int + my_filter_stream_set_backend(struct stream *s, struct filter *filter, + struct proxy *be) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... */ + + return 0; + } + + /* Called when a stream is destroyed */ + static void + my_filter_stream_stop(struct stream *s, struct filter *filter) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... */ + } + + +WARNING : Handling the streams creation and destruction is only possible for + filters defined on proxies with the frontend capability. + +In addition, it is possible to handle creation and destruction of filter +instances using following callbacks: + + * 'flt_ops.attach' : It is called after a filter instance creation, when it is + attached to a stream. This happens when the stream is + started for filters defined on the stream's frontend and + when the backend is set for filters declared on the + stream's backend. It is possible to ignore the filter, if + needed, by returning 0. This could be useful to have + conditional filtering. + + * 'flt_ops.detach' : It is called when a filter instance is detached from a + stream, before its destruction. This happens when the + stream is stopped for filters defined on the stream's + frontend and when the analyze ends for filters defined on + the stream's backend. + +For instance : + + /* Called when a filter instance is created and attach to a stream */ + static int + my_filter_attach(struct stream *s, struct filter *filter) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + if (/* ... */) + return 0; /* Ignore the filter here */ + return 1; + } + + /* Called when a filter instance is detach from a stream, just before its + * destruction */ + static void + my_filter_detach(struct stream *s, struct filter *filter) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... */ + } + +Finally, it may be interesting to notify the filter when the stream is woken up +because of an expired timer. This could let a chance to check some internal +timeouts, if any. To do so the following callback must be used : + + * 'flt_opt.check_timeouts' : It is called when a stream is woken up because of + an expired timer. + +For instance : + + /* Called when a stream is woken up because of an expired timer */ + static void + my_filter_check_timeouts(struct stream *s, struct filter *filter) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... */ + } + + +3.5. ANALYZING THE CHANNELS ACTIVITY +------------------------------------ + +The main purpose of filters is to take part in the channels analyzing. To do so, +there is 2 callbacks, 'flt_ops.channel_pre_analyze' and +'flt_ops.channel_post_analyze', called respectively before and after each +analyzer attached to a channel, except analyzers responsible for the data +forwarding (TCP or HTTP). Concretely, on the request channel, these callbacks +could be called before following analyzers : + + * tcp_inspect_request (AN_REQ_INSPECT_FE and AN_REQ_INSPECT_BE) + * http_wait_for_request (AN_REQ_WAIT_HTTP) + * http_wait_for_request_body (AN_REQ_HTTP_BODY) + * http_process_req_common (AN_REQ_HTTP_PROCESS_FE) + * process_switching_rules (AN_REQ_SWITCHING_RULES) + * http_process_req_ common (AN_REQ_HTTP_PROCESS_BE) + * http_process_tarpit (AN_REQ_HTTP_TARPIT) + * process_server_rules (AN_REQ_SRV_RULES) + * http_process_request (AN_REQ_HTTP_INNER) + * tcp_persist_rdp_cookie (AN_REQ_PRST_RDP_COOKIE) + * process_sticking_rules (AN_REQ_STICKING_RULES) + +And on the response channel : + + * tcp_inspect_response (AN_RES_INSPECT) + * http_wait_for_response (AN_RES_WAIT_HTTP) + * process_store_rules (AN_RES_STORE_RULES) + * http_process_res_common (AN_RES_HTTP_PROCESS_BE) + +Unlike the other callbacks previously seen before, 'flt_ops.channel_pre_analyze' +can interrupt the stream processing. So a filter can decide to not execute the +analyzer that follows and wait the next iteration. If there are more than one +filter, following ones are skipped. On the next iteration, the filtering resumes +where it was stopped, i.e. on the filter that has previously stopped the +processing. So it is possible for a filter to stop the stream processing on a +specific analyzer for a while before continuing. Moreover, this callback can be +called many times for the same analyzer, until it finishes its processing. For +instance : + + /* Called before a processing happens on a given channel. + * Returns a negative value if an error occurs, 0 if it needs to wait, + * any other value otherwise. */ + static int + my_filter_chn_pre_analyze(struct stream *s, struct filter *filter, + struct channel *chn, unsigned an_bit) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + switch (an_bit) { + case AN_REQ_WAIT_HTTP: + if (/* wait that a condition is verified before continuing */) + return 0; + break; + /* ... * / + } + return 1; + } + + * 'an_bit' is the analyzer id. All analyzers are listed in + 'include/haproxy/channels-t.h'. + + * 'chn' is the channel on which the analyzing is done. It is possible to + determine if it is the request or the response channel by testing if + CF_ISRESP flag is set : + + │ ((chn->flags & CF_ISRESP) == CF_ISRESP) + + +In previous example, the stream processing is blocked before receipt of the HTTP +request until a condition is verified. + +'flt_ops.channel_post_analyze', for its part, is not resumable. It returns a +negative value if an error occurs, any other value otherwise. It is called when +a filterable analyzer finishes its processing, so once for the same analyzer. +For instance : + + /* Called after a processing happens on a given channel. + * Returns a negative value if an error occurs, any other + * value otherwise. */ + static int + my_filter_chn_post_analyze(struct stream *s, struct filter *filter, + struct channel *chn, unsigned an_bit) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + struct http_msg *msg; + + switch (an_bit) { + case AN_REQ_WAIT_HTTP: + if (/* A test on received headers before any other treatment */) { + msg = ((chn->flags & CF_ISRESP) ? &s->txn->rsp : &s->txn->req); + txn->status = 400; + msg->msg_state = HTTP_MSG_ERROR; + http_reply_and_close(s, s->txn->status, http_error_message(s)); + return -1; /* This is an error ! */ + } + break; + /* ... * / + } + return 1; + } + + +Pre and post analyzer callbacks of a filter are not automatically called. They +must be regiesterd explicitly on analyzers, updating the value of +'filter.pre_analyzers' and 'filter.post_analyzers' bit fields. All analyzer bits +are listed in 'include/types/channels.h'. Here is an example : + + static int + my_filter_stream_start(struct stream *s, struct filter *filter) + { + /* ... * / + + /* Register the pre analyzer callback on all request and response + * analyzers */ + filter->pre_analyzers |= (AN_REQ_ALL | AN_RES_ALL) + + /* Register the post analyzer callback of only on AN_REQ_WAIT_HTTP and + * AN_RES_WAIT_HTTP analyzers */ + filter->post_analyzers |= (AN_REQ_WAIT_HTTP | AN_RES_WAIT_HTTP) + + /* ... * / + return 0; + } + + +To surround activity of a filter during the channel analyzing, two new analyzers +has been added : + + * 'flt_start_analyze' (AN_REQ/RES_FLT_START_FE/AN_REQ_RES_FLT_START_BE) : For + a specific filter, this analyzer is called before any call to the + 'channel_analyze' callback. From the filter point of view, it calls the + 'flt_ops.channel_start_analyze' callback. + + * 'flt_end_analyze' (AN_REQ/RES_FLT_END) : For a specific filter, this + analyzer is called when all other analyzers have finished their + processing. From the filter point of view, it calls the + 'flt_ops.channel_end_analyze' callback. + +These analyzers are called only once per streams. + +'flt_ops.channel_start_analyze' and 'flt_ops.channel_end_analyze' callbacks can +interrupt the stream processing, as 'flt_ops.channel_analyze'. Here is an +example : + + /* Called when analyze starts for a given channel + * Returns a negative value if an error occurs, 0 if it needs to wait, + * any other value otherwise. */ + static int + my_filter_chn_start_analyze(struct stream *s, struct filter *filter, + struct channel *chn) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... TODO ... */ + + return 1; + } + + /* Called when analyze ends for a given channel + * Returns a negative value if an error occurs, 0 if it needs to wait, + * any other value otherwise. */ + static int + my_filter_chn_end_analyze(struct stream *s, struct filter *filter, + struct channel *chn) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* ... TODO ... */ + + return 1; + } + + +Workflow on channels can be summarized as following : + + FE: Called for filters defined on the stream's frontend + BE: Called for filters defined on the stream's backend + + +------->---------+ + | | | + +----------------------+ | +----------------------+ + | flt_ops.attach (FE) | | | flt_ops.attach (BE) | + +----------------------+ | +----------------------+ + | | | + V | V + +--------------------------+ | +------------------------------------+ + | flt_ops.stream_start (FE)| | | flt_ops.stream_set_backend (FE+BE) | + +--------------------------+ | +------------------------------------+ + | | | + ... | ... + | | | + | ^ | + | --+ | | --+ + +------<----------+ | | +--------<--------+ | + | | | | | | | + V | | | V | | ++-------------------------------+ | | | +-------------------------------+ | | +| flt_start_analyze (FE) +-+ | | | flt_start_analyze (BE) +-+ | +|(flt_ops.channel_start_analyze)| | F | |(flt_ops.channel_start_analyze)| | ++---------------+---------------+ | R | +-------------------------------+ | + | | O | | | + +------<---------+ | N ^ +--------<-------+ | B + | | | T | | | | A ++---------------|------------+ | | E | +---------------|------------+ | | C +|+--------------V-------------+ | | N | |+--------------V-------------+ | | K +||+----------------------------+ | | D | ||+----------------------------+ | | E +|||flt_ops.channel_pre_analyze | | | | |||flt_ops.channel_pre_analyze | | | N +||| V | | | | ||| V | | | D +||| analyzer (FE) +-+ | | ||| analyzer (FE+BE) +-+ | ++|| V | | | +|| V | | + +|flt_ops.channel_post_analyze| | | +|flt_ops.channel_post_analyze| | + +----------------------------+ | | +----------------------------+ | + | --+ | | | + +------------>------------+ ... | + | | + [ data filtering (see below) ] | + | | + ... | + | | + +--------<--------+ | + | | | + V | | + +-------------------------------+ | | + | flt_end_analyze (FE+BE) +-+ | + | (flt_ops.channel_end_analyze) | | + +---------------+---------------+ | + | --+ + V + +----------------------+ + | flt_ops.detach (BE) | + +----------------------+ + | + V + +--------------------------+ + | flt_ops.stream_stop (FE) | + +--------------------------+ + | + V + +----------------------+ + | flt_ops.detach (FE) | + +----------------------+ + | + V + +By zooming on an analyzer box we have: + + ... + | + V + | + +-----------<-----------+ + | | + +-----------------+--------------------+ | + | | | | + | +--------<---------+ | | + | | | | | + | V | | | + | flt_ops.channel_pre_analyze ->-+ | ^ + | | | | + | | | | + | V | | + | analyzer --------->-----+--+ + | | | + | | | + | V | + | flt_ops.channel_post_analyze | + | | | + | | | + +-----------------+--------------------+ + | + V + ... + + + 3.6. FILTERING THE DATA EXCHANGED +----------------------------------- + +WARNING : To fully understand this part, it is important to be aware on how the + buffers work in HAProxy. For the HTTP part, it is also important to + understand how data are parsed and structured, and how the internal + representation, called HTX, works. See doc/internals/buffer-api.txt + and doc/internals/htx-api.txt for details. + +An extended feature of the filters is the data filtering. By default a filter +does not look into data exchanged between the client and the server because it +is expensive. Indeed, instead of forwarding data without any processing, each +byte need to be buffered. + +So, to enable the data filtering on a channel, at any time, in one of previous +callbacks, 'register_data_filter' function must be called. And conversely, to +disable it, 'unregister_data_filter' function must be called. For instance : + + my_filter_http_headers(struct stream *s, struct filter *filter, + struct http_msg *msg) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + + /* 'chn' must be the request channel */ + if (!(msg->chn->flags & CF_ISRESP)) { + struct htx *htx; + struct ist hdr; + struct http_hdr_ctx ctx; + + htx = htxbuf(msg->chn->buf); + + /* Enable the data filtering for the request if 'X-Filter' header + * is set to 'true'. */ + hdr = ist("X-Filter); + ctx.blk = NULL; + if (http_find_header(htx, hdr, &ctx, 0) && + ctx.value.len >= 4 && memcmp(ctx.value.ptr, "true", 4) == 0) + register_data_filter(s, chn, filter); + } + + return 1; + } + +Here, the data filtering is enabled if the HTTP header 'X-Filter' is found and +set to 'true'. + +If several filters are declared, the evaluation order remains the same, +regardless the order of the registrations to the data filtering. Data +registrations must be performed before the data forwarding step. However, a +filter may be unregistered from the data filtering at any time. + +Depending on the stream type, TCP or HTTP, the way to handle data filtering is +different. HTTP data are structured while TCP data are raw. And there are more +callbacks for HTTP streams to fully handle all steps of an HTTP transaction. But +the main part is the same. The data filtering is performed in one callback, +called in loop on input data starting at a specific offset for a given +length. Data analyzed by a filter are considered as forwarded from its point of +view. Because filters are chained, a filter never analyzes more data than its +predecessors. Thus only data analyzed by the last filter are effectively +forwarded. This means, at any time, any filter may choose to not analyze all +available data (available from its point of view), blocking the data forwarding. + +Internally, filters own 2 offsets representing the number of bytes already +analyzed in the available input data, one per channel. There is also an offset +couple at the stream level, in the strm_flt object, representing the total +number of bytes already forwarded. These offsets may be retrieved and updated +using following macros : + + * FLT_OFF(flt, chn) + + * FLT_STRM_OFF(s, chn) + +where 'flt' is the 'struct filter' passed as argument in all callbacks, 's' the +filtered stream and 'chn' is the considered channel. However, there is no reason +for a filter to use these macros or take care of these offsets. + + +3.6.1 FILTERING DATA ON TCP STREAMS +----------------------------------- + +The TCP data filtering for TCP streams is the easy case, because HAProxy do not +parse these data. Data are stored in raw in the buffer. So there is only one +callback to consider: + + * 'flt_ops.tcp_payload : This callback is called when input data are + available. If not defined, all available data will be considered as analyzed + and forwarded from the filter point of view. + +This callback is called only if the filter is registered to analyze TCP +data. Here is an example : + + /* Returns a negative value if an error occurs, else the number of + * consumed bytes. */ + static int + my_filter_tcp_payload(struct stream *s, struct filter *filter, + struct channel *chn, unsigned int offset, + unsigned int len) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + int ret = len; + + /* Do not parse more than 'my_conf->max_parse' bytes at a time */ + if (my_conf->max_parse != 0 && ret > my_conf->max_parse) + ret = my_conf->max_parse; + + /* if available data are not completely parsed, wake up the stream to + * be sure to not freeze it. The best is probably to set a + * chn->analyse_exp timer */ + if (ret != len) + task_wakeup(s->task, TASK_WOKEN_MSG); + return ret; + } + +But it is important to note that tunnelled data of an HTTP stream may also be +filtered via this callback. Tunnelled data are data exchange after an HTTP tunnel +is established between the client and the server, via an HTTP CONNECT or via a +protocol upgrade. In this case, the data are structured. Of course, to do so, +the filter must be able to parse HTX data and must have the FLT_CFG_FL_HTX flag +set. At any time, the IS_HTX_STRM() macros may be used on the stream to know if +it is an HTX stream or a TCP stream. + + +3.6.2 FILTERING DATA ON HTTP STREAMS +------------------------------------ + +The HTTP data filtering is a bit more complex because HAProxy data are +structutred and represented to an internal format, called HTX. So basically +there is the HTTP counterpart to the previous callback : + + * 'flt_ops.http_payload' : This callback is called when input data are + available. If not defined, all available data will be considered as analyzed + and forwarded for the filter. + +But the prototype for this callbacks is slightly different. Instead of having +the channel as parameter, we have the HTTP message (struct http_msg). This +callback is called only if the filter is registered to analyze TCP data. Here is +an example : + + /* Returns a negative value if an error occurs, else the number of + * consumed bytes. */ + static int + my_filter_http_payload(struct stream *s, struct filter *filter, + struct http_msg *msg, unsigned int offset, + unsigned int len) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + struct htx *htx = htxbuf(&msg->chn->buf); + struct htx_ret htxret = htx_find_offset(htx, offset); + struct htx_blk *blk; + + blk = htxret.blk; + offset = htxret.ret; + for (; blk; blk = htx_get_next_blk(blk, htx)) { + enum htx_blk_type type = htx_get_blk_type(blk); + + if (type == HTX_BLK_UNUSED) + continue; + else if (type == HTX_BLK_DATA) { + /* filter data */ + } + else + break; + } + + return len; + } + +In addition, there are two others callbacks : + + * 'flt_ops.http_headers' : This callback is called just before the HTTP body + forwarding and after any processing on the request/response HTTP + headers. When defined, this callback is always called for HTTP streams + (i.e. without needs of a registration on data filtering). + Here is an example : + + + /* Returns a negative value if an error occurs, 0 if it needs to wait, + * any other value otherwise. */ + static int + my_filter_http_headers(struct stream *s, struct filter *filter, + struct http_msg *msg) + { + struct my_filter_config *my_conf = FLT_CONF(filter); + struct htx *htx = htxbuf(&msg->chn->buf); + struct htx_sl *sl = http_get_stline(htx); + int32_t pos; + + for (pos = htx_get_first(htx); pos != -1; pos = htx_get_next(htx, pos)) { + struct htx_blk *blk = htx_get_blk(htx, pos); + enum htx_blk_type type = htx_get_blk_type(blk); + struct ist n, v; + + if (type == HTX_BLK_EOH) + break; + if (type != HTX_BLK_HDR) + continue; + + n = htx_get_blk_name(htx, blk); + v = htx_get_blk_value(htx, blk); + /* Do something on the header name/value */ + } + + return 1; + } + + * 'flt_ops.http_end' : This callback is called when the whole HTTP message was + processed. It may interrupt the stream processing. So, it could be used to + synchronize the HTTP request with the HTTP response, for instance : + + /* Returns a negative value if an error occurs, 0 if it needs to wait, + * any other value otherwise. */ + static int + my_filter_http_end(struct stream *s, struct filter *filter, + struct http_msg *msg) + { + struct my_filter_ctx *my_ctx = filter->ctx; + + + if (!(msg->chn->flags & CF_ISRESP)) /* The request */ + my_ctx->end_of_req = 1; + else /* The response */ + my_ctx->end_of_rsp = 1; + + /* Both the request and the response are finished */ + if (my_ctx->end_of_req == 1 && my_ctx->end_of_rsp == 1) + return 1; + + /* Wait */ + return 0; + } + +Then, to finish, there are 2 informational callbacks : + + * 'flt_ops.http_reset' : This callback is called when an HTTP message is + reset. This happens either when a 1xx informational response is received, or + if we're retrying to send the request to the server after it failed. It + could be useful to reset the filter context before receiving the true + response. + By checking s->txn->status, it is possible to know why this callback is + called. If it's a 1xx, we're called because of an informational + message. Otherwise, it is a L7 retry. + + * 'flt_ops.http_reply' : This callback is called when, at any time, HAProxy + decides to stop the processing on a HTTP message and to send an internal + response to the client. This mainly happens when an error or a redirect + occurs. + + +3.6.3 REWRITING DATA +-------------------- + +The last part, and the trickiest one about the data filtering, is about the data +rewriting. For now, the filter API does not offer a lot of functions to handle +it. There are only functions to notify HAProxy that the data size has changed to +let it update internal state of filters. This is the developer responsibility to +update data itself, i.e. the buffer offsets, using following function : + + * 'flt_update_offsets()' : This function must be called when a filter alter + incoming data. It updates offsets of the stream and of all filters + preceding the calling one. Do not call this function when a filter change + the size of incoming data leads to an undefined behavior. + +A good example of filter changing the data size is the HTTP compression filter. diff --git a/doc/internals/api/htx-api.txt b/doc/internals/api/htx-api.txt new file mode 100644 index 0000000..971328b --- /dev/null +++ b/doc/internals/api/htx-api.txt @@ -0,0 +1,570 @@ + ----------------------------------------------- + HTX API + Version 1.1 + ( Last update: 2021-02-24 ) + ----------------------------------------------- + Author : Christopher Faulet + Contact : cfaulet at haproxy dot com + +1. Background + +Historically, HAProxy stored HTTP messages in a raw fashion in buffers, keeping +parsing information separately in a "struct http_msg" owned by the stream. It was +optimized to the data transfer, but not so much for rewrites. It was also HTTP/1 +centered. While it was the only HTTP version supported, it was not a +problem. But with the rise of HTTP/2, it starts to be hard to still use this +representation. + +At the first age of the HTTP/2 in HAProxy, H2 messages were converted into +H1. This was terribly unefficient because it required two parsing passes, a +first one in H2 and a second one in H1, with a conversion in the middle. And of +course, the same was also true in the opposite direction. outgoing H1 messages +had to be converted back in H2 to be sent. Even worse, because the H2->H1 +conversion, only client H2 connections were supported. + +So, to address all these problems, we decided to replace the old raw +representation by a version-agnostic and self-structured internal HTTP +representation, the HTX. As an additional benefit, with this new representation, +the message parsing and its processing are now separated, making all the HTTP +analysis simpler and cleaner. The parsing of HTTP messages is now handled by +the multiplexers (h1 or h2). + + +2. The HTX message + +The HTX is a structure containing useful information about an HTTP message +followed by a contiguous array with some parts of the message. These parts are +called blocks. A block is composed of metadata (htx_blk) and an associated +payload. Blocks' metadata are stored starting from the end of the array while +their payload are stored at the beginning. Blocks' metadata are often simply +called blocks. it is a misuse of language that's simplify explanations. + +Internally, this structure is "hidden" in a buffer. This way, there are few +changes into intermediate layers (stream-interface and channels). They still +manipulate buffers. Only the multiplexer and the stream have to know how data +are really stored. From the HTX perspective, a buffer is just a memory +area. When an HTX message is stored in a buffer, this one appears as full. + + * General view of an HTX message : + + + buffer->area + | + |<------------ buffer->size == buffer->data ----------------------| + | | + | |<------------- Blocks array (htx->size) ------------------>| + V | | + +-----+-----------------+-------------------------+---------------+ + | HTX | PAYLOADS ==> | | <== HTX_BLKs | + +-----+-----------------+-------------------------+---------------+ + | | | | + |<-payloads part->|<----- free space ------>|<-blocks part->| + (htx->data) + + +The blocks part remains linear and sorted. It may be see as an array with +negative indexes. But, instead of using negative indexes, we use positive +positions to identify a block. This position is then converted to an address +relatively to the beginning of the blocks array. + + tail head + | | + V V + .....--+----+-----------------------+------+------+ + | Bn | ... | B1 | B0 | + .....--+----+-----------------------+------+------+ + ^ ^ ^ + Addr of the block Addr of the block Addr of the block + at the position N at the position 1 at the position 0 + + +In the HTX structure, 3 "special" positions are stored : + + - tail : Position of the newest inserted block + - head : Position of the oldest inserted block + - first : Position of the first block to (re)start the analyse + +The blocks part never wrap. If we have no space to allocate a new block and if +there is a hole at the beginning of the blocks part (so at the end of the blocks +array), we move back all blocks. + + + tail head tail head + | | | | + V V V V + ...+--------------+---------+ blocks ...----------+--------------+ + | X== HTX_BLKS | | defrag | <== HTX_BLKS | + ...+--------------+---------+ =====> ...----------+--------------+ + + +The payloads part is a raw space that may wrap. A block's payload must never be +accessed directly. Instead a block must be selected to retrieve the address of +its payload. + + + +------------------------( B0.addr )--------------------------+ + | +-------------------( B1.addr )----------------------+ | + | | +-----------( B2.addr )----------------+ | | + V V V | | | + +-----+----+-------+----+--------+-------------+-------+----+----+----+ + | HTX | P0 | P1 | P2 | ...==> | | <=... | B2 | B1 | B0 | + +-----+----+-------+----+--------+-------------+-------+----+----+----+ + + +Because the payloads part may wrap, there are 2 usable free spaces : + + - The free space in front of the blocks part. This one is used if and only if + the other one was not used yet. + + - The free space at the beginning of the message. Once this one is used, the + other one is never used again, until a message defragmentation. + + + * Linear payloads part : + + + head_addr end_addr tail_addr + | | | + V V V + +-----+--------------------+-------------+--------------------+-------... + | HTX | | PAYLOADS | | HTX_BLKs + +-----+--------------------+-------------+--------------------+-------... + |<-- free space 2 -->| |<-- free space 1 -->| + (used if the other is too small) (used in priority) + + + * Wrapping payloads part : + + + head_addr end_addr tail_addr + | | | + V V V + +-----+----+----------------+--------+----------------+-------+-------... + | HTX | | PAYLOADS part2 | | PAYLOADS part1 | | HTX_BLKs + +-----+----+----------------+--------+----------------+-------+-------... + |<-->| |<------>| |<----->| + unusable free space unusable + free space free space + + +Finally, when the usable free space is not enough to store a new block, unusable +parts may be get back with a full defragmentation. The payloads part is then +realigned at the beginning of the blocks array and the free space becomes +continuous again. + + +3. The HTX blocks + +An HTX block can be as well a start-line as a header, a body part or a +trailer. For all these types of block, a payload is attached to the block. It +can also be a marker, the end-of-headers or end-of-trailers. For these blocks, +there is no payload but it counts for a byte. It is important to not skip it +when data are forwarded. + +As already said, a block is composed of metadata and a payload. Metadata are +stored in the blocks part and are composed of 2 fields : + + - info : It a 32 bits field containing the block's type on 4 bits followed + by the payload length. See below for details. + + - addr : The payload's address, if any, relatively to the beginning the + array used to store part of the HTTP message itself. + + + * Block's info representation : + + 0b 0000 0000 0000 0000 0000 0000 0000 0000 + ---- ------------------------ --------- + type value (1 MB max) name length (header/trailer - 256B max) + ---------------------------------- + data length (256 MB max) + (body, method, path, version, status, reason) + + +Supported types are : + + - 0000 (0) : The request start-line + - 0001 (1) : The response start-line + - 0010 (2) : A header block + - 0011 (3) : The end-of-headers marker + - 0100 (4) : A data block + - 0101 (5) : A trailer block + - 0110 (6) : The end-of-trailers marker + - 1111 (15) : An unused block + +Other types are unused for now and reserved for futur extensions. + +An HTX message is typically composed of following blocks, in this order : + + - a start-line + - zero or more header blocks + - an end-of-headers marker + - zero or more data blocks + - zero or more trailer blocks (optional) + - an end-of-trailers marker (optional but always set if there is at least + one trailer block) + +Only one HTTP request at a time can be stored in an HTX message. For HTTP +response, it is more complicated. Only one "final" response can be stored in an +HTX message. It is a response with status-code 101 or greater or equal to +200. But it may be preceded by several 1xx informational responses. Such +responses are part of the same HTX message. + +When the end of the message is reached a special flag is set on the message +(HTX_FL_EOM). It means no more data are expected for this message, except +tunneled data. But tunneled data will never be mixed with message data to avoid +ambiguities. Thus once the flag marking the end of the message is set, it is +easy to know the message ends. The end is reached if the HTX message is empty or +on the tail HTX block in the HTX message. Once all blocks of the HTX message are +consumed, tunneled data, if any, may be transferred. + + +3.1. The start-line + +Every HTX message starts with a start-line. Its payload is a "struct htx_sl". In +addition to the parts of the HTTP start-line, this structure contains some +information about the represented HTTP message, mainly in the form of flags +(HTX_SL_F_*). For instance, if an HTTP message contains the header +"conten-length", then the flag HTX_SL_F_CLEN is set. + +Each HTTP message has its own start-line. So an HTX request has one and only one +start-line because it must contain only one HTTP request at a time. But an HTX +response may have more than one start-line if the final HTTP response is +precedeed by some 1xx informational responses. + +In HTTP/2, there is no start-line. So the H2 multiplexer must create one when it +converts an H2 message to HTX : + + - For the request, it uses the pseudo headers ":method", ":path" or + ":authority" depending on the method and the hardcoded version "HTTP/2.0". + + - For the response, it used the hardcoded version "HTTP/2.0", the + pseudo-header ":status" and an empty reason. + + +3.2. The headers and trailers + +HTX Headers and trailers are quite similar. Different types are used to simplify +headers processing. But from the HTX point of view, there is no real difference, +except their position in the HTX message. The header blocks always follow an HTX +start-line while trailer blocks come after the data. If there is no data, they +follow the end-of-headers marker. + +Headers and trailers are the only blocks containing a Key/Value payload. The +corresponding end-of marker must always be placed after each group to mark, as +it name suggests, the end. + +In HTTP/1, trailers are only present on chunked messages. But chunked messages +do not always have trailers. In this case, the end-of-trailers block may or may +not be present. Multiplexers must be able to handle both situations. In HTTP/2, +trailers are only present if a HEADERS frame is sent after DATA frames. + + +3.3. The data + +The payload body of an HTTP message is stored as DATA blocks in the HTX +message. For HTTP/1 messages, it is the message body without the chunks +formatting, if any. For HTTP/2, it is the payload of DATA frames. + +The DATA blocks are the only HTX blocks that may be partially processed (copied +or removed). All other types of block must be entierly processed. This means +DATA blocks can be resized. + + +3.4. The end-of markers + +These blocks are used to delimit parts of an HTX message. It exists two +markers : + + - end-of-headers (EOH) + - end-of-trailers (EOT) + +EOH is always present in an HTX message. EOT is optional. + + +4. The HTX API + + +4.1. Get/set HTX message from/to the underlying buffer + +The first thing to do to process an HTX message is to get it from the underlying +buffer. There are 2 functions to do so, the second one relying on the first : + + - htxbuf() returns an HTX message from a buffer. It does not modify the + buffer. It only initialize the HTX message if the buffer is empty. + + - htx_from_buf() uses htxbuf(). But it also updates the underlying buffer so + that it appears as full. + +Both functions return a "zero-sized" HTX message if the buffer is null. This +way, the HTX message is always valid. The first function is the default function +to use. The second one is only useful when some content will be added. For +instance, it used by the HTX analyzers when HAProxy generates a response. Thus, +the buffer is in a right state. + +Once the processing done, if the HTX message has been modified, the underlying +buffer must be also updated, except htx_from_buf() was used _AND_ data was only +added. For all other cases, the function htx_to_buf() must be called. + +Finally, the function htx_reset() may be called at any time to reset an HTX +message. And the function buf_room_for_htx_data() may be called to know if a raw +buffer is full from the HTX perspective. It is used during conversion from/to +the HTX. + + +4.2. Helpers to deal with free space in an HTX message + +Once with an HTX message, following functions may help to process it : + + - htx_used_space() and htx_meta_space() return, respectively, the total + space used in an HTX message and the space used by block's metadata only. + + - htx_free_space() and htx_free_data_space() return, respectively, the total + free space in an HTX message and the free space available for the payload + if a new HTX block is stored (so it is the total free space minus the size + of an HTX block). + + - htx_is_empty() and htx_is_not_empty() are boolean functions to know if an + HTX message is empty or not. + + - htx_get_max_blksz() returns the maximum size available for the payload, + not exceeding a maximum, metadata included. + + - htx_almost_full() should be used to know if an HTX message uses at least + 3/4 of its capacity. + + +4.3. HTX Blocks manipulations + +Once the available sapce in an HTX message is known, the next step is to add HTX +blocks. First of all the function htx_nbblks() returns the number of blocks +allocated in an HTX message. Then, there is an add function per block's type : + + - htx_add_stline() adds a start-line. The type (request or response) and the + flags of the start-line must be provided, as well as its three parts + (method,uri,version or version,status-code,reason). + + - htx_add_header() and htx_add_trailers() are similar. The name and the + value must be provided. The inserted HTX block is returned on success or + NULL if an error occurred. + + - htx_add_endof() must be used to add any end-of marker. The block's type + (EOH or EOT) must be specified. The inserted HTX block is returned on + success or NULL if an error occurred. + + - htx_add_all_headers() and htx_add_all_trailers() add, respectively, a list + of headers and a list of trailers, followed by the appropriate end-of + marker. On success, this marker is returned. Otherwise, NULL is + returned. Note there is no rollback on the HTX message when an error + occurred. Some headers or trailers may have been added. So it is the + caller responsibility to take care of that. + + - htx_add_data() must be used to add a DATA block. Unlike previous + functions, this one returns the number of bytes copied or 0 if nothing was + copied. If possible, the data are appended to the tail block if it is a + DATA block. Only a part of the payload may be copied because this function + will try to limit the message defragmentation and the wrapping of blocks + as far as possible. + + - htx_add_data_atonce() must be used if all data must be added or nothing. + It tries to insert all the payload, this function returns the inserted + block on success. Otherwise it returns NULL. + +When an HTX block is added, it is always the last one (the tail). But, if a +block must be added at a specific place, it is not really handy. 2 functions may +help (others could be added) : + + - htx_add_last_data() adds a DATA block just after all other DATA blocks and + before any trailers and EOT marker. It relies on htx_add_data_atonce(), so + a defragmentation may be performed. + + - htx_move_blk_before() moves a specific block just after another one. Both + blocks must already be in the HTX message and the block to move must + always be placed after the "pivot". + +Once added, there are three functions to update the block's payload : + + - htx_replace_stline() updates a start-line. The HTX block must be passed as + argument. Only string parts of the start-line are updated by this + function. On success, it returns the new start-line. So it is pretty easy + to update its flags. NULL is returned if an error occurred. + + - htx_replace_header() fully replaces a header (its name and its value) by a + new one. The HTX block must be passed a argument, as well as its new name + and its new value. The new header can be smaller or larger than the old + one. This function returns the new HTX block on success, or NULL is an + error occurred. + + - htx_replace_blk_value() replaces a part of a block's payload or its + totality. It works for HEADERS, TRAILERS or DATA blocks. The HTX block + must be provided with the part to remove and the new one. The new part can + be smaller or larger than the old one. This function returns the new HTX + block on success, or NULL is an error occurred. + + - htx_change_blk_value_len() changes the size of the value. It is the caller + responsibility to change the value itself, make sure there is enough space + and update allocated value. This function updates the HTX message + accordingly. + + - htx_set_blk_value_len() changes the size of the value. It is the caller + responsibility to change the value itself, make sure there is enough space + and update allocated value. Unlike the function + htx_change_blk_value_len(), this one does not update the HTX message. So + it should be used with caution. + + - htx_cut_data_blk() removes <n> bytes from the beginning of a DATA + block. The block's start address and its length are adjusted, and the + htx's total data count is updated. This is used to mark that part of some + data were transferred from a DATA block without removing this DATA + block. No sanity check is performed, the caller is responsible for doing + this exclusively on DATA blocks, and never removing more than the block's + size. + + - htx_remove_blk() removes a block from an HTX message. It returns the + following block or NULL if it is the tail block. + +Finally, a block may be removed using the function htx_remove_blk(). This +function returns the block following the one removed or NULL if it is the tail +block. + + +4.4. The HTX start-line + +Unlike other HTX blocks, the start-line is a bit special because its payload is +a structure followed by its three parts : + + +--------+-------+-------+-------+ + | HTX_SL | PART1 | PART2 | PART3 | + +--------+-------+-------+-------+ + +Some macros and functions may help to manipulate these parts : + + - HTX_SL_P{N}_LEN() and HTX_SL_P{N}_PTR() are macros to get the length of a + part and a pointer on it. {N} should be 1, 2 or 3. + + - HTX_SL_REQ_MLEN(), HTX_SL_REQ_ULEN(), HTX_SL_REQ_VLEN(), + HTX_SL_REQ_MPTR(), HTX_SL_REQ_UPTR() and HTX_SL_REQ_VPTR() are macros to + get info about a request start-line. These macros only wrap HTX_SL_P* + ones. + + - HTX_SL_RES_VLEN(), HTX_SL_RES_CLEN(), HTX_SL_RES_RLEN(), + HTX_SL_RES_VPTR(), HTX_SL_RES_CPTR() and HTX_SL_RES_RPTR() are macros to + get info about a response start-line. These macros only wrap HTX_SL_P* + ones. + + - htx_sl_p1(), htx_sl_p2() and htx_sl_p2() are functions to get the ist + corresponding to the right part of a start-line. + + - htx_sl_req_meth(), htx_sl_req_uri() and htx_sl_req_vsn() get the ist + corresponding to the right part of a request start-line. + + - htx_sl_res_vsn(), htx_sl_res_code() and htx_sl_res_reason() get the ist + corresponding to the right part of a response start-line. + + +4.5. Iterate on the HTX message + +To iterate on an HTX message, the first thing to do is to get the HTX block to +start the loop. There are three special blocks in an HTX message that may be +good candidates to start a loop : + + - the head block. It is the oldest inserted block. Multiplexers always start + to consume an HTX message from this block. The function htx_get_head() + returns its position and htx_get_head_blk() returns the blocks itself. In + addition, the function htx_get_head_type() returns its block's type. + + - the tail block. It is the newest inserted block. The function + htx_get_tail() returns its position and htx_get_tail_blk() returns the + blocks itself. In addition, the function htx_get_tail_type() returns its + block's type. + + - the first block. It is the block where to (re)start the analyse. It is + used as start point by HTX analyzers. The function htx_get_first() returns + its position and htx_get_first_blk() returns the blocks itself. In + addition, the function htx_get_first_type() returns its block's type. + +For all these functions, if the HTX message is empty, -1 is returned for the +block's position, NULL instead of a block and HTX_BLK_UNUSED for its type. + +Then to iterate on blocks, foreword or backward : + + - htx_get_prev() and htx_get_next() return, respectively, the position of + the previous block or the next block, given a specific position. Or -1 if + an edge is reached. + + - htx_get_prev_blk() and htx_get_next_blk() return, respectively, the + previous block or the next one, given a specific block. Or NULL if an edge + is reached. + +4.6. Access block content and info + +Following functions may be used to retrieve information about a specific HTX +block : + + - htx_get_blk_pos() returns the position of a block. It must be in the HTX + message. + + - htx_get_blk_ptr() returns a pointer on the payload of a block. + + - htx_get_blk_type() returns the type of a block. + + - htx_get_blksz() returns the payload size of a block + + - htx_get_blk_name() returns the name of a block, only if it is a header or + a trailer. Otherwise, it returns an empty string. + + - htx_get_blk_value() returns the value of a block, depending on its + type. For header and trailer blocks, it is the value field. For markers + (EOH or EOT), an empty string is returned. For other blocks an ist + pointing on the block payload is returned. + + - htx_is_unique_blk() may be used to know if a block is the only one + remaining inside an HTX message, excluding unused blocks. This function is + pretty useful to determine the end of a HTX message, in conjunction with + HTX_FL_EOM flag. + +4.7. Advanced functions + +Some more advanced functions may be used to do complex processing on the HTX +message. These functions are used by HTX analyzers or by multiplexers. + + - htx_truncate() removes all blocks after the one containing a specific + offset relatively to the head block of the HTX message. If the offset is + inside a DATA block, it is truncated. For all other blocks, the removal + starts to the next block. + + - htx_drain() tries to remove a specific amount of bytes of payload. If the + tail block is a DATA block, it may be truncated if necessary. All other + block are removed at once or kept. This function returns a mixed value, + with the first block not removed, or NULL if everything was removed, and + the amount of data drained. + + - htx_xfer_blks() transfers HTX blocks from an HTX message to another, + stopping on the first block of a specified type or when a specific amount + of bytes, including meta-data, was moved. If the tail block is a DATA + block, it may be partially moved. All other block are transferred at once + or kept. This function returns a mixed value, with the last block moved, + or NULL if nothing was moved, and the amount of data transferred. When + HEADERS or TRAILERS blocks must be transferred, this function transfers + all of them. Otherwise, if it is not possible, it triggers an error. It is + the caller responsibility to transfer all headers or trailers at once. + + - htx_append_msg() append an HTX message to another one. All the message is + copied or nothing. So, if an error occurred, a rollback is performed. This + function returns 1 on success and 0 on error. + + - htx_reserve_max_data() Reserves the maximum possible size for an HTX data + block, by extending an existing one or by creating a new one. It returns a + compound result with the HTX block and the position where new data must be + inserted (0 for a new block). If an error occurs or if there is no space + left, NULL is returned instead of a pointer on an HTX block. + + - htx_find_offset() looks for the HTX block containing a specific offset, + starting at the HTX message's head. The function returns the found HTX + block and the position inside this block where the offset is. If the + offset is outside of the HTX message, NULL is returned. + + - htx_defrag() defragments an HTX message. It removes unused blocks and + unwraps the payloads part. A temporary buffer is used to do so. This + function never fails. A referenced block may be provided. If so, the + corresponding new block is returned. Otherwise, NULL is returned. diff --git a/doc/internals/api/initcalls.txt b/doc/internals/api/initcalls.txt new file mode 100644 index 0000000..30d8737 --- /dev/null +++ b/doc/internals/api/initcalls.txt @@ -0,0 +1,360 @@ +Initialization stages aka how to get your code initialized at the right moment + + +1. Background + +Originally all subsystems were initialized via a dedicated function call +from the huge main() function. Then some code started to become conditional +or a bit more modular and the #ifdef placed there became a mess, resulting +in init code being moved to function constructors in each subsystem's own +file. Then pools of various things were introduced, starting to make the +whole init sequence more complicated due to some forms of internal +dependencies. Later epoll was introduced, requiring a post-fork callback, +and finally threads arrived also requiring some post-thread init/deinit +and allocation, marking the old architecture's last breath. Finally the +whole thing resulted in lots of init code duplication and was simplified +in 1.9 with the introduction of initcalls and initialization stages. + + +2. New architecture + +The new architecture relies on two layers : + - the registration functions + - the INITCALL macros and initialization stages + +The first ones are mostly used to add a callback to a list. The second ones +are used to specify when to call a function. Both are totally independent, +however they are generally combined via another set consisting in the REGISTER +macros which make some registration functions be called at some specific points +during the init sequence. + + +3. Registration functions + +Registration functions never fail. Or more precisely, if they fail it will only +be on out-of-memory condition, and they will cause the process to immediately +exit. As such they do not return any status and the caller doesn't have to care +about their success. + +All available functions are described below in alphanumeric ordering. Please +make sure to respect this ordering when adding new ones. + +- void hap_register_build_opts(const char *str, int must_free) + + This appends the zero-terminated constant string <str> to the list of known + build options that will be reported on the output of "haproxy -vv". A line + feed character ('\n') will automatically be appended after the string when it + is displayed. The <must_free> argument must be zero, unless the string was + allocated by any malloc-compatible function such as malloc()/calloc()/ + realloc()/strdup() or memprintf(), in which case it's better to pass a + non-null value so that the string is freed upon exit. Note that despite the + function's prototype taking a "const char *", the pointer will actually be + cast and freed. The const char* is here to leave more freedom to use consts + when making such options lists. + +- void hap_register_per_thread_alloc(int (*fct)()) + + This adds a call to function <fct> to the list of functions to be called when + threads are started, at the beginning of the polling loop. This is also valid + for the main thread and will be called even if threads are disabled, so that + it is guaranteed that this function will be called in any circumstance. Each + thread will first call all these functions exactly once when it starts. Calls + are serialized by the init_mutex, so that locking is not necessary in these + functions. There is no relation between the thread numbers and the callback + ordering. The function is expected to return non-zero on success, or zero on + failure. A failure will make the process emit a succinct error message and + immediately exit. See also hap_register_per_thread_free() for functions + called after these ones. + +- void hap_register_per_thread_deinit(void (*fct)()); + + This adds a call to function <fct> to the list of functions to be called when + threads are gracefully stopped, at the end of the polling loop. This is also + valid for the main thread and will be called even if threads are disabled, so + that it is guaranteed that this function will be called in any circumstance + if the process experiences a soft stop. Each thread will call this function + exactly once when it stops. However contrary to _alloc() and _init(), the + calls are made without any protection, thus if any shared resource if touched + by the function, the function is responsible for protecting it. The reason + behind this is that such resources are very likely to be still in use in one + other thread and that most of the time the functions will in fact only touch + a refcount or deinitialize their private resources. See also + hap_register_per_thread_free() for functions called after these ones. + +- void hap_register_per_thread_free(void (*fct)()); + + This adds a call to function <fct> to the list of functions to be called when + threads are gracefully stopped, at the end of the polling loop, after all calls + to _deinit() callbacks are done for this thread. This is also valid for the + main thread and will be called even if threads are disabled, so that it is + guaranteed that this function will be called in any circumstance if the + process experiences a soft stop. Each thread will call this function exactly + once when it stops. However contrary to _alloc() and _init(), the calls are + made without any protection, thus if any shared resource if touched by the + function, the function is responsible for protecting it. The reason behind + this is that such resources are very likely to be still in use in one other + thread and that most of the time the functions will in fact only touch a + refcount or deinitialize their private resources. See also + hap_register_per_thread_deinit() for functions called before these ones. + +- void hap_register_per_thread_init(int (*fct)()) + + This adds a call to function <fct> to the list of functions to be called when + threads are started, at the beginning of the polling loop, right after the + list of _alloc() functions. This is also valid for the main thread and will + be called even if threads are disabled, so that it is guaranteed that this + function will be called in any circumstance. Each thread will call this + function exactly once when it starts, and calls are serialized by the + init_mutex which is held over all _alloc() and _init() calls, so that locking + is not necessary in these functions. In other words for all threads but the + current one, the sequence of _alloc() and _init() calls will be atomic. There + is no relation between the thread numbers and the callback ordering. The + function is expected to return non-zero on success, or zero on failure. A + failure will make the process emit a succinct error message and immediately + exit. See also hap_register_per_thread_alloc() for functions called before + these ones. + +- void hap_register_pre_check(int (*fct)()) + + This adds a call to function <fct> to the list of functions to be called at + the step just before the configuration validity checks. This is useful when you + need to create things like it would have been done during the configuration + parsing and where the initialization should continue in the configuration + check. + It could be used for example to generate a proxy with multiple servers using + the configuration parser itself. At this step the trash buffers are allocated. + Threads are not yet started so no protection is required. The function is + expected to return non-zero on success, or zero on failure. A failure will make + the process emit a succinct error message and immediately exit. + +- void hap_register_post_check(int (*fct)()) + + This adds a call to function <fct> to the list of functions to be called at + the end of the configuration validity checks, just at the point where the + program either forks or exits depending whether it's called with "-c" or not. + Such calls are suited for memory allocation or internal table pre-computation + that would preferably not be done on the fly to avoid inducing extra time to + a pure configuration check. Threads are not yet started so no protection is + required. The function is expected to return non-zero on success, or zero on + failure. A failure will make the process emit a succinct error message and + immediately exit. + +- void hap_register_post_deinit(void (*fct)()) + + This adds a call to function <fct> to the list of functions to be called when + freeing the global sections at the end of deinit(), after everything is + stopped. The process is single-threaded at this point, thus these functions + are suitable for releasing configuration elements provided that no other + _deinit() function uses them, i.e. only close/release what is strictly + private to the subsystem. Since such functions are mostly only called during + soft stops (reloads) or failed startups, they tend to experience much less + test coverage than others despite being more exposed, and as such a lot of + care must be taken to test them especially when facing partial subsystem + initializations followed by errors. + +- void hap_register_post_proxy_check(int (*fct)(struct proxy *)) + + This adds a call to function <fct> to the list of functions to be called for + each proxy, after the calls to _post_server_check(). This can allow, for + example, to pre-configure default values for an option in a frontend based on + the "bind" lines or something in a backend based on the "server" lines. It's + worth being aware that such a function must be careful not to waste too much + time in order not to significantly slow down configurations with tens of + thousands of backends. The function is expected to return non-zero on + success, or zero on failure. A failure will make the process emit a succinct + error message and immediately exit. + +- void hap_register_post_server_check(int (*fct)(struct server *)) + + This adds a call to function <fct> to the list of functions to be called for + each server, after the call to check_config_validity(). This can allow, for + example, to preset a health state on a server or to allocate a protocol- + specific memory area. It's worth being aware that such a function must be + careful not to waste too much time in order not to significantly slow down + configurations with tens of thousands of servers. The function is expected + to return non-zero on success, or zero on failure. A failure will make the + process emit a succinct error message and immediately exit. + +- void hap_register_proxy_deinit(void (*fct)(struct proxy *)) + + This adds a call to function <fct> to the list of functions to be called when + freeing the resources during deinit(). These functions will be called as part + of the proxy's resource cleanup. Note that some of the proxy's fields will + already have been freed and others not, so such a function must not use any + information from the proxy that is subject to being released. In particular, + all servers have already been deleted. Since such functions are mostly only + called during soft stops (reloads) or failed startups, they tend to + experience much less test coverage than others despite being more exposed, + and as such a lot of care must be taken to test them especially when facing + partial subsystem initializations followed by errors. It's worth mentioning + that too slow functions could have a significant impact on the configuration + check or exit time especially on large configurations. + +- void hap_register_server_deinit(void (*fct)(struct server *)) + + This adds a call to function <fct> to the list of functions to be called when + freeing the resources during deinit(). These functions will be called as part + of the server's resource cleanup. Note that some of the server's fields will + already have been freed and others not, so such a function must not use any + information from the server that is subject to being released. Since such + functions are mostly only called during soft stops (reloads) or failed + startups, they tend to experience much less test coverage than others despite + being more exposed, and as such a lot of care must be taken to test them + especially when facing partial subsystem initializations followed by errors. + It's worth mentioning that too slow functions could have a significant impact + on the configuration check or exit time especially on large configurations. + + +4. Initialization stages + +In order to offer some guarantees, the startup of the program is split into +several stages. Some callbacks can be placed into each of these stages using +an INITCALL macro, with 0 to 3 arguments, respectively called INITCALL0 to +INITCALL3. These macros must be placed anywhere at the top level of a C file, +preferably at the end so that the referenced symbols have already been met, +but it may also be fine to place them right after the callbacks themselves. + +Such callbacks are referenced into small structures containing a pointer to the +function and 3 arguments. NULL replaces unused arguments. The callbacks are +cast to (void (*)(void *, void *, void *)) and the arguments to (void *). + +The first argument to the INITCALL macro is the initialization stage. The +second one is the callback function, and others if any are the arguments. +The init stage must be among the values of the "init_stage" enum, currently, +and in this execution order: + + - STG_PREPARE : used to preset variables, pre-initialize lookup tables and + pre-initialize list heads + - STG_LOCK : used to pre-initialize locks + - STG_REGISTER : used to register static lists such as keywords + - STG_ALLOC : used to allocate the required structures + - STG_POOL : used to create pools + - STG_INIT : used to initialize subsystems + +Each stage is guaranteed that previous stages have successfully completed. This +means that an INITCALL placed at stage STG_INIT is guaranteed that all pools +were already created and will be usable. Conversely, an INITCALL placed at +stage STG_REGISTER must not rely on any field that requires preliminary +allocation nor initialization. A callback cannot rely on other callbacks of the +same stage, as the execution order within a stage is undefined and essentially +depends on the linking order. + +The STG_REGISTER level is made for run-time linking of the various modules that +compose the executable. Keywords, protocols and various other elements that are +local known to each compilation unit can will be appended into common lists at +boot time. This is why this call is placed just before STG_ALLOC. + +Example: register a very early call to init_log() with no argument, and another + call to cli_register_kw(&cli_kws) much later: + + INITCALL0(STG_PREPARE, init_log); + INITCALL1(STG_REGISTER, cli_register_kw, &cli_kws); + +Technically speaking, each call to such a macro adds a distinct local symbol +whose dynamic name involves the line number. These symbols are placed into a +separate section and the beginning and end section pointers are provided by the +linker. When too old a linker is used, a fallback is applied consisting in +placing them into a linked list which is built by a constructor function for +each initcall (this takes more room). + +Due to the symbols internally using the line number, it is very important not +to place more than one INITCALL per line in the source file. + +It is also strongly recommended that functions and referenced arguments are +static symbols local to the source file, unless they are global registration +functions like in the example above with cli_register_kw(), where only the +argument is a local keywords table. + +INITCALLs do not expect the callback function to return anything and as such +do not perform any error check. As such, they are very similar to constructors +offered by the compiler except that they are segmented in stages. It is thus +the responsibility of the called functions to perform their own error checking +and to exit in case of error. This may change in the future. + + +5. REGISTER family of macros + +The association of INITCALLs and registration functions allows to perform some +early dynamic registration of functions to be used anywhere, as well as values +to be added to existing lists without having to manipulate list elements. For +the sake of simplification, these combinations are available as a set of +REGISTER macros which register calls to certain functions at the appropriate +init stage. Such macros must be used at the top level in a file, just like +INITCALL macros. The following macros are currently supported. Please keep them +alphanumerically ordered: + +- REGISTER_BUILD_OPTS(str) + + Adds the constant string <str> to the list of build options. This is done by + registering a call to hap_register_build_opts(str, 0) at stage STG_REGISTER. + The string will not be freed. + +- REGISTER_CONFIG_POSTPARSER(name, parser) + + Adds a call to function <parser> at the end of the config parsing. The + function is called at the very end of check_config_validity() and may be used + to initialize a subsystem based on global settings for example. This is done + by registering a call to cfg_register_postparser(name, parser) at stage + STG_REGISTER. + +- REGISTER_CONFIG_SECTION(name, parse, post) + + Registers a new config section name <name> which will be parsed by function + <parse> (if not null), and with an optional call to function <post> at the + end of the section. Function <parse> must be of type (int (*parse)(const char + *file, int linenum, char **args, int inv)), and returns 0 on success or an + error code among the ERR_* set on failure. The <post> callback takes no + argument and returns a similar error code. This is achieved by registering a + call to cfg_register_section() with the three arguments at stage + STG_REGISTER. + +- REGISTER_PER_THREAD_ALLOC(fct) + + Registers a call to register_per_thread_alloc(fct) at stage STG_REGISTER. + +- REGISTER_PER_THREAD_DEINIT(fct) + + Registers a call to register_per_thread_deinit(fct) at stage STG_REGISTER. + +- REGISTER_PER_THREAD_FREE(fct) + + Registers a call to register_per_thread_free(fct) at stage STG_REGISTER. + +- REGISTER_PER_THREAD_INIT(fct) + + Registers a call to register_per_thread_init(fct) at stage STG_REGISTER. + +- REGISTER_POOL(ptr, name, size) + + Used internally to declare a new pool. This is made by calling function + create_pool_callback() with these arguments at stage STG_POOL. Do not use it + directly, use either DECLARE_POOL() or DECLARE_STATIC_POOL() instead. + +- REGISTER_PRE_CHECK(fct) + + Registers a call to register_pre_check(fct) at stage STG_REGISTER. + +- REGISTER_POST_CHECK(fct) + + Registers a call to register_post_check(fct) at stage STG_REGISTER. + +- REGISTER_POST_DEINIT(fct) + + Registers a call to register_post_deinit(fct) at stage STG_REGISTER. + +- REGISTER_POST_PROXY_CHECK(fct) + + Registers a call to register_post_proxy_check(fct) at stage STG_REGISTER. + +- REGISTER_POST_SERVER_CHECK(fct) + + Registers a call to register_post_server_check(fct) at stage STG_REGISTER. + +- REGISTER_PROXY_DEINIT(fct) + + Registers a call to register_proxy_deinit(fct) at stage STG_REGISTER. + +- REGISTER_SERVER_DEINIT(fct) + + Registers a call to register_server_deinit(fct) at stage STG_REGISTER. + diff --git a/doc/internals/api/ist.txt b/doc/internals/api/ist.txt new file mode 100644 index 0000000..0f118d6 --- /dev/null +++ b/doc/internals/api/ist.txt @@ -0,0 +1,167 @@ +2021-11-08 - Indirect Strings (IST) API + + +1. Background +------------- + +When parsing traffic, most of the standard C string functions are unusable +since they rely on a trailing zero. In addition, for the rare ones that support +a length, we have to constantly maintain both the pointer and the length. But +then, it's easy to come up with complex lengths and offsets calculations all +over the place, rendering the code hard to read and bugs hard to avoid or spot. + +IST provides a solution to this by defining a structure made of exactly two +word size elements, that most C ABIs know how to handle as a register when +used as a function argument or a function's return value. The functions are +inlined to leave a maximum set of opportunities to the compiler or optimization +and expression reduction, and as a result they are often inexpensive to use. It +is important however to keep in mind that all of these are designed for minimal +code size when dealing with short strings (i.e. parsing tokens in protocols), +and they are not optimal for processing large blocks. + + +2. API description +------------------ + +IST are defined like this: + + struct ist { + char *ptr; // pointer to the string's first byte + size_t len; // number of valid bytes starting from ptr + }; + +A string is not set if its ->ptr member is NULL. In this case .len is undefined +and is recommended to be zero. + +Declaring a function returning an IST: + + struct ist produce_ist(int ok) + { + return ok ? IST("OK") : IST("KO"); + } + +Declaring a function consuming an IST: + + void say_ist(struct ist i) + { + write(1, istptr(i), istlen(i)); + } + +Chaining the two: + + void say_ok(int ok) + { + say_ist(produce_ist(ok)); + } + +Notes: + - the arguments are passed as value, not reference, so there's no need for + any "const" in their declaration (except to catch coding mistakes). + Pointers to ist may benefit from being marked "const" however. + + - similarly for the return value, there's no point is marking it "const" as + this would protect the pointer and length, not the data. + + - use ist0() to append a trailing zero to a variable string for use with + printf()'s "%s" format, or for use with functions that work on NUL- + terminated strings, but beware of not doing this with constants. + + - the API provides a starting pointer and current length, but does not + provide an allocated size. It remains up to the caller to know how large + the allocated area is when adding data, though most functions make this + easy. + +The following macros and functions are defined. Those whose name starts with +underscores require special care and must not be used without being certain +they are properly used (typically subject to buffer overflows if misused). Note +that most functions were added over time depending on instant needs, and some +are very close to each other. Many useful functions are still missing and would +deserve being added. + +Below, arguments "i1","i2" are all of type "ist". Arguments "s" are +NUL-terminated strings of type "char*", and "cs" are of type "const char *". +Arguments "c" are of type "char", and "n" are of type size_t. + + IST(cs):ist make constant IST from a NUL-terminated const string + IST_NULL:ist return an unset IST = ist2(NULL,0) + __istappend(i1,c):ist append character <c> at the end of ist <i1> + ist(s):ist return an IST from a nul-terminated string + ist0(i1):char* write a \0 at the end of an IST, return the string + ist2(cs,l):ist return a variable IST from a const string and length + ist2bin(s,i1):ist copy IST into a buffer, return the result + ist2bin_lc(s,i1):ist like ist2bin() but turning turning to lower case + ist2bin_uc(s,i1):ist like ist2bin() but turning turning to upper case + ist2str(s,i1):ist copy IST into a buffer, add NUL and return the result + ist2str_lc(s,i1):ist like ist2str() but turning turning to lower case + ist2str_uc(s,i1):ist like ist2str() but turning turning to upper case + ist_find(i1,c):ist return first occurrence of char <c> in <i1> + ist_find_ctl(i1):char* return pointer to first CTL char in <i1> or NULL + ist_skip(i1,c):ist return first occurrence of char not <c> in <i1> + istadv(i1,n):ist advance the string by <n> characters + istalloc(n):ist return allocated string of zero initial length + istcat(d,s,n):ssize_t copy <s> after <d> for <n> chars max, return len or -1 + istchr(i1,c):char* return pointer to first occurrence of <c> in <i1> + istclear(i1*):size_t return previous size and set size to zero + istcpy(d,s,n):ssize_t copy <s> over <d> for <n> chars max, return len or -1 + istdiff(i1,i2):int return the ordinal difference, like strcmp() + istdup(i1):ist allocate new ist and copy original one into it + istend(i1):char* return pointer to first character after the IST + isteq(i1,i2):int return non-zero if strings are equal + isteqi(i1,i2):int like isteq() but case-insensitive + istfree(i1*) free of allocated <i1>/IST_NULL and set it to IST_NULL + istissame(i1,i2):int return true if pointers and lengths are equal + istist(i1,i2):ist return first occurrence of <i2> in <i1> + istlen(i1):size_t return the length of the IST (number of characters) + istmatch(i1,i2):int return non-zero if i1 starts like i2 (empty OK) + istmatchi(i1,i2):int like istmatch() but case insensitive + istneq(i1,i2,n):int like isteq() but limited to the first <n> chars + istnext(i1):ist return the IST advanced by one character + istnmatch(i1,i2,n):int like istmatch() but limited to the first <n> chars + istpad(s,i1):ist copy IST into a buffer, add a NUL, return the result + istptr(i1):char* return the starting pointer of the IST + istscat(d,s,n):ssize_t same as istcat() but always place a NUL at the end + istscpy(d,s,n):ssize_t same as istcpy() but always place a NUL at the end + istshift(i1*):char return the first character and advance the IST by one + istsplit(i1*,c):ist return part before <c>, make ist start from <c> + iststop(i1,c):ist truncate ist before first occurrence of <c> + isttest(i1):int return true if ist is not NULL, false otherwise + isttrim(i1,n):ist return ist trimmed to no more than <n> characters + istzero(i1,n):ist trim to <n> chars, trailing zero included. + + +3. Quick index by typical C construct or function +------------------------------------------------- + +Some common C constructs may be adjusted to use ist instead. The mapping is not +always one-to-one, but usually the computations on the length part tends to +disappear in the refactoring, allowing to directly chain function calls. The +entries below are hints to figure what function to look for in order to rewrite +some common use cases. + + char* IST equivalent + + strchr() istchr(), ist_find(), iststop() + strstr() istist() + strcpy() istcpy() + strscpy() istscpy() + strlcpy() istscpy() + strcat() istcat() + strscat() istscat() + strlcat() istscat() + strcmp() istdiff() + strdup() istdup() + !strcmp() isteq() + !strncmp() istneq(), istmatch(), istnmatch() + !strcasecmp() isteqi() + !strncasecmp() istneqi(), istmatchi() + strtok() istsplit() + return NULL return IST_NULL + s = malloc() s = istalloc() + free(s); s = NULL istfree(&s) + p != NULL isttest(p) + c = *(p++) c = istshift(p) + *(p++) = c __istappend(p, c) + p += n istadv(p, n) + p + strlen(p) istend(p) + p[max] = 0 isttrim(p, max) + p[max+1] = 0 istzero(p, max) diff --git a/doc/internals/api/layers.txt b/doc/internals/api/layers.txt new file mode 100644 index 0000000..b5c35f4 --- /dev/null +++ b/doc/internals/api/layers.txt @@ -0,0 +1,190 @@ +2022-05-27 - Stream layers in HAProxy 2.6 + + +1. Background + +There are streams at plenty of levels in haproxy, essentially due to the +introduction of multiplexed protocols which provide high-level streams on top +of low-level streams, themselves either based on stream-oriented protocols or +datagram-oriented protocols. + +The refactoring of the appctx and muxes that allowed to drop a lot of duplicate +code between 2.5 and 2.6-dev6 raised another concern with some entities like +"conn_stream" that were not specific to connections anymore, "endpoints" that +became entities on their own, and "targets" whose life had been extended to +last all along a connection. + +It was time to rename all such legacy entities introduced in 1.8 and which had +turned particularly confusing over time as their roles evolved. + + +2. Naming principles + +The global renaming of some entities between streams and connections was +articulated around several principles: + + - avoid the confusing use of "context" in shared places. For example, the + endpoint's connection is in "ctx" and nothing makes it obvious that the + endpoint's context is a connection, especially when an applet is there. + + - reserve relative nouns for pointers and not for types. "endpoint", just + like "owner" or "peer" is relative, but when accessed from a different + layer it starts to make no sense at all, or to make one believe it's + something else, particularly with void*. + + - avoid too generic terms that have multiple meanings, or words that are + synonyms in a same place (e.g. "peer" and "remote", or "endpoint" and + "target"). If two synonyms are needed to designate two distinct entities, + there's probably a problem elsewhere, or the problem is poorly defined. + + - make it clearer that all that is manipulated is related to streams. This + particularly important in sample fetch functions for example, which tend + to require low-level access and could be mislead in trying to follow the + wrong chain when trying to get information about a connection. + + - use easily spellable short names that abbreviate unambiguously when used + together in adjacent contexts + + +3. Current state as of 2.6 + +- when a name is required to designate the lower block that starts at the mux + stream or the appctx, it is spoken of as a "stream endpoint", and abbreviated + "se". It's okay because while "endpoint" itself is relative, "stream + endpoint" unequivocally designates one extremity of a stream. If a type is + needed for this in the future (e.g. via obj_type), then the type "stendp" + may be used. Before 2.6-dev6 there was no name for this, it was known as + conn_stream->ctx. + +- the 2.6-dev6 cs_endpoint which preserves the state of a mux stream or an + appctx and abstracts them in front of a conn_stream becomes a "stream + endpoint descriptor", of type "sedesc" and often abbreviated "sd", "sed" + or "ed". Its "target" pointer became "se" as per the rule above. Before + 2.6-dev6, these elements were mixed with others inside conn_stream. From + the appctx it's called "sedesc" (few occurrences hence long name OK). + +- the conn_stream which is always attached to either a stream or a health check + and that is used to reach a mux or an applet becomes a "stream connector" of + type "stconn", generally abbreviated "sc". Its "endp" pointer becomes + "sedesc" as per the rule above, and that one has a back pointer "sc". The + stream uses "scf" and "scb" as the respective front and back pointers to the + stconns. Prior to 2.6-dev6, these parts were split between conn_stream and + stream_interface. + +- the sedesc's "ctx" which is solely used to store the connection as of now, is + renamed "conn" to void any doubt in the context of applets or even muxes. In + the future the connection should be attached to the "se" instead and this + pointer should disappear (or be recycled for anything else). + +The new 2.6 model looks like this: + + +------------------------+ + | stream or health check | + +------------------------+ + ^ \ scf, scb + / \ + | | + \ / + app \ v + +----------+ + | stconn | + +----------+ + ^ \ sedesc + / \ + . . . . | . . . | . . . . . split point (retries etc) + \ / + sc \ v + +----------+ + flags <--| sedesc | : sedesc : + +----------+ ... +----------+ + conn / ^ \ se ^ \ + +------------+ / / \ | \ + | connection |<--' | | ... OR ... | | + +------------+ \ / \ | + mux| ^ |ctx sd \ v : sedesc \ v + | | | +----------------------+ \ # +----------+ svcctx + | | | | mux stream or appctx | | # | appctx |--. + | | | +----------------------+ | # +----------+ | + | | | ^ | / private # : : | + v | | | v > to the # +----------+ | + mux_ops | | +----------------+ \ mux # | svcctx |<-' + | +---->| mux connection | ) # +----------+ + +------ +----------------+ / # + +Stream descriptors may exist in the following modes: + - .conn = NULL, .se = NULL : backend, not connection attempt yet + - .conn = NULL, .se = <appctx> : frontend or backend, applet + - .conn = <conn>, .se = NULL : backend, connection in progress + - .conn = <conn>, .se = <muxs> : frontend or backend, connected + +Notes: + - for historical reasons (connect, forced protocol upgrades, etc), during a + connection setup or a rule-based protocol upgrade, the connection's "ctx" + may temporarily point to the stconn + + +4. Invariants and cardinalities + +Usually a stream is created from an existing stconn from a mux or some applets, +but may also be allocated first by other applets schedulers. After stream_new() +a stream always has exactly one stconn per side (scf, scb), each of which has +one ->sedesc. Each side is initialized with either one or no stream endpoint +attached to the descriptor. + +Both applets and a mux stream always have a stream endpoint descriptor. AS SUCH +IT IS NEVER NECESSARY TO TEST FOR THE EXISTENCE OF THE SEDESC FROM ANY SIDE, IT +ALWAYS EXISTS. This explains why as much as possible it's preferable to use the +sedesc to access flags and statuses from any side, rather than bouncing via the +stconn. + +An applet's app layer is always a stream, which means that there are always +channels accessible above, and there is always an opposite stream connector and +a stream endpoint descriptor. As such, it's always safe for an applet to access +the other side using sc_opposite(). + +When an outgoing connection is in the process of being established, the backend +side sedesc has its ->conn pointer pointing to the pending connection, and no +->se. Once the connection is established and a mux is chosen, it's attached to +the ->se. If an applet is used instead of a mux, the appctx is attached to the +sedesc's ->se and ->conn remains NULL. + +If either side wants to detach from the other, it must allocate a new virgin +sedesc to replace the existing one, and leave the existing one to the endpoint, +since it continues to describe the stream endpoint. The stconn keeps its state +(modulo the updates related to the disconnection). The previous sedesc points +to a NULL stconn. For example, disconnecting from a backend mux will leave the +entities like this: + + +------------------------+ + | stream or health check | + +------------------------+ + ^ \ scf, scb + / \ + | | + \ / + app \ v + +----------+ + | stconn | + +----------+ + ^ \ sedesc + / \ + NULL | | + ^ \ / + sc | / sc \ v + +----------+ / +----------+ + flags <--| sedesc1 | . . . . . | sedesc2 |--> flags + +----------+ / +----------+ + conn / ^ \ se / conn / \ se + +------------+ / / \ | | + | connection |<--' | | v v + +------------+ \ / NULL NULL + mux| ^ |ctx sd \ v + | | | +----------------------+ + | | | | mux stream or appctx | + | | | +----------------------+ + | | | ^ | + v | | | v + mux_ops | | +----------------+ + | +---->| mux connection | + +------ +----------------+ + diff --git a/doc/internals/api/list.txt b/doc/internals/api/list.txt new file mode 100644 index 0000000..d03cf03 --- /dev/null +++ b/doc/internals/api/list.txt @@ -0,0 +1,195 @@ +2021-11-09 - List API + + +1. Background +------------- + +HAProxy's lists are almost all doubly-linked and circular so that it is always +possible to insert at the beginning, append at the end, scan them in any order +and delete any element without having to scan to search the predecessor nor the +successor. + +A list's head is just a regular list element, and an element always points to +another list element. Such elements only have two pointers, the next and the +previous elements. The object being pointed to is retrieved by subtracting the +list element's offset in its structure from the list element's pointer. This +way there is no need for any separate allocation for the list element, for a +pointer to the object in the list, nor for a pointer to the list element from +the object, as the list is embedded into the object. + +All basic operations are provided, as well as some iterators. Some iterators +are safe for removal of the current element within the loop, others not. In any +case a list cannot be freely modified while iterating over it (e.g. the current +element's successor cannot not be freed if it's saved as the restart point). + +Extreme care is taken nowadays in HAProxy to make sure that no dangling +pointers are left in elements, so it is important to always initialize list +heads and list elements, as well as elements that are removed from a list if +they are not immediately freed, so that their deletion is idempotent. A rule of +thumb is that a list pointer's validity never has to be checked, it is always +valid to dereference it. A lot of complex bugs have been caused in the past by +incorrect list manipulation, such as an element being deleted twice, resulting +in damaging previously adjacent elements' neighbours. This usually has serious +consequences at locations that are totally different from the one of the bug, +and that are only detected much later, so it is required to be particularly +strict on using lists safely. + +The lists are not thread-safe, but mt_lists may be used instead. + + +2. API description +------------------ + +A list is defined like this, both for the list's head, and for any other +element: + + struct list { + struct list *n; /* next */ + struct list *p; /* prev */ + }; + +An empty list points to itself for both pointers. I.e. a list's head is both +its own successor and its own predecessor. This guarantees that insertions +and deletions can be done without any check and that deletion is idempotent. +For this reason and by convention, a detached element ought to be represented +like an empty head. + +Lists are manipulated using a set of macros which are used to initialize, add, +remove, or iterate over elements. Most of these macros are extremely simple and +are not even protected against multiple evaluation, so it is fundamentally +important that the expressions used in the arguments are idempotent and that +the result does not depend on the evaluation order of the arguments. + +Macro Description + +ILH + Initialized List Head : this is a non-NULL, non-empty list element used + to prevent the compiler from moving an empty list head declaration to + BSS, typically when it appears in an array of keywords Without this, + some older versions of gcc tend to trim all the array and cause + corruption. + +LIST_INIT(l) + Initialize the list as an empty list head + +LIST_HEAD_INIT(l) + Return a valid initialized empty list head pointing to this + element. Essentially used with assignments in declarations. + +LIST_INSERT(l, e) + Add an element at the beginning of a list and return it + +LIST_APPEND(l, e) + Add an element at the end of a list and return it + +LIST_SPLICE(n, o) + Add the contents of a list <o> at the beginning of another list <n>. + The old list head remains untouched. + +LIST_SPLICE_END_DETACHED(n, o) + Add the contents of a list whose first element is is <o> and last one + is <o->p> at the end of another list <n>. The old list DOES NOT have + any head here. + +LIST_DELETE(e) + Remove an element from a list and return it. Safe to call on + initialized elements, but will not change the element itself so it is + not idempotent. Consider using LIST_DEL_INIT() instead unless called + immediately after a free(). + +LIST_DEL_INIT(e) + Remove an element from a list, initialize it and return it so that a + subsequent LIST_DELETE() is safe. This is faster than performing a + LIST_DELETE() followed by a LIST_INIT() as pointers are not reloaded. + +LIST_ELEM(l, t, m) + Return a pointer of type <t> to a structure containing a list head + member called <m> at address <l>. Note that <l> can be the result of a + function or macro since it's used only once. + +LIST_ISEMPTY(l) + Check if the list head <l> is empty (=initialized) or not, and return + non-zero only if so. + +LIST_INLIST(e) + Check if the list element <e> was added to a list or not, thus return + true unless the element was initialized. + +LIST_INLIST_ATOMIC(e) + Atomically check if the list element's next pointer points to anything + different from itself, implying the element should be part of a + list. This usually is similar to LIST_INLIST() except that while that + one might be instrumented using debugging code to perform further + consistency checks, the macro below guarantees to always perform a + single atomic test and is safe to use with barriers. + +LIST_NEXT(l, t, m) + Return a pointer of type <t> to a structure following the element which + contains list head <l>, which is known as member <m> in struct <t>. + +LIST_PREV(l, t, m) + Return a pointer of type <t> to a structure preceding the element which + contains list head <l>, which is known as member <m> in struct <t>. + Note that this macro is first undefined as it happened to already exist + on some old OSes. + +list_for_each_entry(i, l, m) + Iterate local variable <i> through a list of items of type "typeof(*i)" + which are linked via a "struct list" member named <m>. A pointer to the + head of the list is passed in <l>. No temporary variable is needed. + Note that <i> must not be modified during the loop. + +list_for_each_entry_from(i, l, m) + Same as list_for_each_entry() but starting from current value of <i> + instead of the list's head. + +list_for_each_entry_from_rev(i, l, m) + Same as list_for_each_entry_rev() but starting from current value of <i> + instead of the list's head. + +list_for_each_entry_rev(i, l, m) + Iterate backwards local variable <i> through a list of items of type + "typeof(*i)" which are linked via a "struct list" member named <m>. A + pointer to the head of the list is passed in <l>. No temporary variable + is needed. Note that <i> must not be modified during the loop. + +list_for_each_entry_safe(i, b, l, m) + Iterate variable <i> through a list of items of type "typeof(*i)" which + are linked via a "struct list" member named <m>. A pointer to the head + of the list is passed in <l>. A temporary backup variable <b> of same + type as <i> is needed so that <i> may safely be deleted if needed. Note + that it is only permitted to delete <i> and no other element during + this operation! + +list_for_each_entry_safe_from(i, b, l, m) + Same as list_for_each_entry_safe() but starting from current value of + <i> instead of the list's head. + +list_for_each_entry_safe_from_rev(i, b, l, m) + Same as list_for_each_entry_safe_rev() but starting from current value + of <i> instead of the list's head. + +list_for_each_entry_safe_rev(i, b, l, m) + Iterate backwards local variable <i> through a list of items of type + "typeof(*i)" which are linked via a "struct list" member named <m>. A + pointer to the head of the list is passed in <l>. A temporary variable + <b> of same type as <i> is needed so that <i> may safely be deleted if + needed. Note that it is only permitted to delete <i> and no other + element during this operation! + +3. Notes +-------- + +- This API is quite old and some macros are missing. For example there's still + no list_first() so it's common to use LIST_ELEM(head->n, ...) instead. Some + older parts of the code also used to rely on list_for_each() followed by a + break to stop on the first element. + +- Some parts were recently renamed because LIST_ADD() used to do what + LIST_INSERT() currently does and was often mistaken with LIST_ADDQ() which is + what LIST_APPEND() now is. As such it is not totally impossible that some + places use a LIST_INSERT() where a LIST_APPEND() would be desired. + +- The structure must not be modified at all (even to add debug info). Some + parts of the code assume that its layout is exactly this one, particularly + the parts ensuring the casting between MT lists and lists. diff --git a/doc/internals/api/pools.txt b/doc/internals/api/pools.txt new file mode 100644 index 0000000..2c54409 --- /dev/null +++ b/doc/internals/api/pools.txt @@ -0,0 +1,577 @@ +2022-02-24 - Pools structure and API + +1. Background +------------- + +Memory allocation is a complex problem covered by a massive amount of +literature. Memory allocators found in field cover a broad spectrum of +capabilities, performance, fragmentation, efficiency etc. + +The main difficulty of memory allocation comes from finding the optimal chunks +for arbitrary sized requests, that will still preserve a low fragmentation +level. Doing this well is often expensive in CPU usage and/or memory usage. + +In programs like HAProxy that deal with a large number of fixed size objects, +there is no point having to endure all this risk of fragmentation, and the +associated costs (sometimes up to several milliseconds with certain minimalist +allocators) are simply not acceptable. A better approach consists in grouping +frequently used objects by size, knowing that due to the high repetitiveness of +operations, a freed object will immediately be needed for another operation. + +This grouping of objects by size is what is called a pool. Pools are created +for certain frequently allocated objects, are usually merged together when they +are of the same size (or almost the same size), and significantly reduce the +number of calls to the memory allocator. + +With the arrival of threads, pools started to become a bottleneck so they now +implement an optional thread-local lockless cache. Finally with the arrival of +really efficient memory allocator in modern operating systems, the shared part +has also become optional so that it doesn't consume memory if it does not bring +any value. + +In 2.6-dev2, a number of debugging options that used to be configured at build +time only changed to boot-time and can be modified using keywords passed after +"-dM" on the command line, which sets or clears bits in the pool_debugging +variable. The build-time options still affect the default settings however. +Default values may be consulted using "haproxy -dMhelp". + + +2. Principles +------------- + +The pools architecture is selected at build time. The main options are: + + - thread-local caches and process-wide shared pool enabled (1) + + This is the default situation on most operating systems. Each thread has + its own local cache, and when depleted it refills from the process-wide + pool that avoids calling the standard allocator too often. It is possible + to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at + boot time with "-dMglobal". + + - thread-local caches only are enabled (2) + + This is the situation on operating systems where a fast and modern memory + allocator is detected and when it is estimated that the process-wide shared + pool will not bring any benefit. This detection is automatic at build time, + but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS + or at boot time with "-dMno-global". + + - pass-through to the standard allocator (3) + + This is used when one absolutely wants to disable pools and rely on regular + malloc() and free() calls, essentially in order to trace memory allocations + by call points, either internally via DEBUG_MEM_STATS, or externally via + tools such as Valgrind. This mode of operation may be forced at build time + by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache". + + - pass-through to an mmap-based allocator for debugging (4) + + This is used only during deep debugging when trying to detect various + conditions such as use-after-free. In this case each allocated object's + size is rounded up to a multiple of a page size (4096 bytes) and an + integral number of pages is allocated for each object using mmap(), + surrounded by two unaccessible holes that aim to detect some out-of-bounds + accesses. Released objects are instantly freed using munmap() so that any + immediate subsequent access to the memory area crashes the process if the + area had not been reallocated yet. This mode can be enabled at build time + by setting DEBUG_UAF. It tends to consume a lot of memory and not to scale + at all with concurrent calls, that tends to make the system stall. The + watchdog may even trigger on some slow allocations. + +There are no more provisions for running with a shared pool but no thread-local +cache: the shared pool's main goal is to compensate for the expensive calls to +the memory allocator. This gain may be huge on tiny systems using basic +allocators, but the thread-local cache will already achieve this. And on larger +threaded systems, the shared pool's benefit is visible when the underlying +allocator scales poorly, but in this case the shared pool would suffer from +the same limitations without its thread-local cache and wouldn't provide any +benefit. + +Summary of the various operation modes: + + (1) (2) (3) (4) + + User User User User + | | | | + pool_alloc() V V | | + +---------+ +---------+ | | + | Thread | | Thread | | | + | Local | | Local | | | + | Cache | | Cache | | | + +---------+ +---------+ | | + | | | | + pool_refill*() V | | | + +---------+ | | | + | Shared | | | | + | Pool | | | | + +---------+ | | | + | | | | + malloc() V V V | + +---------+ +---------+ +---------+ | + | Library | | Library | | Library | | + +---------+ +---------+ +---------+ | + | | | | + mmap() V V V V + +---------+ +---------+ +---------+ +---------+ + | OS | | OS | | OS | | OS | + +---------+ +---------+ +---------+ +---------+ + +One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation +failure in pool_alloc() by randomly returning NULL, to test that callers +properly handle allocation failures. It may also be enabled at boot time using +"-dMfail". In this case the desired average rate of allocation failures can be +fixed by global setting "tune.fail-alloc" expressed in percent. + +The thread-local caches contain the freshest objects whose total size amounts +to CONFIG_HAP_POOL_CACHE_SIZE bytes, which is typically was 1MB before 2.6 and +is 512kB after. The aim is to keep hot objects that still fit in the CPU core's +private L2 cache. Once these objects do not fit into the cache anymore, there's +no benefit keeping them local to the thread, so they'd rather be returned to +the shared pool or the main allocator so that any other thread may make use of +them. + + +3. Storage in thread-local caches +--------------------------------- + +This section describes how objects are linked in thread local caches. This is +not meant to be a concern for users of the pools API but it can be useful when +inspecting post-mortem dumps or when trying to figure certain size constraints. + +Objects are stored in the local cache using a doubly-linked list. This ensures +that they can be visited by freshness order like a stack, while at the same +time being able to access them from oldest to newest when it is needed to +evict coldest ones first: + + - releasing an object to the cache always puts it on the top. + + - allocating an object from the cache always takes the topmost one, hence the + freshest one. + + - scanning for older objects to evict starts from the bottom, where the + oldest ones are located + +To that end, each thread-local cache keeps a list head in the "list" member of +its "pool_cache_head" descriptor, that links all objects cast to type +"pool_cache_item" via their "by_pool" member. + +Note that the mechanism described above only works for a single pool. When +trying to limit the total cache size to a certain value, all pools included, +there is also a need to arrange all objects from all pools together in the +local caches. For this, each thread_ctx maintains a list head of recently +released objects, all pools included, in its member "pool_lru_head". All items +in a thread-local cache are linked there via their "by_lru" member. + +This means that releasing an object using pool_free() consists in inserting +it at the beginning of two lists: + - the local pool_cache_head's "list" list head + - the thread context's "pool_lru_head" list head + +Allocating an object consists in picking the first entry from the pool's "list" +and deleting its "by_pool" and "by_lru" links. + +Evicting an object consists in scanning the thread context's "pool_lru_head" +backwards and deleting the object's "by_pool" and "by_lru" links. + +Given that entries are both inserted and removed synchronously, we have the +guarantee that the oldest object in the thread's LRU list is always the oldest +object in its pool, and that the next element is the cache's list head. This is +what allows the LRU eviction mechanism to figure what pool an object belongs to +when releasing it. + +Note: + | Since a pool_cache_item has two list entries, on 64-bit systems it will be + | 32-bytes long. This is the smallest size that a pool may be, and any smaller + | size will automatically be rounded up to this size. + +When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option +"-dMintegrity" is passed on the command line, the area of the object between +the two list elements and the end according to pool->size will be filled with +pseudo-random words during pool_put_to_cache(), and these words will be +compared between each other during pool_get_from_cache(), and the process will +crash in case any bit differs, as this would indicate that the memory area was +modified after the free. The pseudo-random pattern is in fact incremented by +(~0)/3 upon each free so that roughly half of the bits change each time and we +maximize the likelihood of detecting a single bit flip in either direction. In +order to avoid an immediate reuse and maximize the time the object spends in +the cache, when this option is set, objects are picked from the cache from the +oldest one instead of the freshest one. This way even late memory corruptions +have a chance to be detected. + +When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag" +is passed on the executable's command line, pool objects are allocated with +one extra pointer compared to the requested size, so that the bytes that follow +the memory area point to the pool descriptor itself as long as the object is +allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is +compared and the code will crash in if it differs. This allows to detect both +memory overflows and object released to the wrong pool (code bug resulting from +a copy-paste error typically). + +Thus an object will look like this depending whether it's in the cache or is +currently in use: + + in cache in use + +------------+ +------------+ + <--+ by_pool.p | | N bytes | + | by_pool.n +--> | | + +------------+ |N=16 min on | + <--+ by_lru.p | | 32-bit, | + | by_lru.n +--> | 32 min on | + +------------+ | 64-bit | + : : : : + | N bytes | | | + +------------+ +------------+ \ optional, only if + : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS + +------------+ +------------+ / is set at build time + or -dMtag at boot time + +Right now no provisions are made to return objects aligned on larger boundaries +than those currently covered by malloc() (i.e. two pointers). This need appears +from time to time and the layout above might evolve a little bit if needed. + + +4. Storage in the process-wide shared pool +------------------------------------------ + +In order for the shared pool not to be a contention point in a multi-threaded +environment, objects are allocated from or released to shared pools by clusters +of a few objects at once. The maximum number of objects that may be moved to or +from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build +time, and currently defaults to 8. + +In order to remain scalable, the shared pool has to make some tradeoffs to +limit the number of atomic operations and the duration of any locked operation. +As such, it's composed of a single-linked list of clusters, themselves made of +a single-linked list of objects. + +Clusters and objects are of the same type "pool_item" and are accessed from the +pool's "free_list" member. This member points to the latest pool_item inserted +into the pool by a release operation. And the pool_item's "next" member points +to the next pool_item, which was the one present in the pool's free_list just +before the pool_item was inserted, and the last pool_item in the list simply +has a NULL "next" field. + +The pool_item's "down" pointer points down to the next objects part of the same +cluster, that will be released or allocated at the same time as the first one. +Each of these items also has a NULL "next" field, and are chained by their +respective "down" pointers until the last one is detected by a NULL value. + +This results in the following layout: + + pool pool_item pool_item pool_item + +-----------+ +------+ +------+ +------+ + | free_list +--> | next +--> | next +--> | NULL | + +-----------+ +------+ +------+ +------+ + | down | | NULL | | down | + +--+---+ +------+ +--+---+ + | | + V V + +------+ +------+ + | NULL | | NULL | + +------+ +------+ + | down | | NULL | + +--+---+ +------+ + | + V + +------+ + | NULL | + +------+ + | NULL | + +------+ + +Allocating an entry is only a matter of performing two atomic allocations on +the free_list and reading the pool's "next" value: + + - atomically mark the free_list as being updated by writing a "magic" pointer + - read the first pool_item's "next" field + - atomically replace the free_list with this value + +This results in a fast operation that instantly retrieves a cluster at once. +Then outside of the critical section entries are walked over and inserted into +the local cache one at a time. In order to keep the code simple and efficient, +objects allocated from the shared pool are all placed into the local cache, and +only then the first one is allocated from the cache. This operation is +performed by the dedicated function pool_refill_local_from_shared() which is +called from pool_get_from_cache() when the cache is empty. It means there is an +overhead of two list insert/delete operations for the first object and that +could be avoided at the expense of more complex code in the fast path, but this +is negligible since it only concerns objects that need to be visited anyway. + +Freeing a group of objects consists in performing the operation the other way +around: + + - atomically mark the free_list as being updated by writing a "magic" pointer + - write the free_list value to the to-be-released item's "next" entry + - atomically replace the free_list with the pool_item's pointer + +The cluster will simply have to be prepared before being sent to the shared +pool. The operation of releasing a cluster at once is performed by function +pool_put_to_shared_cache() which is called from pool_evict_last_items() which +itself is responsible for building the clusters. + +Due to the way objects are stored, it is important to try to group objects as +much as possible when releasing them because this is what will condition their +retrieval as groups as well. This is the reason why pool_evict_last_items() +uses the LRU to find a first entry but tries to pick several items at once from +a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8 +achieves up to 6-6.5 objects on average per operation, which effectively +divides by as much the average time spent per object by each thread and pushes +the contention point further. + +Also, grouping items in clusters is a property of the process-wide shared pool +and not of the thread-local caches. This means that there is no grouped +operation when not using the shared pool (mode "2" in the diagram above). + + +5. API +------ + +The following functions are public and available for user code: + +struct pool_head *create_pool(char *name, uint size, uint flags) + Create a new pool named <name> for objects of size <size> bytes. Pool + names are truncated to their first 11 characters. Pools of very similar + size will usually be merged if both have set the flag MEM_F_SHARED in + <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or + "-dMno-merge" is passed on the executable's command line, the pools + also need to have the exact same name to be merged. In addition, unless + MEM_F_EXACT is set in <flags>, the object size will usually be rounded + up to the size of pointers (16 or 32 bytes). The name that will appear + in the pool upon merging is the name of the first created pool. The + returned pointer is the new (or reused) pool head, or NULL upon error. + Pools created this way must be destroyed using pool_destroy(). + +void *pool_destroy(struct pool_head *pool) + Destroy pool <pool>, that is, all of its unused objects are freed and + the structure is freed as well if the pool didn't have any used objects + anymore. In this case NULL is returned. If some objects remain in use, + the pool is preserved and its pointer is returned. This ought to be + used essentially on exit or in rare situations where some internal + entities that hold pools have to be destroyed. + +void pool_destroy_all(void) + Destroy all pools, without checking which ones still have used entries. + This is only meant for use on exit. + +void *__pool_alloc(struct pool_head *pool, uint flags) + Allocate an entry from the pool <pool>. The allocator will first look + for an object in the thread-local cache if enabled, then in the shared + pool if enabled, then will fall back to the operating system's default + allocator. NULL is returned if the object couldn't be allocated (due to + configured limits or lack of memory). Object allocated this way have to + be released using pool_free(). Like with malloc(), by default the + contents of the returned object are undefined. If memory poisonning is + enabled, the object will be filled with the poisonning byte. If the + global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is + enabled, a random number generator will be called to randomly return a + NULL. The allocator's behavior may be adjusted using a few flags passed + in <flags>: + - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when + pointless and expensive, like for buffers) + - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before + being returned, similar to what calloc() does + - POOL_F_NO_FAIL : when set, disables the random allocation failure, + e.g. for use during early init code or critical sections. + +void *pool_alloc(struct pool_head *pool) + This is an exact equivalent of __pool_alloc(pool, 0). It is the regular + way to allocate entries from a pool. + +void *pool_alloc_nocache(struct pool_head *pool) + Allocate an entry from the pool <pool>, bypassing the cache. If shared + pools are enabled, they will be consulted first. Otherwise the object + is allocated using the operating system's default allocator. This is + essentially used during early boot to pre-allocate a number of objects + for pools which require a minimum number of entries to exist. + +void *pool_zalloc(struct pool_head *pool) + This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO). + +void pool_free(struct pool_head *pool, void *ptr) + Free an entry allocate from one of the pool_alloc() functions above + from pool <pool>. The object will be placed into the thread-local cache + if enabled, or in the shared pool if enabled, or will be released using + the operating system's default allocator. When a local cache is + enabled, if the local cache size becomes larger than 75% of the maximum + size configured at build time, some objects will be evicted to the + shared pool. Such objects are taken first from the same pool, but if + the total size is really huge, other pools might be checked as well. + Some extra checks enabled at build time may enforce extra checks so + that the process will immediately crash if the object was not allocated + from this pool or experienced an overflow or some memory corruption. + +void pool_flush(struct pool_head *pool) + Free all unused objects from shared pool <pool>. Thread-local caches + are not affected. This is essentially used when running low on memory + or when stopping, in order to release a maximum amount of memory for + the new process. + +void pool_gc(struct pool_head *pool) + Free all unused objects from all pools, but respecting the minimum + number of spare objects required for each of them. Then, for operating + systems which support it, indicate the system that all unused memory + can be released. Thread-local caches are not affected. This operation + differs from pool_flush() in that it is run locklessly, under thread + isolation, and on all pools in a row. It is called by the SIGQUIT + signal handler and upon exit. Note that the obsolete argument <pool> is + not used and the convention is to pass NULL there. + +void dump_pools_to_trash(void) + Dump the current status of all pools into the trash buffer. This is + essentially used by the "show pools" CLI command or the SIGQUIT signal + handler to dump them on stderr. The total report size may not exceed + the size of the trash buffer. If it does, some entries will be missing. + +void dump_pools(void) + Dump the current status of all pools to stderr. This just calls + dump_pools_to_trash() and writes the trash to stderr. + +int pool_total_failures(void) + Report the total number of failed allocations. This is solely used to + report the "PoolFailed" metrics of the "show info" output. The total + is calculated on the fly by summing the number of failures in all pools + and is only meant to be used as an indicator rather than a precise + measure. + +ullong pool_total_allocated(void) + Report the total number of bytes allocated in all pools, for reporting + in the "PoolAlloc_MB" field of the "show info" output. The total is + calculated on the fly by summing the number of allocated bytes in all + pools and is only meant to be used as an indicator rather than a + precise measure. + +ullong pool_total_used(void) + Report the total number of bytes used in all pools, for reporting in + the "PoolUsed_MB" field of the "show info" output. The total is + calculated on the fly by summing the number of used bytes in all pools + and is only meant to be used as an indicator rather than a precise + measure. Note that objects present in caches are accounted as used. + +Some other functions exist and are only used by the pools code itself. While +not strictly forbidden to use outside of this code, it is generally recommended +to avoid touching them in order not to create undesired dependencies that will +complicate maintenance. + +A few macros exist to ease the declaration of pools: + +DECLARE_POOL(ptr, name, size) + Placed at the top level of a file, this declares a global memory pool + as variable <ptr>, name <name> and size <size> bytes per element. This + is made via a call to REGISTER_POOL() and by assigning the resulting + pointer to variable <ptr>. <ptr> will be created of type "struct + pool_head *". If the pool needs to be visible outside of the function + (which is likely), it will also need to be declared somewhere as + "extern struct pool_head *<ptr>;". It is recommended to place such + declarations very early in the source file so that the variable is + already known to all subsequent functions which may use it. + +DECLARE_STATIC_POOL(ptr, name, size) + Placed at the top level of a file, this declares a static memory pool + as variable <ptr>, name <name> and size <size> bytes per element. This + is made via a call to REGISTER_POOL() and by assigning the resulting + pointer to local variable <ptr>. <ptr> will be created of type "static + struct pool_head *". It is recommended to place such declarations very + early in the source file so that the variable is already known to all + subsequent functions which may use it. + + +6. Build options +---------------- + +A number of build-time defines allow to tune the pools behavior. All of them +have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG +variable. + +DEBUG_NO_POOLS + When this is set, pools are entirely disabled, and allocations are made + using malloc() instead. This is not recommended for production but may + be useful for tracing allocations. It corresponds to "-dMno-cache" at + boot time. + +DEBUG_MEMORY_POOLS + When this is set, an extra pointer is allocated at the end of each + object to reference the pool the object was allocated from and detect + buffer overflows. Then, pool_free() will provoke a crash in case it + detects an anomaly (pointer at the end not matching the pool). It + corresponds to "-dMtag" at boot time. + +DEBUG_FAIL_ALLOC + When enabled, a global setting "tune.fail-alloc" may be set to a non- + zero value representing a percentage of memory allocations that will be + made to fail in order to stress the calling code. It corresponds to + "-dMfail" at boot time. + +DEBUG_DONT_SHARE_POOLS + When enabled, pools of similar sizes are not merged unless the have the + exact same name. It corresponds to "-dMno-merge" at boot time. + +DEBUG_UAF + When enabled, pools are disabled and all allocations and releases pass + through mmap() and munmap(). The memory usage significantly inflates + and the performance degrades, but this allows to detect a lot of + use-after-free conditions by crashing the program at the first abnormal + access. This should not be used in production. + +DEBUG_POOL_INTEGRITY + When enabled, objects picked from the cache are checked for corruption + by comparing their contents against a pattern that was placed when they + were inserted into the cache. Objects are also allocated in the reverse + order, from the oldest one to the most recent, so as to maximize the + ability to detect such a corruption. The goal is to detect writes after + free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF + this cannot detect reads after free, but may possibly detect later + corruptions and will not consume extra memory. The CPU usage will + increase a bit due to the cost of filling/checking the area and for the + preference for cold cache instead of hot cache, though not as much as + with DEBUG_UAF. This option is meant to be usable in production. It + corresponds to boot-time options "-dMcold-first,integrity". + +DEBUG_POOL_TRACING + When enabled, the callers of pool_alloc() and pool_free() will be + recorded into an extra memory area placed after the end of the object. + This may only be required by developers who want to get a few more + hints about code paths involved in some crashes, but will serve no + purpose outside of this. It remains compatible (and completes well) + DEBUG_POOL_INTEGRITY above. Such information become meaningless once + the objects leave the thread-local cache. It corresponds to boot-time + option "-dMcaller". + +DEBUG_MEM_STATS + When enabled, all malloc/calloc/realloc/strdup/free calls are accounted + for per call place (file+line number), and may be displayed or reset on + the CLI using "debug dev memstats". This is essentially used to detect + potential leaks or abnormal usages. When pools are enabled (default), + such calls are rare and the output will mostly contain calls induced by + libraries. When pools are disabled, about all calls to pool_alloc() and + pool_free() will also appear since they will be remapped to standard + functions. + +CONFIG_HAP_GLOBAL_POOLS + When enabled, process-wide shared pools will be forcefully enabled even + if not considered useful on the platform. The default is to let haproxy + decide based on the OS and C library. It corresponds to boot-time + option "-dMglobal". + +CONFIG_HAP_NO_GLOBAL_POOLS + When enabled, process-wide shared pools will be forcefully disabled + even if considered useful on the platform. The default is to let + haproxy decide based on the OS and C library. It corresponds to + boot-time option "-dMno-global". + +CONFIG_HAP_POOL_CACHE_SIZE + This allows one to define the size of the per-thread cache, in bytes. + The default value is 512 kB (524288). Smaller values will use less + memory at the expense of a possibly higher CPU usage when using many + threads. Higher values will give diminishing returns on performance + while using much more memory. Usually there is no benefit in using + more than a per-core L2 cache size. It would be better not to set this + value lower than a few times the size of a buffer (bufsize, defaults to + 16 kB). + +CONFIG_HAP_POOL_CLUSTER_SIZE + This allows one to define the maximum number of objects that will be + groupped together in an allocation from the shared pool. Values 4 to 8 + have experimentally shown good results with 16 threads. On systems with + more cores or loosely coupled caches exhibiting slow atomic operations, + it could possibly make sense to slightly increase this value. diff --git a/doc/internals/api/scheduler.txt b/doc/internals/api/scheduler.txt new file mode 100644 index 0000000..3469543 --- /dev/null +++ b/doc/internals/api/scheduler.txt @@ -0,0 +1,226 @@ +2021-11-17 - Scheduler API + + +1. Background +------------- + +The scheduler relies on two major parts: + - the wait queue or timers queue, which contains an ordered tree of the next + timers to expire + + - the run queue, which contains tasks that were already woken up and are + waiting for a CPU slot to execute. + +There are two types of schedulable objects in HAProxy: + - tasks: they contain one timer and can be in the run queue without leaving + their place in the timers queue. + + - tasklets: they do not have the timers part and are either sleeping or + running. + +Both the timers queue and run queue in fact exist both shared between all +threads and per-thread. A task or tasklet may only be queued in a single of +each at a time. The thread-local queues are not thread-safe while the shared +ones are. This means that it is only permitted to manipulate an object which +is in the local queue or in a shared queue, but then after locking it. As such +tasks and tasklets are usually pinned to threads and do not move, or only in +very specific ways not detailed here. + +In case of doubt, keep in mind that it's not permitted to manipulate another +thread's private task or tasklet, and that any task held by another thread +might vanish while it's being looked at. + +Internally a large part of the task and tasklet struct is shared between +the two types, which reduces code duplication and eases the preservation +of fairness in the run queue by interleaving all of them. As such, some +fields or flags may not always be relevant to tasklets and may be ignored. + + +Tasklets do not use a thread mask but use a thread ID instead, to which they +are bound. If the thread ID is negative, the tasklet is not bound but may only +be run on the calling thread. + + +2. API +------ + +There are few functions exposed by the scheduler. A few more ones are in fact +accessible but if not documented there they'd rather be avoided or used only +when absolutely certain they're suitable, as some have delicate corner cases. +In doubt, checking the sched.pdf diagram may help. + +int total_run_queues() + Return the approximate number of tasks in run queues. This is racy + and a bit inaccurate as it iterates over all queues, but it is + sufficient for stats reporting. + +int task_in_rq(t) + Return non-zero if the designated task is in the run queue (i.e. it was + already woken up). + +int task_in_wq(t) + Return non-zero if the designated task is in the timers queue (i.e. it + has a valid timeout and will eventually expire). + +int thread_has_tasks() + Return non-zero if the current thread has some work to be done in the + run queue. This is used to decide whether or not to sleep in poll(). + +void task_wakeup(t, f) + Will make sure task <t> will wake up, that is, will execute at least + once after the start of the function is called. The task flags <f> will + be ORed on the task's state, among TASK_WOKEN_* flags exclusively. In + multi-threaded environments it is safe to wake up another thread's task + and even if the thread is sleeping it will be woken up. Users have to + keep in mind that a task running on another thread might very well + finish and go back to sleep before the function returns. It is + permitted to wake the current task up, in which case it will be + scheduled to run another time after it returns to the scheduler. + +struct task *task_unlink_wq(t) + Remove the task from the timers queue if it was in it, and return it. + It may only be done for the local thread, or for a shared thread that + might be in the shared queue. It must not be done for another thread's + task. + +void task_queue(t) + Place or update task <t> into the timers queue, where it may already + be, scheduling it for an expiration at date t->expire. If t->expire is + infinite, nothing is done, so it's safe to call this function without + prior checking the expiration date. It is only valid to call this + function for local tasks or for shared tasks who have the calling + thread in their thread mask. + +void task_set_affinity(t, m) + Change task <t>'s thread_mask to new value <m>. This may only be + performed by the task itself while running. This is only used to let a + task voluntarily migrate to another thread. + +void tasklet_wakeup(tl) + Make sure that tasklet <tl> will wake up, that is, will execute at + least once. The tasklet will run on its assigned thread, or on any + thread if its TID is negative. + +void tasklet_wakeup_on(tl, thr) + Make sure that tasklet <tl> will wake up on thread <thr>, that is, will + execute at least once. The designated thread may only differ from the + calling one if the tasklet is already configured to run on another + thread, and it is not permitted to self-assign a tasklet if its tid is + negative, as it may already be scheduled to run somewhere else. Just in + case, only use tasklet_wakeup() which will pick the tasklet's assigned + thread ID. + +struct tasklet *tasklet_new() + Allocate a new tasklet and set it to run by default on the calling + thread. The caller may change its tid to another one before using it. + The new tasklet is returned. + +struct task *task_new_anywhere() + Allocate a new task to run on any thread, and return the task, or NULL + in case of allocation issue. Note that such tasks will be marked as + shared and will go through the locked queues, thus their activity will + be heavier than for other ones. See also task_new_here(). + +struct task *task_new_here() + Allocate a new task to run on the calling thread, and return the task, + or NULL in case of allocation issue. + +struct task *task_new_on(t) + Allocate a new task to run on thread <t>, and return the task, or NULL + in case of allocation issue. + +void task_destroy(t) + Destroy this task. The task will be unlinked from any timers queue, + and either immediately freed, or asynchronously killed if currently + running. This may only be done by one of the threads this task is + allowed to run on. Developers must not forget that the task's memory + area is not always immediately freed, and that certain misuses could + only have effect later down the chain (e.g. use-after-free). + +void tasklet_free() + Free this tasklet, which must not be running, so that may only be + called by the thread responsible for the tasklet, typically the + tasklet's process() function itself. + +void task_schedule(t, d) + Schedule task <t> to run no later than date <d>. If the task is already + running, or scheduled for an earlier instant, nothing is done. If the + task was not in queued or was scheduled to run later, its timer entry + will be updated. This function assumes that it will never be called + with a timer in the past nor with TICK_ETERNITY. Only one of the + threads assigned to the task may call this function. + +The task's ->process() function receives the following arguments: + + - struct task *t: a pointer to the task itself. It is always valid. + + - void *ctx : a copy of the task's ->context pointer at the moment + the ->process() function was called by the scheduler. A + function must use this and not task->context, because + task->context might possibly be changed by another thread. + For instance, the muxes' takeover() function do this. + + - uint state : a copy of the task's ->state field at the moment the + ->process() function was executed. A function must use + this and not task->state as the latter misses the wakeup + reasons and may constantly change during execution along + concurrent wakeups (threads or signals). + +The possible state flags to use during a call to task_wakeup() or seen by the +task being called are the following; they're automatically cleaned from the +state field before the call to ->process() + + - TASK_WOKEN_INIT each creation of a task causes a first wakeup with this + flag set. Applications should not set it themselves. + + - TASK_WOKEN_TIMER this indicates the task's expire date was reached in the + timers queue. Applications should not set it themselves. + + - TASK_WOKEN_IO indicates the wake-up happened due to I/O activity. Now + that all low-level I/O processing happens on tasklets, + this notion of I/O is now application-defined (for + example stream-interfaces use it to notify the stream). + + - TASK_WOKEN_SIGNAL indicates that a signal the task was subscribed to was + received. Applications should not set it themselves. + + - TASK_WOKEN_MSG any application-defined wake-up reason, usually for + inter-task communication (e.g filters vs streams). + + - TASK_WOKEN_RES a resource the task was waiting for was finally made + available, allowing the task to continue its work. This + is essentially used by buffers and queues. Applications + may carefully use it for their own purpose if they're + certain not to rely on existing ones. + + - TASK_WOKEN_OTHER any other application-defined wake-up reason. + + +In addition, a few persistent flags may be observed or manipulated by the +application, both for tasks and tasklets: + + - TASK_SELF_WAKING when set, indicates that this task was found waking + itself up, and its class will change to bulk processing. + If this behavior is under control temporarily expected, + and it is not expected to happen again, it may make + sense to reset this flag from the ->process() function + itself. + + - TASK_HEAVY when set, indicates that this task does so heavy + processing that it will become mandatory to give back + control to I/Os otherwise big latencies might occur. It + may be set by an application that expects something + heavy to happen (tens to hundreds of microseconds), and + reset once finished. An example of user is the TLS stack + which sets it when an imminent crypto operation is + expected. + + - TASK_F_USR1 This is the first application-defined persistent flag. + It is always zero unless the application changes it. An + example of use cases is the I/O handler for backend + connections, to mention whether the connection is safe + to use or might have recently been migrated. + +Finally, when built with -DDEBUG_TASK, an extra sub-structure "debug" is added +to both tasks and tasklets to note the code locations of the last two calls to +task_wakeup() and tasklet_wakeup(). |