summaryrefslogtreecommitdiffstats
path: root/doc/internals/api
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--doc/internals/api/appctx.txt142
-rw-r--r--doc/internals/api/buffer-api.txt653
-rw-r--r--doc/internals/api/filters.txt1186
-rw-r--r--doc/internals/api/htx-api.txt570
-rw-r--r--doc/internals/api/initcalls.txt360
-rw-r--r--doc/internals/api/ist.txt167
-rw-r--r--doc/internals/api/layers.txt190
-rw-r--r--doc/internals/api/list.txt195
-rw-r--r--doc/internals/api/pools.txt577
-rw-r--r--doc/internals/api/scheduler.txt226
10 files changed, 4266 insertions, 0 deletions
diff --git a/doc/internals/api/appctx.txt b/doc/internals/api/appctx.txt
new file mode 100644
index 0000000..137ec7b
--- /dev/null
+++ b/doc/internals/api/appctx.txt
@@ -0,0 +1,142 @@
+Instantiation of applet contexts (appctx) in 2.6.
+
+
+1. Background
+
+Most applets are in fact simplified services that are called by the CLI when a
+registered keyword is matched. Some of them only have a ->parse() function
+which immediately returns with a final result, while others will return zero
+asking for the->io_handler() one to be called till the end. For these ones, a
+context is generally needed between calls to know where to restart from.
+
+Other applets are completely autonomous applets with their init function and
+an I/O handler, and these ones also need a persistent context between calls to
+the I/O handler. These ones are typically instantiated by "use-service" or by
+other means.
+
+Originally a few integers were provided to keep a trivial state (st0, st1, st2)
+and these ones progressively proved insufficient, leading to a "ctx.cli" sub-
+context that was allowed to use extra fields of various types. Other applets
+preferred to use their own context definition.
+
+All this resulted in the appctx->ctx to contain a myriad of definitions of
+various service contexts, and in some services abusing other services'
+definitions by laziness, and others being extended to use their own definition
+after having run for a long time on the generic types, some of which were not
+noticed and mistakenly used the same storage locations by accident. A massive
+cleanup was needed.
+
+
+2. New approach in 2.6
+
+In 2.6, there's an "svcctx" pointer that's initialized to NULL before any
+instantiation of an applet or of a CLI keyword's function. Applets and keyword
+handlers are free to make it point wherever they want, and to find it unaltered
+between subsequent calls, including up to the ->release() call. The "st2" state
+that was totally abused with random enums is not used anymore and was marked as
+deprecated. It's still initialized to zero before the first call though.
+
+One special area, "svc.storage[]", is large enough to contain any of the
+contexts that used to be present under "appctx->ctx". The "svcctx" may be set
+to point to this area so that a small structure can be allocated for free and
+without requiring error checking. In order to make this easier, a specially
+purposed function is provided: "applet_reserve_svcctx()". This function will
+require the caller to indicate how large an area it needs, and will return a
+pointer to this area after checking that it fits. If it does not, haproxy will
+crash. This is purposely done so that it's known during development that if a
+small structure doesn't fit, a different approach is required.
+
+As such, for the vast majority of commands, the process is the following one:
+
+ struct foo_ctx {
+ int myfield1;
+ int myfield2;
+ char *myfield3;
+ };
+
+ int io_handler(struct appctx *appctx)
+ {
+ struct foo_ctx *ctx = applet_reserve_svcctx(appctx, sizeof(*ctx));
+
+ if (!ctx->myfield1) {
+ /* first call */
+ ctx->myfield1++;
+ }
+ ...
+ }
+
+The pointer may be directly accessed from the I/O handler if it's known that it
+was already reserved by the init handler or parsing function. Otherwise it's
+guaranteed to be NULL so that can also serve as a test for a first call:
+
+ int parse_handler(struct appctx *appctx)
+ {
+ struct foo_ctx *ctx = applet_reserve_svcctx(appctx, sizeof(*ctx));
+ ctx->myfield1 = 12;
+ return 0;
+ }
+
+ int io_handler(struct appctx *appctx)
+ {
+ struct foo_ctx *ctx = appctx->svcctx;
+
+ for (; !ctx->myfield1; ctx->myfield1--) {
+ do_something();
+ }
+ ...
+ }
+
+There is no need to free anything because that space is not allocated but just
+points to a reserved area.
+
+If it is too small (its size is APPLET_MAX_SVCCTX bytes), it is preferable to
+use it with dynamically allocated structures (pools, malloc, etc). For example:
+
+ int io_handler(struct appctx *appctx)
+ {
+ struct foo_ctx *ctx = appctx->svcctx;
+
+ if (!ctx) {
+ /* first call */
+ ctx = pool_alloc(pool_foo_ctx);
+ if (!ctx)
+ return 1;
+ }
+ ...
+ }
+
+ void io_release(struct appctx *appctx)
+ {
+ pool_free(pool_foo_ctx, appctx->svcctx);
+ }
+
+The CLI code itself uses this mechanism for the cli_print_*() functions. Since
+these functions are terminal (i.e. not meant to be used in the middle of an I/O
+handler as they share the same contextual space), they always reset the svcctx
+pointer to place it to the "cli_print_ctx" mapped in ->svc.storage.
+
+
+3. Transition for old code
+
+A lot of care was taken to make the transition as smooth as possible for
+out-of-tree code since that's an API change. A dummy "ctx.cli" struct still
+exists in the appctx struct, and it happens to map perfectly to the one set by
+cli_print_*, so that if some code uses a mix of both, it will still work.
+However, it will build with "deprecated" warnings allowing to spot the
+remaining places. It's a good exercise to rename "ctx.cli" in "appctx" and see
+if the code still compiles.
+
+Regarding the "st2" sub-state, it will disappear as well after 2.6, but is
+still provided and initialized so that code relying on it will still work even
+if it builds with deprecation warnings. The correct approach is to move this
+state into the newly defined applet's context, and to stop using the stats
+enums STAT_ST_* that often barely match the needs and result in code that is
+more complicated than desired (the STAT_ST_* enum values have also been marked
+as deprecated).
+
+The code dealing with "show fd", "show sess" and the peers applet show good
+examples of how to convert a registered keyword or an applet.
+
+All this transition code requires complex layouts that will be removed during
+2.7-dev so there is no other long-term option but to update the code (or better
+get it merged if it can be useful to other users).
diff --git a/doc/internals/api/buffer-api.txt b/doc/internals/api/buffer-api.txt
new file mode 100644
index 0000000..ac35300
--- /dev/null
+++ b/doc/internals/api/buffer-api.txt
@@ -0,0 +1,653 @@
+2018-07-13 - HAProxy Internal Buffer API
+
+
+1. Background
+
+HAProxy uses a "struct buffer" internally to store data received from external
+agents, as well as data to be sent to external agents. These buffers are also
+used during data transformation such as compression, header insertion or
+defragmentation, and are used to carry intermediary representations between the
+various internal layers. They support wrapping at the end, and they carry their
+own size information so that in theory it would be possible to use different
+buffer sizes in parallel even though this is not currently implemented.
+
+The format of this structure has evolved over time, to reach a point where it
+is convenient and versatile enough to have permitted to make several internal
+types converge into a single one (specifically the struct chunk disappeared).
+
+
+2. Representation as of 1.9-dev1
+
+The current buffer representation consists in a linear storage area of known
+size, with a head position indicating the oldest data, and a total data count
+expressed in bytes. The head position, data count and size are expressed as
+integers and are positive or null. By convention, the head position is strictly
+smaller than the buffer size and the data count is smaller than or equal to the
+size, so that wrapping can be resolved with a single subtract. A buffer not
+respecting these rules is said to be degenerate. Unless specified otherwise,
+the various API functions will adopt an undefined behaviour when passed such a
+degenerate buffer.
+
+ Buffer declaration :
+
+ struct buffer {
+ size_t size; // size of the storage area (wrapping point)
+ char *area; // start of the storage area
+ size_t data; // contents length after head
+ size_t head; // start offset of remaining data relative to area
+ };
+
+
+ Linear buffer representation :
+
+ area
+ |
+ V<--------------------------------------------------------->| size
+ +-----------+---------------------------------+-------------+
+ | |/////////////////////////////////| |
+ +-----------+---------------------------------+-------------+
+ |<--------->|<------------------------------->|
+ head data ^
+ |
+ tail
+
+
+ Wrapping buffer representation :
+
+ area
+ |
+ V<--------------------------------------------------------->| size
+ +---------------+------------------------+------------------+
+ |///////////////| |//////////////////|
+ +---------------+------------------------+------------------+
+ |<-------------------------------------->| head
+ |-------------->| ...data data...|<-----------------|
+ ^
+ |
+ tail
+
+
+3. Terminology
+
+Manipulating a buffer just based on a head and a wrapping data count is not
+very convenient, so we define a certain number of terms for important elements
+characterizing a buffer :
+
+ - origin : pointer to relative position 0 in the storage area. Undefined
+ when the buffer is not allocated.
+
+ - size : the allocated size of the storage area starting at the origin,
+ expressed in bytes. A buffer whose size is zero is said not to
+ be allocated, and its origin in this case is undefined.
+
+ - data : the amount of data the buffer contains, in bytes. It is always
+ lower than or equal to the buffer's size, hence it is always 0
+ for an unallocated buffer.
+
+ - emptiness : a buffer is said to be empty when it contains no data, hence
+ data == 0. It is possible for such buffers not to be allocated
+ and to have size == 0 as well.
+
+ - room : the available space in the buffer. This is its size minus data.
+
+ - head : position relative to origin where the oldest data byte is found
+ (it typically is what send() uses to pick outgoing data). The
+ head is strictly smaller than the size.
+
+ - tail : position relative to origin where the first spare byte is found
+ (it typically is what recv() uses to store incoming data). It
+ is always equal to the buffer's data added to its head modulo
+ the buffer's size.
+
+ - wrapping : the byte following the last one of the storage area loops back
+ to position 0. This is called wrapping. The wrapping point is
+ the first position relative to origin which doesn't belong to
+ the storage area. There is no wrapping when a buffer is not
+ allocated. Wrapping requires special care and means that the
+ regular string manipulation functions are not usable on most
+ buffers, unless it is known that no wrapping happens. Free
+ space may wrap as well if the buffer only contains data in the
+ middle.
+
+ - alignment : a buffer is said to be aligned if its data do not wrap. That
+ is, its head is strictly before the tail, or the buffer is
+ empty and the head is null. Aligning a buffer may be required
+ to use regular string manipulation functions which have no
+ support for wrapping.
+
+
+A buffer may be in three different states :
+ - unallocated : size == 0, area == 0 (b_is_null() is true)
+ - waiting : size == 0, area != 0
+ - allocated : size > 0, area > 0
+
+It is not permitted to have area == 0 with a non-null size. In addition, the
+waiting state may also be used to indicate a read-only buffer which does not
+wrap and which must not be freed (e.g. for use with error messages).
+
+The basic API only covers allocated buffers. Switching to/from the other states
+is covered by the management API since it requires specific allocation and free
+calls.
+
+
+4. Using buffers
+
+Buffers are defined in a few files :
+ - include/common/buf.h : structure definition, and manipulation functions
+ - include/common/buffer.h : resource management (alloc/free/wait lists)
+ - include/common/istbuf.h : advanced string manipulation
+
+
+4.1. Basic API
+
+The basic API is made of the functions which abstract accesses to the buffers
+and which help calculating their state, free space or used space.
+
+====================+==================+=======================================
+Function | Arguments/Return | Description
+--------------------+------------------+---------------------------------------
+b_is_null() | const buffer *buf| returns true if (and only if) the
+ | ret: int | buffer is not yet allocated and thus
+ | | points to a NULL area
+--------------------+------------------+---------------------------------------
+b_orig() | const buffer *buf| returns the pointer to the origin of
+ | ret: char * | the storage, which is the location of
+ | | byte at offset zero. This is mostly
+ | | used by functions which handle the
+ | | wrapping by themselves
+--------------------+------------------+---------------------------------------
+b_size() | const buffer *buf| returns the size of the buffer
+ | ret: size_t |
+--------------------+------------------+---------------------------------------
+b_wrap() | const buffer *buf| returns the pointer to the wrapping
+ | ret: char * | position of the buffer area, which is
+ | | by definition the first byte not part
+ | | of the buffer
+--------------------+------------------+---------------------------------------
+b_data() | const buffer *buf| returns the number of bytes present in
+ | ret: size_t | the buffer
+--------------------+------------------+---------------------------------------
+b_room() | const buffer *buf| returns the amount of room left in the
+ | ret: size_t | buffer
+--------------------+------------------+---------------------------------------
+b_full() | const buffer *buf| returns true if the buffer is full
+ | ret: int |
+--------------------+------------------+---------------------------------------
+__b_stop() | const buffer *buf| returns a pointer to the byte
+ | ret: char * | following the end of the buffer, which
+ | | may be out of the buffer if the buffer
+ | | ends on the last byte of the area. It
+ | | is the caller's responsibility to
+ | | either know that the buffer does not
+ | | wrap or to check that the result does
+ | | not wrap
+--------------------+------------------+---------------------------------------
+__b_stop_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the byte following the end
+ | | of the buffer, which may be out of the
+ | | buffer if the buffer ends on the last
+ | | byte of the area. It's the caller's
+ | | responsibility to either know that the
+ | | buffer does not wrap or to check that
+ | | the result does not wrap
+--------------------+------------------+---------------------------------------
+b_stop() | const buffer *buf| returns the pointer to the byte
+ | ret: char * | following the end of the buffer, which
+ | | may be out of the buffer if the buffer
+ | | ends on the last byte of the area
+--------------------+------------------+---------------------------------------
+b_stop_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the byte following the end
+ | | of the buffer, which may be out of the
+ | | buffer if the buffer ends on the last
+ | | byte of the area
+--------------------+------------------+---------------------------------------
+__b_peek() | const buffer *buf| returns a pointer to the data at
+ | size_t ofs | position <ofs> relative to the head of
+ | ret: char * | the buffer. Will typically point to
+ | | input data if called with the amount
+ | | of output data. It's the caller's
+ | | responsibility to either know that the
+ | | buffer does not wrap or to check that
+ | | the result does not wrap
+--------------------+------------------+---------------------------------------
+__b_peek_ofs() | const buffer *buf| returns an origin-relative offset
+ | size_t ofs | pointing to the data at position <ofs>
+ | ret: size_t | relative to the head of the
+ | | buffer. Will typically point to input
+ | | data if called with the amount of
+ | | output data. It's the caller's
+ | | responsibility to either know that the
+ | | buffer does not wrap or to check that
+ | | the result does not wrap
+--------------------+------------------+---------------------------------------
+b_peek() | const buffer *buf| returns a pointer to the data at
+ | size_t ofs | position <ofs> relative to the head of
+ | ret: char * | the buffer. Will typically point to
+ | | input data if called with the amount
+ | | of output data. If applying <ofs> to
+ | | the buffers' head results in a
+ | | position between <size> and 2*>size>-1
+ | | included, a wrapping compensation is
+ | | applied to the result
+--------------------+------------------+---------------------------------------
+b_peek_ofs() | const buffer *buf| returns an origin-relative offset
+ | size_t ofs | pointing to the data at position <ofs>
+ | ret: size_t | relative to the head of the
+ | | buffer. Will typically point to input
+ | | data if called with the amount of
+ | | output data. If applying <ofs> to the
+ | | buffers' head results in a position
+ | | between <size> and 2*>size>-1
+ | | included, a wrapping compensation is
+ | | applied to the result
+--------------------+------------------+---------------------------------------
+__b_head() | const buffer *buf| returns the pointer to the buffer's
+ | ret: char * | head, which is the location of the
+ | | next byte to be dequeued. The result
+ | | is undefined for unallocated buffers
+--------------------+------------------+---------------------------------------
+__b_head_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the buffer's head, which
+ | | is the location of the next byte to be
+ | | dequeued. The result is undefined for
+ | | unallocated buffers
+--------------------+------------------+---------------------------------------
+b_head() | const buffer *buf| returns the pointer to the buffer's
+ | ret: char * | head, which is the location of the
+ | | next byte to be dequeued. The result
+ | | is undefined for unallocated
+ | | buffers. If applying <ofs> to the
+ | | buffers' head results in a position
+ | | between <size> and 2*>size>-1
+ | | included, a wrapping compensation is
+ | | applied to the result
+--------------------+------------------+---------------------------------------
+b_head_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the buffer's head, which
+ | | is the location of the next byte to be
+ | | dequeued. The result is undefined for
+ | | unallocated buffers. If applying
+ | | <ofs> to the buffers' head results in
+ | | a position between <size> and
+ | | 2*>size>-1 included, a wrapping
+ | | compensation is applied to the result
+--------------------+------------------+---------------------------------------
+__b_tail() | const buffer *buf| returns the pointer to the tail of the
+ | ret: char * | buffer, which is the location of the
+ | | first byte where it is possible to
+ | | enqueue new data. The result is
+ | | undefined for unallocated buffers
+--------------------+------------------+---------------------------------------
+__b_tail_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the tail of the buffer,
+ | | which is the location of the first
+ | | byte where it is possible to enqueue
+ | | new data. The result is undefined for
+ | | unallocated buffers
+--------------------+------------------+---------------------------------------
+b_tail() | const buffer *buf| returns the pointer to the tail of the
+ | ret: char * | buffer, which is the location of the
+ | | first byte where it is possible to
+ | | enqueue new data. The result is
+ | | undefined for unallocated buffers
+--------------------+------------------+---------------------------------------
+b_tail_ofs() | const buffer *buf| returns an origin-relative offset
+ | ret: size_t | pointing to the tail of the buffer,
+ | | which is the location of the first
+ | | byte where it is possible to enqueue
+ | | new data. The result is undefined for
+ | | unallocated buffers
+--------------------+------------------+---------------------------------------
+b_next() | const buffer *buf| for an absolute pointer <p> pointing
+ | const char *p | to a valid location within buffer <b>,
+ | ret: char * | returns the absolute pointer to the
+ | | next byte, which usually is at (p + 1)
+ | | unless p reaches the wrapping point
+ | | and wrapping is needed
+--------------------+------------------+---------------------------------------
+b_next_ofs() | const buffer *buf| for an origin-relative offset <o>
+ | size_t o | pointing to a valid location within
+ | ret: size_t | buffer <b>, returns either the
+ | | relative offset pointing to the next
+ | | byte, which usually is at (o + 1)
+ | | unless o reaches the wrapping point
+ | | and wrapping is needed
+--------------------+------------------+---------------------------------------
+b_dist() | const buffer *buf| returns the distance between two
+ | const char *from | pointers, taking into account the
+ | const char *to | ability to wrap around the buffer's
+ | ret: size_t | end. The operation is not defined if
+ | | either of the pointers does not belong
+ | | to the buffer or if their distance is
+ | | greater than the buffer's size
+--------------------+------------------+---------------------------------------
+b_almost_full() | const buffer *buf| returns 1 if the buffer uses at least
+ | ret: int | 3/4 of its capacity, otherwise
+ | | zero. Buffers of size zero are
+ | | considered full
+--------------------+------------------+---------------------------------------
+b_space_wraps() | const buffer *buf| returns non-zero only if the buffer's
+ | ret: int | free space wraps, which means that the
+ | | buffer contains data that are not
+ | | touching at least one edge
+--------------------+------------------+---------------------------------------
+b_contig_data() | const buffer *buf| returns the amount of data that can
+ | size_t start | contiguously be read at once starting
+ | ret: size_t | from a relative offset <start> (which
+ | | allows to easily pre-compute blocks
+ | | for memcpy). The start point will
+ | | typically contain the amount of past
+ | | data already returned by a previous
+ | | call to this function
+--------------------+------------------+---------------------------------------
+b_contig_space() | const buffer *buf| returns the amount of bytes that can
+ | ret: size_t | be appended to the buffer at once
+--------------------+------------------+---------------------------------------
+b_getblk() | const buffer *buf| gets one full block of data at once
+ | char *blk | from a buffer, starting from offset
+ | size_t len | <offset> after the buffer's head, and
+ | size_t offset | limited to no more than <len> bytes.
+ | ret: size_t | The caller is responsible for ensuring
+ | | that neither <offset> nor <offset> +
+ | | <len> exceed the total number of bytes
+ | | available in the buffer. Return zero
+ | | if not enough data was available, in
+ | | which case blk is left undefined, or
+ | | the number of bytes read which is
+ | | equal to the requested size
+--------------------+------------------+---------------------------------------
+b_getblk_nc() | const buffer *buf| gets one or two blocks of data at once
+ | const char **blk1| from a buffer, starting from offset
+ | size_t *len1 | <ofs> after the beginning of its
+ | const char **blk2| output, and limited to no more than
+ | size_t *len2 | <max> bytes. The caller is responsible
+ | size_t ofs | for ensuring that neither <ofs> nor
+ | size_t max | <ofs>+<max> exceed the total number of
+ | ret: int | bytes available in the buffer. Returns
+ | | 0 if not enough data were available,
+ | | or the number of blocks filled (1 or
+ | | 2). <blk1> is always filled before
+ | | <blk2>. The unused blocks are left
+ | | undefined, and the buffer is left
+ | | unaffected. Unused buffers are left in
+ | | an undefined state
+--------------------+------------------+---------------------------------------
+b_reset() | buffer *buf | resets a buffer. The size is not
+ | ret: void | touched. In practice it resets the
+ | | head and the data length
+--------------------+------------------+---------------------------------------
+b_sub() | buffer *buf | decreases the buffer length by <count>
+ | size_t count | without touching the head position
+ | ret: void | (only the tail moves). this may mostly
+ | | be used to trim pending data before
+ | | reusing a buffer. The caller is
+ | | responsible for not removing more than
+ | | the available data
+--------------------+------------------+---------------------------------------
+b_add() | buffer *buf | increase the buffer length by <count>
+ | size_t count | without touching the head position
+ | ret: void | (only the tail moves). This is used
+ | | when adding data at the tail of a
+ | | buffer. The caller is responsible for
+ | | not adding more than the available
+ | | room
+--------------------+------------------+---------------------------------------
+b_set_data() | buffer *buf | sets the buffer's length, by adjusting
+ | size_t len | the buffer's tail only. The caller is
+ | ret: void | responsible for passing a valid length
+--------------------+------------------+---------------------------------------
+b_del() | buffer *buf | deletes <del> bytes at the head of
+ | size_t del | buffer <b> and updates the head. The
+ | ret: void | caller is responsible for not removing
+ | | more than the available data. This is
+ | | used after sending data from the
+ | | buffer
+--------------------+------------------+---------------------------------------
+b_realign_if_empty()| buffer *buf | realigns a buffer if it's empty, does
+ | ret: void | nothing otherwise. This is mostly used
+ | | after b_del() to make an empty
+ | | buffer's free space contiguous
+--------------------+------------------+---------------------------------------
+b_slow_realign() | buffer *buf | realigns a possibly wrapping buffer so
+ | size_t output | that the part remaining to be parsed
+ | ret: void | is contiguous and starts at the
+ | | beginning of the buffer and the
+ | | already parsed output part ends at the
+ | | end of the buffer. This provides the
+ | | best conditions since it allows the
+ | | largest inputs to be processed at once
+ | | and ensures that once the output data
+ | | leaves, the whole buffer is available
+ | | at once. The number of output bytes
+ | | supposedly present at the beginning of
+ | | the buffer and which need to be moved
+ | | to the end must be passed in <output>.
+ | | It will effectively make this offset
+ | | the new wrapping point. A temporary
+ | | swap area at least as large as b->size
+ | | must be provided in <swap>. It's up
+ | | to the caller to ensure <output> is no
+ | | larger than the difference between the
+ | | whole buffer's length and its input
+--------------------+------------------+---------------------------------------
+b_putchar() | buffer *buf | tries to append char <c> at the end of
+ | char c | buffer <b>. Supports wrapping. New
+ | ret: void | data are silently discarded if the
+ | | buffer is already full
+--------------------+------------------+---------------------------------------
+b_putblk() | buffer *buf | tries to append block <blk> at the end
+ | const char *blk | of buffer <b>. Supports wrapping. Data
+ | size_t len | are truncated if the buffer is too
+ | ret: size_t | short or if not enough space is
+ | | available. It returns the number of
+ | | bytes really copied
+--------------------+------------------+---------------------------------------
+b_move() | buffer *buf | moves block (src,len) left or right
+ | size_t src | by <shift> bytes, supporting wrapping
+ | size_t len | and overlapping.
+ | size_t shift |
+--------------------+------------------+---------------------------------------
+b_rep_blk() | buffer *buf | writes the block <blk> at position
+ | char *pos | <pos> which must be in buffer <b>, and
+ | char *end | moves the part between <end> and the
+ | const char *blk | buffer's tail just after the end of
+ | size_t len | the copy of <blk>. This effectively
+ | ret: int | replaces the part located between
+ | | <pos> and <end> with a copy of <blk>
+ | | of length <len>. The buffer's length
+ | | is automatically updated. This is used
+ | | to replace a block with another one
+ | | inside a buffer. The shift value
+ | | (positive or negative) is returned. If
+ | | there's no space left, the move is not
+ | | done. If <len> is null, the <blk>
+ | | pointer is allowed to be null, in
+ | | order to erase a block
+--------------------+------------------+---------------------------------------
+b_xfer() | buffer *src | transfers at most <count> bytes from
+ | buffer *dst | buffer <src> to buffer <dst> and
+ | size_t cout | returns the number of bytes copied.
+ | ret: size_t | The bytes are removed from <src> and
+ | | added to <dst>. The caller guarantees
+ | | that <count> is <= b_room(dst)
+====================+==================+=======================================
+
+
+4.2. String API
+
+The string API aims at providing both convenient and efficient ways to read and
+write to/from buffers using indirect strings (ist). These strings and some
+associated functions are defined in ist.h.
+
+====================+==================+=======================================
+Function | Arguments/Return | Description
+--------------------+------------------+---------------------------------------
+b_isteq() | const buffer *b | b_isteq() : returns > 0 if the first
+ | size_t o | <n> characters of buffer <b> starting
+ | size_t n | at offset <o> relative to the buffer's
+ | const ist ist | head match <ist>. (empty strings do
+ | ret: int | match). It is designed to be used with
+ | | reasonably small strings (it matches a
+ | | single byte per loop iteration). It is
+ | | expected to be used with an offset to
+ | | skip old data. Return value number of
+ | | matching bytes if >0, not enough bytes
+ | | or empty string if 0, or non-matching
+ | | byte found if <0.
+--------------------+------------------+---------------------------------------
+b_isteat | struct buffer *b | b_isteat() : "eats" string <ist> from
+ | const ist ist | the head of buffer <b>. Wrapping data
+ | ret: ssize_t | is explicitly supported. It matches a
+ | | single byte per iteration so strings
+ | | should remain reasonably small.
+ | | Returns the number of bytes matched
+ | | and eaten if >0, not enough bytes or
+ | | matched empty string if 0, or non
+ | | matching byte found if <0.
+--------------------+------------------+---------------------------------------
+b_istput | struct buffer *b | b_istput() : injects string <ist> at
+ | const ist ist | the tail of output buffer <b> provided
+ | ret: ssize_t | that it fits. Wrapping is supported.
+ | | It's designed for small strings as it
+ | | only writes a single byte per
+ | | iteration. Returns the number of
+ | | characters copied (ist.len), 0 if it
+ | | temporarily does not fit, or -1 if it
+ | | will never fit. It will only modify
+ | | the buffer upon success. In all cases,
+ | | the contents are copied prior to
+ | | reporting an error, so that the
+ | | destination at least contains a valid
+ | | but truncated string.
+--------------------+------------------+---------------------------------------
+b_putist | struct buffer *b | b_putist() : tries to copy as much as
+ | const ist ist | possible of string <ist> into buffer
+ | ret: size_t | <b> and returns the number of bytes
+ | | copied (truncation is possible). It
+ | | uses b_putblk() and is suitable for
+ | | large blocks.
+====================+==================+=======================================
+
+
+4.3. Management API
+
+The management API makes a distinction between an empty buffer, which by
+definition is not allocated but is ready to be allocated at any time, and a
+buffer which failed an allocation and is waiting for an available area to be
+offered. The functions allow to register on a list to be notified about buffer
+availability, to notify others of a number of buffers just released, and to be
+and to be notified of buffer availability. All allocations are made through the
+standard buffer pools.
+
+====================+==================+=======================================
+Function | Arguments/Return | Description
+--------------------+------------------+---------------------------------------
+buffer_almost_full | const buffer *buf| returns true if the buffer is not null
+ | ret: int | and at least 3/4 of the buffer's space
+ | | are used. A waiting buffer will match.
+--------------------+------------------+---------------------------------------
+b_alloc | buffer *buf | ensures that <buf> is allocated or
+ | ret: buffer * | allocates a buffer and assigns it to
+ | | *buf. If no memory is available, (1)
+ | | is assigned instead with a zero size.
+ | | The allocated buffer is returned, or
+ | | NULL in case no memory is available
+--------------------+------------------+---------------------------------------
+__b_free | buffer *buf | releases <buf> which must be allocated
+ | ret: void | and marks it empty
+--------------------+------------------+---------------------------------------
+b_free | buffer *buf | releases <buf> only if it is allocated
+ | ret: void | and marks it empty
+--------------------+------------------+---------------------------------------
+offer_buffers() | void *from | offer a buffer currently belonging to
+ | uint threshold | target <from> to whoever needs
+ | ret: void | one. Any pointer is valid for <from>,
+ | | including NULL. Its purpose is to
+ | | avoid passing a buffer to oneself in
+ | | case of failed allocations (e.g. need
+ | | two buffers, get one, fail, release it
+ | | and wake up self again). In case of
+ | | normal buffer release where it is
+ | | expected that the caller is not
+ | | waiting for a buffer, NULL is fine
+====================+==================+=======================================
+
+
+5. Porting code from older versions
+
+The previous buffer API introduced in 1.5-dev9 (May 2012) used to look like the
+following (with the struct renamed to old_buffer here to avoid confusion during
+quick lookups at the doc). It's worth noting that the "data" field used to be
+part of the struct but with a different type and meaning. It's important to be
+careful about potential code making use of &b->data as it will silently compile
+but fail.
+
+ Previous buffer declaration :
+
+ struct old_buffer {
+ char *p; /* buffer's start pointer, separates in and out data */
+ unsigned int size; /* buffer size in bytes */
+ unsigned int i; /* number of input bytes pending for analysis in the buffer */
+ unsigned int o; /* number of out bytes the sender can consume from this buffer */
+ char data[0]; /* <size> bytes */
+ };
+
+ Previous linear buffer representation :
+
+ data p
+ | |
+ V V
+ +-----------+--------------------+------------+-------------+
+ | |////////////////////|////////////| |
+ +-----------+--------------------+------------+-------------+
+ <---------------------------------------------------------> size
+ <------------------> <---------->
+ o i
+
+There is this correspondence between old and new fields (some will involve a
+knowledge of a channel when the output byte count is required) :
+
+ Old | New
+ --------+----------------------------------------------------
+ p | data + head + co_data(channel) // ci_head(channel)
+ size | size
+ i | data - co_data(channel) // ci_data(channel)
+ o | co_data(channel) // channel->output
+ data | area
+ --------+-----------------------------------------------------
+
+Then some common expressions can be mapped like this :
+
+ Old | New
+ -----------------------+---------------------------------------
+ b->data | b_orig(b)
+ &b->data | b_orig(b)
+ bi_ptr(b) | ci_head(channel)
+ bi_end(b) | b_tail(b)
+ bo_ptr(b) | b_head(b)
+ bo_end(b) | co_tail(channel)
+ bi_putblk(b,s,l) | b_putblk(b,s,l)
+ bo_getblk(b,s,l,o) | b_getblk(b,s,l,o)
+ bo_getblk_nc(b,s,l,o) | b_getblk_nc(b,s,l,o,0,co_data(channel))
+ b->i + b->o | b_data(b)
+ b->data + b->size | b_wrap(b)
+ b->i += len | b_add(b, len)
+ b->i -= len | b_sub(b, len)
+ b->i = len | b_set_data(b, co_data(channel) + len)
+ b->o += len | b_add(b, len); channel->output += len
+ b->o -= len | b_del(b, len); channel->output -= len
+ -----------------------+---------------------------------------
+
+The buffer modification functions are less straightforward and depend a lot on
+the context where they are used. It is strongly advised to figure in the list
+of functions above what is available based on what is attempted to be done in
+the existing code.
+
+Note that it is very likely that any out-of-tree code relying on buffers will
+not use both ->i and ->o but instead will use exclusively ->i on the side
+producing data and use exclusively ->o on the side consuming data (such as in a
+mux or in an applet). In both cases, it should be assumed that the other side
+is always zero and that either ->i or ->o is replaced with ->data, making the
+remaining code much simpler (no more code duplication based on the data
+direction).
diff --git a/doc/internals/api/filters.txt b/doc/internals/api/filters.txt
new file mode 100644
index 0000000..eee74cf
--- /dev/null
+++ b/doc/internals/api/filters.txt
@@ -0,0 +1,1186 @@
+ -----------------------------------------
+ Filters Guide - version 2.5
+ ( Last update: 2021-02-24 )
+ ------------------------------------------
+ Author : Christopher Faulet
+ Contact : christopher dot faulet at capflam dot org
+
+
+ABSTRACT
+--------
+
+The filters support is a new feature of HAProxy 1.7. It is a way to extend
+HAProxy without touching its core code and, in certain extent, without knowing
+its internals. This feature will ease contributions, reducing impact of
+changes. Another advantage will be to simplify HAProxy by replacing some parts
+by filters. As we will see, and as an example, the HTTP compression is the first
+feature moved in a filter.
+
+This document describes how to write a filter and what to keep in mind to do
+so. It also talks about the known limits and the pitfalls to avoid.
+
+As said, filters are quite new for now. The API is not freezed and will be
+updated/modified/improved/extended as needed.
+
+
+
+SUMMARY
+-------
+
+ 1. Filters introduction
+ 2. How to use filters
+ 3. How to write a new filter
+ 3.1. API Overview
+ 3.2. Defining the filter name and its configuration
+ 3.3. Managing the filter lifecycle
+ 3.3.1. Dealing with threads
+ 3.4. Handling the streams activity
+ 3.5. Analyzing the channels activity
+ 3.6. Filtering the data exchanged
+ 4. FAQ
+
+
+
+1. FILTERS INTRODUCTION
+-----------------------
+
+First of all, to fully understand how filters work and how to create one, it is
+best to know, at least from a distance, what is a proxy (frontend/backend), a
+stream and a channel in HAProxy and how these entities are linked to each other.
+doc/internals/entities.pdf is a good overview.
+
+Then, to support filters, many callbacks has been added to HAProxy at different
+places, mainly around channel analyzers. Their purpose is to allow filters to
+be involved in the data processing, from the stream creation/destruction to
+the data forwarding. Depending of what it should do, a filter can implement all
+or part of these callbacks. For now, existing callbacks are focused on
+streams. But future improvements could enlarge filters scope. For instance, it
+could be useful to handle events at the connection level.
+
+In HAProxy configuration file, a filter is declared in a proxy section, except
+default. So the configuration corresponding to a filter declaration is attached
+to a specific proxy, and will be shared by all its instances. it is opaque from
+the HAProxy point of view, this is the filter responsibility to manage it. For
+each filter declaration matches a uniq configuration. Several declarations of
+the same filter in the same proxy will be handle as different filters by
+HAProxy.
+
+A filter instance is represented by a partially opaque context (or a state)
+attached to a stream and passed as arguments to callbacks. Through this context,
+filter instances are stateful. Depending the filter is declared in a frontend or
+a backend section, its instances will be created, respectively, when a stream is
+created or when a backend is selected. Their behaviors will also be
+different. Only instances of filters declared in a frontend section will be
+aware of the creation and the destruction of the stream, and will take part in
+the channels analyzing before the backend is defined.
+
+It is important to remember the configuration of a filter is shared by all its
+instances, while the context of an instance is owned by a uniq stream.
+
+Filters are designed to be chained. It is possible to declare several filters in
+the same proxy section. The declaration order is important because filters will
+be called one after the other respecting this order. Frontend and backend
+filters are also chained, frontend ones called first. Even if the filters
+processing is serialized, each filter will bahave as it was alone (unless it was
+developed to be aware of other filters). For all that, some constraints are
+imposed to filters, especially when data exchanged between the client and the
+server are processed. We will discuss again these constraints when we will tackle
+the subject of writing a filter.
+
+
+
+2. HOW TO USE FILTERS
+---------------------
+
+To use a filter, the parameter 'filter' should be used, followed by the filter
+name and, optionally, its configuration in the desired listen, frontend or
+backend section. For instance :
+
+ listen test
+ ...
+ filter trace name TST
+ ...
+
+
+See doc/configuration.txt for a formal definition of the parameter 'filter'.
+Note that additional parameters on the filter line must be parsed by the filter
+itself.
+
+The list of available filters is reported by 'haproxy -vv' :
+
+ $> haproxy -vv
+ HAProxy version 1.7-dev2-3a1d4a-33 2016/03/21
+ Copyright 2000-2016 Willy Tarreau <willy@haproxy.org>
+
+ [...]
+
+ Available filters :
+ [COMP] compression
+ [TRACE] trace
+
+
+Multiple filter lines can be used in a proxy section to chain filters. Filters
+will be called in the declaration order.
+
+Some filters can support implicit declarations in certain circumstances
+(without the filter line). This is not recommended for new features but are
+useful for existing ones moved in a filter, for backward compatibility
+reasons. Implicit declarations are supported when there is only one filter used
+on a proxy. When several filters are used, explicit declarations are mandatory.
+The HTTP compression filter is one of these filters. Alone, using 'compression'
+keywords is enough to use it. But when at least a second filter is used, a
+filter line must be added.
+
+ # filter line is optional
+ listen t1
+ bind *:80
+ compression algo gzip
+ compression offload
+ server srv x.x.x.x:80
+
+ # filter line is mandatory for the compression filter
+ listen t2
+ bind *:81
+ filter trace name T2
+ filter compression
+ compression algo gzip
+ compression offload
+ server srv x.x.x.x:80
+
+
+
+
+3. HOW TO WRITE A NEW FILTER
+----------------------------
+
+To write a filter, there are 2 header files to explore :
+
+ * include/haproxy/filters-t.h : This is the main header file, containing all
+ important structures to use. It represents the
+ filter API.
+
+ * include/haproxy/filters.h : This header file contains helper functions that
+ may be used. It also contains the internal API
+ used by HAProxy to handle filters.
+
+To ease the filters integration, it is better to follow some conventions :
+
+ * Use 'flt_' prefix to name the filter (e.g flt_http_comp or flt_trace).
+
+ * Keep everything related to the filter in a same file.
+
+The filter 'trace' can be used as a template to write new filter. It is a good
+start to see how filters really work.
+
+3.1 API OVERVIEW
+----------------
+
+Writing a filter can be summarized to write functions and attach them to the
+existing callbacks. Available callbacks are listed in the following structure :
+
+ struct flt_ops {
+ /*
+ * Callbacks to manage the filter lifecycle
+ */
+ int (*init) (struct proxy *p, struct flt_conf *fconf);
+ void (*deinit) (struct proxy *p, struct flt_conf *fconf);
+ int (*check) (struct proxy *p, struct flt_conf *fconf);
+ int (*init_per_thread) (struct proxy *p, struct flt_conf *fconf);
+ void (*deinit_per_thread)(struct proxy *p, struct flt_conf *fconf);
+
+ /*
+ * Stream callbacks
+ */
+ int (*attach) (struct stream *s, struct filter *f);
+ int (*stream_start) (struct stream *s, struct filter *f);
+ int (*stream_set_backend)(struct stream *s, struct filter *f, struct proxy *be);
+ void (*stream_stop) (struct stream *s, struct filter *f);
+ void (*detach) (struct stream *s, struct filter *f);
+ void (*check_timeouts) (struct stream *s, struct filter *f);
+
+ /*
+ * Channel callbacks
+ */
+ int (*channel_start_analyze)(struct stream *s, struct filter *f,
+ struct channel *chn);
+ int (*channel_pre_analyze) (struct stream *s, struct filter *f,
+ struct channel *chn,
+ unsigned int an_bit);
+ int (*channel_post_analyze) (struct stream *s, struct filter *f,
+ struct channel *chn,
+ unsigned int an_bit);
+ int (*channel_end_analyze) (struct stream *s, struct filter *f,
+ struct channel *chn);
+
+ /*
+ * HTTP callbacks
+ */
+ int (*http_headers) (struct stream *s, struct filter *f,
+ struct http_msg *msg);
+ int (*http_payload) (struct stream *s, struct filter *f,
+ struct http_msg *msg, unsigned int offset,
+ unsigned int len);
+ int (*http_end) (struct stream *s, struct filter *f,
+ struct http_msg *msg);
+
+ void (*http_reset) (struct stream *s, struct filter *f,
+ struct http_msg *msg);
+ void (*http_reply) (struct stream *s, struct filter *f,
+ short status,
+ const struct buffer *msg);
+
+ /*
+ * TCP callbacks
+ */
+ int (*tcp_payload) (struct stream *s, struct filter *f,
+ struct channel *chn, unsigned int offset,
+ unsigned int len);
+ };
+
+
+We will explain in following parts when these callbacks are called and what they
+should do.
+
+Filters are declared in proxy sections. So each proxy have an ordered list of
+filters, possibly empty if no filter is used. When the configuration of a proxy
+is parsed, each filter line represents an entry in this list. In the structure
+'proxy', the filters configurations are stored in the field 'filter_configs',
+each one of type 'struct flt_conf *' :
+
+ /*
+ * Structure representing the filter configuration, attached to a proxy and
+ * accessible from a filter when instantiated in a stream
+ */
+ struct flt_conf {
+ const char *id; /* The filter id */
+ struct flt_ops *ops; /* The filter callbacks */
+ void *conf; /* The filter configuration */
+ struct list list; /* Next filter for the same proxy */
+ unsigned int flags; /* FLT_CFG_FL_* */
+ };
+
+ * 'flt_conf.id' is an identifier, defined by the filter. It can be
+ NULL. HAProxy does not use this field. Filters can use it in log messages or
+ as a uniq identifier to check multiple declarations. It is the filter
+ responsibility to free it, if necessary.
+
+ * 'flt_conf.conf' is opaque. It is the internal configuration of a filter,
+ generally allocated and filled by its parsing function (See § 3.2). It is
+ the filter responsibility to free it.
+
+ * 'flt_conf.ops' references the callbacks implemented by the filter. This
+ field must be set during the parsing phase (See § 3.2) and can be refine
+ during the initialization phase (See § 3.3). If it is dynamically allocated,
+ it is the filter responsibility to free it.
+
+ * 'flt_conf.flags' is a bitfield to specify the filter capabilities. For now,
+ only FLT_CFG_FL_HTX may be set when a filter is able to process HTX
+ streams. If not set, the filter is excluded from the HTTP filtering.
+
+
+The filter configuration is global and shared by all its instances. A filter
+instance is created in the context of a stream and attached to this stream. in
+the structure 'stream', the field 'strm_flt' is the state of all filter
+instances attached to a stream :
+
+ /*
+ * Structure representing the "global" state of filters attached to a
+ * stream.
+ */
+ struct strm_flt {
+ struct list filters; /* List of filters attached to a stream */
+ struct filter *current[2]; /* From which filter resume processing, for a specific channel.
+ * This is used for resumable callbacks only,
+ * If NULL, we start from the first filter.
+ * 0: request channel, 1: response channel */
+ unsigned short flags; /* STRM_FL_* */
+ unsigned char nb_req_data_filters; /* Number of data filters registered on the request channel */
+ unsigned char nb_rsp_data_filters; /* Number of data filters registered on the response channel */
+ unsigned long long offset[2]; /* gloal offset of input data already filtered for a specific channel
+ * 0: request channel, 1: response channel */
+ };
+
+
+Filter instances attached to a stream are stored in the field
+'strm_flt.filters', each instance is of type 'struct filter *' :
+
+ /*
+ * Structure representing a filter instance attached to a stream
+ *
+ * 2D-Array fields are used to store info per channel. The first index
+ * stands for the request channel, and the second one for the response
+ * channel. Especially, <next> and <fwd> are offsets representing amount of
+ * data that the filter are, respectively, parsed and forwarded on a
+ * channel. Filters can access these values using FLT_NXT and FLT_FWD
+ * macros.
+ */
+ struct filter {
+ struct flt_conf *config; /* the filter's configuration */
+ void *ctx; /* The filter context (opaque) */
+ unsigned short flags; /* FLT_FL_* */
+ unsigned long long offset[2]; /* Offset of input data already filtered for a specific channel
+ * 0: request channel, 1: response channel */
+ unsigned int pre_analyzers; /* bit field indicating analyzers to
+ * pre-process */
+ unsigned int post_analyzers; /* bit field indicating analyzers to
+ * post-process */
+ struct list list; /* Next filter for the same proxy/stream */
+ };
+
+ * 'filter.config' is the filter configuration previously described. All
+ instances of a filter share it.
+
+ * 'filter.ctx' is an opaque context. It is managed by the filter, so it is its
+ responsibility to free it.
+
+ * 'filter.pre_analyzers and 'filter.post_analyzers will be described later
+ (See § 3.5).
+
+ * 'filter.offset' will be described later (See § 3.6).
+
+
+3.2. DEFINING THE FILTER NAME AND ITS CONFIGURATION
+---------------------------------------------------
+
+During the filter development, the first thing to do is to add it in the
+supported filters. To do so, its name must be registered as a valid keyword on
+the filter line :
+
+ /* Declare the filter parser for "my_filter" keyword */
+ static struct flt_kw_list flt_kws = { "MY_FILTER_SCOPE", { }, {
+ { "my_filter", parse_my_filter_cfg, NULL /* private data */ },
+ { NULL, NULL, NULL },
+ }
+ };
+ INITCALL1(STG_REGISTER, flt_register_keywords, &flt_kws);
+
+
+Then the filter internal configuration must be defined. For instance :
+
+ struct my_filter_config {
+ struct proxy *proxy;
+ char *name;
+ /* ... */
+ };
+
+
+All callbacks implemented by the filter must then be declared. Here, a global
+variable is used :
+
+ struct flt_ops my_filter_ops {
+ .init = my_filter_init,
+ .deinit = my_filter_deinit,
+ .check = my_filter_config_check,
+
+ /* ... */
+ };
+
+
+Finally, the function to parse the filter configuration must be written, here
+'parse_my_filter_cfg'. This function must parse all remaining keywords on the
+filter line :
+
+ /* Return -1 on error, else 0 */
+ static int
+ parse_my_filter_cfg(char **args, int *cur_arg, struct proxy *px,
+ struct flt_conf *flt_conf, char **err, void *private)
+ {
+ struct my_filter_config *my_conf;
+ int pos = *cur_arg;
+
+ /* Allocate the internal configuration used by the filter */
+ my_conf = calloc(1, sizeof(*my_conf));
+ if (!my_conf) {
+ memprintf(err, "%s : out of memory", args[*cur_arg]);
+ return -1;
+ }
+ my_conf->proxy = px;
+
+ /* ... */
+
+ /* Parse all keywords supported by the filter and fill the internal
+ * configuration */
+ pos++; /* Skip the filter name */
+ while (*args[pos]) {
+ if (!strcmp(args[pos], "name")) {
+ if (!*args[pos + 1]) {
+ memprintf(err, "'%s' : '%s' option without value",
+ args[*cur_arg], args[pos]);
+ goto error;
+ }
+ my_conf->name = strdup(args[pos + 1]);
+ if (!my_conf->name) {
+ memprintf(err, "%s : out of memory", args[*cur_arg]);
+ goto error;
+ }
+ pos += 2;
+ }
+
+ /* ... parse other keywords ... */
+ }
+ *cur_arg = pos;
+
+ /* Set callbacks supported by the filter */
+ flt_conf->ops = &my_filter_ops;
+
+ /* Last, save the internal configuration */
+ flt_conf->conf = my_conf;
+ return 0;
+
+ error:
+ if (my_conf->name)
+ free(my_conf->name);
+ free(my_conf);
+ return -1;
+ }
+
+
+WARNING : In this parsing function, 'flt_conf->ops' must be initialized. All
+ arguments of the filter line must also be parsed. This is mandatory.
+
+In the previous example, the filter lne should be read as follows :
+
+ filter my_filter name MY_NAME ...
+
+
+Optionally, by implementing the 'flt_ops.check' callback, an extra set is added
+to check the internal configuration of the filter after the parsing phase, when
+the HAProxy configuration is fully defined. For instance :
+
+ /* Check configuration of a trace filter for a specified proxy.
+ * Return 1 on error, else 0. */
+ static int
+ my_filter_config_check(struct proxy *px, struct flt_conf *my_conf)
+ {
+ if (px->mode != PR_MODE_HTTP) {
+ Alert("The filter 'my_filter' cannot be used in non-HTTP mode.\n");
+ return 1;
+ }
+
+ /* ... */
+
+ return 0;
+ }
+
+
+
+3.3. MANAGING THE FILTER LIFECYCLE
+----------------------------------
+
+Once the configuration parsed and checked, filters are ready to by used. There
+are two main callbacks to manage the filter lifecycle :
+
+ * 'flt_ops.init' : It initializes the filter for a proxy. This callback may be
+ defined to finish the filter configuration.
+
+ * 'flt_ops.deinit' : It cleans up what the parsing function and the init
+ callback have done. This callback is useful to release
+ memory allocated for the filter configuration.
+
+Here is an example :
+
+ /* Initialize the filter. Returns -1 on error, else 0. */
+ static int
+ my_filter_init(struct proxy *px, struct flt_conf *fconf)
+ {
+ struct my_filter_config *my_conf = fconf->conf;
+
+ /* ... */
+
+ return 0;
+ }
+
+ /* Free resources allocated by the trace filter. */
+ static void
+ my_filter_deinit(struct proxy *px, struct flt_conf *fconf)
+ {
+ struct my_filter_config *my_conf = fconf->conf;
+
+ if (my_conf) {
+ free(my_conf->name);
+ /* ... */
+ free(my_conf);
+ }
+ fconf->conf = NULL;
+ }
+
+
+3.3.1 DEALING WITH THREADS
+--------------------------
+
+When HAProxy is compiled with the threads support and started with more that one
+thread (global.nbthread > 1), then it is possible to manage the filter per
+thread with following callbacks :
+
+ * 'flt_ops.init_per_thread': It initializes the filter for each thread. It
+ works the same way than 'flt_ops.init' but in the
+ context of a thread. This callback is called
+ after the thread creation.
+
+ * 'flt_ops.deinit_per_thread': It cleans up what the init_per_thread callback
+ have done. It is called in the context of a
+ thread, before exiting it.
+
+It is the filter responsibility to deal with concurrency. check, init and deinit
+callbacks are called on the main thread. All others are called on a "worker"
+thread (not always the same). It is also the filter responsibility to know if
+HAProxy is started with more than one thread. If it is started with one thread
+(or compiled without the threads support), these callbacks will be silently
+ignored (in this case, global.nbthread will be always equal to one).
+
+
+3.4. HANDLING THE STREAMS ACTIVITY
+-----------------------------------
+
+It may be interesting to handle streams activity. For now, there is three
+callbacks that should define to do so :
+
+ * 'flt_ops.stream_start' : It is called when a stream is started. This
+ callback can fail by returning a negative value. It
+ will be considered as a critical error by HAProxy
+ which disabled the listener for a short time.
+
+ * 'flt_ops.stream_set_backend' : It is called when a backend is set for a
+ stream. This callbacks will be called for all
+ filters attached to a stream (frontend and
+ backend). Note this callback is not called if
+ the frontend and the backend are the same.
+
+ * 'flt_ops.stream_stop' : It is called when a stream is stopped. This callback
+ always succeed. Anyway, it is too late to return an
+ error.
+
+For instance :
+
+ /* Called when a stream is created. Returns -1 on error, else 0. */
+ static int
+ my_filter_stream_start(struct stream *s, struct filter *filter)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... */
+
+ return 0;
+ }
+
+ /* Called when a backend is set for a stream */
+ static int
+ my_filter_stream_set_backend(struct stream *s, struct filter *filter,
+ struct proxy *be)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... */
+
+ return 0;
+ }
+
+ /* Called when a stream is destroyed */
+ static void
+ my_filter_stream_stop(struct stream *s, struct filter *filter)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... */
+ }
+
+
+WARNING : Handling the streams creation and destruction is only possible for
+ filters defined on proxies with the frontend capability.
+
+In addition, it is possible to handle creation and destruction of filter
+instances using following callbacks:
+
+ * 'flt_ops.attach' : It is called after a filter instance creation, when it is
+ attached to a stream. This happens when the stream is
+ started for filters defined on the stream's frontend and
+ when the backend is set for filters declared on the
+ stream's backend. It is possible to ignore the filter, if
+ needed, by returning 0. This could be useful to have
+ conditional filtering.
+
+ * 'flt_ops.detach' : It is called when a filter instance is detached from a
+ stream, before its destruction. This happens when the
+ stream is stopped for filters defined on the stream's
+ frontend and when the analyze ends for filters defined on
+ the stream's backend.
+
+For instance :
+
+ /* Called when a filter instance is created and attach to a stream */
+ static int
+ my_filter_attach(struct stream *s, struct filter *filter)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ if (/* ... */)
+ return 0; /* Ignore the filter here */
+ return 1;
+ }
+
+ /* Called when a filter instance is detach from a stream, just before its
+ * destruction */
+ static void
+ my_filter_detach(struct stream *s, struct filter *filter)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... */
+ }
+
+Finally, it may be interesting to notify the filter when the stream is woken up
+because of an expired timer. This could let a chance to check some internal
+timeouts, if any. To do so the following callback must be used :
+
+ * 'flt_opt.check_timeouts' : It is called when a stream is woken up because of
+ an expired timer.
+
+For instance :
+
+ /* Called when a stream is woken up because of an expired timer */
+ static void
+ my_filter_check_timeouts(struct stream *s, struct filter *filter)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... */
+ }
+
+
+3.5. ANALYZING THE CHANNELS ACTIVITY
+------------------------------------
+
+The main purpose of filters is to take part in the channels analyzing. To do so,
+there is 2 callbacks, 'flt_ops.channel_pre_analyze' and
+'flt_ops.channel_post_analyze', called respectively before and after each
+analyzer attached to a channel, except analyzers responsible for the data
+forwarding (TCP or HTTP). Concretely, on the request channel, these callbacks
+could be called before following analyzers :
+
+ * tcp_inspect_request (AN_REQ_INSPECT_FE and AN_REQ_INSPECT_BE)
+ * http_wait_for_request (AN_REQ_WAIT_HTTP)
+ * http_wait_for_request_body (AN_REQ_HTTP_BODY)
+ * http_process_req_common (AN_REQ_HTTP_PROCESS_FE)
+ * process_switching_rules (AN_REQ_SWITCHING_RULES)
+ * http_process_req_ common (AN_REQ_HTTP_PROCESS_BE)
+ * http_process_tarpit (AN_REQ_HTTP_TARPIT)
+ * process_server_rules (AN_REQ_SRV_RULES)
+ * http_process_request (AN_REQ_HTTP_INNER)
+ * tcp_persist_rdp_cookie (AN_REQ_PRST_RDP_COOKIE)
+ * process_sticking_rules (AN_REQ_STICKING_RULES)
+
+And on the response channel :
+
+ * tcp_inspect_response (AN_RES_INSPECT)
+ * http_wait_for_response (AN_RES_WAIT_HTTP)
+ * process_store_rules (AN_RES_STORE_RULES)
+ * http_process_res_common (AN_RES_HTTP_PROCESS_BE)
+
+Unlike the other callbacks previously seen before, 'flt_ops.channel_pre_analyze'
+can interrupt the stream processing. So a filter can decide to not execute the
+analyzer that follows and wait the next iteration. If there are more than one
+filter, following ones are skipped. On the next iteration, the filtering resumes
+where it was stopped, i.e. on the filter that has previously stopped the
+processing. So it is possible for a filter to stop the stream processing on a
+specific analyzer for a while before continuing. Moreover, this callback can be
+called many times for the same analyzer, until it finishes its processing. For
+instance :
+
+ /* Called before a processing happens on a given channel.
+ * Returns a negative value if an error occurs, 0 if it needs to wait,
+ * any other value otherwise. */
+ static int
+ my_filter_chn_pre_analyze(struct stream *s, struct filter *filter,
+ struct channel *chn, unsigned an_bit)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ switch (an_bit) {
+ case AN_REQ_WAIT_HTTP:
+ if (/* wait that a condition is verified before continuing */)
+ return 0;
+ break;
+ /* ... * /
+ }
+ return 1;
+ }
+
+ * 'an_bit' is the analyzer id. All analyzers are listed in
+ 'include/haproxy/channels-t.h'.
+
+ * 'chn' is the channel on which the analyzing is done. It is possible to
+ determine if it is the request or the response channel by testing if
+ CF_ISRESP flag is set :
+
+ │ ((chn->flags & CF_ISRESP) == CF_ISRESP)
+
+
+In previous example, the stream processing is blocked before receipt of the HTTP
+request until a condition is verified.
+
+'flt_ops.channel_post_analyze', for its part, is not resumable. It returns a
+negative value if an error occurs, any other value otherwise. It is called when
+a filterable analyzer finishes its processing, so once for the same analyzer.
+For instance :
+
+ /* Called after a processing happens on a given channel.
+ * Returns a negative value if an error occurs, any other
+ * value otherwise. */
+ static int
+ my_filter_chn_post_analyze(struct stream *s, struct filter *filter,
+ struct channel *chn, unsigned an_bit)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+ struct http_msg *msg;
+
+ switch (an_bit) {
+ case AN_REQ_WAIT_HTTP:
+ if (/* A test on received headers before any other treatment */) {
+ msg = ((chn->flags & CF_ISRESP) ? &s->txn->rsp : &s->txn->req);
+ txn->status = 400;
+ msg->msg_state = HTTP_MSG_ERROR;
+ http_reply_and_close(s, s->txn->status, http_error_message(s));
+ return -1; /* This is an error ! */
+ }
+ break;
+ /* ... * /
+ }
+ return 1;
+ }
+
+
+Pre and post analyzer callbacks of a filter are not automatically called. They
+must be regiesterd explicitly on analyzers, updating the value of
+'filter.pre_analyzers' and 'filter.post_analyzers' bit fields. All analyzer bits
+are listed in 'include/types/channels.h'. Here is an example :
+
+ static int
+ my_filter_stream_start(struct stream *s, struct filter *filter)
+ {
+ /* ... * /
+
+ /* Register the pre analyzer callback on all request and response
+ * analyzers */
+ filter->pre_analyzers |= (AN_REQ_ALL | AN_RES_ALL)
+
+ /* Register the post analyzer callback of only on AN_REQ_WAIT_HTTP and
+ * AN_RES_WAIT_HTTP analyzers */
+ filter->post_analyzers |= (AN_REQ_WAIT_HTTP | AN_RES_WAIT_HTTP)
+
+ /* ... * /
+ return 0;
+ }
+
+
+To surround activity of a filter during the channel analyzing, two new analyzers
+has been added :
+
+ * 'flt_start_analyze' (AN_REQ/RES_FLT_START_FE/AN_REQ_RES_FLT_START_BE) : For
+ a specific filter, this analyzer is called before any call to the
+ 'channel_analyze' callback. From the filter point of view, it calls the
+ 'flt_ops.channel_start_analyze' callback.
+
+ * 'flt_end_analyze' (AN_REQ/RES_FLT_END) : For a specific filter, this
+ analyzer is called when all other analyzers have finished their
+ processing. From the filter point of view, it calls the
+ 'flt_ops.channel_end_analyze' callback.
+
+These analyzers are called only once per streams.
+
+'flt_ops.channel_start_analyze' and 'flt_ops.channel_end_analyze' callbacks can
+interrupt the stream processing, as 'flt_ops.channel_analyze'. Here is an
+example :
+
+ /* Called when analyze starts for a given channel
+ * Returns a negative value if an error occurs, 0 if it needs to wait,
+ * any other value otherwise. */
+ static int
+ my_filter_chn_start_analyze(struct stream *s, struct filter *filter,
+ struct channel *chn)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... TODO ... */
+
+ return 1;
+ }
+
+ /* Called when analyze ends for a given channel
+ * Returns a negative value if an error occurs, 0 if it needs to wait,
+ * any other value otherwise. */
+ static int
+ my_filter_chn_end_analyze(struct stream *s, struct filter *filter,
+ struct channel *chn)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* ... TODO ... */
+
+ return 1;
+ }
+
+
+Workflow on channels can be summarized as following :
+
+ FE: Called for filters defined on the stream's frontend
+ BE: Called for filters defined on the stream's backend
+
+ +------->---------+
+ | | |
+ +----------------------+ | +----------------------+
+ | flt_ops.attach (FE) | | | flt_ops.attach (BE) |
+ +----------------------+ | +----------------------+
+ | | |
+ V | V
+ +--------------------------+ | +------------------------------------+
+ | flt_ops.stream_start (FE)| | | flt_ops.stream_set_backend (FE+BE) |
+ +--------------------------+ | +------------------------------------+
+ | | |
+ ... | ...
+ | | |
+ | ^ |
+ | --+ | | --+
+ +------<----------+ | | +--------<--------+ |
+ | | | | | | |
+ V | | | V | |
++-------------------------------+ | | | +-------------------------------+ | |
+| flt_start_analyze (FE) +-+ | | | flt_start_analyze (BE) +-+ |
+|(flt_ops.channel_start_analyze)| | F | |(flt_ops.channel_start_analyze)| |
++---------------+---------------+ | R | +-------------------------------+ |
+ | | O | | |
+ +------<---------+ | N ^ +--------<-------+ | B
+ | | | T | | | | A
++---------------|------------+ | | E | +---------------|------------+ | | C
+|+--------------V-------------+ | | N | |+--------------V-------------+ | | K
+||+----------------------------+ | | D | ||+----------------------------+ | | E
+|||flt_ops.channel_pre_analyze | | | | |||flt_ops.channel_pre_analyze | | | N
+||| V | | | | ||| V | | | D
+||| analyzer (FE) +-+ | | ||| analyzer (FE+BE) +-+ |
++|| V | | | +|| V | |
+ +|flt_ops.channel_post_analyze| | | +|flt_ops.channel_post_analyze| |
+ +----------------------------+ | | +----------------------------+ |
+ | --+ | | |
+ +------------>------------+ ... |
+ | |
+ [ data filtering (see below) ] |
+ | |
+ ... |
+ | |
+ +--------<--------+ |
+ | | |
+ V | |
+ +-------------------------------+ | |
+ | flt_end_analyze (FE+BE) +-+ |
+ | (flt_ops.channel_end_analyze) | |
+ +---------------+---------------+ |
+ | --+
+ V
+ +----------------------+
+ | flt_ops.detach (BE) |
+ +----------------------+
+ |
+ V
+ +--------------------------+
+ | flt_ops.stream_stop (FE) |
+ +--------------------------+
+ |
+ V
+ +----------------------+
+ | flt_ops.detach (FE) |
+ +----------------------+
+ |
+ V
+
+By zooming on an analyzer box we have:
+
+ ...
+ |
+ V
+ |
+ +-----------<-----------+
+ | |
+ +-----------------+--------------------+ |
+ | | | |
+ | +--------<---------+ | |
+ | | | | |
+ | V | | |
+ | flt_ops.channel_pre_analyze ->-+ | ^
+ | | | |
+ | | | |
+ | V | |
+ | analyzer --------->-----+--+
+ | | |
+ | | |
+ | V |
+ | flt_ops.channel_post_analyze |
+ | | |
+ | | |
+ +-----------------+--------------------+
+ |
+ V
+ ...
+
+
+ 3.6. FILTERING THE DATA EXCHANGED
+-----------------------------------
+
+WARNING : To fully understand this part, it is important to be aware on how the
+ buffers work in HAProxy. For the HTTP part, it is also important to
+ understand how data are parsed and structured, and how the internal
+ representation, called HTX, works. See doc/internals/buffer-api.txt
+ and doc/internals/htx-api.txt for details.
+
+An extended feature of the filters is the data filtering. By default a filter
+does not look into data exchanged between the client and the server because it
+is expensive. Indeed, instead of forwarding data without any processing, each
+byte need to be buffered.
+
+So, to enable the data filtering on a channel, at any time, in one of previous
+callbacks, 'register_data_filter' function must be called. And conversely, to
+disable it, 'unregister_data_filter' function must be called. For instance :
+
+ my_filter_http_headers(struct stream *s, struct filter *filter,
+ struct http_msg *msg)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+
+ /* 'chn' must be the request channel */
+ if (!(msg->chn->flags & CF_ISRESP)) {
+ struct htx *htx;
+ struct ist hdr;
+ struct http_hdr_ctx ctx;
+
+ htx = htxbuf(msg->chn->buf);
+
+ /* Enable the data filtering for the request if 'X-Filter' header
+ * is set to 'true'. */
+ hdr = ist("X-Filter);
+ ctx.blk = NULL;
+ if (http_find_header(htx, hdr, &ctx, 0) &&
+ ctx.value.len >= 4 && memcmp(ctx.value.ptr, "true", 4) == 0)
+ register_data_filter(s, chn, filter);
+ }
+
+ return 1;
+ }
+
+Here, the data filtering is enabled if the HTTP header 'X-Filter' is found and
+set to 'true'.
+
+If several filters are declared, the evaluation order remains the same,
+regardless the order of the registrations to the data filtering. Data
+registrations must be performed before the data forwarding step. However, a
+filter may be unregistered from the data filtering at any time.
+
+Depending on the stream type, TCP or HTTP, the way to handle data filtering is
+different. HTTP data are structured while TCP data are raw. And there are more
+callbacks for HTTP streams to fully handle all steps of an HTTP transaction. But
+the main part is the same. The data filtering is performed in one callback,
+called in loop on input data starting at a specific offset for a given
+length. Data analyzed by a filter are considered as forwarded from its point of
+view. Because filters are chained, a filter never analyzes more data than its
+predecessors. Thus only data analyzed by the last filter are effectively
+forwarded. This means, at any time, any filter may choose to not analyze all
+available data (available from its point of view), blocking the data forwarding.
+
+Internally, filters own 2 offsets representing the number of bytes already
+analyzed in the available input data, one per channel. There is also an offset
+couple at the stream level, in the strm_flt object, representing the total
+number of bytes already forwarded. These offsets may be retrieved and updated
+using following macros :
+
+ * FLT_OFF(flt, chn)
+
+ * FLT_STRM_OFF(s, chn)
+
+where 'flt' is the 'struct filter' passed as argument in all callbacks, 's' the
+filtered stream and 'chn' is the considered channel. However, there is no reason
+for a filter to use these macros or take care of these offsets.
+
+
+3.6.1 FILTERING DATA ON TCP STREAMS
+-----------------------------------
+
+The TCP data filtering for TCP streams is the easy case, because HAProxy do not
+parse these data. Data are stored in raw in the buffer. So there is only one
+callback to consider:
+
+ * 'flt_ops.tcp_payload : This callback is called when input data are
+ available. If not defined, all available data will be considered as analyzed
+ and forwarded from the filter point of view.
+
+This callback is called only if the filter is registered to analyze TCP
+data. Here is an example :
+
+ /* Returns a negative value if an error occurs, else the number of
+ * consumed bytes. */
+ static int
+ my_filter_tcp_payload(struct stream *s, struct filter *filter,
+ struct channel *chn, unsigned int offset,
+ unsigned int len)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+ int ret = len;
+
+ /* Do not parse more than 'my_conf->max_parse' bytes at a time */
+ if (my_conf->max_parse != 0 && ret > my_conf->max_parse)
+ ret = my_conf->max_parse;
+
+ /* if available data are not completely parsed, wake up the stream to
+ * be sure to not freeze it. The best is probably to set a
+ * chn->analyse_exp timer */
+ if (ret != len)
+ task_wakeup(s->task, TASK_WOKEN_MSG);
+ return ret;
+ }
+
+But it is important to note that tunnelled data of an HTTP stream may also be
+filtered via this callback. Tunnelled data are data exchange after an HTTP tunnel
+is established between the client and the server, via an HTTP CONNECT or via a
+protocol upgrade. In this case, the data are structured. Of course, to do so,
+the filter must be able to parse HTX data and must have the FLT_CFG_FL_HTX flag
+set. At any time, the IS_HTX_STRM() macros may be used on the stream to know if
+it is an HTX stream or a TCP stream.
+
+
+3.6.2 FILTERING DATA ON HTTP STREAMS
+------------------------------------
+
+The HTTP data filtering is a bit more complex because HAProxy data are
+structutred and represented to an internal format, called HTX. So basically
+there is the HTTP counterpart to the previous callback :
+
+ * 'flt_ops.http_payload' : This callback is called when input data are
+ available. If not defined, all available data will be considered as analyzed
+ and forwarded for the filter.
+
+But the prototype for this callbacks is slightly different. Instead of having
+the channel as parameter, we have the HTTP message (struct http_msg). This
+callback is called only if the filter is registered to analyze TCP data. Here is
+an example :
+
+ /* Returns a negative value if an error occurs, else the number of
+ * consumed bytes. */
+ static int
+ my_filter_http_payload(struct stream *s, struct filter *filter,
+ struct http_msg *msg, unsigned int offset,
+ unsigned int len)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+ struct htx *htx = htxbuf(&msg->chn->buf);
+ struct htx_ret htxret = htx_find_offset(htx, offset);
+ struct htx_blk *blk;
+
+ blk = htxret.blk;
+ offset = htxret.ret;
+ for (; blk; blk = htx_get_next_blk(blk, htx)) {
+ enum htx_blk_type type = htx_get_blk_type(blk);
+
+ if (type == HTX_BLK_UNUSED)
+ continue;
+ else if (type == HTX_BLK_DATA) {
+ /* filter data */
+ }
+ else
+ break;
+ }
+
+ return len;
+ }
+
+In addition, there are two others callbacks :
+
+ * 'flt_ops.http_headers' : This callback is called just before the HTTP body
+ forwarding and after any processing on the request/response HTTP
+ headers. When defined, this callback is always called for HTTP streams
+ (i.e. without needs of a registration on data filtering).
+ Here is an example :
+
+
+ /* Returns a negative value if an error occurs, 0 if it needs to wait,
+ * any other value otherwise. */
+ static int
+ my_filter_http_headers(struct stream *s, struct filter *filter,
+ struct http_msg *msg)
+ {
+ struct my_filter_config *my_conf = FLT_CONF(filter);
+ struct htx *htx = htxbuf(&msg->chn->buf);
+ struct htx_sl *sl = http_get_stline(htx);
+ int32_t pos;
+
+ for (pos = htx_get_first(htx); pos != -1; pos = htx_get_next(htx, pos)) {
+ struct htx_blk *blk = htx_get_blk(htx, pos);
+ enum htx_blk_type type = htx_get_blk_type(blk);
+ struct ist n, v;
+
+ if (type == HTX_BLK_EOH)
+ break;
+ if (type != HTX_BLK_HDR)
+ continue;
+
+ n = htx_get_blk_name(htx, blk);
+ v = htx_get_blk_value(htx, blk);
+ /* Do something on the header name/value */
+ }
+
+ return 1;
+ }
+
+ * 'flt_ops.http_end' : This callback is called when the whole HTTP message was
+ processed. It may interrupt the stream processing. So, it could be used to
+ synchronize the HTTP request with the HTTP response, for instance :
+
+ /* Returns a negative value if an error occurs, 0 if it needs to wait,
+ * any other value otherwise. */
+ static int
+ my_filter_http_end(struct stream *s, struct filter *filter,
+ struct http_msg *msg)
+ {
+ struct my_filter_ctx *my_ctx = filter->ctx;
+
+
+ if (!(msg->chn->flags & CF_ISRESP)) /* The request */
+ my_ctx->end_of_req = 1;
+ else /* The response */
+ my_ctx->end_of_rsp = 1;
+
+ /* Both the request and the response are finished */
+ if (my_ctx->end_of_req == 1 && my_ctx->end_of_rsp == 1)
+ return 1;
+
+ /* Wait */
+ return 0;
+ }
+
+Then, to finish, there are 2 informational callbacks :
+
+ * 'flt_ops.http_reset' : This callback is called when an HTTP message is
+ reset. This happens either when a 1xx informational response is received, or
+ if we're retrying to send the request to the server after it failed. It
+ could be useful to reset the filter context before receiving the true
+ response.
+ By checking s->txn->status, it is possible to know why this callback is
+ called. If it's a 1xx, we're called because of an informational
+ message. Otherwise, it is a L7 retry.
+
+ * 'flt_ops.http_reply' : This callback is called when, at any time, HAProxy
+ decides to stop the processing on a HTTP message and to send an internal
+ response to the client. This mainly happens when an error or a redirect
+ occurs.
+
+
+3.6.3 REWRITING DATA
+--------------------
+
+The last part, and the trickiest one about the data filtering, is about the data
+rewriting. For now, the filter API does not offer a lot of functions to handle
+it. There are only functions to notify HAProxy that the data size has changed to
+let it update internal state of filters. This is the developer responsibility to
+update data itself, i.e. the buffer offsets, using following function :
+
+ * 'flt_update_offsets()' : This function must be called when a filter alter
+ incoming data. It updates offsets of the stream and of all filters
+ preceding the calling one. Do not call this function when a filter change
+ the size of incoming data leads to an undefined behavior.
+
+A good example of filter changing the data size is the HTTP compression filter.
diff --git a/doc/internals/api/htx-api.txt b/doc/internals/api/htx-api.txt
new file mode 100644
index 0000000..971328b
--- /dev/null
+++ b/doc/internals/api/htx-api.txt
@@ -0,0 +1,570 @@
+ -----------------------------------------------
+ HTX API
+ Version 1.1
+ ( Last update: 2021-02-24 )
+ -----------------------------------------------
+ Author : Christopher Faulet
+ Contact : cfaulet at haproxy dot com
+
+1. Background
+
+Historically, HAProxy stored HTTP messages in a raw fashion in buffers, keeping
+parsing information separately in a "struct http_msg" owned by the stream. It was
+optimized to the data transfer, but not so much for rewrites. It was also HTTP/1
+centered. While it was the only HTTP version supported, it was not a
+problem. But with the rise of HTTP/2, it starts to be hard to still use this
+representation.
+
+At the first age of the HTTP/2 in HAProxy, H2 messages were converted into
+H1. This was terribly unefficient because it required two parsing passes, a
+first one in H2 and a second one in H1, with a conversion in the middle. And of
+course, the same was also true in the opposite direction. outgoing H1 messages
+had to be converted back in H2 to be sent. Even worse, because the H2->H1
+conversion, only client H2 connections were supported.
+
+So, to address all these problems, we decided to replace the old raw
+representation by a version-agnostic and self-structured internal HTTP
+representation, the HTX. As an additional benefit, with this new representation,
+the message parsing and its processing are now separated, making all the HTTP
+analysis simpler and cleaner. The parsing of HTTP messages is now handled by
+the multiplexers (h1 or h2).
+
+
+2. The HTX message
+
+The HTX is a structure containing useful information about an HTTP message
+followed by a contiguous array with some parts of the message. These parts are
+called blocks. A block is composed of metadata (htx_blk) and an associated
+payload. Blocks' metadata are stored starting from the end of the array while
+their payload are stored at the beginning. Blocks' metadata are often simply
+called blocks. it is a misuse of language that's simplify explanations.
+
+Internally, this structure is "hidden" in a buffer. This way, there are few
+changes into intermediate layers (stream-interface and channels). They still
+manipulate buffers. Only the multiplexer and the stream have to know how data
+are really stored. From the HTX perspective, a buffer is just a memory
+area. When an HTX message is stored in a buffer, this one appears as full.
+
+ * General view of an HTX message :
+
+
+ buffer->area
+ |
+ |<------------ buffer->size == buffer->data ----------------------|
+ | |
+ | |<------------- Blocks array (htx->size) ------------------>|
+ V | |
+ +-----+-----------------+-------------------------+---------------+
+ | HTX | PAYLOADS ==> | | <== HTX_BLKs |
+ +-----+-----------------+-------------------------+---------------+
+ | | | |
+ |<-payloads part->|<----- free space ------>|<-blocks part->|
+ (htx->data)
+
+
+The blocks part remains linear and sorted. It may be see as an array with
+negative indexes. But, instead of using negative indexes, we use positive
+positions to identify a block. This position is then converted to an address
+relatively to the beginning of the blocks array.
+
+ tail head
+ | |
+ V V
+ .....--+----+-----------------------+------+------+
+ | Bn | ... | B1 | B0 |
+ .....--+----+-----------------------+------+------+
+ ^ ^ ^
+ Addr of the block Addr of the block Addr of the block
+ at the position N at the position 1 at the position 0
+
+
+In the HTX structure, 3 "special" positions are stored :
+
+ - tail : Position of the newest inserted block
+ - head : Position of the oldest inserted block
+ - first : Position of the first block to (re)start the analyse
+
+The blocks part never wrap. If we have no space to allocate a new block and if
+there is a hole at the beginning of the blocks part (so at the end of the blocks
+array), we move back all blocks.
+
+
+ tail head tail head
+ | | | |
+ V V V V
+ ...+--------------+---------+ blocks ...----------+--------------+
+ | X== HTX_BLKS | | defrag | <== HTX_BLKS |
+ ...+--------------+---------+ =====> ...----------+--------------+
+
+
+The payloads part is a raw space that may wrap. A block's payload must never be
+accessed directly. Instead a block must be selected to retrieve the address of
+its payload.
+
+
+ +------------------------( B0.addr )--------------------------+
+ | +-------------------( B1.addr )----------------------+ |
+ | | +-----------( B2.addr )----------------+ | |
+ V V V | | |
+ +-----+----+-------+----+--------+-------------+-------+----+----+----+
+ | HTX | P0 | P1 | P2 | ...==> | | <=... | B2 | B1 | B0 |
+ +-----+----+-------+----+--------+-------------+-------+----+----+----+
+
+
+Because the payloads part may wrap, there are 2 usable free spaces :
+
+ - The free space in front of the blocks part. This one is used if and only if
+ the other one was not used yet.
+
+ - The free space at the beginning of the message. Once this one is used, the
+ other one is never used again, until a message defragmentation.
+
+
+ * Linear payloads part :
+
+
+ head_addr end_addr tail_addr
+ | | |
+ V V V
+ +-----+--------------------+-------------+--------------------+-------...
+ | HTX | | PAYLOADS | | HTX_BLKs
+ +-----+--------------------+-------------+--------------------+-------...
+ |<-- free space 2 -->| |<-- free space 1 -->|
+ (used if the other is too small) (used in priority)
+
+
+ * Wrapping payloads part :
+
+
+ head_addr end_addr tail_addr
+ | | |
+ V V V
+ +-----+----+----------------+--------+----------------+-------+-------...
+ | HTX | | PAYLOADS part2 | | PAYLOADS part1 | | HTX_BLKs
+ +-----+----+----------------+--------+----------------+-------+-------...
+ |<-->| |<------>| |<----->|
+ unusable free space unusable
+ free space free space
+
+
+Finally, when the usable free space is not enough to store a new block, unusable
+parts may be get back with a full defragmentation. The payloads part is then
+realigned at the beginning of the blocks array and the free space becomes
+continuous again.
+
+
+3. The HTX blocks
+
+An HTX block can be as well a start-line as a header, a body part or a
+trailer. For all these types of block, a payload is attached to the block. It
+can also be a marker, the end-of-headers or end-of-trailers. For these blocks,
+there is no payload but it counts for a byte. It is important to not skip it
+when data are forwarded.
+
+As already said, a block is composed of metadata and a payload. Metadata are
+stored in the blocks part and are composed of 2 fields :
+
+ - info : It a 32 bits field containing the block's type on 4 bits followed
+ by the payload length. See below for details.
+
+ - addr : The payload's address, if any, relatively to the beginning the
+ array used to store part of the HTTP message itself.
+
+
+ * Block's info representation :
+
+ 0b 0000 0000 0000 0000 0000 0000 0000 0000
+ ---- ------------------------ ---------
+ type value (1 MB max) name length (header/trailer - 256B max)
+ ----------------------------------
+ data length (256 MB max)
+ (body, method, path, version, status, reason)
+
+
+Supported types are :
+
+ - 0000 (0) : The request start-line
+ - 0001 (1) : The response start-line
+ - 0010 (2) : A header block
+ - 0011 (3) : The end-of-headers marker
+ - 0100 (4) : A data block
+ - 0101 (5) : A trailer block
+ - 0110 (6) : The end-of-trailers marker
+ - 1111 (15) : An unused block
+
+Other types are unused for now and reserved for futur extensions.
+
+An HTX message is typically composed of following blocks, in this order :
+
+ - a start-line
+ - zero or more header blocks
+ - an end-of-headers marker
+ - zero or more data blocks
+ - zero or more trailer blocks (optional)
+ - an end-of-trailers marker (optional but always set if there is at least
+ one trailer block)
+
+Only one HTTP request at a time can be stored in an HTX message. For HTTP
+response, it is more complicated. Only one "final" response can be stored in an
+HTX message. It is a response with status-code 101 or greater or equal to
+200. But it may be preceded by several 1xx informational responses. Such
+responses are part of the same HTX message.
+
+When the end of the message is reached a special flag is set on the message
+(HTX_FL_EOM). It means no more data are expected for this message, except
+tunneled data. But tunneled data will never be mixed with message data to avoid
+ambiguities. Thus once the flag marking the end of the message is set, it is
+easy to know the message ends. The end is reached if the HTX message is empty or
+on the tail HTX block in the HTX message. Once all blocks of the HTX message are
+consumed, tunneled data, if any, may be transferred.
+
+
+3.1. The start-line
+
+Every HTX message starts with a start-line. Its payload is a "struct htx_sl". In
+addition to the parts of the HTTP start-line, this structure contains some
+information about the represented HTTP message, mainly in the form of flags
+(HTX_SL_F_*). For instance, if an HTTP message contains the header
+"conten-length", then the flag HTX_SL_F_CLEN is set.
+
+Each HTTP message has its own start-line. So an HTX request has one and only one
+start-line because it must contain only one HTTP request at a time. But an HTX
+response may have more than one start-line if the final HTTP response is
+precedeed by some 1xx informational responses.
+
+In HTTP/2, there is no start-line. So the H2 multiplexer must create one when it
+converts an H2 message to HTX :
+
+ - For the request, it uses the pseudo headers ":method", ":path" or
+ ":authority" depending on the method and the hardcoded version "HTTP/2.0".
+
+ - For the response, it used the hardcoded version "HTTP/2.0", the
+ pseudo-header ":status" and an empty reason.
+
+
+3.2. The headers and trailers
+
+HTX Headers and trailers are quite similar. Different types are used to simplify
+headers processing. But from the HTX point of view, there is no real difference,
+except their position in the HTX message. The header blocks always follow an HTX
+start-line while trailer blocks come after the data. If there is no data, they
+follow the end-of-headers marker.
+
+Headers and trailers are the only blocks containing a Key/Value payload. The
+corresponding end-of marker must always be placed after each group to mark, as
+it name suggests, the end.
+
+In HTTP/1, trailers are only present on chunked messages. But chunked messages
+do not always have trailers. In this case, the end-of-trailers block may or may
+not be present. Multiplexers must be able to handle both situations. In HTTP/2,
+trailers are only present if a HEADERS frame is sent after DATA frames.
+
+
+3.3. The data
+
+The payload body of an HTTP message is stored as DATA blocks in the HTX
+message. For HTTP/1 messages, it is the message body without the chunks
+formatting, if any. For HTTP/2, it is the payload of DATA frames.
+
+The DATA blocks are the only HTX blocks that may be partially processed (copied
+or removed). All other types of block must be entierly processed. This means
+DATA blocks can be resized.
+
+
+3.4. The end-of markers
+
+These blocks are used to delimit parts of an HTX message. It exists two
+markers :
+
+ - end-of-headers (EOH)
+ - end-of-trailers (EOT)
+
+EOH is always present in an HTX message. EOT is optional.
+
+
+4. The HTX API
+
+
+4.1. Get/set HTX message from/to the underlying buffer
+
+The first thing to do to process an HTX message is to get it from the underlying
+buffer. There are 2 functions to do so, the second one relying on the first :
+
+ - htxbuf() returns an HTX message from a buffer. It does not modify the
+ buffer. It only initialize the HTX message if the buffer is empty.
+
+ - htx_from_buf() uses htxbuf(). But it also updates the underlying buffer so
+ that it appears as full.
+
+Both functions return a "zero-sized" HTX message if the buffer is null. This
+way, the HTX message is always valid. The first function is the default function
+to use. The second one is only useful when some content will be added. For
+instance, it used by the HTX analyzers when HAProxy generates a response. Thus,
+the buffer is in a right state.
+
+Once the processing done, if the HTX message has been modified, the underlying
+buffer must be also updated, except htx_from_buf() was used _AND_ data was only
+added. For all other cases, the function htx_to_buf() must be called.
+
+Finally, the function htx_reset() may be called at any time to reset an HTX
+message. And the function buf_room_for_htx_data() may be called to know if a raw
+buffer is full from the HTX perspective. It is used during conversion from/to
+the HTX.
+
+
+4.2. Helpers to deal with free space in an HTX message
+
+Once with an HTX message, following functions may help to process it :
+
+ - htx_used_space() and htx_meta_space() return, respectively, the total
+ space used in an HTX message and the space used by block's metadata only.
+
+ - htx_free_space() and htx_free_data_space() return, respectively, the total
+ free space in an HTX message and the free space available for the payload
+ if a new HTX block is stored (so it is the total free space minus the size
+ of an HTX block).
+
+ - htx_is_empty() and htx_is_not_empty() are boolean functions to know if an
+ HTX message is empty or not.
+
+ - htx_get_max_blksz() returns the maximum size available for the payload,
+ not exceeding a maximum, metadata included.
+
+ - htx_almost_full() should be used to know if an HTX message uses at least
+ 3/4 of its capacity.
+
+
+4.3. HTX Blocks manipulations
+
+Once the available sapce in an HTX message is known, the next step is to add HTX
+blocks. First of all the function htx_nbblks() returns the number of blocks
+allocated in an HTX message. Then, there is an add function per block's type :
+
+ - htx_add_stline() adds a start-line. The type (request or response) and the
+ flags of the start-line must be provided, as well as its three parts
+ (method,uri,version or version,status-code,reason).
+
+ - htx_add_header() and htx_add_trailers() are similar. The name and the
+ value must be provided. The inserted HTX block is returned on success or
+ NULL if an error occurred.
+
+ - htx_add_endof() must be used to add any end-of marker. The block's type
+ (EOH or EOT) must be specified. The inserted HTX block is returned on
+ success or NULL if an error occurred.
+
+ - htx_add_all_headers() and htx_add_all_trailers() add, respectively, a list
+ of headers and a list of trailers, followed by the appropriate end-of
+ marker. On success, this marker is returned. Otherwise, NULL is
+ returned. Note there is no rollback on the HTX message when an error
+ occurred. Some headers or trailers may have been added. So it is the
+ caller responsibility to take care of that.
+
+ - htx_add_data() must be used to add a DATA block. Unlike previous
+ functions, this one returns the number of bytes copied or 0 if nothing was
+ copied. If possible, the data are appended to the tail block if it is a
+ DATA block. Only a part of the payload may be copied because this function
+ will try to limit the message defragmentation and the wrapping of blocks
+ as far as possible.
+
+ - htx_add_data_atonce() must be used if all data must be added or nothing.
+ It tries to insert all the payload, this function returns the inserted
+ block on success. Otherwise it returns NULL.
+
+When an HTX block is added, it is always the last one (the tail). But, if a
+block must be added at a specific place, it is not really handy. 2 functions may
+help (others could be added) :
+
+ - htx_add_last_data() adds a DATA block just after all other DATA blocks and
+ before any trailers and EOT marker. It relies on htx_add_data_atonce(), so
+ a defragmentation may be performed.
+
+ - htx_move_blk_before() moves a specific block just after another one. Both
+ blocks must already be in the HTX message and the block to move must
+ always be placed after the "pivot".
+
+Once added, there are three functions to update the block's payload :
+
+ - htx_replace_stline() updates a start-line. The HTX block must be passed as
+ argument. Only string parts of the start-line are updated by this
+ function. On success, it returns the new start-line. So it is pretty easy
+ to update its flags. NULL is returned if an error occurred.
+
+ - htx_replace_header() fully replaces a header (its name and its value) by a
+ new one. The HTX block must be passed a argument, as well as its new name
+ and its new value. The new header can be smaller or larger than the old
+ one. This function returns the new HTX block on success, or NULL is an
+ error occurred.
+
+ - htx_replace_blk_value() replaces a part of a block's payload or its
+ totality. It works for HEADERS, TRAILERS or DATA blocks. The HTX block
+ must be provided with the part to remove and the new one. The new part can
+ be smaller or larger than the old one. This function returns the new HTX
+ block on success, or NULL is an error occurred.
+
+ - htx_change_blk_value_len() changes the size of the value. It is the caller
+ responsibility to change the value itself, make sure there is enough space
+ and update allocated value. This function updates the HTX message
+ accordingly.
+
+ - htx_set_blk_value_len() changes the size of the value. It is the caller
+ responsibility to change the value itself, make sure there is enough space
+ and update allocated value. Unlike the function
+ htx_change_blk_value_len(), this one does not update the HTX message. So
+ it should be used with caution.
+
+ - htx_cut_data_blk() removes <n> bytes from the beginning of a DATA
+ block. The block's start address and its length are adjusted, and the
+ htx's total data count is updated. This is used to mark that part of some
+ data were transferred from a DATA block without removing this DATA
+ block. No sanity check is performed, the caller is responsible for doing
+ this exclusively on DATA blocks, and never removing more than the block's
+ size.
+
+ - htx_remove_blk() removes a block from an HTX message. It returns the
+ following block or NULL if it is the tail block.
+
+Finally, a block may be removed using the function htx_remove_blk(). This
+function returns the block following the one removed or NULL if it is the tail
+block.
+
+
+4.4. The HTX start-line
+
+Unlike other HTX blocks, the start-line is a bit special because its payload is
+a structure followed by its three parts :
+
+ +--------+-------+-------+-------+
+ | HTX_SL | PART1 | PART2 | PART3 |
+ +--------+-------+-------+-------+
+
+Some macros and functions may help to manipulate these parts :
+
+ - HTX_SL_P{N}_LEN() and HTX_SL_P{N}_PTR() are macros to get the length of a
+ part and a pointer on it. {N} should be 1, 2 or 3.
+
+ - HTX_SL_REQ_MLEN(), HTX_SL_REQ_ULEN(), HTX_SL_REQ_VLEN(),
+ HTX_SL_REQ_MPTR(), HTX_SL_REQ_UPTR() and HTX_SL_REQ_VPTR() are macros to
+ get info about a request start-line. These macros only wrap HTX_SL_P*
+ ones.
+
+ - HTX_SL_RES_VLEN(), HTX_SL_RES_CLEN(), HTX_SL_RES_RLEN(),
+ HTX_SL_RES_VPTR(), HTX_SL_RES_CPTR() and HTX_SL_RES_RPTR() are macros to
+ get info about a response start-line. These macros only wrap HTX_SL_P*
+ ones.
+
+ - htx_sl_p1(), htx_sl_p2() and htx_sl_p2() are functions to get the ist
+ corresponding to the right part of a start-line.
+
+ - htx_sl_req_meth(), htx_sl_req_uri() and htx_sl_req_vsn() get the ist
+ corresponding to the right part of a request start-line.
+
+ - htx_sl_res_vsn(), htx_sl_res_code() and htx_sl_res_reason() get the ist
+ corresponding to the right part of a response start-line.
+
+
+4.5. Iterate on the HTX message
+
+To iterate on an HTX message, the first thing to do is to get the HTX block to
+start the loop. There are three special blocks in an HTX message that may be
+good candidates to start a loop :
+
+ - the head block. It is the oldest inserted block. Multiplexers always start
+ to consume an HTX message from this block. The function htx_get_head()
+ returns its position and htx_get_head_blk() returns the blocks itself. In
+ addition, the function htx_get_head_type() returns its block's type.
+
+ - the tail block. It is the newest inserted block. The function
+ htx_get_tail() returns its position and htx_get_tail_blk() returns the
+ blocks itself. In addition, the function htx_get_tail_type() returns its
+ block's type.
+
+ - the first block. It is the block where to (re)start the analyse. It is
+ used as start point by HTX analyzers. The function htx_get_first() returns
+ its position and htx_get_first_blk() returns the blocks itself. In
+ addition, the function htx_get_first_type() returns its block's type.
+
+For all these functions, if the HTX message is empty, -1 is returned for the
+block's position, NULL instead of a block and HTX_BLK_UNUSED for its type.
+
+Then to iterate on blocks, foreword or backward :
+
+ - htx_get_prev() and htx_get_next() return, respectively, the position of
+ the previous block or the next block, given a specific position. Or -1 if
+ an edge is reached.
+
+ - htx_get_prev_blk() and htx_get_next_blk() return, respectively, the
+ previous block or the next one, given a specific block. Or NULL if an edge
+ is reached.
+
+4.6. Access block content and info
+
+Following functions may be used to retrieve information about a specific HTX
+block :
+
+ - htx_get_blk_pos() returns the position of a block. It must be in the HTX
+ message.
+
+ - htx_get_blk_ptr() returns a pointer on the payload of a block.
+
+ - htx_get_blk_type() returns the type of a block.
+
+ - htx_get_blksz() returns the payload size of a block
+
+ - htx_get_blk_name() returns the name of a block, only if it is a header or
+ a trailer. Otherwise, it returns an empty string.
+
+ - htx_get_blk_value() returns the value of a block, depending on its
+ type. For header and trailer blocks, it is the value field. For markers
+ (EOH or EOT), an empty string is returned. For other blocks an ist
+ pointing on the block payload is returned.
+
+ - htx_is_unique_blk() may be used to know if a block is the only one
+ remaining inside an HTX message, excluding unused blocks. This function is
+ pretty useful to determine the end of a HTX message, in conjunction with
+ HTX_FL_EOM flag.
+
+4.7. Advanced functions
+
+Some more advanced functions may be used to do complex processing on the HTX
+message. These functions are used by HTX analyzers or by multiplexers.
+
+ - htx_truncate() removes all blocks after the one containing a specific
+ offset relatively to the head block of the HTX message. If the offset is
+ inside a DATA block, it is truncated. For all other blocks, the removal
+ starts to the next block.
+
+ - htx_drain() tries to remove a specific amount of bytes of payload. If the
+ tail block is a DATA block, it may be truncated if necessary. All other
+ block are removed at once or kept. This function returns a mixed value,
+ with the first block not removed, or NULL if everything was removed, and
+ the amount of data drained.
+
+ - htx_xfer_blks() transfers HTX blocks from an HTX message to another,
+ stopping on the first block of a specified type or when a specific amount
+ of bytes, including meta-data, was moved. If the tail block is a DATA
+ block, it may be partially moved. All other block are transferred at once
+ or kept. This function returns a mixed value, with the last block moved,
+ or NULL if nothing was moved, and the amount of data transferred. When
+ HEADERS or TRAILERS blocks must be transferred, this function transfers
+ all of them. Otherwise, if it is not possible, it triggers an error. It is
+ the caller responsibility to transfer all headers or trailers at once.
+
+ - htx_append_msg() append an HTX message to another one. All the message is
+ copied or nothing. So, if an error occurred, a rollback is performed. This
+ function returns 1 on success and 0 on error.
+
+ - htx_reserve_max_data() Reserves the maximum possible size for an HTX data
+ block, by extending an existing one or by creating a new one. It returns a
+ compound result with the HTX block and the position where new data must be
+ inserted (0 for a new block). If an error occurs or if there is no space
+ left, NULL is returned instead of a pointer on an HTX block.
+
+ - htx_find_offset() looks for the HTX block containing a specific offset,
+ starting at the HTX message's head. The function returns the found HTX
+ block and the position inside this block where the offset is. If the
+ offset is outside of the HTX message, NULL is returned.
+
+ - htx_defrag() defragments an HTX message. It removes unused blocks and
+ unwraps the payloads part. A temporary buffer is used to do so. This
+ function never fails. A referenced block may be provided. If so, the
+ corresponding new block is returned. Otherwise, NULL is returned.
diff --git a/doc/internals/api/initcalls.txt b/doc/internals/api/initcalls.txt
new file mode 100644
index 0000000..30d8737
--- /dev/null
+++ b/doc/internals/api/initcalls.txt
@@ -0,0 +1,360 @@
+Initialization stages aka how to get your code initialized at the right moment
+
+
+1. Background
+
+Originally all subsystems were initialized via a dedicated function call
+from the huge main() function. Then some code started to become conditional
+or a bit more modular and the #ifdef placed there became a mess, resulting
+in init code being moved to function constructors in each subsystem's own
+file. Then pools of various things were introduced, starting to make the
+whole init sequence more complicated due to some forms of internal
+dependencies. Later epoll was introduced, requiring a post-fork callback,
+and finally threads arrived also requiring some post-thread init/deinit
+and allocation, marking the old architecture's last breath. Finally the
+whole thing resulted in lots of init code duplication and was simplified
+in 1.9 with the introduction of initcalls and initialization stages.
+
+
+2. New architecture
+
+The new architecture relies on two layers :
+ - the registration functions
+ - the INITCALL macros and initialization stages
+
+The first ones are mostly used to add a callback to a list. The second ones
+are used to specify when to call a function. Both are totally independent,
+however they are generally combined via another set consisting in the REGISTER
+macros which make some registration functions be called at some specific points
+during the init sequence.
+
+
+3. Registration functions
+
+Registration functions never fail. Or more precisely, if they fail it will only
+be on out-of-memory condition, and they will cause the process to immediately
+exit. As such they do not return any status and the caller doesn't have to care
+about their success.
+
+All available functions are described below in alphanumeric ordering. Please
+make sure to respect this ordering when adding new ones.
+
+- void hap_register_build_opts(const char *str, int must_free)
+
+ This appends the zero-terminated constant string <str> to the list of known
+ build options that will be reported on the output of "haproxy -vv". A line
+ feed character ('\n') will automatically be appended after the string when it
+ is displayed. The <must_free> argument must be zero, unless the string was
+ allocated by any malloc-compatible function such as malloc()/calloc()/
+ realloc()/strdup() or memprintf(), in which case it's better to pass a
+ non-null value so that the string is freed upon exit. Note that despite the
+ function's prototype taking a "const char *", the pointer will actually be
+ cast and freed. The const char* is here to leave more freedom to use consts
+ when making such options lists.
+
+- void hap_register_per_thread_alloc(int (*fct)())
+
+ This adds a call to function <fct> to the list of functions to be called when
+ threads are started, at the beginning of the polling loop. This is also valid
+ for the main thread and will be called even if threads are disabled, so that
+ it is guaranteed that this function will be called in any circumstance. Each
+ thread will first call all these functions exactly once when it starts. Calls
+ are serialized by the init_mutex, so that locking is not necessary in these
+ functions. There is no relation between the thread numbers and the callback
+ ordering. The function is expected to return non-zero on success, or zero on
+ failure. A failure will make the process emit a succinct error message and
+ immediately exit. See also hap_register_per_thread_free() for functions
+ called after these ones.
+
+- void hap_register_per_thread_deinit(void (*fct)());
+
+ This adds a call to function <fct> to the list of functions to be called when
+ threads are gracefully stopped, at the end of the polling loop. This is also
+ valid for the main thread and will be called even if threads are disabled, so
+ that it is guaranteed that this function will be called in any circumstance
+ if the process experiences a soft stop. Each thread will call this function
+ exactly once when it stops. However contrary to _alloc() and _init(), the
+ calls are made without any protection, thus if any shared resource if touched
+ by the function, the function is responsible for protecting it. The reason
+ behind this is that such resources are very likely to be still in use in one
+ other thread and that most of the time the functions will in fact only touch
+ a refcount or deinitialize their private resources. See also
+ hap_register_per_thread_free() for functions called after these ones.
+
+- void hap_register_per_thread_free(void (*fct)());
+
+ This adds a call to function <fct> to the list of functions to be called when
+ threads are gracefully stopped, at the end of the polling loop, after all calls
+ to _deinit() callbacks are done for this thread. This is also valid for the
+ main thread and will be called even if threads are disabled, so that it is
+ guaranteed that this function will be called in any circumstance if the
+ process experiences a soft stop. Each thread will call this function exactly
+ once when it stops. However contrary to _alloc() and _init(), the calls are
+ made without any protection, thus if any shared resource if touched by the
+ function, the function is responsible for protecting it. The reason behind
+ this is that such resources are very likely to be still in use in one other
+ thread and that most of the time the functions will in fact only touch a
+ refcount or deinitialize their private resources. See also
+ hap_register_per_thread_deinit() for functions called before these ones.
+
+- void hap_register_per_thread_init(int (*fct)())
+
+ This adds a call to function <fct> to the list of functions to be called when
+ threads are started, at the beginning of the polling loop, right after the
+ list of _alloc() functions. This is also valid for the main thread and will
+ be called even if threads are disabled, so that it is guaranteed that this
+ function will be called in any circumstance. Each thread will call this
+ function exactly once when it starts, and calls are serialized by the
+ init_mutex which is held over all _alloc() and _init() calls, so that locking
+ is not necessary in these functions. In other words for all threads but the
+ current one, the sequence of _alloc() and _init() calls will be atomic. There
+ is no relation between the thread numbers and the callback ordering. The
+ function is expected to return non-zero on success, or zero on failure. A
+ failure will make the process emit a succinct error message and immediately
+ exit. See also hap_register_per_thread_alloc() for functions called before
+ these ones.
+
+- void hap_register_pre_check(int (*fct)())
+
+ This adds a call to function <fct> to the list of functions to be called at
+ the step just before the configuration validity checks. This is useful when you
+ need to create things like it would have been done during the configuration
+ parsing and where the initialization should continue in the configuration
+ check.
+ It could be used for example to generate a proxy with multiple servers using
+ the configuration parser itself. At this step the trash buffers are allocated.
+ Threads are not yet started so no protection is required. The function is
+ expected to return non-zero on success, or zero on failure. A failure will make
+ the process emit a succinct error message and immediately exit.
+
+- void hap_register_post_check(int (*fct)())
+
+ This adds a call to function <fct> to the list of functions to be called at
+ the end of the configuration validity checks, just at the point where the
+ program either forks or exits depending whether it's called with "-c" or not.
+ Such calls are suited for memory allocation or internal table pre-computation
+ that would preferably not be done on the fly to avoid inducing extra time to
+ a pure configuration check. Threads are not yet started so no protection is
+ required. The function is expected to return non-zero on success, or zero on
+ failure. A failure will make the process emit a succinct error message and
+ immediately exit.
+
+- void hap_register_post_deinit(void (*fct)())
+
+ This adds a call to function <fct> to the list of functions to be called when
+ freeing the global sections at the end of deinit(), after everything is
+ stopped. The process is single-threaded at this point, thus these functions
+ are suitable for releasing configuration elements provided that no other
+ _deinit() function uses them, i.e. only close/release what is strictly
+ private to the subsystem. Since such functions are mostly only called during
+ soft stops (reloads) or failed startups, they tend to experience much less
+ test coverage than others despite being more exposed, and as such a lot of
+ care must be taken to test them especially when facing partial subsystem
+ initializations followed by errors.
+
+- void hap_register_post_proxy_check(int (*fct)(struct proxy *))
+
+ This adds a call to function <fct> to the list of functions to be called for
+ each proxy, after the calls to _post_server_check(). This can allow, for
+ example, to pre-configure default values for an option in a frontend based on
+ the "bind" lines or something in a backend based on the "server" lines. It's
+ worth being aware that such a function must be careful not to waste too much
+ time in order not to significantly slow down configurations with tens of
+ thousands of backends. The function is expected to return non-zero on
+ success, or zero on failure. A failure will make the process emit a succinct
+ error message and immediately exit.
+
+- void hap_register_post_server_check(int (*fct)(struct server *))
+
+ This adds a call to function <fct> to the list of functions to be called for
+ each server, after the call to check_config_validity(). This can allow, for
+ example, to preset a health state on a server or to allocate a protocol-
+ specific memory area. It's worth being aware that such a function must be
+ careful not to waste too much time in order not to significantly slow down
+ configurations with tens of thousands of servers. The function is expected
+ to return non-zero on success, or zero on failure. A failure will make the
+ process emit a succinct error message and immediately exit.
+
+- void hap_register_proxy_deinit(void (*fct)(struct proxy *))
+
+ This adds a call to function <fct> to the list of functions to be called when
+ freeing the resources during deinit(). These functions will be called as part
+ of the proxy's resource cleanup. Note that some of the proxy's fields will
+ already have been freed and others not, so such a function must not use any
+ information from the proxy that is subject to being released. In particular,
+ all servers have already been deleted. Since such functions are mostly only
+ called during soft stops (reloads) or failed startups, they tend to
+ experience much less test coverage than others despite being more exposed,
+ and as such a lot of care must be taken to test them especially when facing
+ partial subsystem initializations followed by errors. It's worth mentioning
+ that too slow functions could have a significant impact on the configuration
+ check or exit time especially on large configurations.
+
+- void hap_register_server_deinit(void (*fct)(struct server *))
+
+ This adds a call to function <fct> to the list of functions to be called when
+ freeing the resources during deinit(). These functions will be called as part
+ of the server's resource cleanup. Note that some of the server's fields will
+ already have been freed and others not, so such a function must not use any
+ information from the server that is subject to being released. Since such
+ functions are mostly only called during soft stops (reloads) or failed
+ startups, they tend to experience much less test coverage than others despite
+ being more exposed, and as such a lot of care must be taken to test them
+ especially when facing partial subsystem initializations followed by errors.
+ It's worth mentioning that too slow functions could have a significant impact
+ on the configuration check or exit time especially on large configurations.
+
+
+4. Initialization stages
+
+In order to offer some guarantees, the startup of the program is split into
+several stages. Some callbacks can be placed into each of these stages using
+an INITCALL macro, with 0 to 3 arguments, respectively called INITCALL0 to
+INITCALL3. These macros must be placed anywhere at the top level of a C file,
+preferably at the end so that the referenced symbols have already been met,
+but it may also be fine to place them right after the callbacks themselves.
+
+Such callbacks are referenced into small structures containing a pointer to the
+function and 3 arguments. NULL replaces unused arguments. The callbacks are
+cast to (void (*)(void *, void *, void *)) and the arguments to (void *).
+
+The first argument to the INITCALL macro is the initialization stage. The
+second one is the callback function, and others if any are the arguments.
+The init stage must be among the values of the "init_stage" enum, currently,
+and in this execution order:
+
+ - STG_PREPARE : used to preset variables, pre-initialize lookup tables and
+ pre-initialize list heads
+ - STG_LOCK : used to pre-initialize locks
+ - STG_REGISTER : used to register static lists such as keywords
+ - STG_ALLOC : used to allocate the required structures
+ - STG_POOL : used to create pools
+ - STG_INIT : used to initialize subsystems
+
+Each stage is guaranteed that previous stages have successfully completed. This
+means that an INITCALL placed at stage STG_INIT is guaranteed that all pools
+were already created and will be usable. Conversely, an INITCALL placed at
+stage STG_REGISTER must not rely on any field that requires preliminary
+allocation nor initialization. A callback cannot rely on other callbacks of the
+same stage, as the execution order within a stage is undefined and essentially
+depends on the linking order.
+
+The STG_REGISTER level is made for run-time linking of the various modules that
+compose the executable. Keywords, protocols and various other elements that are
+local known to each compilation unit can will be appended into common lists at
+boot time. This is why this call is placed just before STG_ALLOC.
+
+Example: register a very early call to init_log() with no argument, and another
+ call to cli_register_kw(&cli_kws) much later:
+
+ INITCALL0(STG_PREPARE, init_log);
+ INITCALL1(STG_REGISTER, cli_register_kw, &cli_kws);
+
+Technically speaking, each call to such a macro adds a distinct local symbol
+whose dynamic name involves the line number. These symbols are placed into a
+separate section and the beginning and end section pointers are provided by the
+linker. When too old a linker is used, a fallback is applied consisting in
+placing them into a linked list which is built by a constructor function for
+each initcall (this takes more room).
+
+Due to the symbols internally using the line number, it is very important not
+to place more than one INITCALL per line in the source file.
+
+It is also strongly recommended that functions and referenced arguments are
+static symbols local to the source file, unless they are global registration
+functions like in the example above with cli_register_kw(), where only the
+argument is a local keywords table.
+
+INITCALLs do not expect the callback function to return anything and as such
+do not perform any error check. As such, they are very similar to constructors
+offered by the compiler except that they are segmented in stages. It is thus
+the responsibility of the called functions to perform their own error checking
+and to exit in case of error. This may change in the future.
+
+
+5. REGISTER family of macros
+
+The association of INITCALLs and registration functions allows to perform some
+early dynamic registration of functions to be used anywhere, as well as values
+to be added to existing lists without having to manipulate list elements. For
+the sake of simplification, these combinations are available as a set of
+REGISTER macros which register calls to certain functions at the appropriate
+init stage. Such macros must be used at the top level in a file, just like
+INITCALL macros. The following macros are currently supported. Please keep them
+alphanumerically ordered:
+
+- REGISTER_BUILD_OPTS(str)
+
+ Adds the constant string <str> to the list of build options. This is done by
+ registering a call to hap_register_build_opts(str, 0) at stage STG_REGISTER.
+ The string will not be freed.
+
+- REGISTER_CONFIG_POSTPARSER(name, parser)
+
+ Adds a call to function <parser> at the end of the config parsing. The
+ function is called at the very end of check_config_validity() and may be used
+ to initialize a subsystem based on global settings for example. This is done
+ by registering a call to cfg_register_postparser(name, parser) at stage
+ STG_REGISTER.
+
+- REGISTER_CONFIG_SECTION(name, parse, post)
+
+ Registers a new config section name <name> which will be parsed by function
+ <parse> (if not null), and with an optional call to function <post> at the
+ end of the section. Function <parse> must be of type (int (*parse)(const char
+ *file, int linenum, char **args, int inv)), and returns 0 on success or an
+ error code among the ERR_* set on failure. The <post> callback takes no
+ argument and returns a similar error code. This is achieved by registering a
+ call to cfg_register_section() with the three arguments at stage
+ STG_REGISTER.
+
+- REGISTER_PER_THREAD_ALLOC(fct)
+
+ Registers a call to register_per_thread_alloc(fct) at stage STG_REGISTER.
+
+- REGISTER_PER_THREAD_DEINIT(fct)
+
+ Registers a call to register_per_thread_deinit(fct) at stage STG_REGISTER.
+
+- REGISTER_PER_THREAD_FREE(fct)
+
+ Registers a call to register_per_thread_free(fct) at stage STG_REGISTER.
+
+- REGISTER_PER_THREAD_INIT(fct)
+
+ Registers a call to register_per_thread_init(fct) at stage STG_REGISTER.
+
+- REGISTER_POOL(ptr, name, size)
+
+ Used internally to declare a new pool. This is made by calling function
+ create_pool_callback() with these arguments at stage STG_POOL. Do not use it
+ directly, use either DECLARE_POOL() or DECLARE_STATIC_POOL() instead.
+
+- REGISTER_PRE_CHECK(fct)
+
+ Registers a call to register_pre_check(fct) at stage STG_REGISTER.
+
+- REGISTER_POST_CHECK(fct)
+
+ Registers a call to register_post_check(fct) at stage STG_REGISTER.
+
+- REGISTER_POST_DEINIT(fct)
+
+ Registers a call to register_post_deinit(fct) at stage STG_REGISTER.
+
+- REGISTER_POST_PROXY_CHECK(fct)
+
+ Registers a call to register_post_proxy_check(fct) at stage STG_REGISTER.
+
+- REGISTER_POST_SERVER_CHECK(fct)
+
+ Registers a call to register_post_server_check(fct) at stage STG_REGISTER.
+
+- REGISTER_PROXY_DEINIT(fct)
+
+ Registers a call to register_proxy_deinit(fct) at stage STG_REGISTER.
+
+- REGISTER_SERVER_DEINIT(fct)
+
+ Registers a call to register_server_deinit(fct) at stage STG_REGISTER.
+
diff --git a/doc/internals/api/ist.txt b/doc/internals/api/ist.txt
new file mode 100644
index 0000000..0f118d6
--- /dev/null
+++ b/doc/internals/api/ist.txt
@@ -0,0 +1,167 @@
+2021-11-08 - Indirect Strings (IST) API
+
+
+1. Background
+-------------
+
+When parsing traffic, most of the standard C string functions are unusable
+since they rely on a trailing zero. In addition, for the rare ones that support
+a length, we have to constantly maintain both the pointer and the length. But
+then, it's easy to come up with complex lengths and offsets calculations all
+over the place, rendering the code hard to read and bugs hard to avoid or spot.
+
+IST provides a solution to this by defining a structure made of exactly two
+word size elements, that most C ABIs know how to handle as a register when
+used as a function argument or a function's return value. The functions are
+inlined to leave a maximum set of opportunities to the compiler or optimization
+and expression reduction, and as a result they are often inexpensive to use. It
+is important however to keep in mind that all of these are designed for minimal
+code size when dealing with short strings (i.e. parsing tokens in protocols),
+and they are not optimal for processing large blocks.
+
+
+2. API description
+------------------
+
+IST are defined like this:
+
+ struct ist {
+ char *ptr; // pointer to the string's first byte
+ size_t len; // number of valid bytes starting from ptr
+ };
+
+A string is not set if its ->ptr member is NULL. In this case .len is undefined
+and is recommended to be zero.
+
+Declaring a function returning an IST:
+
+ struct ist produce_ist(int ok)
+ {
+ return ok ? IST("OK") : IST("KO");
+ }
+
+Declaring a function consuming an IST:
+
+ void say_ist(struct ist i)
+ {
+ write(1, istptr(i), istlen(i));
+ }
+
+Chaining the two:
+
+ void say_ok(int ok)
+ {
+ say_ist(produce_ist(ok));
+ }
+
+Notes:
+ - the arguments are passed as value, not reference, so there's no need for
+ any "const" in their declaration (except to catch coding mistakes).
+ Pointers to ist may benefit from being marked "const" however.
+
+ - similarly for the return value, there's no point is marking it "const" as
+ this would protect the pointer and length, not the data.
+
+ - use ist0() to append a trailing zero to a variable string for use with
+ printf()'s "%s" format, or for use with functions that work on NUL-
+ terminated strings, but beware of not doing this with constants.
+
+ - the API provides a starting pointer and current length, but does not
+ provide an allocated size. It remains up to the caller to know how large
+ the allocated area is when adding data, though most functions make this
+ easy.
+
+The following macros and functions are defined. Those whose name starts with
+underscores require special care and must not be used without being certain
+they are properly used (typically subject to buffer overflows if misused). Note
+that most functions were added over time depending on instant needs, and some
+are very close to each other. Many useful functions are still missing and would
+deserve being added.
+
+Below, arguments "i1","i2" are all of type "ist". Arguments "s" are
+NUL-terminated strings of type "char*", and "cs" are of type "const char *".
+Arguments "c" are of type "char", and "n" are of type size_t.
+
+ IST(cs):ist make constant IST from a NUL-terminated const string
+ IST_NULL:ist return an unset IST = ist2(NULL,0)
+ __istappend(i1,c):ist append character <c> at the end of ist <i1>
+ ist(s):ist return an IST from a nul-terminated string
+ ist0(i1):char* write a \0 at the end of an IST, return the string
+ ist2(cs,l):ist return a variable IST from a const string and length
+ ist2bin(s,i1):ist copy IST into a buffer, return the result
+ ist2bin_lc(s,i1):ist like ist2bin() but turning turning to lower case
+ ist2bin_uc(s,i1):ist like ist2bin() but turning turning to upper case
+ ist2str(s,i1):ist copy IST into a buffer, add NUL and return the result
+ ist2str_lc(s,i1):ist like ist2str() but turning turning to lower case
+ ist2str_uc(s,i1):ist like ist2str() but turning turning to upper case
+ ist_find(i1,c):ist return first occurrence of char <c> in <i1>
+ ist_find_ctl(i1):char* return pointer to first CTL char in <i1> or NULL
+ ist_skip(i1,c):ist return first occurrence of char not <c> in <i1>
+ istadv(i1,n):ist advance the string by <n> characters
+ istalloc(n):ist return allocated string of zero initial length
+ istcat(d,s,n):ssize_t copy <s> after <d> for <n> chars max, return len or -1
+ istchr(i1,c):char* return pointer to first occurrence of <c> in <i1>
+ istclear(i1*):size_t return previous size and set size to zero
+ istcpy(d,s,n):ssize_t copy <s> over <d> for <n> chars max, return len or -1
+ istdiff(i1,i2):int return the ordinal difference, like strcmp()
+ istdup(i1):ist allocate new ist and copy original one into it
+ istend(i1):char* return pointer to first character after the IST
+ isteq(i1,i2):int return non-zero if strings are equal
+ isteqi(i1,i2):int like isteq() but case-insensitive
+ istfree(i1*) free of allocated <i1>/IST_NULL and set it to IST_NULL
+ istissame(i1,i2):int return true if pointers and lengths are equal
+ istist(i1,i2):ist return first occurrence of <i2> in <i1>
+ istlen(i1):size_t return the length of the IST (number of characters)
+ istmatch(i1,i2):int return non-zero if i1 starts like i2 (empty OK)
+ istmatchi(i1,i2):int like istmatch() but case insensitive
+ istneq(i1,i2,n):int like isteq() but limited to the first <n> chars
+ istnext(i1):ist return the IST advanced by one character
+ istnmatch(i1,i2,n):int like istmatch() but limited to the first <n> chars
+ istpad(s,i1):ist copy IST into a buffer, add a NUL, return the result
+ istptr(i1):char* return the starting pointer of the IST
+ istscat(d,s,n):ssize_t same as istcat() but always place a NUL at the end
+ istscpy(d,s,n):ssize_t same as istcpy() but always place a NUL at the end
+ istshift(i1*):char return the first character and advance the IST by one
+ istsplit(i1*,c):ist return part before <c>, make ist start from <c>
+ iststop(i1,c):ist truncate ist before first occurrence of <c>
+ isttest(i1):int return true if ist is not NULL, false otherwise
+ isttrim(i1,n):ist return ist trimmed to no more than <n> characters
+ istzero(i1,n):ist trim to <n> chars, trailing zero included.
+
+
+3. Quick index by typical C construct or function
+-------------------------------------------------
+
+Some common C constructs may be adjusted to use ist instead. The mapping is not
+always one-to-one, but usually the computations on the length part tends to
+disappear in the refactoring, allowing to directly chain function calls. The
+entries below are hints to figure what function to look for in order to rewrite
+some common use cases.
+
+ char* IST equivalent
+
+ strchr() istchr(), ist_find(), iststop()
+ strstr() istist()
+ strcpy() istcpy()
+ strscpy() istscpy()
+ strlcpy() istscpy()
+ strcat() istcat()
+ strscat() istscat()
+ strlcat() istscat()
+ strcmp() istdiff()
+ strdup() istdup()
+ !strcmp() isteq()
+ !strncmp() istneq(), istmatch(), istnmatch()
+ !strcasecmp() isteqi()
+ !strncasecmp() istneqi(), istmatchi()
+ strtok() istsplit()
+ return NULL return IST_NULL
+ s = malloc() s = istalloc()
+ free(s); s = NULL istfree(&s)
+ p != NULL isttest(p)
+ c = *(p++) c = istshift(p)
+ *(p++) = c __istappend(p, c)
+ p += n istadv(p, n)
+ p + strlen(p) istend(p)
+ p[max] = 0 isttrim(p, max)
+ p[max+1] = 0 istzero(p, max)
diff --git a/doc/internals/api/layers.txt b/doc/internals/api/layers.txt
new file mode 100644
index 0000000..b5c35f4
--- /dev/null
+++ b/doc/internals/api/layers.txt
@@ -0,0 +1,190 @@
+2022-05-27 - Stream layers in HAProxy 2.6
+
+
+1. Background
+
+There are streams at plenty of levels in haproxy, essentially due to the
+introduction of multiplexed protocols which provide high-level streams on top
+of low-level streams, themselves either based on stream-oriented protocols or
+datagram-oriented protocols.
+
+The refactoring of the appctx and muxes that allowed to drop a lot of duplicate
+code between 2.5 and 2.6-dev6 raised another concern with some entities like
+"conn_stream" that were not specific to connections anymore, "endpoints" that
+became entities on their own, and "targets" whose life had been extended to
+last all along a connection.
+
+It was time to rename all such legacy entities introduced in 1.8 and which had
+turned particularly confusing over time as their roles evolved.
+
+
+2. Naming principles
+
+The global renaming of some entities between streams and connections was
+articulated around several principles:
+
+ - avoid the confusing use of "context" in shared places. For example, the
+ endpoint's connection is in "ctx" and nothing makes it obvious that the
+ endpoint's context is a connection, especially when an applet is there.
+
+ - reserve relative nouns for pointers and not for types. "endpoint", just
+ like "owner" or "peer" is relative, but when accessed from a different
+ layer it starts to make no sense at all, or to make one believe it's
+ something else, particularly with void*.
+
+ - avoid too generic terms that have multiple meanings, or words that are
+ synonyms in a same place (e.g. "peer" and "remote", or "endpoint" and
+ "target"). If two synonyms are needed to designate two distinct entities,
+ there's probably a problem elsewhere, or the problem is poorly defined.
+
+ - make it clearer that all that is manipulated is related to streams. This
+ particularly important in sample fetch functions for example, which tend
+ to require low-level access and could be mislead in trying to follow the
+ wrong chain when trying to get information about a connection.
+
+ - use easily spellable short names that abbreviate unambiguously when used
+ together in adjacent contexts
+
+
+3. Current state as of 2.6
+
+- when a name is required to designate the lower block that starts at the mux
+ stream or the appctx, it is spoken of as a "stream endpoint", and abbreviated
+ "se". It's okay because while "endpoint" itself is relative, "stream
+ endpoint" unequivocally designates one extremity of a stream. If a type is
+ needed for this in the future (e.g. via obj_type), then the type "stendp"
+ may be used. Before 2.6-dev6 there was no name for this, it was known as
+ conn_stream->ctx.
+
+- the 2.6-dev6 cs_endpoint which preserves the state of a mux stream or an
+ appctx and abstracts them in front of a conn_stream becomes a "stream
+ endpoint descriptor", of type "sedesc" and often abbreviated "sd", "sed"
+ or "ed". Its "target" pointer became "se" as per the rule above. Before
+ 2.6-dev6, these elements were mixed with others inside conn_stream. From
+ the appctx it's called "sedesc" (few occurrences hence long name OK).
+
+- the conn_stream which is always attached to either a stream or a health check
+ and that is used to reach a mux or an applet becomes a "stream connector" of
+ type "stconn", generally abbreviated "sc". Its "endp" pointer becomes
+ "sedesc" as per the rule above, and that one has a back pointer "sc". The
+ stream uses "scf" and "scb" as the respective front and back pointers to the
+ stconns. Prior to 2.6-dev6, these parts were split between conn_stream and
+ stream_interface.
+
+- the sedesc's "ctx" which is solely used to store the connection as of now, is
+ renamed "conn" to void any doubt in the context of applets or even muxes. In
+ the future the connection should be attached to the "se" instead and this
+ pointer should disappear (or be recycled for anything else).
+
+The new 2.6 model looks like this:
+
+ +------------------------+
+ | stream or health check |
+ +------------------------+
+ ^ \ scf, scb
+ / \
+ | |
+ \ /
+ app \ v
+ +----------+
+ | stconn |
+ +----------+
+ ^ \ sedesc
+ / \
+ . . . . | . . . | . . . . . split point (retries etc)
+ \ /
+ sc \ v
+ +----------+
+ flags <--| sedesc | : sedesc :
+ +----------+ ... +----------+
+ conn / ^ \ se ^ \
+ +------------+ / / \ | \
+ | connection |<--' | | ... OR ... | |
+ +------------+ \ / \ |
+ mux| ^ |ctx sd \ v : sedesc \ v
+ | | | +----------------------+ \ # +----------+ svcctx
+ | | | | mux stream or appctx | | # | appctx |--.
+ | | | +----------------------+ | # +----------+ |
+ | | | ^ | / private # : : |
+ v | | | v > to the # +----------+ |
+ mux_ops | | +----------------+ \ mux # | svcctx |<-'
+ | +---->| mux connection | ) # +----------+
+ +------ +----------------+ / #
+
+Stream descriptors may exist in the following modes:
+ - .conn = NULL, .se = NULL : backend, not connection attempt yet
+ - .conn = NULL, .se = <appctx> : frontend or backend, applet
+ - .conn = <conn>, .se = NULL : backend, connection in progress
+ - .conn = <conn>, .se = <muxs> : frontend or backend, connected
+
+Notes:
+ - for historical reasons (connect, forced protocol upgrades, etc), during a
+ connection setup or a rule-based protocol upgrade, the connection's "ctx"
+ may temporarily point to the stconn
+
+
+4. Invariants and cardinalities
+
+Usually a stream is created from an existing stconn from a mux or some applets,
+but may also be allocated first by other applets schedulers. After stream_new()
+a stream always has exactly one stconn per side (scf, scb), each of which has
+one ->sedesc. Each side is initialized with either one or no stream endpoint
+attached to the descriptor.
+
+Both applets and a mux stream always have a stream endpoint descriptor. AS SUCH
+IT IS NEVER NECESSARY TO TEST FOR THE EXISTENCE OF THE SEDESC FROM ANY SIDE, IT
+ALWAYS EXISTS. This explains why as much as possible it's preferable to use the
+sedesc to access flags and statuses from any side, rather than bouncing via the
+stconn.
+
+An applet's app layer is always a stream, which means that there are always
+channels accessible above, and there is always an opposite stream connector and
+a stream endpoint descriptor. As such, it's always safe for an applet to access
+the other side using sc_opposite().
+
+When an outgoing connection is in the process of being established, the backend
+side sedesc has its ->conn pointer pointing to the pending connection, and no
+->se. Once the connection is established and a mux is chosen, it's attached to
+the ->se. If an applet is used instead of a mux, the appctx is attached to the
+sedesc's ->se and ->conn remains NULL.
+
+If either side wants to detach from the other, it must allocate a new virgin
+sedesc to replace the existing one, and leave the existing one to the endpoint,
+since it continues to describe the stream endpoint. The stconn keeps its state
+(modulo the updates related to the disconnection). The previous sedesc points
+to a NULL stconn. For example, disconnecting from a backend mux will leave the
+entities like this:
+
+ +------------------------+
+ | stream or health check |
+ +------------------------+
+ ^ \ scf, scb
+ / \
+ | |
+ \ /
+ app \ v
+ +----------+
+ | stconn |
+ +----------+
+ ^ \ sedesc
+ / \
+ NULL | |
+ ^ \ /
+ sc | / sc \ v
+ +----------+ / +----------+
+ flags <--| sedesc1 | . . . . . | sedesc2 |--> flags
+ +----------+ / +----------+
+ conn / ^ \ se / conn / \ se
+ +------------+ / / \ | |
+ | connection |<--' | | v v
+ +------------+ \ / NULL NULL
+ mux| ^ |ctx sd \ v
+ | | | +----------------------+
+ | | | | mux stream or appctx |
+ | | | +----------------------+
+ | | | ^ |
+ v | | | v
+ mux_ops | | +----------------+
+ | +---->| mux connection |
+ +------ +----------------+
+
diff --git a/doc/internals/api/list.txt b/doc/internals/api/list.txt
new file mode 100644
index 0000000..d03cf03
--- /dev/null
+++ b/doc/internals/api/list.txt
@@ -0,0 +1,195 @@
+2021-11-09 - List API
+
+
+1. Background
+-------------
+
+HAProxy's lists are almost all doubly-linked and circular so that it is always
+possible to insert at the beginning, append at the end, scan them in any order
+and delete any element without having to scan to search the predecessor nor the
+successor.
+
+A list's head is just a regular list element, and an element always points to
+another list element. Such elements only have two pointers, the next and the
+previous elements. The object being pointed to is retrieved by subtracting the
+list element's offset in its structure from the list element's pointer. This
+way there is no need for any separate allocation for the list element, for a
+pointer to the object in the list, nor for a pointer to the list element from
+the object, as the list is embedded into the object.
+
+All basic operations are provided, as well as some iterators. Some iterators
+are safe for removal of the current element within the loop, others not. In any
+case a list cannot be freely modified while iterating over it (e.g. the current
+element's successor cannot not be freed if it's saved as the restart point).
+
+Extreme care is taken nowadays in HAProxy to make sure that no dangling
+pointers are left in elements, so it is important to always initialize list
+heads and list elements, as well as elements that are removed from a list if
+they are not immediately freed, so that their deletion is idempotent. A rule of
+thumb is that a list pointer's validity never has to be checked, it is always
+valid to dereference it. A lot of complex bugs have been caused in the past by
+incorrect list manipulation, such as an element being deleted twice, resulting
+in damaging previously adjacent elements' neighbours. This usually has serious
+consequences at locations that are totally different from the one of the bug,
+and that are only detected much later, so it is required to be particularly
+strict on using lists safely.
+
+The lists are not thread-safe, but mt_lists may be used instead.
+
+
+2. API description
+------------------
+
+A list is defined like this, both for the list's head, and for any other
+element:
+
+ struct list {
+ struct list *n; /* next */
+ struct list *p; /* prev */
+ };
+
+An empty list points to itself for both pointers. I.e. a list's head is both
+its own successor and its own predecessor. This guarantees that insertions
+and deletions can be done without any check and that deletion is idempotent.
+For this reason and by convention, a detached element ought to be represented
+like an empty head.
+
+Lists are manipulated using a set of macros which are used to initialize, add,
+remove, or iterate over elements. Most of these macros are extremely simple and
+are not even protected against multiple evaluation, so it is fundamentally
+important that the expressions used in the arguments are idempotent and that
+the result does not depend on the evaluation order of the arguments.
+
+Macro Description
+
+ILH
+ Initialized List Head : this is a non-NULL, non-empty list element used
+ to prevent the compiler from moving an empty list head declaration to
+ BSS, typically when it appears in an array of keywords Without this,
+ some older versions of gcc tend to trim all the array and cause
+ corruption.
+
+LIST_INIT(l)
+ Initialize the list as an empty list head
+
+LIST_HEAD_INIT(l)
+ Return a valid initialized empty list head pointing to this
+ element. Essentially used with assignments in declarations.
+
+LIST_INSERT(l, e)
+ Add an element at the beginning of a list and return it
+
+LIST_APPEND(l, e)
+ Add an element at the end of a list and return it
+
+LIST_SPLICE(n, o)
+ Add the contents of a list <o> at the beginning of another list <n>.
+ The old list head remains untouched.
+
+LIST_SPLICE_END_DETACHED(n, o)
+ Add the contents of a list whose first element is is <o> and last one
+ is <o->p> at the end of another list <n>. The old list DOES NOT have
+ any head here.
+
+LIST_DELETE(e)
+ Remove an element from a list and return it. Safe to call on
+ initialized elements, but will not change the element itself so it is
+ not idempotent. Consider using LIST_DEL_INIT() instead unless called
+ immediately after a free().
+
+LIST_DEL_INIT(e)
+ Remove an element from a list, initialize it and return it so that a
+ subsequent LIST_DELETE() is safe. This is faster than performing a
+ LIST_DELETE() followed by a LIST_INIT() as pointers are not reloaded.
+
+LIST_ELEM(l, t, m)
+ Return a pointer of type <t> to a structure containing a list head
+ member called <m> at address <l>. Note that <l> can be the result of a
+ function or macro since it's used only once.
+
+LIST_ISEMPTY(l)
+ Check if the list head <l> is empty (=initialized) or not, and return
+ non-zero only if so.
+
+LIST_INLIST(e)
+ Check if the list element <e> was added to a list or not, thus return
+ true unless the element was initialized.
+
+LIST_INLIST_ATOMIC(e)
+ Atomically check if the list element's next pointer points to anything
+ different from itself, implying the element should be part of a
+ list. This usually is similar to LIST_INLIST() except that while that
+ one might be instrumented using debugging code to perform further
+ consistency checks, the macro below guarantees to always perform a
+ single atomic test and is safe to use with barriers.
+
+LIST_NEXT(l, t, m)
+ Return a pointer of type <t> to a structure following the element which
+ contains list head <l>, which is known as member <m> in struct <t>.
+
+LIST_PREV(l, t, m)
+ Return a pointer of type <t> to a structure preceding the element which
+ contains list head <l>, which is known as member <m> in struct <t>.
+ Note that this macro is first undefined as it happened to already exist
+ on some old OSes.
+
+list_for_each_entry(i, l, m)
+ Iterate local variable <i> through a list of items of type "typeof(*i)"
+ which are linked via a "struct list" member named <m>. A pointer to the
+ head of the list is passed in <l>. No temporary variable is needed.
+ Note that <i> must not be modified during the loop.
+
+list_for_each_entry_from(i, l, m)
+ Same as list_for_each_entry() but starting from current value of <i>
+ instead of the list's head.
+
+list_for_each_entry_from_rev(i, l, m)
+ Same as list_for_each_entry_rev() but starting from current value of <i>
+ instead of the list's head.
+
+list_for_each_entry_rev(i, l, m)
+ Iterate backwards local variable <i> through a list of items of type
+ "typeof(*i)" which are linked via a "struct list" member named <m>. A
+ pointer to the head of the list is passed in <l>. No temporary variable
+ is needed. Note that <i> must not be modified during the loop.
+
+list_for_each_entry_safe(i, b, l, m)
+ Iterate variable <i> through a list of items of type "typeof(*i)" which
+ are linked via a "struct list" member named <m>. A pointer to the head
+ of the list is passed in <l>. A temporary backup variable <b> of same
+ type as <i> is needed so that <i> may safely be deleted if needed. Note
+ that it is only permitted to delete <i> and no other element during
+ this operation!
+
+list_for_each_entry_safe_from(i, b, l, m)
+ Same as list_for_each_entry_safe() but starting from current value of
+ <i> instead of the list's head.
+
+list_for_each_entry_safe_from_rev(i, b, l, m)
+ Same as list_for_each_entry_safe_rev() but starting from current value
+ of <i> instead of the list's head.
+
+list_for_each_entry_safe_rev(i, b, l, m)
+ Iterate backwards local variable <i> through a list of items of type
+ "typeof(*i)" which are linked via a "struct list" member named <m>. A
+ pointer to the head of the list is passed in <l>. A temporary variable
+ <b> of same type as <i> is needed so that <i> may safely be deleted if
+ needed. Note that it is only permitted to delete <i> and no other
+ element during this operation!
+
+3. Notes
+--------
+
+- This API is quite old and some macros are missing. For example there's still
+ no list_first() so it's common to use LIST_ELEM(head->n, ...) instead. Some
+ older parts of the code also used to rely on list_for_each() followed by a
+ break to stop on the first element.
+
+- Some parts were recently renamed because LIST_ADD() used to do what
+ LIST_INSERT() currently does and was often mistaken with LIST_ADDQ() which is
+ what LIST_APPEND() now is. As such it is not totally impossible that some
+ places use a LIST_INSERT() where a LIST_APPEND() would be desired.
+
+- The structure must not be modified at all (even to add debug info). Some
+ parts of the code assume that its layout is exactly this one, particularly
+ the parts ensuring the casting between MT lists and lists.
diff --git a/doc/internals/api/pools.txt b/doc/internals/api/pools.txt
new file mode 100644
index 0000000..2c54409
--- /dev/null
+++ b/doc/internals/api/pools.txt
@@ -0,0 +1,577 @@
+2022-02-24 - Pools structure and API
+
+1. Background
+-------------
+
+Memory allocation is a complex problem covered by a massive amount of
+literature. Memory allocators found in field cover a broad spectrum of
+capabilities, performance, fragmentation, efficiency etc.
+
+The main difficulty of memory allocation comes from finding the optimal chunks
+for arbitrary sized requests, that will still preserve a low fragmentation
+level. Doing this well is often expensive in CPU usage and/or memory usage.
+
+In programs like HAProxy that deal with a large number of fixed size objects,
+there is no point having to endure all this risk of fragmentation, and the
+associated costs (sometimes up to several milliseconds with certain minimalist
+allocators) are simply not acceptable. A better approach consists in grouping
+frequently used objects by size, knowing that due to the high repetitiveness of
+operations, a freed object will immediately be needed for another operation.
+
+This grouping of objects by size is what is called a pool. Pools are created
+for certain frequently allocated objects, are usually merged together when they
+are of the same size (or almost the same size), and significantly reduce the
+number of calls to the memory allocator.
+
+With the arrival of threads, pools started to become a bottleneck so they now
+implement an optional thread-local lockless cache. Finally with the arrival of
+really efficient memory allocator in modern operating systems, the shared part
+has also become optional so that it doesn't consume memory if it does not bring
+any value.
+
+In 2.6-dev2, a number of debugging options that used to be configured at build
+time only changed to boot-time and can be modified using keywords passed after
+"-dM" on the command line, which sets or clears bits in the pool_debugging
+variable. The build-time options still affect the default settings however.
+Default values may be consulted using "haproxy -dMhelp".
+
+
+2. Principles
+-------------
+
+The pools architecture is selected at build time. The main options are:
+
+ - thread-local caches and process-wide shared pool enabled (1)
+
+ This is the default situation on most operating systems. Each thread has
+ its own local cache, and when depleted it refills from the process-wide
+ pool that avoids calling the standard allocator too often. It is possible
+ to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at
+ boot time with "-dMglobal".
+
+ - thread-local caches only are enabled (2)
+
+ This is the situation on operating systems where a fast and modern memory
+ allocator is detected and when it is estimated that the process-wide shared
+ pool will not bring any benefit. This detection is automatic at build time,
+ but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS
+ or at boot time with "-dMno-global".
+
+ - pass-through to the standard allocator (3)
+
+ This is used when one absolutely wants to disable pools and rely on regular
+ malloc() and free() calls, essentially in order to trace memory allocations
+ by call points, either internally via DEBUG_MEM_STATS, or externally via
+ tools such as Valgrind. This mode of operation may be forced at build time
+ by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache".
+
+ - pass-through to an mmap-based allocator for debugging (4)
+
+ This is used only during deep debugging when trying to detect various
+ conditions such as use-after-free. In this case each allocated object's
+ size is rounded up to a multiple of a page size (4096 bytes) and an
+ integral number of pages is allocated for each object using mmap(),
+ surrounded by two unaccessible holes that aim to detect some out-of-bounds
+ accesses. Released objects are instantly freed using munmap() so that any
+ immediate subsequent access to the memory area crashes the process if the
+ area had not been reallocated yet. This mode can be enabled at build time
+ by setting DEBUG_UAF. It tends to consume a lot of memory and not to scale
+ at all with concurrent calls, that tends to make the system stall. The
+ watchdog may even trigger on some slow allocations.
+
+There are no more provisions for running with a shared pool but no thread-local
+cache: the shared pool's main goal is to compensate for the expensive calls to
+the memory allocator. This gain may be huge on tiny systems using basic
+allocators, but the thread-local cache will already achieve this. And on larger
+threaded systems, the shared pool's benefit is visible when the underlying
+allocator scales poorly, but in this case the shared pool would suffer from
+the same limitations without its thread-local cache and wouldn't provide any
+benefit.
+
+Summary of the various operation modes:
+
+ (1) (2) (3) (4)
+
+ User User User User
+ | | | |
+ pool_alloc() V V | |
+ +---------+ +---------+ | |
+ | Thread | | Thread | | |
+ | Local | | Local | | |
+ | Cache | | Cache | | |
+ +---------+ +---------+ | |
+ | | | |
+ pool_refill*() V | | |
+ +---------+ | | |
+ | Shared | | | |
+ | Pool | | | |
+ +---------+ | | |
+ | | | |
+ malloc() V V V |
+ +---------+ +---------+ +---------+ |
+ | Library | | Library | | Library | |
+ +---------+ +---------+ +---------+ |
+ | | | |
+ mmap() V V V V
+ +---------+ +---------+ +---------+ +---------+
+ | OS | | OS | | OS | | OS |
+ +---------+ +---------+ +---------+ +---------+
+
+One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation
+failure in pool_alloc() by randomly returning NULL, to test that callers
+properly handle allocation failures. It may also be enabled at boot time using
+"-dMfail". In this case the desired average rate of allocation failures can be
+fixed by global setting "tune.fail-alloc" expressed in percent.
+
+The thread-local caches contain the freshest objects whose total size amounts
+to CONFIG_HAP_POOL_CACHE_SIZE bytes, which is typically was 1MB before 2.6 and
+is 512kB after. The aim is to keep hot objects that still fit in the CPU core's
+private L2 cache. Once these objects do not fit into the cache anymore, there's
+no benefit keeping them local to the thread, so they'd rather be returned to
+the shared pool or the main allocator so that any other thread may make use of
+them.
+
+
+3. Storage in thread-local caches
+---------------------------------
+
+This section describes how objects are linked in thread local caches. This is
+not meant to be a concern for users of the pools API but it can be useful when
+inspecting post-mortem dumps or when trying to figure certain size constraints.
+
+Objects are stored in the local cache using a doubly-linked list. This ensures
+that they can be visited by freshness order like a stack, while at the same
+time being able to access them from oldest to newest when it is needed to
+evict coldest ones first:
+
+ - releasing an object to the cache always puts it on the top.
+
+ - allocating an object from the cache always takes the topmost one, hence the
+ freshest one.
+
+ - scanning for older objects to evict starts from the bottom, where the
+ oldest ones are located
+
+To that end, each thread-local cache keeps a list head in the "list" member of
+its "pool_cache_head" descriptor, that links all objects cast to type
+"pool_cache_item" via their "by_pool" member.
+
+Note that the mechanism described above only works for a single pool. When
+trying to limit the total cache size to a certain value, all pools included,
+there is also a need to arrange all objects from all pools together in the
+local caches. For this, each thread_ctx maintains a list head of recently
+released objects, all pools included, in its member "pool_lru_head". All items
+in a thread-local cache are linked there via their "by_lru" member.
+
+This means that releasing an object using pool_free() consists in inserting
+it at the beginning of two lists:
+ - the local pool_cache_head's "list" list head
+ - the thread context's "pool_lru_head" list head
+
+Allocating an object consists in picking the first entry from the pool's "list"
+and deleting its "by_pool" and "by_lru" links.
+
+Evicting an object consists in scanning the thread context's "pool_lru_head"
+backwards and deleting the object's "by_pool" and "by_lru" links.
+
+Given that entries are both inserted and removed synchronously, we have the
+guarantee that the oldest object in the thread's LRU list is always the oldest
+object in its pool, and that the next element is the cache's list head. This is
+what allows the LRU eviction mechanism to figure what pool an object belongs to
+when releasing it.
+
+Note:
+ | Since a pool_cache_item has two list entries, on 64-bit systems it will be
+ | 32-bytes long. This is the smallest size that a pool may be, and any smaller
+ | size will automatically be rounded up to this size.
+
+When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option
+"-dMintegrity" is passed on the command line, the area of the object between
+the two list elements and the end according to pool->size will be filled with
+pseudo-random words during pool_put_to_cache(), and these words will be
+compared between each other during pool_get_from_cache(), and the process will
+crash in case any bit differs, as this would indicate that the memory area was
+modified after the free. The pseudo-random pattern is in fact incremented by
+(~0)/3 upon each free so that roughly half of the bits change each time and we
+maximize the likelihood of detecting a single bit flip in either direction. In
+order to avoid an immediate reuse and maximize the time the object spends in
+the cache, when this option is set, objects are picked from the cache from the
+oldest one instead of the freshest one. This way even late memory corruptions
+have a chance to be detected.
+
+When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag"
+is passed on the executable's command line, pool objects are allocated with
+one extra pointer compared to the requested size, so that the bytes that follow
+the memory area point to the pool descriptor itself as long as the object is
+allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is
+compared and the code will crash in if it differs. This allows to detect both
+memory overflows and object released to the wrong pool (code bug resulting from
+a copy-paste error typically).
+
+Thus an object will look like this depending whether it's in the cache or is
+currently in use:
+
+ in cache in use
+ +------------+ +------------+
+ <--+ by_pool.p | | N bytes |
+ | by_pool.n +--> | |
+ +------------+ |N=16 min on |
+ <--+ by_lru.p | | 32-bit, |
+ | by_lru.n +--> | 32 min on |
+ +------------+ | 64-bit |
+ : : : :
+ | N bytes | | |
+ +------------+ +------------+ \ optional, only if
+ : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS
+ +------------+ +------------+ / is set at build time
+ or -dMtag at boot time
+
+Right now no provisions are made to return objects aligned on larger boundaries
+than those currently covered by malloc() (i.e. two pointers). This need appears
+from time to time and the layout above might evolve a little bit if needed.
+
+
+4. Storage in the process-wide shared pool
+------------------------------------------
+
+In order for the shared pool not to be a contention point in a multi-threaded
+environment, objects are allocated from or released to shared pools by clusters
+of a few objects at once. The maximum number of objects that may be moved to or
+from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build
+time, and currently defaults to 8.
+
+In order to remain scalable, the shared pool has to make some tradeoffs to
+limit the number of atomic operations and the duration of any locked operation.
+As such, it's composed of a single-linked list of clusters, themselves made of
+a single-linked list of objects.
+
+Clusters and objects are of the same type "pool_item" and are accessed from the
+pool's "free_list" member. This member points to the latest pool_item inserted
+into the pool by a release operation. And the pool_item's "next" member points
+to the next pool_item, which was the one present in the pool's free_list just
+before the pool_item was inserted, and the last pool_item in the list simply
+has a NULL "next" field.
+
+The pool_item's "down" pointer points down to the next objects part of the same
+cluster, that will be released or allocated at the same time as the first one.
+Each of these items also has a NULL "next" field, and are chained by their
+respective "down" pointers until the last one is detected by a NULL value.
+
+This results in the following layout:
+
+ pool pool_item pool_item pool_item
+ +-----------+ +------+ +------+ +------+
+ | free_list +--> | next +--> | next +--> | NULL |
+ +-----------+ +------+ +------+ +------+
+ | down | | NULL | | down |
+ +--+---+ +------+ +--+---+
+ | |
+ V V
+ +------+ +------+
+ | NULL | | NULL |
+ +------+ +------+
+ | down | | NULL |
+ +--+---+ +------+
+ |
+ V
+ +------+
+ | NULL |
+ +------+
+ | NULL |
+ +------+
+
+Allocating an entry is only a matter of performing two atomic allocations on
+the free_list and reading the pool's "next" value:
+
+ - atomically mark the free_list as being updated by writing a "magic" pointer
+ - read the first pool_item's "next" field
+ - atomically replace the free_list with this value
+
+This results in a fast operation that instantly retrieves a cluster at once.
+Then outside of the critical section entries are walked over and inserted into
+the local cache one at a time. In order to keep the code simple and efficient,
+objects allocated from the shared pool are all placed into the local cache, and
+only then the first one is allocated from the cache. This operation is
+performed by the dedicated function pool_refill_local_from_shared() which is
+called from pool_get_from_cache() when the cache is empty. It means there is an
+overhead of two list insert/delete operations for the first object and that
+could be avoided at the expense of more complex code in the fast path, but this
+is negligible since it only concerns objects that need to be visited anyway.
+
+Freeing a group of objects consists in performing the operation the other way
+around:
+
+ - atomically mark the free_list as being updated by writing a "magic" pointer
+ - write the free_list value to the to-be-released item's "next" entry
+ - atomically replace the free_list with the pool_item's pointer
+
+The cluster will simply have to be prepared before being sent to the shared
+pool. The operation of releasing a cluster at once is performed by function
+pool_put_to_shared_cache() which is called from pool_evict_last_items() which
+itself is responsible for building the clusters.
+
+Due to the way objects are stored, it is important to try to group objects as
+much as possible when releasing them because this is what will condition their
+retrieval as groups as well. This is the reason why pool_evict_last_items()
+uses the LRU to find a first entry but tries to pick several items at once from
+a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8
+achieves up to 6-6.5 objects on average per operation, which effectively
+divides by as much the average time spent per object by each thread and pushes
+the contention point further.
+
+Also, grouping items in clusters is a property of the process-wide shared pool
+and not of the thread-local caches. This means that there is no grouped
+operation when not using the shared pool (mode "2" in the diagram above).
+
+
+5. API
+------
+
+The following functions are public and available for user code:
+
+struct pool_head *create_pool(char *name, uint size, uint flags)
+ Create a new pool named <name> for objects of size <size> bytes. Pool
+ names are truncated to their first 11 characters. Pools of very similar
+ size will usually be merged if both have set the flag MEM_F_SHARED in
+ <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or
+ "-dMno-merge" is passed on the executable's command line, the pools
+ also need to have the exact same name to be merged. In addition, unless
+ MEM_F_EXACT is set in <flags>, the object size will usually be rounded
+ up to the size of pointers (16 or 32 bytes). The name that will appear
+ in the pool upon merging is the name of the first created pool. The
+ returned pointer is the new (or reused) pool head, or NULL upon error.
+ Pools created this way must be destroyed using pool_destroy().
+
+void *pool_destroy(struct pool_head *pool)
+ Destroy pool <pool>, that is, all of its unused objects are freed and
+ the structure is freed as well if the pool didn't have any used objects
+ anymore. In this case NULL is returned. If some objects remain in use,
+ the pool is preserved and its pointer is returned. This ought to be
+ used essentially on exit or in rare situations where some internal
+ entities that hold pools have to be destroyed.
+
+void pool_destroy_all(void)
+ Destroy all pools, without checking which ones still have used entries.
+ This is only meant for use on exit.
+
+void *__pool_alloc(struct pool_head *pool, uint flags)
+ Allocate an entry from the pool <pool>. The allocator will first look
+ for an object in the thread-local cache if enabled, then in the shared
+ pool if enabled, then will fall back to the operating system's default
+ allocator. NULL is returned if the object couldn't be allocated (due to
+ configured limits or lack of memory). Object allocated this way have to
+ be released using pool_free(). Like with malloc(), by default the
+ contents of the returned object are undefined. If memory poisonning is
+ enabled, the object will be filled with the poisonning byte. If the
+ global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is
+ enabled, a random number generator will be called to randomly return a
+ NULL. The allocator's behavior may be adjusted using a few flags passed
+ in <flags>:
+ - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when
+ pointless and expensive, like for buffers)
+ - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before
+ being returned, similar to what calloc() does
+ - POOL_F_NO_FAIL : when set, disables the random allocation failure,
+ e.g. for use during early init code or critical sections.
+
+void *pool_alloc(struct pool_head *pool)
+ This is an exact equivalent of __pool_alloc(pool, 0). It is the regular
+ way to allocate entries from a pool.
+
+void *pool_alloc_nocache(struct pool_head *pool)
+ Allocate an entry from the pool <pool>, bypassing the cache. If shared
+ pools are enabled, they will be consulted first. Otherwise the object
+ is allocated using the operating system's default allocator. This is
+ essentially used during early boot to pre-allocate a number of objects
+ for pools which require a minimum number of entries to exist.
+
+void *pool_zalloc(struct pool_head *pool)
+ This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO).
+
+void pool_free(struct pool_head *pool, void *ptr)
+ Free an entry allocate from one of the pool_alloc() functions above
+ from pool <pool>. The object will be placed into the thread-local cache
+ if enabled, or in the shared pool if enabled, or will be released using
+ the operating system's default allocator. When a local cache is
+ enabled, if the local cache size becomes larger than 75% of the maximum
+ size configured at build time, some objects will be evicted to the
+ shared pool. Such objects are taken first from the same pool, but if
+ the total size is really huge, other pools might be checked as well.
+ Some extra checks enabled at build time may enforce extra checks so
+ that the process will immediately crash if the object was not allocated
+ from this pool or experienced an overflow or some memory corruption.
+
+void pool_flush(struct pool_head *pool)
+ Free all unused objects from shared pool <pool>. Thread-local caches
+ are not affected. This is essentially used when running low on memory
+ or when stopping, in order to release a maximum amount of memory for
+ the new process.
+
+void pool_gc(struct pool_head *pool)
+ Free all unused objects from all pools, but respecting the minimum
+ number of spare objects required for each of them. Then, for operating
+ systems which support it, indicate the system that all unused memory
+ can be released. Thread-local caches are not affected. This operation
+ differs from pool_flush() in that it is run locklessly, under thread
+ isolation, and on all pools in a row. It is called by the SIGQUIT
+ signal handler and upon exit. Note that the obsolete argument <pool> is
+ not used and the convention is to pass NULL there.
+
+void dump_pools_to_trash(void)
+ Dump the current status of all pools into the trash buffer. This is
+ essentially used by the "show pools" CLI command or the SIGQUIT signal
+ handler to dump them on stderr. The total report size may not exceed
+ the size of the trash buffer. If it does, some entries will be missing.
+
+void dump_pools(void)
+ Dump the current status of all pools to stderr. This just calls
+ dump_pools_to_trash() and writes the trash to stderr.
+
+int pool_total_failures(void)
+ Report the total number of failed allocations. This is solely used to
+ report the "PoolFailed" metrics of the "show info" output. The total
+ is calculated on the fly by summing the number of failures in all pools
+ and is only meant to be used as an indicator rather than a precise
+ measure.
+
+ullong pool_total_allocated(void)
+ Report the total number of bytes allocated in all pools, for reporting
+ in the "PoolAlloc_MB" field of the "show info" output. The total is
+ calculated on the fly by summing the number of allocated bytes in all
+ pools and is only meant to be used as an indicator rather than a
+ precise measure.
+
+ullong pool_total_used(void)
+ Report the total number of bytes used in all pools, for reporting in
+ the "PoolUsed_MB" field of the "show info" output. The total is
+ calculated on the fly by summing the number of used bytes in all pools
+ and is only meant to be used as an indicator rather than a precise
+ measure. Note that objects present in caches are accounted as used.
+
+Some other functions exist and are only used by the pools code itself. While
+not strictly forbidden to use outside of this code, it is generally recommended
+to avoid touching them in order not to create undesired dependencies that will
+complicate maintenance.
+
+A few macros exist to ease the declaration of pools:
+
+DECLARE_POOL(ptr, name, size)
+ Placed at the top level of a file, this declares a global memory pool
+ as variable <ptr>, name <name> and size <size> bytes per element. This
+ is made via a call to REGISTER_POOL() and by assigning the resulting
+ pointer to variable <ptr>. <ptr> will be created of type "struct
+ pool_head *". If the pool needs to be visible outside of the function
+ (which is likely), it will also need to be declared somewhere as
+ "extern struct pool_head *<ptr>;". It is recommended to place such
+ declarations very early in the source file so that the variable is
+ already known to all subsequent functions which may use it.
+
+DECLARE_STATIC_POOL(ptr, name, size)
+ Placed at the top level of a file, this declares a static memory pool
+ as variable <ptr>, name <name> and size <size> bytes per element. This
+ is made via a call to REGISTER_POOL() and by assigning the resulting
+ pointer to local variable <ptr>. <ptr> will be created of type "static
+ struct pool_head *". It is recommended to place such declarations very
+ early in the source file so that the variable is already known to all
+ subsequent functions which may use it.
+
+
+6. Build options
+----------------
+
+A number of build-time defines allow to tune the pools behavior. All of them
+have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG
+variable.
+
+DEBUG_NO_POOLS
+ When this is set, pools are entirely disabled, and allocations are made
+ using malloc() instead. This is not recommended for production but may
+ be useful for tracing allocations. It corresponds to "-dMno-cache" at
+ boot time.
+
+DEBUG_MEMORY_POOLS
+ When this is set, an extra pointer is allocated at the end of each
+ object to reference the pool the object was allocated from and detect
+ buffer overflows. Then, pool_free() will provoke a crash in case it
+ detects an anomaly (pointer at the end not matching the pool). It
+ corresponds to "-dMtag" at boot time.
+
+DEBUG_FAIL_ALLOC
+ When enabled, a global setting "tune.fail-alloc" may be set to a non-
+ zero value representing a percentage of memory allocations that will be
+ made to fail in order to stress the calling code. It corresponds to
+ "-dMfail" at boot time.
+
+DEBUG_DONT_SHARE_POOLS
+ When enabled, pools of similar sizes are not merged unless the have the
+ exact same name. It corresponds to "-dMno-merge" at boot time.
+
+DEBUG_UAF
+ When enabled, pools are disabled and all allocations and releases pass
+ through mmap() and munmap(). The memory usage significantly inflates
+ and the performance degrades, but this allows to detect a lot of
+ use-after-free conditions by crashing the program at the first abnormal
+ access. This should not be used in production.
+
+DEBUG_POOL_INTEGRITY
+ When enabled, objects picked from the cache are checked for corruption
+ by comparing their contents against a pattern that was placed when they
+ were inserted into the cache. Objects are also allocated in the reverse
+ order, from the oldest one to the most recent, so as to maximize the
+ ability to detect such a corruption. The goal is to detect writes after
+ free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
+ this cannot detect reads after free, but may possibly detect later
+ corruptions and will not consume extra memory. The CPU usage will
+ increase a bit due to the cost of filling/checking the area and for the
+ preference for cold cache instead of hot cache, though not as much as
+ with DEBUG_UAF. This option is meant to be usable in production. It
+ corresponds to boot-time options "-dMcold-first,integrity".
+
+DEBUG_POOL_TRACING
+ When enabled, the callers of pool_alloc() and pool_free() will be
+ recorded into an extra memory area placed after the end of the object.
+ This may only be required by developers who want to get a few more
+ hints about code paths involved in some crashes, but will serve no
+ purpose outside of this. It remains compatible (and completes well)
+ DEBUG_POOL_INTEGRITY above. Such information become meaningless once
+ the objects leave the thread-local cache. It corresponds to boot-time
+ option "-dMcaller".
+
+DEBUG_MEM_STATS
+ When enabled, all malloc/calloc/realloc/strdup/free calls are accounted
+ for per call place (file+line number), and may be displayed or reset on
+ the CLI using "debug dev memstats". This is essentially used to detect
+ potential leaks or abnormal usages. When pools are enabled (default),
+ such calls are rare and the output will mostly contain calls induced by
+ libraries. When pools are disabled, about all calls to pool_alloc() and
+ pool_free() will also appear since they will be remapped to standard
+ functions.
+
+CONFIG_HAP_GLOBAL_POOLS
+ When enabled, process-wide shared pools will be forcefully enabled even
+ if not considered useful on the platform. The default is to let haproxy
+ decide based on the OS and C library. It corresponds to boot-time
+ option "-dMglobal".
+
+CONFIG_HAP_NO_GLOBAL_POOLS
+ When enabled, process-wide shared pools will be forcefully disabled
+ even if considered useful on the platform. The default is to let
+ haproxy decide based on the OS and C library. It corresponds to
+ boot-time option "-dMno-global".
+
+CONFIG_HAP_POOL_CACHE_SIZE
+ This allows one to define the size of the per-thread cache, in bytes.
+ The default value is 512 kB (524288). Smaller values will use less
+ memory at the expense of a possibly higher CPU usage when using many
+ threads. Higher values will give diminishing returns on performance
+ while using much more memory. Usually there is no benefit in using
+ more than a per-core L2 cache size. It would be better not to set this
+ value lower than a few times the size of a buffer (bufsize, defaults to
+ 16 kB).
+
+CONFIG_HAP_POOL_CLUSTER_SIZE
+ This allows one to define the maximum number of objects that will be
+ groupped together in an allocation from the shared pool. Values 4 to 8
+ have experimentally shown good results with 16 threads. On systems with
+ more cores or loosely coupled caches exhibiting slow atomic operations,
+ it could possibly make sense to slightly increase this value.
diff --git a/doc/internals/api/scheduler.txt b/doc/internals/api/scheduler.txt
new file mode 100644
index 0000000..3469543
--- /dev/null
+++ b/doc/internals/api/scheduler.txt
@@ -0,0 +1,226 @@
+2021-11-17 - Scheduler API
+
+
+1. Background
+-------------
+
+The scheduler relies on two major parts:
+ - the wait queue or timers queue, which contains an ordered tree of the next
+ timers to expire
+
+ - the run queue, which contains tasks that were already woken up and are
+ waiting for a CPU slot to execute.
+
+There are two types of schedulable objects in HAProxy:
+ - tasks: they contain one timer and can be in the run queue without leaving
+ their place in the timers queue.
+
+ - tasklets: they do not have the timers part and are either sleeping or
+ running.
+
+Both the timers queue and run queue in fact exist both shared between all
+threads and per-thread. A task or tasklet may only be queued in a single of
+each at a time. The thread-local queues are not thread-safe while the shared
+ones are. This means that it is only permitted to manipulate an object which
+is in the local queue or in a shared queue, but then after locking it. As such
+tasks and tasklets are usually pinned to threads and do not move, or only in
+very specific ways not detailed here.
+
+In case of doubt, keep in mind that it's not permitted to manipulate another
+thread's private task or tasklet, and that any task held by another thread
+might vanish while it's being looked at.
+
+Internally a large part of the task and tasklet struct is shared between
+the two types, which reduces code duplication and eases the preservation
+of fairness in the run queue by interleaving all of them. As such, some
+fields or flags may not always be relevant to tasklets and may be ignored.
+
+
+Tasklets do not use a thread mask but use a thread ID instead, to which they
+are bound. If the thread ID is negative, the tasklet is not bound but may only
+be run on the calling thread.
+
+
+2. API
+------
+
+There are few functions exposed by the scheduler. A few more ones are in fact
+accessible but if not documented there they'd rather be avoided or used only
+when absolutely certain they're suitable, as some have delicate corner cases.
+In doubt, checking the sched.pdf diagram may help.
+
+int total_run_queues()
+ Return the approximate number of tasks in run queues. This is racy
+ and a bit inaccurate as it iterates over all queues, but it is
+ sufficient for stats reporting.
+
+int task_in_rq(t)
+ Return non-zero if the designated task is in the run queue (i.e. it was
+ already woken up).
+
+int task_in_wq(t)
+ Return non-zero if the designated task is in the timers queue (i.e. it
+ has a valid timeout and will eventually expire).
+
+int thread_has_tasks()
+ Return non-zero if the current thread has some work to be done in the
+ run queue. This is used to decide whether or not to sleep in poll().
+
+void task_wakeup(t, f)
+ Will make sure task <t> will wake up, that is, will execute at least
+ once after the start of the function is called. The task flags <f> will
+ be ORed on the task's state, among TASK_WOKEN_* flags exclusively. In
+ multi-threaded environments it is safe to wake up another thread's task
+ and even if the thread is sleeping it will be woken up. Users have to
+ keep in mind that a task running on another thread might very well
+ finish and go back to sleep before the function returns. It is
+ permitted to wake the current task up, in which case it will be
+ scheduled to run another time after it returns to the scheduler.
+
+struct task *task_unlink_wq(t)
+ Remove the task from the timers queue if it was in it, and return it.
+ It may only be done for the local thread, or for a shared thread that
+ might be in the shared queue. It must not be done for another thread's
+ task.
+
+void task_queue(t)
+ Place or update task <t> into the timers queue, where it may already
+ be, scheduling it for an expiration at date t->expire. If t->expire is
+ infinite, nothing is done, so it's safe to call this function without
+ prior checking the expiration date. It is only valid to call this
+ function for local tasks or for shared tasks who have the calling
+ thread in their thread mask.
+
+void task_set_affinity(t, m)
+ Change task <t>'s thread_mask to new value <m>. This may only be
+ performed by the task itself while running. This is only used to let a
+ task voluntarily migrate to another thread.
+
+void tasklet_wakeup(tl)
+ Make sure that tasklet <tl> will wake up, that is, will execute at
+ least once. The tasklet will run on its assigned thread, or on any
+ thread if its TID is negative.
+
+void tasklet_wakeup_on(tl, thr)
+ Make sure that tasklet <tl> will wake up on thread <thr>, that is, will
+ execute at least once. The designated thread may only differ from the
+ calling one if the tasklet is already configured to run on another
+ thread, and it is not permitted to self-assign a tasklet if its tid is
+ negative, as it may already be scheduled to run somewhere else. Just in
+ case, only use tasklet_wakeup() which will pick the tasklet's assigned
+ thread ID.
+
+struct tasklet *tasklet_new()
+ Allocate a new tasklet and set it to run by default on the calling
+ thread. The caller may change its tid to another one before using it.
+ The new tasklet is returned.
+
+struct task *task_new_anywhere()
+ Allocate a new task to run on any thread, and return the task, or NULL
+ in case of allocation issue. Note that such tasks will be marked as
+ shared and will go through the locked queues, thus their activity will
+ be heavier than for other ones. See also task_new_here().
+
+struct task *task_new_here()
+ Allocate a new task to run on the calling thread, and return the task,
+ or NULL in case of allocation issue.
+
+struct task *task_new_on(t)
+ Allocate a new task to run on thread <t>, and return the task, or NULL
+ in case of allocation issue.
+
+void task_destroy(t)
+ Destroy this task. The task will be unlinked from any timers queue,
+ and either immediately freed, or asynchronously killed if currently
+ running. This may only be done by one of the threads this task is
+ allowed to run on. Developers must not forget that the task's memory
+ area is not always immediately freed, and that certain misuses could
+ only have effect later down the chain (e.g. use-after-free).
+
+void tasklet_free()
+ Free this tasklet, which must not be running, so that may only be
+ called by the thread responsible for the tasklet, typically the
+ tasklet's process() function itself.
+
+void task_schedule(t, d)
+ Schedule task <t> to run no later than date <d>. If the task is already
+ running, or scheduled for an earlier instant, nothing is done. If the
+ task was not in queued or was scheduled to run later, its timer entry
+ will be updated. This function assumes that it will never be called
+ with a timer in the past nor with TICK_ETERNITY. Only one of the
+ threads assigned to the task may call this function.
+
+The task's ->process() function receives the following arguments:
+
+ - struct task *t: a pointer to the task itself. It is always valid.
+
+ - void *ctx : a copy of the task's ->context pointer at the moment
+ the ->process() function was called by the scheduler. A
+ function must use this and not task->context, because
+ task->context might possibly be changed by another thread.
+ For instance, the muxes' takeover() function do this.
+
+ - uint state : a copy of the task's ->state field at the moment the
+ ->process() function was executed. A function must use
+ this and not task->state as the latter misses the wakeup
+ reasons and may constantly change during execution along
+ concurrent wakeups (threads or signals).
+
+The possible state flags to use during a call to task_wakeup() or seen by the
+task being called are the following; they're automatically cleaned from the
+state field before the call to ->process()
+
+ - TASK_WOKEN_INIT each creation of a task causes a first wakeup with this
+ flag set. Applications should not set it themselves.
+
+ - TASK_WOKEN_TIMER this indicates the task's expire date was reached in the
+ timers queue. Applications should not set it themselves.
+
+ - TASK_WOKEN_IO indicates the wake-up happened due to I/O activity. Now
+ that all low-level I/O processing happens on tasklets,
+ this notion of I/O is now application-defined (for
+ example stream-interfaces use it to notify the stream).
+
+ - TASK_WOKEN_SIGNAL indicates that a signal the task was subscribed to was
+ received. Applications should not set it themselves.
+
+ - TASK_WOKEN_MSG any application-defined wake-up reason, usually for
+ inter-task communication (e.g filters vs streams).
+
+ - TASK_WOKEN_RES a resource the task was waiting for was finally made
+ available, allowing the task to continue its work. This
+ is essentially used by buffers and queues. Applications
+ may carefully use it for their own purpose if they're
+ certain not to rely on existing ones.
+
+ - TASK_WOKEN_OTHER any other application-defined wake-up reason.
+
+
+In addition, a few persistent flags may be observed or manipulated by the
+application, both for tasks and tasklets:
+
+ - TASK_SELF_WAKING when set, indicates that this task was found waking
+ itself up, and its class will change to bulk processing.
+ If this behavior is under control temporarily expected,
+ and it is not expected to happen again, it may make
+ sense to reset this flag from the ->process() function
+ itself.
+
+ - TASK_HEAVY when set, indicates that this task does so heavy
+ processing that it will become mandatory to give back
+ control to I/Os otherwise big latencies might occur. It
+ may be set by an application that expects something
+ heavy to happen (tens to hundreds of microseconds), and
+ reset once finished. An example of user is the TLS stack
+ which sets it when an imminent crypto operation is
+ expected.
+
+ - TASK_F_USR1 This is the first application-defined persistent flag.
+ It is always zero unless the application changes it. An
+ example of use cases is the I/O handler for backend
+ connections, to mention whether the connection is safe
+ to use or might have recently been migrated.
+
+Finally, when built with -DDEBUG_TASK, an extra sub-structure "debug" is added
+to both tasks and tasklets to note the code locations of the last two calls to
+task_wakeup() and tasklet_wakeup().