1 files changed, 585 insertions, 0 deletions
diff --git a/doc/internals/api/pools.txt b/doc/internals/api/pools.txt
new file mode 100644
index 0000000..d84fb9d
--- /dev/null
+++ b/doc/internals/api/pools.txt
@@ -0,0 +1,585 @@
+2022-02-24 - Pools structure and API
+
+1. Background
+-------------
+
+Memory allocation is a complex problem covered by a massive amount of
+literature. Memory allocators found in field cover a broad spectrum of
+capabilities, performance, fragmentation, efficiency etc.
+
+The main difficulty of memory allocation comes from finding the optimal chunks
+for arbitrary sized requests, that will still preserve a low fragmentation
+level. Doing this well is often expensive in CPU usage and/or memory usage.
+
+In programs like HAProxy that deal with a large number of fixed size objects,
+there is no point having to endure all this risk of fragmentation, and the
+associated costs (sometimes up to several milliseconds with certain minimalist
+allocators) are simply not acceptable. A better approach consists in grouping
+frequently used objects by size, knowing that due to the high repetitiveness of
+operations, a freed object will immediately be needed for another operation.
+
+This grouping of objects by size is what is called a pool. Pools are created
+for certain frequently allocated objects, are usually merged together when they
+are of the same size (or almost the same size), and significantly reduce the
+number of calls to the memory allocator.
+
+With the arrival of threads, pools started to become a bottleneck so they now
+implement an optional thread-local lockless cache. Finally with the arrival of
+really efficient memory allocator in modern operating systems, the shared part
+has also become optional so that it doesn't consume memory if it does not bring
+any value.
+
+In 2.6-dev2, a number of debugging options that used to be configured at build
+time only changed to boot-time and can be modified using keywords passed after
+"-dM" on the command line, which sets or clears bits in the pool_debugging
+variable. The build-time options still affect the default settings however.
+Default values may be consulted using "haproxy -dMhelp".
+
+
+2. Principles
+-------------
+
+The pools architecture is selected at build time. The main options are:
+
+  - thread-local caches and process-wide shared pool enabled (1)
+
+    This is the default situation on most operating systems. Each thread has
+    its own local cache, and when depleted it refills from the process-wide
+    pool that avoids calling the standard allocator too often. It is possible
+    to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at
+    boot time with "-dMglobal".
+
+  - thread-local caches only are enabled (2)
+
+    This is the situation on operating systems where a fast and modern memory
+    allocator is detected and when it is estimated that the process-wide shared
+    pool will not bring any benefit. This detection is automatic at build time,
+    but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS
+    or at boot time with "-dMno-global".
+
+  - pass-through to the standard allocator (3)
+
+    This is used when one absolutely wants to disable pools and rely on regular
+    malloc() and free() calls, essentially in order to trace memory allocations
+    by call points, either internally via DEBUG_MEM_STATS, or externally via
+    tools such as Valgrind. This mode of operation may be forced at build time
+    by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache".
+
+  - pass-through to an mmap-based allocator for debugging (4)
+
+    This is used only during deep debugging when trying to detect various
+    conditions such as use-after-free. In this case each allocated object's
+    size is rounded up to a multiple of a page size (4096 bytes) and an
+    integral number of pages is allocated for each object using mmap(),
+    surrounded by two unaccessible holes that aim to detect some out-of-bounds
+    accesses. Released objects are instantly freed using munmap() so that any
+    immediate subsequent access to the memory area crashes the process if the
+    area had not been reallocated yet. This mode can be enabled at build time
+    by setting DEBUG_UAF, or at run time by disabling pools and enabling UAF
+    with "-dMuaf". It tends to consume a lot of memory and not to scale at all
+    with concurrent calls, that tends to make the system stall. The watchdog
+    may even trigger on some slow allocations.
+
+There are no more provisions for running with a shared pool but no thread-local
+cache: the shared pool's main goal is to compensate for the expensive calls to
+the memory allocator. This gain may be huge on tiny systems using basic
+allocators, but the thread-local cache will already achieve this. And on larger
+threaded systems, the shared pool's benefit is visible when the underlying
+allocator scales poorly, but in this case the shared pool would suffer from
+the same limitations without its thread-local cache and wouldn't provide any
+benefit.
+
+Summary of the various operation modes:
+
+                  (1)            (2)            (3)            (4)
+
+                  User           User           User           User
+                   |              |              |              |
+      pool_alloc() V              V              |              |
+              +---------+    +---------+         |              |
+              | Thread  |    | Thread  |         |              |
+              |  Local  |    |  Local  |         |              |
+              |  Cache  |    |  Cache  |         |              |
+              +---------+    +---------+         |              |
+                   |              |              |              |
+    pool_refill*() V              |              |              |
+              +---------+         |              |              |
+              | Shared  |         |              |              |
+              |  Pool   |         |              |              |
+              +---------+         |              |              |
+                   |              |              |              |
+          malloc() V              V              V              |
+              +---------+    +---------+    +---------+         |
+              | Library |    | Library |    | Library |         |
+              +---------+    +---------+    +---------+         |
+                   |              |              |              |
+            mmap() V              V              V              V
+              +---------+    +---------+    +---------+    +---------+
+              |   OS    |    |   OS    |    |   OS    |    |   OS    |
+              +---------+    +---------+    +---------+    +---------+
+
+One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation
+failure in pool_alloc() by randomly returning NULL, to test that callers
+properly handle allocation failures. It may also be enabled at boot time using
+"-dMfail". In this case the desired average rate of allocation failures can be
+fixed by global setting "tune.fail-alloc" expressed in percent.
+
+The thread-local caches contain the freshest objects. Its total size amounts to
+the number of bytes set in global.tune.pool_cache_size and that may be adjusted
+by the "tune.memory.hot-size" global option, which itself defaults to build
+time setting CONFIG_HAP_POOL_CACHE_SIZE, which was 1MB before 2.6 and 512kB
+after. The aim is to keep hot objects that still fit in the CPU core's private
+L2 cache. Once these objects do not fit into the cache anymore, there's no
+benefit keeping them local to the thread, so they'd rather be returned to the
+shared pool or the main allocator so that any other thread may make use of
+them. Under extreme thread contention the cost of accessing shared structures
+in the global cache or in malloc() may still be important and it may prove
+useful to increase the thread-local cache size.
+
+
+3. Storage in thread-local caches
+---------------------------------
+
+This section describes how objects are linked in thread local caches. This is
+not meant to be a concern for users of the pools API but it can be useful when
+inspecting post-mortem dumps or when trying to figure certain size constraints.
+
+Objects are stored in the local cache using a doubly-linked list. This ensures
+that they can be visited by freshness order like a stack, while at the same
+time being able to access them from oldest to newest when it is needed to
+evict coldest ones first:
+
+  - releasing an object to the cache always puts it on the top.
+
+  - allocating an object from the cache always takes the topmost one, hence the
+    freshest one.
+
+  - scanning for older objects to evict starts from the bottom, where the
+    oldest ones are located
+
+To that end, each thread-local cache keeps a list head in the "list" member of
+its "pool_cache_head" descriptor, that links all objects cast to type
+"pool_cache_item" via their "by_pool" member.
+
+Note that the mechanism described above only works for a single pool. When
+trying to limit the total cache size to a certain value, all pools included,
+there is also a need to arrange all objects from all pools together in the
+local caches. For this, each thread_ctx maintains a list head of recently
+released objects, all pools included, in its member "pool_lru_head". All items
+in a thread-local cache are linked there via their "by_lru" member.
+
+This means that releasing an object using pool_free() consists in inserting
+it at the beginning of two lists:
+  - the local pool_cache_head's "list" list head
+  - the thread context's "pool_lru_head" list head
+
+Allocating an object consists in picking the first entry from the pool's "list"
+and deleting its "by_pool" and "by_lru" links.
+
+Evicting an object consists in scanning the thread context's "pool_lru_head"
+backwards and deleting the object's "by_pool" and "by_lru" links.
+
+Given that entries are both inserted and removed synchronously, we have the
+guarantee that the oldest object in the thread's LRU list is always the oldest
+object in its pool, and that the next element is the cache's list head. This is
+what allows the LRU eviction mechanism to figure what pool an object belongs to
+when releasing it.
+
+Note:
+ | Since a pool_cache_item has two list entries, on 64-bit systems it will be
+ | 32-bytes long. This is the smallest size that a pool may be, and any smaller
+ | size will automatically be rounded up to this size.
+
+When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option
+"-dMintegrity" is passed on the command line, the area of the object between
+the two list elements and the end according to pool->size will be filled with
+pseudo-random words during pool_put_to_cache(), and these words will be
+compared between each other during pool_get_from_cache(), and the process will
+crash in case any bit differs, as this would indicate that the memory area was
+modified after the free. The pseudo-random pattern is in fact incremented by
+(~0)/3 upon each free so that roughly half of the bits change each time and we
+maximize the likelihood of detecting a single bit flip in either direction. In
+order to avoid an immediate reuse and maximize the time the object spends in
+the cache, when this option is set, objects are picked from the cache from the
+oldest one instead of the freshest one. This way even late memory corruptions
+have a chance to be detected.
+
+When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag"
+is passed on the executable's command line, pool objects are allocated with
+one extra pointer compared to the requested size, so that the bytes that follow
+the memory area point to the pool descriptor itself as long as the object is
+allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is
+compared and the code will crash in if it differs. This allows to detect both
+memory overflows and object released to the wrong pool (code bug resulting from
+a copy-paste error typically).
+
+Thus an object will look like this depending whether it's in the cache or is
+currently in use:
+
+             in cache                 in use
+          +------------+          +------------+
+       <--+  by_pool.p |          |   N bytes  |
+          |  by_pool.n +-->       |            |
+          +------------+          |N=16 min on |
+       <--+  by_lru.p  |          |  32-bit,   |
+          |  by_lru.n  +-->       |  32 min on |
+          +------------+          |  64-bit    |
+          :            :          :            :
+          |   N bytes  |          |            |
+          +------------+          +------------+ \   optional, only if
+          :  (unused)  :          :  pool ptr  :  >  DEBUG_MEMORY_POOLS
+          +------------+          +------------+ /   is set at build time
+                                                     or -dMtag at boot time
+
+Right now no provisions are made to return objects aligned on larger boundaries
+than those currently covered by malloc() (i.e. two pointers). This need appears
+from time to time and the layout above might evolve a little bit if needed.
+
+
+4. Storage in the process-wide shared pool
+------------------------------------------
+
+In order for the shared pool not to be a contention point in a multi-threaded
+environment, objects are allocated from or released to shared pools by clusters
+of a few objects at once. The maximum number of objects that may be moved to or
+from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build
+time, and currently defaults to 8.
+
+In order to remain scalable, the shared pool has to make some tradeoffs to
+limit the number of atomic operations and the duration of any locked operation.
+As such, it's composed of a single-linked list of clusters, themselves made of
+a single-linked list of objects.
+
+Clusters and objects are of the same type "pool_item" and are accessed from the
+pool's "free_list" member. This member points to the latest pool_item inserted
+into the pool by a release operation. And the pool_item's "next" member points
+to the next pool_item, which was the one present in the pool's free_list just
+before the pool_item was inserted, and the last pool_item in the list simply
+has a NULL "next" field.
+
+The pool_item's "down" pointer points down to the next objects part of the same
+cluster, that will be released or allocated at the same time as the first one.
+Each of these items also has a NULL "next" field, and are chained by their
+respective "down" pointers until the last one is detected by a NULL value.
+
+This results in the following layout:
+
+      pool           pool_item   pool_item   pool_item
+    +-----------+    +------+    +------+    +------+
+    | free_list +--> | next +--> | next +--> | NULL |
+    +-----------+    +------+    +------+    +------+
+                     | down |    | NULL |    | down |
+                     +--+---+    +------+    +--+---+
+                        |                       |
+                        V                       V
+                     +------+                +------+
+                     | NULL |                | NULL |
+                     +------+                +------+
+                     | down |                | NULL |
+                     +--+---+                +------+
+                        |
+                        V
+                     +------+
+                     | NULL |
+                     +------+
+                     | NULL |
+                     +------+
+
+Allocating an entry is only a matter of performing two atomic allocations on
+the free_list and reading the pool's "next" value:
+
+  - atomically mark the free_list as being updated by writing a "magic" pointer
+  - read the first pool_item's "next" field
+  - atomically replace the free_list with this value
+
+This results in a fast operation that instantly retrieves a cluster at once.
+Then outside of the critical section entries are walked over and inserted into
+the local cache one at a time. In order to keep the code simple and efficient,
+objects allocated from the shared pool are all placed into the local cache, and
+only then the first one is allocated from the cache. This operation is
+performed by the dedicated function pool_refill_local_from_shared() which is
+called from pool_get_from_cache() when the cache is empty. It means there is an
+overhead of two list insert/delete operations for the first object and that
+could be avoided at the expense of more complex code in the fast path, but this
+is negligible since it only concerns objects that need to be visited anyway.
+
+Freeing a group of objects consists in performing the operation the other way
+around:
+
+  - atomically mark the free_list as being updated by writing a "magic" pointer
+  - write the free_list value to the to-be-released item's "next" entry
+  - atomically replace the free_list with the pool_item's pointer
+
+The cluster will simply have to be prepared before being sent to the shared
+pool. The operation of releasing a cluster at once is performed by function
+pool_put_to_shared_cache() which is called from pool_evict_last_items() which
+itself is responsible for building the clusters.
+
+Due to the way objects are stored, it is important to try to group objects as
+much as possible when releasing them because this is what will condition their
+retrieval as groups as well. This is the reason why pool_evict_last_items()
+uses the LRU to find a first entry but tries to pick several items at once from
+a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8
+achieves up to 6-6.5 objects on average per operation, which effectively
+divides by as much the average time spent per object by each thread and pushes
+the contention point further.
+
+Also, grouping items in clusters is a property of the process-wide shared pool
+and not of the thread-local caches. This means that there is no grouped
+operation when not using the shared pool (mode "2" in the diagram above).
+
+
+5. API
+------
+
+The following functions are public and available for user code:
+
+struct pool_head *create_pool(char *name, uint size, uint flags)
+        Create a new pool named <name> for objects of size <size> bytes. Pool
+        names are truncated to their first 11 characters. Pools of very similar
+        size will usually be merged if both have set the flag MEM_F_SHARED in
+        <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or
+        "-dMno-merge" is passed on the executable's command line, the pools
+        also need to have the exact same name to be merged. In addition, unless
+        MEM_F_EXACT is set in <flags>, the object size will usually be rounded
+        up to the size of pointers (16 or 32 bytes). The name that will appear
+        in the pool upon merging is the name of the first created pool. The
+        returned pointer is the new (or reused) pool head, or NULL upon error.
+        Pools created this way must be destroyed using pool_destroy().
+
+void *pool_destroy(struct pool_head *pool)
+        Destroy pool <pool>, that is, all of its unused objects are freed and
+        the structure is freed as well if the pool didn't have any used objects
+        anymore. In this case NULL is returned. If some objects remain in use,
+        the pool is preserved and its pointer is returned. This ought to be
+        used essentially on exit or in rare situations where some internal
+        entities that hold pools have to be destroyed.
+
+void pool_destroy_all(void)
+        Destroy all pools, without checking which ones still have used entries.
+        This is only meant for use on exit.
+
+void *__pool_alloc(struct pool_head *pool, uint flags)
+        Allocate an entry from the pool <pool>. The allocator will first look
+        for an object in the thread-local cache if enabled, then in the shared
+        pool if enabled, then will fall back to the operating system's default
+        allocator. NULL is returned if the object couldn't be allocated (due to
+        configured limits or lack of memory). Object allocated this way have to
+        be released using pool_free(). Like with malloc(), by default the
+        contents of the returned object are undefined. If memory poisonning is
+        enabled, the object will be filled with the poisonning byte. If the
+        global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is
+        enabled, a random number generator will be called to randomly return a
+        NULL. The allocator's behavior may be adjusted using a few flags passed
+        in <flags>:
+           - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when
+             pointless and expensive, like for buffers)
+           - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before
+             being returned, similar to what calloc() does
+           - POOL_F_NO_FAIL : when set, disables the random allocation failure,
+             e.g. for use during early init code or critical sections.
+
+void *pool_alloc(struct pool_head *pool)
+        This is an exact equivalent of __pool_alloc(pool, 0). It is the regular
+        way to allocate entries from a pool.
+
+void *pool_alloc_nocache(struct pool_head *pool)
+        Allocate an entry from the pool <pool>, bypassing the cache. If shared
+        pools are enabled, they will be consulted first. Otherwise the object
+        is allocated using the operating system's default allocator. This is
+        essentially used during early boot to pre-allocate a number of objects
+        for pools which require a minimum number of entries to exist.
+
+void *pool_zalloc(struct pool_head *pool)
+        This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO).
+
+void pool_free(struct pool_head *pool, void *ptr)
+        Free an entry allocate from one of the pool_alloc() functions above
+        from pool <pool>. The object will be placed into the thread-local cache
+        if enabled, or in the shared pool if enabled, or will be released using
+        the operating system's default allocator. When a local cache is
+        enabled, if the local cache size becomes larger than 75% of the maximum
+        size configured at build time, some objects will be evicted to the
+        shared pool. Such objects are taken first from the same pool, but if
+        the total size is really huge, other pools might be checked as well.
+        Some extra checks enabled at build time may enforce extra checks so
+        that the process will immediately crash if the object was not allocated
+        from this pool or experienced an overflow or some memory corruption.
+
+void pool_flush(struct pool_head *pool)
+        Free all unused objects from shared pool <pool>. Thread-local caches
+        are not affected. This is essentially used when running low on memory
+        or when stopping, in order to release a maximum amount of memory for
+        the new process.
+
+void pool_gc(struct pool_head *pool)
+        Free all unused objects from all pools, but respecting the minimum
+        number of spare objects required for each of them. Then, for operating
+        systems which support it, indicate the system that all unused memory
+        can be released. Thread-local caches are not affected. This operation
+        differs from pool_flush() in that it is run locklessly, under thread
+        isolation, and on all pools in a row. It is called by the SIGQUIT
+        signal handler and upon exit. Note that the obsolete argument <pool> is
+        not used and the convention is to pass NULL there.
+
+void dump_pools_to_trash(void)
+        Dump the current status of all pools into the trash buffer. This is
+        essentially used by the "show pools" CLI command or the SIGQUIT signal
+        handler to dump them on stderr. The total report size may not exceed
+        the size of the trash buffer. If it does, some entries will be missing.
+
+void dump_pools(void)
+        Dump the current status of all pools to stderr. This just calls
+        dump_pools_to_trash() and writes the trash to stderr.
+
+int pool_total_failures(void)
+        Report the total number of failed allocations. This is solely used to
+        report the "PoolFailed" metrics of the "show info" output. The total
+        is calculated on the fly by summing the number of failures in all pools
+        and is only meant to be used as an indicator rather than a precise
+        measure.
+
+ullong pool_total_allocated(void)
+        Report the total number of bytes allocated in all pools, for reporting
+        in the "PoolAlloc_MB" field of the "show info" output. The total is
+        calculated on the fly by summing the number of allocated bytes in all
+        pools and is only meant to be used as an indicator rather than a
+        precise measure.
+
+ullong pool_total_used(void)
+        Report the total number of bytes used in all pools, for reporting in
+        the "PoolUsed_MB" field of the "show info" output. The total is
+        calculated on the fly by summing the number of used bytes in all pools
+        and is only meant to be used as an indicator rather than a precise
+        measure. Note that objects present in caches are accounted as used.
+
+Some other functions exist and are only used by the pools code itself. While
+not strictly forbidden to use outside of this code, it is generally recommended
+to avoid touching them in order not to create undesired dependencies that will
+complicate maintenance.
+
+A few macros exist to ease the declaration of pools:
+
+DECLARE_POOL(ptr, name, size)
+        Placed at the top level of a file, this declares a global memory pool
+        as variable <ptr>, name <name> and size <size> bytes per element. This
+        is made via a call to REGISTER_POOL() and by assigning the resulting
+        pointer to variable <ptr>. <ptr> will be created of type "struct
+        pool_head *". If the pool needs to be visible outside of the function
+        (which is likely), it will also need to be declared somewhere as
+        "extern struct pool_head *<ptr>;". It is recommended to place such
+        declarations very early in the source file so that the variable is
+        already known to all subsequent functions which may use it.
+
+DECLARE_STATIC_POOL(ptr, name, size)
+        Placed at the top level of a file, this declares a static memory pool
+        as variable <ptr>, name <name> and size <size> bytes per element. This
+        is made via a call to REGISTER_POOL() and by assigning the resulting
+        pointer to local variable <ptr>. <ptr> will be created of type "static
+        struct pool_head *". It is recommended to place such declarations very
+        early in the source file so that the variable is already known to all
+        subsequent functions which may use it.
+
+
+6. Build options
+----------------
+
+A number of build-time defines allow to tune the pools behavior. All of them
+have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG
+variable.
+
+DEBUG_NO_POOLS
+        When this is set, pools are entirely disabled, and allocations are made
+        using malloc() instead. This is not recommended for production but may
+        be useful for tracing allocations. It corresponds to "-dMno-cache" at
+        boot time.
+
+DEBUG_MEMORY_POOLS
+        When this is set, an extra pointer is allocated at the end of each
+        object to reference the pool the object was allocated from and detect
+        buffer overflows. Then, pool_free() will provoke a crash in case it
+        detects an anomaly (pointer at the end not matching the pool). It
+        corresponds to "-dMtag" at boot time.
+
+DEBUG_FAIL_ALLOC
+        When enabled, a global setting "tune.fail-alloc" may be set to a non-
+        zero value representing a percentage of memory allocations that will be
+        made to fail in order to stress the calling code. It corresponds to
+        "-dMfail" at boot time.
+
+DEBUG_DONT_SHARE_POOLS
+        When enabled, pools of similar sizes are not merged unless the have the
+        exact same name. It corresponds to "-dMno-merge" at boot time.
+
+DEBUG_UAF
+        When enabled, pools are disabled and all allocations and releases pass
+        through mmap() and munmap(). The memory usage significantly inflates
+        and the performance degrades, but this allows to detect a lot of
+        use-after-free conditions by crashing the program at the first abnormal
+        access. This should not be used in production. It corresponds to
+        boot-time options "-dMuaf". Caching is disabled but may be re-enabled
+        using "-dMcache".
+
+DEBUG_POOL_INTEGRITY
+        When enabled, objects picked from the cache are checked for corruption
+        by comparing their contents against a pattern that was placed when they
+        were inserted into the cache. Objects are also allocated in the reverse
+        order, from the oldest one to the most recent, so as to maximize the
+        ability to detect such a corruption. The goal is to detect writes after
+        free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
+        this cannot detect reads after free, but may possibly detect later
+        corruptions and will not consume extra memory. The CPU usage will
+        increase a bit due to the cost of filling/checking the area and for the
+        preference for cold cache instead of hot cache, though not as much as
+        with DEBUG_UAF. This option is meant to be usable in production. It
+        corresponds to boot-time options "-dMcold-first,integrity".
+
+DEBUG_POOL_TRACING
+        When enabled, the callers of pool_alloc() and pool_free() will be
+        recorded into an extra memory area placed after the end of the object.
+        This may only be required by developers who want to get a few more
+        hints about code paths involved in some crashes, but will serve no
+        purpose outside of this. It remains compatible (and completes well)
+        DEBUG_POOL_INTEGRITY above. Such information become meaningless once
+        the objects leave the thread-local cache. It corresponds to boot-time
+        option "-dMcaller".
+
+DEBUG_MEM_STATS
+        When enabled, all malloc/calloc/realloc/strdup/free calls are accounted
+        for per call place (file+line number), and may be displayed or reset on
+        the CLI using "debug dev memstats". This is essentially used to detect
+        potential leaks or abnormal usages. When pools are enabled (default),
+        such calls are rare and the output will mostly contain calls induced by
+        libraries. When pools are disabled, about all calls to pool_alloc() and
+        pool_free() will also appear since they will be remapped to standard
+        functions.
+
+CONFIG_HAP_GLOBAL_POOLS
+        When enabled, process-wide shared pools will be forcefully enabled even
+        if not considered useful on the platform. The default is to let haproxy
+        decide based on the OS and C library. It corresponds to boot-time
+        option "-dMglobal".
+
+CONFIG_HAP_NO_GLOBAL_POOLS
+        When enabled, process-wide shared pools will be forcefully disabled
+        even if considered useful on the platform. The default is to let
+        haproxy decide based on the OS and C library. It corresponds to
+        boot-time option "-dMno-global".
+
+CONFIG_HAP_POOL_CACHE_SIZE
+        This allows one to define the default size of the per-thread cache, in
+        bytes. The default value is 512 kB (524288). Smaller values will use
+        less memory at the expense of a possibly higher CPU usage when using
+        many threads. Higher values will give diminishing returns on
+        performance while using much more memory. Usually there is no benefit
+        in using more than a per-core L2 cache size. It would be better not to
+        set this value lower than a few times the size of a buffer (bufsize,
+        defaults to 16 kB). In addition, keep in mind that this option may be
+        changed at runtime using "tune.memory.hot-size".
+
+CONFIG_HAP_POOL_CLUSTER_SIZE
+        This allows one to define the maximum number of objects that will be
+        groupped together in an allocation from the shared pool. Values 4 to 8
+        have experimentally shown good results with 16 threads. On systems with
+        more cores or loosely coupled caches exhibiting slow atomic operations,
+        it could possibly make sense to slightly increase this value.