summaryrefslogtreecommitdiffstats
path: root/doc/internals/api/pools.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/internals/api/pools.txt')
-rw-r--r--doc/internals/api/pools.txt585
1 files changed, 585 insertions, 0 deletions
diff --git a/doc/internals/api/pools.txt b/doc/internals/api/pools.txt
new file mode 100644
index 0000000..d84fb9d
--- /dev/null
+++ b/doc/internals/api/pools.txt
@@ -0,0 +1,585 @@
+2022-02-24 - Pools structure and API
+
+1. Background
+-------------
+
+Memory allocation is a complex problem covered by a massive amount of
+literature. Memory allocators found in field cover a broad spectrum of
+capabilities, performance, fragmentation, efficiency etc.
+
+The main difficulty of memory allocation comes from finding the optimal chunks
+for arbitrary sized requests, that will still preserve a low fragmentation
+level. Doing this well is often expensive in CPU usage and/or memory usage.
+
+In programs like HAProxy that deal with a large number of fixed size objects,
+there is no point having to endure all this risk of fragmentation, and the
+associated costs (sometimes up to several milliseconds with certain minimalist
+allocators) are simply not acceptable. A better approach consists in grouping
+frequently used objects by size, knowing that due to the high repetitiveness of
+operations, a freed object will immediately be needed for another operation.
+
+This grouping of objects by size is what is called a pool. Pools are created
+for certain frequently allocated objects, are usually merged together when they
+are of the same size (or almost the same size), and significantly reduce the
+number of calls to the memory allocator.
+
+With the arrival of threads, pools started to become a bottleneck so they now
+implement an optional thread-local lockless cache. Finally with the arrival of
+really efficient memory allocator in modern operating systems, the shared part
+has also become optional so that it doesn't consume memory if it does not bring
+any value.
+
+In 2.6-dev2, a number of debugging options that used to be configured at build
+time only changed to boot-time and can be modified using keywords passed after
+"-dM" on the command line, which sets or clears bits in the pool_debugging
+variable. The build-time options still affect the default settings however.
+Default values may be consulted using "haproxy -dMhelp".
+
+
+2. Principles
+-------------
+
+The pools architecture is selected at build time. The main options are:
+
+ - thread-local caches and process-wide shared pool enabled (1)
+
+ This is the default situation on most operating systems. Each thread has
+ its own local cache, and when depleted it refills from the process-wide
+ pool that avoids calling the standard allocator too often. It is possible
+ to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at
+ boot time with "-dMglobal".
+
+ - thread-local caches only are enabled (2)
+
+ This is the situation on operating systems where a fast and modern memory
+ allocator is detected and when it is estimated that the process-wide shared
+ pool will not bring any benefit. This detection is automatic at build time,
+ but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS
+ or at boot time with "-dMno-global".
+
+ - pass-through to the standard allocator (3)
+
+ This is used when one absolutely wants to disable pools and rely on regular
+ malloc() and free() calls, essentially in order to trace memory allocations
+ by call points, either internally via DEBUG_MEM_STATS, or externally via
+ tools such as Valgrind. This mode of operation may be forced at build time
+ by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache".
+
+ - pass-through to an mmap-based allocator for debugging (4)
+
+ This is used only during deep debugging when trying to detect various
+ conditions such as use-after-free. In this case each allocated object's
+ size is rounded up to a multiple of a page size (4096 bytes) and an
+ integral number of pages is allocated for each object using mmap(),
+ surrounded by two unaccessible holes that aim to detect some out-of-bounds
+ accesses. Released objects are instantly freed using munmap() so that any
+ immediate subsequent access to the memory area crashes the process if the
+ area had not been reallocated yet. This mode can be enabled at build time
+ by setting DEBUG_UAF, or at run time by disabling pools and enabling UAF
+ with "-dMuaf". It tends to consume a lot of memory and not to scale at all
+ with concurrent calls, that tends to make the system stall. The watchdog
+ may even trigger on some slow allocations.
+
+There are no more provisions for running with a shared pool but no thread-local
+cache: the shared pool's main goal is to compensate for the expensive calls to
+the memory allocator. This gain may be huge on tiny systems using basic
+allocators, but the thread-local cache will already achieve this. And on larger
+threaded systems, the shared pool's benefit is visible when the underlying
+allocator scales poorly, but in this case the shared pool would suffer from
+the same limitations without its thread-local cache and wouldn't provide any
+benefit.
+
+Summary of the various operation modes:
+
+ (1) (2) (3) (4)
+
+ User User User User
+ | | | |
+ pool_alloc() V V | |
+ +---------+ +---------+ | |
+ | Thread | | Thread | | |
+ | Local | | Local | | |
+ | Cache | | Cache | | |
+ +---------+ +---------+ | |
+ | | | |
+ pool_refill*() V | | |
+ +---------+ | | |
+ | Shared | | | |
+ | Pool | | | |
+ +---------+ | | |
+ | | | |
+ malloc() V V V |
+ +---------+ +---------+ +---------+ |
+ | Library | | Library | | Library | |
+ +---------+ +---------+ +---------+ |
+ | | | |
+ mmap() V V V V
+ +---------+ +---------+ +---------+ +---------+
+ | OS | | OS | | OS | | OS |
+ +---------+ +---------+ +---------+ +---------+
+
+One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation
+failure in pool_alloc() by randomly returning NULL, to test that callers
+properly handle allocation failures. It may also be enabled at boot time using
+"-dMfail". In this case the desired average rate of allocation failures can be
+fixed by global setting "tune.fail-alloc" expressed in percent.
+
+The thread-local caches contain the freshest objects. Its total size amounts to
+the number of bytes set in global.tune.pool_cache_size and that may be adjusted
+by the "tune.memory.hot-size" global option, which itself defaults to build
+time setting CONFIG_HAP_POOL_CACHE_SIZE, which was 1MB before 2.6 and 512kB
+after. The aim is to keep hot objects that still fit in the CPU core's private
+L2 cache. Once these objects do not fit into the cache anymore, there's no
+benefit keeping them local to the thread, so they'd rather be returned to the
+shared pool or the main allocator so that any other thread may make use of
+them. Under extreme thread contention the cost of accessing shared structures
+in the global cache or in malloc() may still be important and it may prove
+useful to increase the thread-local cache size.
+
+
+3. Storage in thread-local caches
+---------------------------------
+
+This section describes how objects are linked in thread local caches. This is
+not meant to be a concern for users of the pools API but it can be useful when
+inspecting post-mortem dumps or when trying to figure certain size constraints.
+
+Objects are stored in the local cache using a doubly-linked list. This ensures
+that they can be visited by freshness order like a stack, while at the same
+time being able to access them from oldest to newest when it is needed to
+evict coldest ones first:
+
+ - releasing an object to the cache always puts it on the top.
+
+ - allocating an object from the cache always takes the topmost one, hence the
+ freshest one.
+
+ - scanning for older objects to evict starts from the bottom, where the
+ oldest ones are located
+
+To that end, each thread-local cache keeps a list head in the "list" member of
+its "pool_cache_head" descriptor, that links all objects cast to type
+"pool_cache_item" via their "by_pool" member.
+
+Note that the mechanism described above only works for a single pool. When
+trying to limit the total cache size to a certain value, all pools included,
+there is also a need to arrange all objects from all pools together in the
+local caches. For this, each thread_ctx maintains a list head of recently
+released objects, all pools included, in its member "pool_lru_head". All items
+in a thread-local cache are linked there via their "by_lru" member.
+
+This means that releasing an object using pool_free() consists in inserting
+it at the beginning of two lists:
+ - the local pool_cache_head's "list" list head
+ - the thread context's "pool_lru_head" list head
+
+Allocating an object consists in picking the first entry from the pool's "list"
+and deleting its "by_pool" and "by_lru" links.
+
+Evicting an object consists in scanning the thread context's "pool_lru_head"
+backwards and deleting the object's "by_pool" and "by_lru" links.
+
+Given that entries are both inserted and removed synchronously, we have the
+guarantee that the oldest object in the thread's LRU list is always the oldest
+object in its pool, and that the next element is the cache's list head. This is
+what allows the LRU eviction mechanism to figure what pool an object belongs to
+when releasing it.
+
+Note:
+ | Since a pool_cache_item has two list entries, on 64-bit systems it will be
+ | 32-bytes long. This is the smallest size that a pool may be, and any smaller
+ | size will automatically be rounded up to this size.
+
+When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option
+"-dMintegrity" is passed on the command line, the area of the object between
+the two list elements and the end according to pool->size will be filled with
+pseudo-random words during pool_put_to_cache(), and these words will be
+compared between each other during pool_get_from_cache(), and the process will
+crash in case any bit differs, as this would indicate that the memory area was
+modified after the free. The pseudo-random pattern is in fact incremented by
+(~0)/3 upon each free so that roughly half of the bits change each time and we
+maximize the likelihood of detecting a single bit flip in either direction. In
+order to avoid an immediate reuse and maximize the time the object spends in
+the cache, when this option is set, objects are picked from the cache from the
+oldest one instead of the freshest one. This way even late memory corruptions
+have a chance to be detected.
+
+When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag"
+is passed on the executable's command line, pool objects are allocated with
+one extra pointer compared to the requested size, so that the bytes that follow
+the memory area point to the pool descriptor itself as long as the object is
+allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is
+compared and the code will crash in if it differs. This allows to detect both
+memory overflows and object released to the wrong pool (code bug resulting from
+a copy-paste error typically).
+
+Thus an object will look like this depending whether it's in the cache or is
+currently in use:
+
+ in cache in use
+ +------------+ +------------+
+ <--+ by_pool.p | | N bytes |
+ | by_pool.n +--> | |
+ +------------+ |N=16 min on |
+ <--+ by_lru.p | | 32-bit, |
+ | by_lru.n +--> | 32 min on |
+ +------------+ | 64-bit |
+ : : : :
+ | N bytes | | |
+ +------------+ +------------+ \ optional, only if
+ : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS
+ +------------+ +------------+ / is set at build time
+ or -dMtag at boot time
+
+Right now no provisions are made to return objects aligned on larger boundaries
+than those currently covered by malloc() (i.e. two pointers). This need appears
+from time to time and the layout above might evolve a little bit if needed.
+
+
+4. Storage in the process-wide shared pool
+------------------------------------------
+
+In order for the shared pool not to be a contention point in a multi-threaded
+environment, objects are allocated from or released to shared pools by clusters
+of a few objects at once. The maximum number of objects that may be moved to or
+from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build
+time, and currently defaults to 8.
+
+In order to remain scalable, the shared pool has to make some tradeoffs to
+limit the number of atomic operations and the duration of any locked operation.
+As such, it's composed of a single-linked list of clusters, themselves made of
+a single-linked list of objects.
+
+Clusters and objects are of the same type "pool_item" and are accessed from the
+pool's "free_list" member. This member points to the latest pool_item inserted
+into the pool by a release operation. And the pool_item's "next" member points
+to the next pool_item, which was the one present in the pool's free_list just
+before the pool_item was inserted, and the last pool_item in the list simply
+has a NULL "next" field.
+
+The pool_item's "down" pointer points down to the next objects part of the same
+cluster, that will be released or allocated at the same time as the first one.
+Each of these items also has a NULL "next" field, and are chained by their
+respective "down" pointers until the last one is detected by a NULL value.
+
+This results in the following layout:
+
+ pool pool_item pool_item pool_item
+ +-----------+ +------+ +------+ +------+
+ | free_list +--> | next +--> | next +--> | NULL |
+ +-----------+ +------+ +------+ +------+
+ | down | | NULL | | down |
+ +--+---+ +------+ +--+---+
+ | |
+ V V
+ +------+ +------+
+ | NULL | | NULL |
+ +------+ +------+
+ | down | | NULL |
+ +--+---+ +------+
+ |
+ V
+ +------+
+ | NULL |
+ +------+
+ | NULL |
+ +------+
+
+Allocating an entry is only a matter of performing two atomic allocations on
+the free_list and reading the pool's "next" value:
+
+ - atomically mark the free_list as being updated by writing a "magic" pointer
+ - read the first pool_item's "next" field
+ - atomically replace the free_list with this value
+
+This results in a fast operation that instantly retrieves a cluster at once.
+Then outside of the critical section entries are walked over and inserted into
+the local cache one at a time. In order to keep the code simple and efficient,
+objects allocated from the shared pool are all placed into the local cache, and
+only then the first one is allocated from the cache. This operation is
+performed by the dedicated function pool_refill_local_from_shared() which is
+called from pool_get_from_cache() when the cache is empty. It means there is an
+overhead of two list insert/delete operations for the first object and that
+could be avoided at the expense of more complex code in the fast path, but this
+is negligible since it only concerns objects that need to be visited anyway.
+
+Freeing a group of objects consists in performing the operation the other way
+around:
+
+ - atomically mark the free_list as being updated by writing a "magic" pointer
+ - write the free_list value to the to-be-released item's "next" entry
+ - atomically replace the free_list with the pool_item's pointer
+
+The cluster will simply have to be prepared before being sent to the shared
+pool. The operation of releasing a cluster at once is performed by function
+pool_put_to_shared_cache() which is called from pool_evict_last_items() which
+itself is responsible for building the clusters.
+
+Due to the way objects are stored, it is important to try to group objects as
+much as possible when releasing them because this is what will condition their
+retrieval as groups as well. This is the reason why pool_evict_last_items()
+uses the LRU to find a first entry but tries to pick several items at once from
+a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8
+achieves up to 6-6.5 objects on average per operation, which effectively
+divides by as much the average time spent per object by each thread and pushes
+the contention point further.
+
+Also, grouping items in clusters is a property of the process-wide shared pool
+and not of the thread-local caches. This means that there is no grouped
+operation when not using the shared pool (mode "2" in the diagram above).
+
+
+5. API
+------
+
+The following functions are public and available for user code:
+
+struct pool_head *create_pool(char *name, uint size, uint flags)
+ Create a new pool named <name> for objects of size <size> bytes. Pool
+ names are truncated to their first 11 characters. Pools of very similar
+ size will usually be merged if both have set the flag MEM_F_SHARED in
+ <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or
+ "-dMno-merge" is passed on the executable's command line, the pools
+ also need to have the exact same name to be merged. In addition, unless
+ MEM_F_EXACT is set in <flags>, the object size will usually be rounded
+ up to the size of pointers (16 or 32 bytes). The name that will appear
+ in the pool upon merging is the name of the first created pool. The
+ returned pointer is the new (or reused) pool head, or NULL upon error.
+ Pools created this way must be destroyed using pool_destroy().
+
+void *pool_destroy(struct pool_head *pool)
+ Destroy pool <pool>, that is, all of its unused objects are freed and
+ the structure is freed as well if the pool didn't have any used objects
+ anymore. In this case NULL is returned. If some objects remain in use,
+ the pool is preserved and its pointer is returned. This ought to be
+ used essentially on exit or in rare situations where some internal
+ entities that hold pools have to be destroyed.
+
+void pool_destroy_all(void)
+ Destroy all pools, without checking which ones still have used entries.
+ This is only meant for use on exit.
+
+void *__pool_alloc(struct pool_head *pool, uint flags)
+ Allocate an entry from the pool <pool>. The allocator will first look
+ for an object in the thread-local cache if enabled, then in the shared
+ pool if enabled, then will fall back to the operating system's default
+ allocator. NULL is returned if the object couldn't be allocated (due to
+ configured limits or lack of memory). Object allocated this way have to
+ be released using pool_free(). Like with malloc(), by default the
+ contents of the returned object are undefined. If memory poisonning is
+ enabled, the object will be filled with the poisonning byte. If the
+ global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is
+ enabled, a random number generator will be called to randomly return a
+ NULL. The allocator's behavior may be adjusted using a few flags passed
+ in <flags>:
+ - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when
+ pointless and expensive, like for buffers)
+ - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before
+ being returned, similar to what calloc() does
+ - POOL_F_NO_FAIL : when set, disables the random allocation failure,
+ e.g. for use during early init code or critical sections.
+
+void *pool_alloc(struct pool_head *pool)
+ This is an exact equivalent of __pool_alloc(pool, 0). It is the regular
+ way to allocate entries from a pool.
+
+void *pool_alloc_nocache(struct pool_head *pool)
+ Allocate an entry from the pool <pool>, bypassing the cache. If shared
+ pools are enabled, they will be consulted first. Otherwise the object
+ is allocated using the operating system's default allocator. This is
+ essentially used during early boot to pre-allocate a number of objects
+ for pools which require a minimum number of entries to exist.
+
+void *pool_zalloc(struct pool_head *pool)
+ This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO).
+
+void pool_free(struct pool_head *pool, void *ptr)
+ Free an entry allocate from one of the pool_alloc() functions above
+ from pool <pool>. The object will be placed into the thread-local cache
+ if enabled, or in the shared pool if enabled, or will be released using
+ the operating system's default allocator. When a local cache is
+ enabled, if the local cache size becomes larger than 75% of the maximum
+ size configured at build time, some objects will be evicted to the
+ shared pool. Such objects are taken first from the same pool, but if
+ the total size is really huge, other pools might be checked as well.
+ Some extra checks enabled at build time may enforce extra checks so
+ that the process will immediately crash if the object was not allocated
+ from this pool or experienced an overflow or some memory corruption.
+
+void pool_flush(struct pool_head *pool)
+ Free all unused objects from shared pool <pool>. Thread-local caches
+ are not affected. This is essentially used when running low on memory
+ or when stopping, in order to release a maximum amount of memory for
+ the new process.
+
+void pool_gc(struct pool_head *pool)
+ Free all unused objects from all pools, but respecting the minimum
+ number of spare objects required for each of them. Then, for operating
+ systems which support it, indicate the system that all unused memory
+ can be released. Thread-local caches are not affected. This operation
+ differs from pool_flush() in that it is run locklessly, under thread
+ isolation, and on all pools in a row. It is called by the SIGQUIT
+ signal handler and upon exit. Note that the obsolete argument <pool> is
+ not used and the convention is to pass NULL there.
+
+void dump_pools_to_trash(void)
+ Dump the current status of all pools into the trash buffer. This is
+ essentially used by the "show pools" CLI command or the SIGQUIT signal
+ handler to dump them on stderr. The total report size may not exceed
+ the size of the trash buffer. If it does, some entries will be missing.
+
+void dump_pools(void)
+ Dump the current status of all pools to stderr. This just calls
+ dump_pools_to_trash() and writes the trash to stderr.
+
+int pool_total_failures(void)
+ Report the total number of failed allocations. This is solely used to
+ report the "PoolFailed" metrics of the "show info" output. The total
+ is calculated on the fly by summing the number of failures in all pools
+ and is only meant to be used as an indicator rather than a precise
+ measure.
+
+ullong pool_total_allocated(void)
+ Report the total number of bytes allocated in all pools, for reporting
+ in the "PoolAlloc_MB" field of the "show info" output. The total is
+ calculated on the fly by summing the number of allocated bytes in all
+ pools and is only meant to be used as an indicator rather than a
+ precise measure.
+
+ullong pool_total_used(void)
+ Report the total number of bytes used in all pools, for reporting in
+ the "PoolUsed_MB" field of the "show info" output. The total is
+ calculated on the fly by summing the number of used bytes in all pools
+ and is only meant to be used as an indicator rather than a precise
+ measure. Note that objects present in caches are accounted as used.
+
+Some other functions exist and are only used by the pools code itself. While
+not strictly forbidden to use outside of this code, it is generally recommended
+to avoid touching them in order not to create undesired dependencies that will
+complicate maintenance.
+
+A few macros exist to ease the declaration of pools:
+
+DECLARE_POOL(ptr, name, size)
+ Placed at the top level of a file, this declares a global memory pool
+ as variable <ptr>, name <name> and size <size> bytes per element. This
+ is made via a call to REGISTER_POOL() and by assigning the resulting
+ pointer to variable <ptr>. <ptr> will be created of type "struct
+ pool_head *". If the pool needs to be visible outside of the function
+ (which is likely), it will also need to be declared somewhere as
+ "extern struct pool_head *<ptr>;". It is recommended to place such
+ declarations very early in the source file so that the variable is
+ already known to all subsequent functions which may use it.
+
+DECLARE_STATIC_POOL(ptr, name, size)
+ Placed at the top level of a file, this declares a static memory pool
+ as variable <ptr>, name <name> and size <size> bytes per element. This
+ is made via a call to REGISTER_POOL() and by assigning the resulting
+ pointer to local variable <ptr>. <ptr> will be created of type "static
+ struct pool_head *". It is recommended to place such declarations very
+ early in the source file so that the variable is already known to all
+ subsequent functions which may use it.
+
+
+6. Build options
+----------------
+
+A number of build-time defines allow to tune the pools behavior. All of them
+have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG
+variable.
+
+DEBUG_NO_POOLS
+ When this is set, pools are entirely disabled, and allocations are made
+ using malloc() instead. This is not recommended for production but may
+ be useful for tracing allocations. It corresponds to "-dMno-cache" at
+ boot time.
+
+DEBUG_MEMORY_POOLS
+ When this is set, an extra pointer is allocated at the end of each
+ object to reference the pool the object was allocated from and detect
+ buffer overflows. Then, pool_free() will provoke a crash in case it
+ detects an anomaly (pointer at the end not matching the pool). It
+ corresponds to "-dMtag" at boot time.
+
+DEBUG_FAIL_ALLOC
+ When enabled, a global setting "tune.fail-alloc" may be set to a non-
+ zero value representing a percentage of memory allocations that will be
+ made to fail in order to stress the calling code. It corresponds to
+ "-dMfail" at boot time.
+
+DEBUG_DONT_SHARE_POOLS
+ When enabled, pools of similar sizes are not merged unless the have the
+ exact same name. It corresponds to "-dMno-merge" at boot time.
+
+DEBUG_UAF
+ When enabled, pools are disabled and all allocations and releases pass
+ through mmap() and munmap(). The memory usage significantly inflates
+ and the performance degrades, but this allows to detect a lot of
+ use-after-free conditions by crashing the program at the first abnormal
+ access. This should not be used in production. It corresponds to
+ boot-time options "-dMuaf". Caching is disabled but may be re-enabled
+ using "-dMcache".
+
+DEBUG_POOL_INTEGRITY
+ When enabled, objects picked from the cache are checked for corruption
+ by comparing their contents against a pattern that was placed when they
+ were inserted into the cache. Objects are also allocated in the reverse
+ order, from the oldest one to the most recent, so as to maximize the
+ ability to detect such a corruption. The goal is to detect writes after
+ free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
+ this cannot detect reads after free, but may possibly detect later
+ corruptions and will not consume extra memory. The CPU usage will
+ increase a bit due to the cost of filling/checking the area and for the
+ preference for cold cache instead of hot cache, though not as much as
+ with DEBUG_UAF. This option is meant to be usable in production. It
+ corresponds to boot-time options "-dMcold-first,integrity".
+
+DEBUG_POOL_TRACING
+ When enabled, the callers of pool_alloc() and pool_free() will be
+ recorded into an extra memory area placed after the end of the object.
+ This may only be required by developers who want to get a few more
+ hints about code paths involved in some crashes, but will serve no
+ purpose outside of this. It remains compatible (and completes well)
+ DEBUG_POOL_INTEGRITY above. Such information become meaningless once
+ the objects leave the thread-local cache. It corresponds to boot-time
+ option "-dMcaller".
+
+DEBUG_MEM_STATS
+ When enabled, all malloc/calloc/realloc/strdup/free calls are accounted
+ for per call place (file+line number), and may be displayed or reset on
+ the CLI using "debug dev memstats". This is essentially used to detect
+ potential leaks or abnormal usages. When pools are enabled (default),
+ such calls are rare and the output will mostly contain calls induced by
+ libraries. When pools are disabled, about all calls to pool_alloc() and
+ pool_free() will also appear since they will be remapped to standard
+ functions.
+
+CONFIG_HAP_GLOBAL_POOLS
+ When enabled, process-wide shared pools will be forcefully enabled even
+ if not considered useful on the platform. The default is to let haproxy
+ decide based on the OS and C library. It corresponds to boot-time
+ option "-dMglobal".
+
+CONFIG_HAP_NO_GLOBAL_POOLS
+ When enabled, process-wide shared pools will be forcefully disabled
+ even if considered useful on the platform. The default is to let
+ haproxy decide based on the OS and C library. It corresponds to
+ boot-time option "-dMno-global".
+
+CONFIG_HAP_POOL_CACHE_SIZE
+ This allows one to define the default size of the per-thread cache, in
+ bytes. The default value is 512 kB (524288). Smaller values will use
+ less memory at the expense of a possibly higher CPU usage when using
+ many threads. Higher values will give diminishing returns on
+ performance while using much more memory. Usually there is no benefit
+ in using more than a per-core L2 cache size. It would be better not to
+ set this value lower than a few times the size of a buffer (bufsize,
+ defaults to 16 kB). In addition, keep in mind that this option may be
+ changed at runtime using "tune.memory.hot-size".
+
+CONFIG_HAP_POOL_CLUSTER_SIZE
+ This allows one to define the maximum number of objects that will be
+ groupped together in an allocation from the shared pool. Values 4 to 8
+ have experimentally shown good results with 16 threads. On systems with
+ more cores or loosely coupled caches exhibiting slow atomic operations,
+ it could possibly make sense to slightly increase this value.