2022-02-24 - Pools structure and API

1. Background
-------------

Memory allocation is a complex problem covered by a massive amount of
literature. Memory allocators found in field cover a broad spectrum of
capabilities, performance, fragmentation, efficiency etc.

The main difficulty of memory allocation comes from finding the optimal chunks
for arbitrary sized requests, that will still preserve a low fragmentation
level. Doing this well is often expensive in CPU usage and/or memory usage.

In programs like HAProxy that deal with a large number of fixed size objects,
there is no point having to endure all this risk of fragmentation, and the
associated costs (sometimes up to several milliseconds with certain minimalist
allocators) are simply not acceptable. A better approach consists in grouping
frequently used objects by size, knowing that due to the high repetitiveness of
operations, a freed object will immediately be needed for another operation.

This grouping of objects by size is what is called a pool. Pools are created
for certain frequently allocated objects, are usually merged together when they
are of the same size (or almost the same size), and significantly reduce the
number of calls to the memory allocator.

With the arrival of threads, pools started to become a bottleneck so they now
implement an optional thread-local lockless cache. Finally with the arrival of
really efficient memory allocator in modern operating systems, the shared part
has also become optional so that it doesn't consume memory if it does not bring
any value.

In 2.6-dev2, a number of debugging options that used to be configured at build
time only changed to boot-time and can be modified using keywords passed after
"-dM" on the command line, which sets or clears bits in the pool_debugging
variable. The build-time options still affect the default settings however.
Default values may be consulted using "haproxy -dMhelp".


2. Principles
-------------

The pools architecture is selected at build time. The main options are:

  - thread-local caches and process-wide shared pool enabled (1)

    This is the default situation on most operating systems. Each thread has
    its own local cache, and when depleted it refills from the process-wide
    pool that avoids calling the standard allocator too often. It is possible
    to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at
    boot time with "-dMglobal".

  - thread-local caches only are enabled (2)

    This is the situation on operating systems where a fast and modern memory
    allocator is detected and when it is estimated that the process-wide shared
    pool will not bring any benefit. This detection is automatic at build time,
    but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS
    or at boot time with "-dMno-global".

  - pass-through to the standard allocator (3)

    This is used when one absolutely wants to disable pools and rely on regular
    malloc() and free() calls, essentially in order to trace memory allocations
    by call points, either internally via DEBUG_MEM_STATS, or externally via
    tools such as Valgrind. This mode of operation may be forced at build time
    by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache".

  - pass-through to an mmap-based allocator for debugging (4)

    This is used only during deep debugging when trying to detect various
    conditions such as use-after-free. In this case each allocated object's
    size is rounded up to a multiple of a page size (4096 bytes) and an
    integral number of pages is allocated for each object using mmap(),
    surrounded by two unaccessible holes that aim to detect some out-of-bounds
    accesses. Released objects are instantly freed using munmap() so that any
    immediate subsequent access to the memory area crashes the process if the
    area had not been reallocated yet. This mode can be enabled at build time
    by setting DEBUG_UAF, or at run time by disabling pools and enabling UAF
    with "-dMuaf". It tends to consume a lot of memory and not to scale at all
    with concurrent calls, that tends to make the system stall. The watchdog
    may even trigger on some slow allocations.

There are no more provisions for running with a shared pool but no thread-local
cache: the shared pool's main goal is to compensate for the expensive calls to
the memory allocator. This gain may be huge on tiny systems using basic
allocators, but the thread-local cache will already achieve this. And on larger
threaded systems, the shared pool's benefit is visible when the underlying
allocator scales poorly, but in this case the shared pool would suffer from
the same limitations without its thread-local cache and wouldn't provide any
benefit.

Summary of the various operation modes:

                  (1)            (2)            (3)            (4)

                  User           User           User           User
                   |              |              |              |
      pool_alloc() V              V              |              |
              +---------+    +---------+         |              |
              | Thread  |    | Thread  |         |              |
              |  Local  |    |  Local  |         |              |
              |  Cache  |    |  Cache  |         |              |
              +---------+    +---------+         |              |
                   |              |              |              |
    pool_refill*() V              |              |              |
              +---------+         |              |              |
              | Shared  |         |              |              |
              |  Pool   |         |              |              |
              +---------+         |              |              |
                   |              |              |              |
          malloc() V              V              V              |
              +---------+    +---------+    +---------+         |
              | Library |    | Library |    | Library |         |
              +---------+    +---------+    +---------+         |
                   |              |              |              |
            mmap() V              V              V              V
              +---------+    +---------+    +---------+    +---------+
              |   OS    |    |   OS    |    |   OS    |    |   OS    |
              +---------+    +---------+    +---------+    +---------+

One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation
failure in pool_alloc() by randomly returning NULL, to test that callers
properly handle allocation failures. It may also be enabled at boot time using
"-dMfail". In this case the desired average rate of allocation failures can be
fixed by global setting "tune.fail-alloc" expressed in percent.

The thread-local caches contain the freshest objects. Its total size amounts to
the number of bytes set in global.tune.pool_cache_size and that may be adjusted
by the "tune.memory.hot-size" global option, which itself defaults to build
time setting CONFIG_HAP_POOL_CACHE_SIZE, which was 1MB before 2.6 and 512kB
after. The aim is to keep hot objects that still fit in the CPU core's private
L2 cache. Once these objects do not fit into the cache anymore, there's no
benefit keeping them local to the thread, so they'd rather be returned to the
shared pool or the main allocator so that any other thread may make use of
them. Under extreme thread contention the cost of accessing shared structures
in the global cache or in malloc() may still be important and it may prove
useful to increase the thread-local cache size.


3. Storage in thread-local caches
---------------------------------

This section describes how objects are linked in thread local caches. This is
not meant to be a concern for users of the pools API but it can be useful when
inspecting post-mortem dumps or when trying to figure certain size constraints.

Objects are stored in the local cache using a doubly-linked list. This ensures
that they can be visited by freshness order like a stack, while at the same
time being able to access them from oldest to newest when it is needed to
evict coldest ones first:

  - releasing an object to the cache always puts it on the top.

  - allocating an object from the cache always takes the topmost one, hence the
    freshest one.

  - scanning for older objects to evict starts from the bottom, where the
    oldest ones are located

To that end, each thread-local cache keeps a list head in the "list" member of
its "pool_cache_head" descriptor, that links all objects cast to type
"pool_cache_item" via their "by_pool" member.

Note that the mechanism described above only works for a single pool. When
trying to limit the total cache size to a certain value, all pools included,
there is also a need to arrange all objects from all pools together in the
local caches. For this, each thread_ctx maintains a list head of recently
released objects, all pools included, in its member "pool_lru_head". All items
in a thread-local cache are linked there via their "by_lru" member.

This means that releasing an object using pool_free() consists in inserting
it at the beginning of two lists:
  - the local pool_cache_head's "list" list head
  - the thread context's "pool_lru_head" list head

Allocating an object consists in picking the first entry from the pool's "list"
and deleting its "by_pool" and "by_lru" links.

Evicting an object consists in scanning the thread context's "pool_lru_head"
backwards and deleting the object's "by_pool" and "by_lru" links.

Given that entries are both inserted and removed synchronously, we have the
guarantee that the oldest object in the thread's LRU list is always the oldest
object in its pool, and that the next element is the cache's list head. This is
what allows the LRU eviction mechanism to figure what pool an object belongs to
when releasing it.

Note:
 | Since a pool_cache_item has two list entries, on 64-bit systems it will be
 | 32-bytes long. This is the smallest size that a pool may be, and any smaller
 | size will automatically be rounded up to this size.

When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option
"-dMintegrity" is passed on the command line, the area of the object between
the two list elements and the end according to pool->size will be filled with
pseudo-random words during pool_put_to_cache(), and these words will be
compared between each other during pool_get_from_cache(), and the process will
crash in case any bit differs, as this would indicate that the memory area was
modified after the free. The pseudo-random pattern is in fact incremented by
(~0)/3 upon each free so that roughly half of the bits change each time and we
maximize the likelihood of detecting a single bit flip in either direction. In
order to avoid an immediate reuse and maximize the time the object spends in
the cache, when this option is set, objects are picked from the cache from the
oldest one instead of the freshest one. This way even late memory corruptions
have a chance to be detected.

When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag"
is passed on the executable's command line, pool objects are allocated with
one extra pointer compared to the requested size, so that the bytes that follow
the memory area point to the pool descriptor itself as long as the object is
allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is
compared and the code will crash in if it differs. This allows to detect both
memory overflows and object released to the wrong pool (code bug resulting from
a copy-paste error typically).

Thus an object will look like this depending whether it's in the cache or is
currently in use:

             in cache                 in use
          +------------+          +------------+
       <--+  by_pool.p |          |   N bytes  |
          |  by_pool.n +-->       |            |
          +------------+          |N=16 min on |
       <--+  by_lru.p  |          |  32-bit,   |
          |  by_lru.n  +-->       |  32 min on |
          +------------+          |  64-bit    |
          :            :          :            :
          |   N bytes  |          |            |
          +------------+          +------------+ \   optional, only if
          :  (unused)  :          :  pool ptr  :  >  DEBUG_MEMORY_POOLS
          +------------+          +------------+ /   is set at build time
                                                     or -dMtag at boot time

Right now no provisions are made to return objects aligned on larger boundaries
than those currently covered by malloc() (i.e. two pointers). This need appears
from time to time and the layout above might evolve a little bit if needed.


4. Storage in the process-wide shared pool
------------------------------------------

In order for the shared pool not to be a contention point in a multi-threaded
environment, objects are allocated from or released to shared pools by clusters
of a few objects at once. The maximum number of objects that may be moved to or
from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build
time, and currently defaults to 8.

In order to remain scalable, the shared pool has to make some tradeoffs to
limit the number of atomic operations and the duration of any locked operation.
As such, it's composed of a single-linked list of clusters, themselves made of
a single-linked list of objects.

Clusters and objects are of the same type "pool_item" and are accessed from the
pool's "free_list" member. This member points to the latest pool_item inserted
into the pool by a release operation. And the pool_item's "next" member points
to the next pool_item, which was the one present in the pool's free_list just
before the pool_item was inserted, and the last pool_item in the list simply
has a NULL "next" field.

The pool_item's "down" pointer points down to the next objects part of the same
cluster, that will be released or allocated at the same time as the first one.
Each of these items also has a NULL "next" field, and are chained by their
respective "down" pointers until the last one is detected by a NULL value.

This results in the following layout:

      pool           pool_item   pool_item   pool_item
    +-----------+    +------+    +------+    +------+
    | free_list +--> | next +--> | next +--> | NULL |
    +-----------+    +------+    +------+    +------+
                     | down |    | NULL |    | down |
                     +--+---+    +------+    +--+---+
                        |                       |
                        V                       V
                     +------+                +------+
                     | NULL |                | NULL |
                     +------+                +------+
                     | down |                | NULL |
                     +--+---+                +------+
                        |
                        V
                     +------+
                     | NULL |
                     +------+
                     | NULL |
                     +------+

Allocating an entry is only a matter of performing two atomic allocations on
the free_list and reading the pool's "next" value:

  - atomically mark the free_list as being updated by writing a "magic" pointer
  - read the first pool_item's "next" field
  - atomically replace the free_list with this value

This results in a fast operation that instantly retrieves a cluster at once.
Then outside of the critical section entries are walked over and inserted into
the local cache one at a time. In order to keep the code simple and efficient,
objects allocated from the shared pool are all placed into the local cache, and
only then the first one is allocated from the cache. This operation is
performed by the dedicated function pool_refill_local_from_shared() which is
called from pool_get_from_cache() when the cache is empty. It means there is an
overhead of two list insert/delete operations for the first object and that
could be avoided at the expense of more complex code in the fast path, but this
is negligible since it only concerns objects that need to be visited anyway.

Freeing a group of objects consists in performing the operation the other way
around:

  - atomically mark the free_list as being updated by writing a "magic" pointer
  - write the free_list value to the to-be-released item's "next" entry
  - atomically replace the free_list with the pool_item's pointer

The cluster will simply have to be prepared before being sent to the shared
pool. The operation of releasing a cluster at once is performed by function
pool_put_to_shared_cache() which is called from pool_evict_last_items() which
itself is responsible for building the clusters.

Due to the way objects are stored, it is important to try to group objects as
much as possible when releasing them because this is what will condition their
retrieval as groups as well. This is the reason why pool_evict_last_items()
uses the LRU to find a first entry but tries to pick several items at once from
a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8
achieves up to 6-6.5 objects on average per operation, which effectively
divides by as much the average time spent per object by each thread and pushes
the contention point further.

Also, grouping items in clusters is a property of the process-wide shared pool
and not of the thread-local caches. This means that there is no grouped
operation when not using the shared pool (mode "2" in the diagram above).


5. API
------

The following functions are public and available for user code:

struct pool_head *create_pool(char *name, uint size, uint flags)
        Create a new pool named <name> for objects of size <size> bytes. Pool
        names are truncated to their first 11 characters. Pools of very similar
        size will usually be merged if both have set the flag MEM_F_SHARED in
        <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or
        "-dMno-merge" is passed on the executable's command line, the pools
        also need to have the exact same name to be merged. In addition, unless
        MEM_F_EXACT is set in <flags>, the object size will usually be rounded
        up to the size of pointers (16 or 32 bytes). The name that will appear
        in the pool upon merging is the name of the first created pool. The
        returned pointer is the new (or reused) pool head, or NULL upon error.
        Pools created this way must be destroyed using pool_destroy().

void *pool_destroy(struct pool_head *pool)
        Destroy pool <pool>, that is, all of its unused objects are freed and
        the structure is freed as well if the pool didn't have any used objects
        anymore. In this case NULL is returned. If some objects remain in use,
        the pool is preserved and its pointer is returned. This ought to be
        used essentially on exit or in rare situations where some internal
        entities that hold pools have to be destroyed.

void pool_destroy_all(void)
        Destroy all pools, without checking which ones still have used entries.
        This is only meant for use on exit.

void *__pool_alloc(struct pool_head *pool, uint flags)
        Allocate an entry from the pool <pool>. The allocator will first look
        for an object in the thread-local cache if enabled, then in the shared
        pool if enabled, then will fall back to the operating system's default
        allocator. NULL is returned if the object couldn't be allocated (due to
        configured limits or lack of memory). Object allocated this way have to
        be released using pool_free(). Like with malloc(), by default the
        contents of the returned object are undefined. If memory poisonning is
        enabled, the object will be filled with the poisonning byte. If the
        global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is
        enabled, a random number generator will be called to randomly return a
        NULL. The allocator's behavior may be adjusted using a few flags passed
        in <flags>:
           - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when
             pointless and expensive, like for buffers)
           - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before
             being returned, similar to what calloc() does
           - POOL_F_NO_FAIL : when set, disables the random allocation failure,
             e.g. for use during early init code or critical sections.

void *pool_alloc(struct pool_head *pool)
        This is an exact equivalent of __pool_alloc(pool, 0). It is the regular
        way to allocate entries from a pool.

void *pool_alloc_nocache(struct pool_head *pool)
        Allocate an entry from the pool <pool>, bypassing the cache. If shared
        pools are enabled, they will be consulted first. Otherwise the object
        is allocated using the operating system's default allocator. This is
        essentially used during early boot to pre-allocate a number of objects
        for pools which require a minimum number of entries to exist.

void *pool_zalloc(struct pool_head *pool)
        This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO).

void pool_free(struct pool_head *pool, void *ptr)
        Free an entry allocate from one of the pool_alloc() functions above
        from pool <pool>. The object will be placed into the thread-local cache
        if enabled, or in the shared pool if enabled, or will be released using
        the operating system's default allocator. When a local cache is
        enabled, if the local cache size becomes larger than 75% of the maximum
        size configured at build time, some objects will be evicted to the
        shared pool. Such objects are taken first from the same pool, but if
        the total size is really huge, other pools might be checked as well.
        Some extra checks enabled at build time may enforce extra checks so
        that the process will immediately crash if the object was not allocated
        from this pool or experienced an overflow or some memory corruption.

void pool_flush(struct pool_head *pool)
        Free all unused objects from shared pool <pool>. Thread-local caches
        are not affected. This is essentially used when running low on memory
        or when stopping, in order to release a maximum amount of memory for
        the new process.

void pool_gc(struct pool_head *pool)
        Free all unused objects from all pools, but respecting the minimum
        number of spare objects required for each of them. Then, for operating
        systems which support it, indicate the system that all unused memory
        can be released. Thread-local caches are not affected. This operation
        differs from pool_flush() in that it is run locklessly, under thread
        isolation, and on all pools in a row. It is called by the SIGQUIT
        signal handler and upon exit. Note that the obsolete argument <pool> is
        not used and the convention is to pass NULL there.

void dump_pools_to_trash(void)
        Dump the current status of all pools into the trash buffer. This is
        essentially used by the "show pools" CLI command or the SIGQUIT signal
        handler to dump them on stderr. The total report size may not exceed
        the size of the trash buffer. If it does, some entries will be missing.

void dump_pools(void)
        Dump the current status of all pools to stderr. This just calls
        dump_pools_to_trash() and writes the trash to stderr.

int pool_total_failures(void)
        Report the total number of failed allocations. This is solely used to
        report the "PoolFailed" metrics of the "show info" output. The total
        is calculated on the fly by summing the number of failures in all pools
        and is only meant to be used as an indicator rather than a precise
        measure.

ullong pool_total_allocated(void)
        Report the total number of bytes allocated in all pools, for reporting
        in the "PoolAlloc_MB" field of the "show info" output. The total is
        calculated on the fly by summing the number of allocated bytes in all
        pools and is only meant to be used as an indicator rather than a
        precise measure.

ullong pool_total_used(void)
        Report the total number of bytes used in all pools, for reporting in
        the "PoolUsed_MB" field of the "show info" output. The total is
        calculated on the fly by summing the number of used bytes in all pools
        and is only meant to be used as an indicator rather than a precise
        measure. Note that objects present in caches are accounted as used.

Some other functions exist and are only used by the pools code itself. While
not strictly forbidden to use outside of this code, it is generally recommended
to avoid touching them in order not to create undesired dependencies that will
complicate maintenance.

A few macros exist to ease the declaration of pools:

DECLARE_POOL(ptr, name, size)
        Placed at the top level of a file, this declares a global memory pool
        as variable <ptr>, name <name> and size <size> bytes per element. This
        is made via a call to REGISTER_POOL() and by assigning the resulting
        pointer to variable <ptr>. <ptr> will be created of type "struct
        pool_head *". If the pool needs to be visible outside of the function
        (which is likely), it will also need to be declared somewhere as
        "extern struct pool_head *<ptr>;". It is recommended to place such
        declarations very early in the source file so that the variable is
        already known to all subsequent functions which may use it.

DECLARE_STATIC_POOL(ptr, name, size)
        Placed at the top level of a file, this declares a static memory pool
        as variable <ptr>, name <name> and size <size> bytes per element. This
        is made via a call to REGISTER_POOL() and by assigning the resulting
        pointer to local variable <ptr>. <ptr> will be created of type "static
        struct pool_head *". It is recommended to place such declarations very
        early in the source file so that the variable is already known to all
        subsequent functions which may use it.


6. Build options
----------------

A number of build-time defines allow to tune the pools behavior. All of them
have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG
variable.

DEBUG_NO_POOLS
        When this is set, pools are entirely disabled, and allocations are made
        using malloc() instead. This is not recommended for production but may
        be useful for tracing allocations. It corresponds to "-dMno-cache" at
        boot time.

DEBUG_MEMORY_POOLS
        When this is set, an extra pointer is allocated at the end of each
        object to reference the pool the object was allocated from and detect
        buffer overflows. Then, pool_free() will provoke a crash in case it
        detects an anomaly (pointer at the end not matching the pool). It
        corresponds to "-dMtag" at boot time.

DEBUG_FAIL_ALLOC
        When enabled, a global setting "tune.fail-alloc" may be set to a non-
        zero value representing a percentage of memory allocations that will be
        made to fail in order to stress the calling code. It corresponds to
        "-dMfail" at boot time.

DEBUG_DONT_SHARE_POOLS
        When enabled, pools of similar sizes are not merged unless the have the
        exact same name. It corresponds to "-dMno-merge" at boot time.

DEBUG_UAF
        When enabled, pools are disabled and all allocations and releases pass
        through mmap() and munmap(). The memory usage significantly inflates
        and the performance degrades, but this allows to detect a lot of
        use-after-free conditions by crashing the program at the first abnormal
        access. This should not be used in production. It corresponds to
        boot-time options "-dMuaf". Caching is disabled but may be re-enabled
        using "-dMcache".

DEBUG_POOL_INTEGRITY
        When enabled, objects picked from the cache are checked for corruption
        by comparing their contents against a pattern that was placed when they
        were inserted into the cache. Objects are also allocated in the reverse
        order, from the oldest one to the most recent, so as to maximize the
        ability to detect such a corruption. The goal is to detect writes after
        free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
        this cannot detect reads after free, but may possibly detect later
        corruptions and will not consume extra memory. The CPU usage will
        increase a bit due to the cost of filling/checking the area and for the
        preference for cold cache instead of hot cache, though not as much as
        with DEBUG_UAF. This option is meant to be usable in production. It
        corresponds to boot-time options "-dMcold-first,integrity".

DEBUG_POOL_TRACING
        When enabled, the callers of pool_alloc() and pool_free() will be
        recorded into an extra memory area placed after the end of the object.
        This may only be required by developers who want to get a few more
        hints about code paths involved in some crashes, but will serve no
        purpose outside of this. It remains compatible (and completes well)
        DEBUG_POOL_INTEGRITY above. Such information become meaningless once
        the objects leave the thread-local cache. It corresponds to boot-time
        option "-dMcaller".

DEBUG_MEM_STATS
        When enabled, all malloc/calloc/realloc/strdup/free calls are accounted
        for per call place (file+line number), and may be displayed or reset on
        the CLI using "debug dev memstats". This is essentially used to detect
        potential leaks or abnormal usages. When pools are enabled (default),
        such calls are rare and the output will mostly contain calls induced by
        libraries. When pools are disabled, about all calls to pool_alloc() and
        pool_free() will also appear since they will be remapped to standard
        functions.

CONFIG_HAP_GLOBAL_POOLS
        When enabled, process-wide shared pools will be forcefully enabled even
        if not considered useful on the platform. The default is to let haproxy
        decide based on the OS and C library. It corresponds to boot-time
        option "-dMglobal".

CONFIG_HAP_NO_GLOBAL_POOLS
        When enabled, process-wide shared pools will be forcefully disabled
        even if considered useful on the platform. The default is to let
        haproxy decide based on the OS and C library. It corresponds to
        boot-time option "-dMno-global".

CONFIG_HAP_POOL_CACHE_SIZE
        This allows one to define the default size of the per-thread cache, in
        bytes. The default value is 512 kB (524288). Smaller values will use
        less memory at the expense of a possibly higher CPU usage when using
        many threads. Higher values will give diminishing returns on
        performance while using much more memory. Usually there is no benefit
        in using more than a per-core L2 cache size. It would be better not to
        set this value lower than a few times the size of a buffer (bufsize,
        defaults to 16 kB). In addition, keep in mind that this option may be
        changed at runtime using "tune.memory.hot-size".

CONFIG_HAP_POOL_CLUSTER_SIZE
        This allows one to define the maximum number of objects that will be
        groupped together in an allocation from the shared pool. Values 4 to 8
        have experimentally shown good results with 16 threads. On systems with
        more cores or loosely coupled caches exhibiting slow atomic operations,
        it could possibly make sense to slightly increase this value.