2013/11/20 - How hashing works internally in haproxy - maddalab@gmail.com

This document describes how HAProxy implements hashing both map-based and
consistent hashing, both prior to versions 1.5 and the motivation and tests
that were done when providing additional options starting in version 2.4

A note on hashing in general, hash functions strive to have little
correlation between input and output. The heart of a hash function is its
mixing step. The behavior of the mixing step largely determines whether the
hash function is collision-resistant. Hash functions that are collision
resistant are more likely to have an even distribution of load.

The purpose of the mixing function is to spread the effect of each message
bit throughout all the bits of the internal state. Ideally every bit in the
hash state is affected by every bit in the message. And we want to do that
as quickly as possible simply for the sake of program performance. A
function is said to satisfy the strict avalanche criterion if, whenever a
single input bit is complemented (toggled between 0 and 1), each of the
output bits should change with a probability of one half for an arbitrary
selection of the remaining input bits.

To guard against a combination of hash function and input that results in
high rate of collisions, haproxy implements an avalanche algorithm on the
result of the hashing function. In all versions 1.4 and prior avalanche is
always applied when using the consistent hashing directive. It is intended
to provide quite a good distribution for little input variations. The result
is quite suited to fit over a 32-bit space with enough variations so that
a randomly picked number falls equally before any server position, which is
ideal for consistently hashed backends, a common use case for caches.

In all versions 1.4 and prior HAProxy implements the SDBM hashing function.
However tests show that alternatives to SDBM have a better cache
distribution on different hashing criteria. Additional tests involving
alternatives for hash input and an option to trigger avalanche, we found
different algorithms perform better on different criteria. DJB2 performs
well when hashing ascii text and is a good choice when hashing on host
header. Other alternatives perform better on numbers and are a good choice
when using source ip. The results also vary by use of the avalanche flag.

The results of the testing can be found under the tests folder. Here is
a summary of the discussion on the results on 1 input criteria and the
methodology used to generate the results.

A note of the setup when validating the results independently, one
would want to avoid backend server counts that may skew the results. As
an example with DJB2 avoid 33 servers. Please see the implementations of
the hashing function, which can be found in the links under references.

The following was the set up used

(a) hash-type consistent/map-based
(b) avalanche on/off
(c) balanche host(hdr)
(d) 3 criteria for inputs
    - ~ 10K requests, including duplicates
    - ~ 46K requests, unique requests from 1 MM requests were obtained
    - ~ 250K requests, including duplicates
(e) 17 servers in backend, all servers were assigned the same weight

Result of the hashing were obtained across the server via monitoring log
files for haproxy. Population Standard deviation was used to evaluate the
efficacy of the hashing algorithm. Lower standard deviation, indicates
a better distribution of load across the backends.

On 10K requests, when using consistent hashing with avalanche on host
headers, DJB2 significantly out performs SDBM. Std dev on SDBM was 48.95
and DJB2 was 26.29. This relationship is inverted with avalanche disabled,
however DJB2 with avalanche enabled out performs SDBM with avalanche
disabled.

On map-based hashing SDBM out performs DJB2 irrespective of the avalanche
option. SDBM without avalanche is marginally better than with avalanche.
DJB2 performs significantly worse with avalanche enabled.

Summary: The results of the testing indicate that there isn't a hashing
algorithm that can be applied across all input criteria. It is necessary
to support alternatives to SDBM, which is generally the best option, with
algorithms that are better for different inputs. Avalanche is not always
applicable and may result in less smooth distribution.

References:
Mixing Functions/Avalanche: https://papa.bretmulvey.com/post/124027987928/hash-functions
Hash Functions: http://www.cse.yorku.ca/~oz/hash.html