summaryrefslogtreecommitdiffstats
path: root/doc/peers.txt
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--doc/peers.txt491
1 files changed, 491 insertions, 0 deletions
diff --git a/doc/peers.txt b/doc/peers.txt
new file mode 100644
index 0000000..a5da40f
--- /dev/null
+++ b/doc/peers.txt
@@ -0,0 +1,491 @@
+ +--------------------+
+ | Peers protocol 2.1 |
+ +--------------------+
+
+
+ Peers protocol has been implemented over TCP. Its aim is to transmit
+ stick-table entries information between several haproxy processes.
+
+ This protocol is symmetrical. This means that at any time, each peer
+ may connect to other peers they have been configured for, to send
+ their last stick-table updates. There is no role of client or server in this
+ protocol. As peers may connect to each others at the same time, the protocol
+ ensures that only one peer session may stay opened between a couple of peers
+ before they start sending their stick-table information, possibly in both
+ directions (or not).
+
+
+ Handshake
+ +++++++++
+
+ Just after having connected to another one, a peer must identify itself
+ and identify the remote peer, sending a "hello" message. The remote peer
+ replies with a "status" message.
+
+ A "hello" message is made of three lines terminated by a line feed character
+ as follows:
+
+ <protocol identifier> <version>\n
+ <remote peer identifier>\n
+ <local peer identifier> <process ID> <relative process ID>\n
+
+ protocol identifier : HAProxyS
+ version : 2.1
+ remote peer identifier: the peer name this "hello" message is sent to.
+ local peer identifier : the name of the peer which sends this "hello" message.
+ process ID : the ID of the process handling this peer session.
+ relative process ID : the haproxy's relative process ID (0 if nbproc == 1).
+
+ The "status" message is made of a unique line terminated by a line feed
+ character as follows:
+
+ <status code>\n
+
+ with these values as status code (a three-digit number):
+
+ +-------------+---------------------------------+
+ | status code | signification |
+ +-------------+---------------------------------+
+ | 200 | Handshake succeeded |
+ +-------------+---------------------------------+
+ | 300 | Try again later |
+ +-------------+---------------------------------+
+ | 501 | Protocol error |
+ +-------------+---------------------------------+
+ | 502 | Bad version |
+ +-------------+---------------------------------+
+ | 503 | Local peer identifier mismatch |
+ +-------------+---------------------------------+
+ | 504 | Remote peer identifier mismatch |
+ +-------------+---------------------------------+
+
+ As the protocol is symmetrical, some peers may connect to each other at the
+ same time. For efficiency reasons, the protocol ensures there may be only
+ one TCP session opened after the handshake succeeded and before transmitting
+ any stick-table data information. In fact, for each couple of peers, this is
+ the last connected peer which wins. Each time a peer A receives a "hello"
+ message from a peer B, peer A checks if it already managed to open a peer
+ session with peer B, so with a successful handshake. If it is the case,
+ peer A closes its peer session. So, this is the peer session opened by B
+ which stays opened.
+
+
+ Peer A Peer B
+ hello
+ ---------------------->
+ status 200
+ <----------------------
+ hello
+ <++++++++++++++++++++++
+ TCP/FIN-ACK
+ ---------------------->
+ TCP/FIN-ACK
+ <----------------------
+ status 200
+ ++++++++++++++++++++++>
+ data
+ <++++++++++++++++++++++
+ data
+ ++++++++++++++++++++++>
+ data
+ ++++++++++++++++++++++>
+ data
+ <++++++++++++++++++++++
+ .
+ .
+ .
+
+ As it is still possible that a couple of peers decide to close both their
+ peer sessions at the same time, the protocol ensures peers will not reconnect
+ at the same time, adding a random delay (50 up to 2050 ms) before any
+ reconnection.
+
+
+ Encoding
+ ++++++++
+
+ As some TCP data may be corrupted, for integrity reason, some data fields
+ are encoded at peer session level.
+
+ The following algorithms explain how to encode/decode the data.
+
+ encode:
+ input : val (64bits integer)
+ output: bitf (variable-length bitfield)
+
+ if val has no bit set above bit 4 (or if val is less than 0xf0)
+ set the next byte of bitf to the value of val
+ return bitf
+
+ set the next byte of bitf to the value of val OR'ed with 0xf0
+ subtract 0xf0 from val
+ right shift val by 4
+
+ while val bit 7 is set (or if val is greater or equal to 0x80):
+ set the next byte of bitf to the value of the byte made of the last
+ 7 bits of val OR'ed with 0x80
+ subtract 0x80 from val
+ right shift val by 7
+
+ set the next byte of bitf to the value of val
+ return bitf
+
+ decode:
+ input : bitf (variable-length bitfield)
+ output: val (64bits integer)
+
+ set val to the value of the first byte of bitf
+ if bit 4 up to 7 of val are not set
+ return val
+
+ set loop to 0
+ do
+ add to val the value of the next byte of bitf left shifted by (4 + 7*loop)
+ set loop to (loop + 1)
+ while the bit 7 of the next byte of bitf is set
+ return val
+
+ Example:
+
+ let's say that we must encode 0x1234.
+
+ "set the next byte of bitf to the value of val OR'ed with 0xf0"
+ => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4
+
+ "subtract 0xf0 from val"
+ => val = 0x1144
+
+ right shift val by 4
+ => val = 0x114
+
+ "set the next byte of bitf to the value of the byte made of the last
+ 7 bits of val OR'ed with 0x80"
+ => bitf[1] = (0x114 | 0x80) & 0xff = 0x94
+
+ "subtract 0x80 from val"
+ => val= 0x94
+
+ "right shift val by 7"
+ => val = 0x1
+
+ => bitf[2] = 0x1
+
+ So, the encoded value of 0x1234 is 0xf49401.
+
+ To decode this value:
+
+ "set val to the value of the first byte of bitf"
+ => val = 0xf4
+
+ "add to val the value of the next byte of bitf left shifted by 4"
+ => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34
+
+ "add to val the value of the next byte of bitf left shifted by (4 + 7)"
+ => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234
+
+
+ Messages
+ ++++++++
+
+ *** General ***
+
+ After the handshake has successfully completed, peers are authorized to send
+ some messages to each others, possibly in both direction.
+
+ All the messages are made at least of a two bytes length header.
+
+ The first byte of this header identifies the class of the message. The next
+ byte identifies the type of message in the class.
+
+ Some of these messages are variable-length. Others have a fixed size.
+ Variable-length messages are identified by the value of the message type
+ byte. For such messages, it is greater than or equal to 128.
+
+ All variable-length message headers must be followed by the encoded length
+ of the remaining bytes (so the encoded length of the message minus 2 bytes
+ for the header and minus the length of the encoded length).
+
+ There exist four classes of messages:
+
+ +------------+---------------------+--------------+
+ | class byte | signification | message size |
+ +------------+---------------------+--------------+
+ | 0 | control | fixed (2) |
+ +------------+---------------------+--------------|
+ | 1 | error | fixed (2) |
+ +------------+---------------------+--------------|
+ | 10 | stick-table updates | variable |
+ +------------+---------------------+--------------|
+ | 255 | reserved | |
+ +------------+---------------------+--------------+
+
+ At this time of this writing, only control and error messages have a fixed
+ size of two bytes (header only). The stick-table updates messages are all
+ variable-length (their message type bytes are greater than 128).
+
+
+ *** Control message class ***
+
+ At this time of writing, control messages are fixed-length messages used
+ only to control the synchronizations between local and/or remote processes
+ and to emit heartbeat messages.
+
+ There exists five types of such control messages:
+
+ +------------+--------------------------------------------------------+
+ | type byte | signification |
+ +------------+--------------------------------------------------------+
+ | 0 | synchronisation request: ask a remote peer for a full |
+ | | synchronization |
+ +------------+--------------------------------------------------------+
+ | 1 | synchronization finished: signal a remote peer that |
+ | | local updates have been pushed and local is considered |
+ | | up to date. |
+ +------------+--------------------------------------------------------+
+ | 2 | synchronization partial: signal a remote peer that |
+ | | local updates have been pushed and local is not |
+ | | considered up to date. |
+ +------------+--------------------------------------------------------+
+ | 3 | synchronization confirmed: acknowledge a finished or |
+ | | partial synchronization message. |
+ +------------+--------------------------------------------------------+
+ | 4 | Heartbeat message. |
+ +------------+--------------------------------------------------------+
+
+ About hearbeat messages: a peer sends heartbeat messages to peers it is
+ connected to after periods of 3s of inactivity (i.e. when there is no
+ stick-table to synchronize for 3s). After a successful peer protocol
+ handshake between two peers, if one of them does not send any other peer
+ protocol messages (i.e. no heartbeat and no stick-table update messages)
+ during a 5s period, it is considered as no more alive by its remote peer
+ which closes the session and then tries to reconnect to the peer which
+ has just disappeared.
+
+ *** Error message class ***
+
+ There exits two types of such error messages:
+
+ +-----------+------------------+
+ | type byte | signification |
+ +-----------+------------------+
+ | 0 | protocol error |
+ +-----------+------------------+
+ | 1 | size limit error |
+ +-----------+------------------+
+
+
+ *** Stick-table update message class ***
+
+ This class is the more important one because it is in relation with the
+ stick-table entries handling between peers which is at the core of peers
+ protocol.
+
+ All the messages of this class are variable-length. Their type bytes are
+ all greater than or equal to 128.
+
+ There exits five types of such stick-table update messages:
+
+ +-----------+--------------------------------+
+ | type byte | signification |
+ +-----------+--------------------------------+
+ | 128 | Entry update |
+ +-----------+--------------------------------+
+ | 129 | Incremental entry update |
+ +-----------+--------------------------------+
+ | 130 | Stick-table definition |
+ +-----------+--------------------------------+
+ | 131 | Stick-table switch (unused) |
+ +-----------+--------------------------------+
+ | 133 | Update message acknowledgement |
+ +-----------+--------------------------------+
+
+ Note that entry update messages may be multiplexed. This means that different
+ entry update messages for different stick-tables may be sent over the same
+ peer session.
+
+ To do so, each time entry update messages have to sent, they must be preceded
+ by a stick-table definition message. This remains true for incremental entry
+ update messages.
+
+ As its name indicate, "Update message acknowledgement" messages are used to
+ acknowledge the entry update messages.
+
+ In this following paragraph, we give some information about the format of
+ each stick-table update messages. This very simple following legend will
+ contribute in understanding it. The unit used is the octet.
+
+ XX
+ +-----------+
+ | foo | Unique fixed sized "foo" field, made of XX octets.
+ +-----------+
+
+ +===========+
+ | foo | Variable-length "foo" field.
+ +===========+
+
+ +xxxxxxxxxxx+
+ | foo | Encoded variable-length "foo" field.
+ +xxxxxxxxxxx+
+
+ +###########+
+ | foo | hereunder described "foo" field.
+ +###########+
+
+
+ With this legend, all the stick-table update messages have such a header:
+
+ 1 1
+ +--------------------+------------------------+xxxxxxxxxxxxxxxx+
+ | Message Class (10) | Message type (128-133) | Message length |
+ +--------------------+------------------------+xxxxxxxxxxxxxxxx+
+
+ Note that to help in making communicate different versions of peers protocol,
+ such stick-table update messages may be extended adding non mandatory
+ fields at the end of such messages, announcing a total message length
+ which is greater than the message length of the previous versions of
+ peers protocol. After having parsed such messages, the remaining ones
+ will be skipped to parse the next message.
+
+ - Definition message format:
+
+ Before sending entry update messages, a peer must announce the configuration
+ of the stick-table in relation with these messages thanks to a
+ "Stick-table definition" message with such a following format:
+
+ +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
+ | Stick-table ID | Stick-table name length | Stick-table name |
+ +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
+
+ +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
+ | Key type | Key length | Data types bitfield | Expiry |
+ +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
+
+ +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
+ | Frequency counter #1 | Frequency counter #1 period |
+ +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
+
+ +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
+ | Frequency counter #2 | Frequency counter #2 period |
+ +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
+ .
+ .
+ .
+
+ Note that "Stick-table ID" field is an encoded integer which is used to
+ identify the stick-table without using its name (or "Stick-table name"
+ field). It is local to the process handling the stick-table. So we can have
+ two peers attached to processes which generate stick-table updates for
+ the same stick-table (same name) but with different stick-table IDs.
+
+ Also note that the list of "Frequency counter #X" and their associated
+ periods fields exists only if their underlying types are already defined
+ in "Data types bitfield" field.
+
+ "Expiry" field and the remaining ones are not used by all the existing
+ version of haproxy peers. But they are MANDATORY, so that to make a
+ stick-table aggregator peer be able to autoconfigure itself.
+
+
+ - Entry update message format:
+ 4
+ +-----------------+###########+############+
+ | Local update ID | Key | Data |
+ +-----------------+###########+############+
+
+ with "Key" described as follows:
+
+ +xxxxxxxxxxx+=======+
+ | length | value | if key type is (non null terminated) "string",
+ +xxxxxxxxxxx+=======+
+
+ 4
+ +-------+
+ | value | if key type is "integer",
+ +-------+
+
+ +=======+
+ | value | for other key types: the size is announced in
+ +=======+ the previous stick-table definition message.
+
+ "Data" field is basically a list of encoded values for each type announced
+ by the "Data types bitfield" field of the previous "Stick-table definition"
+ message:
+
+ +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
+ | Data type #1 value | Data type #2 value | .... | Data type #n value |
+ +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
+
+
+ Most of these fields are internally stored as uint32_t (see STD_T_SINT,
+ STD_T_UINT, STD_T_ULL C enumerations) or structures made of several uint32_t
+ (see STD_T_FRQP C enumeration). The remaining one STD_T_DICT is internally
+ used to store entries of LRU caches for others literal dictionary entries
+ (couples of IDs associated to strings). It is used to transmit these cache
+ entries as follows:
+
+ +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+
+ | length | ID | string length | string |
+ +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+
+
+ "length" is the length in bytes of the remaining data after this "length" field.
+ "string length" is the length of "string" field which follows.
+
+ Here the cache is used so that not to have to send again and again an already
+ sent string. Indeed, the second time we have to send the same dictionary entry,
+ if still cached, a peer sends only its ID:
+
+ +xxxxxxxxxxx+xxxx+
+ | length | ID |
+ +xxxxxxxxxxx+xxxx+
+
+ - Update message acknowledgement format:
+
+ These messages are responses to "Entry update" messages only.
+
+ Its format is very basic for efficiency reasons:
+
+ 4
+ +xxxxxxxxxxxxxxxx+-----------+
+ | Stick-table ID | Update ID |
+ +xxxxxxxxxxxxxxxx+-----------+
+
+
+ Note that the "Stick-table ID" field value is in relation with the one which
+ has been previously announce by a "Stick-table definition" message.
+
+ The following schema may help in understanding how to handle a stream of
+ stick-table update messages. The handshake step is not represented.
+ Stick-table IDs are preceded by a '#' character.
+
+
+ Peer A Peer B
+
+ stkt def. #1
+ ---------------------->
+ updates (1-5)
+ ---------------------->
+ stkt def. #3
+ ---------------------->
+ updates (1000-1005)
+ ---------------------->
+
+ stkt def. #2
+ <----------------------
+ updates (10-15)
+ <----------------------
+ ack 5 for #1
+ <----------------------
+ ack 1005 for #3
+ <----------------------
+ stkt def. #4
+ <----------------------
+ updates (100-105)
+ <----------------------
+
+ ack 10 for #2
+ ---------------------->
+ ack 105 for #4
+ ---------------------->
+ (from here, on both sides, all stick-table updates
+ are considered as received)
+