summaryrefslogtreecommitdiffstats
path: root/doc/peers.txt
blob: 7ce2fcb466e3c267680c2676a29c970bab11d21f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
                            +--------------------+
                            | Peers protocol 2.1 |
                            +--------------------+


    Peers protocol has been implemented over TCP. Its aim is to transmit
    stick-table entries information between several haproxy processes.

    This protocol is symmetrical. This means that at any time, each peer
    may connect to other peers they have been configured for, to send
    their last stick-table updates. There is no role of client or server in this
    protocol. As peers may connect to each others at the same time, the protocol
    ensures that only one peer session may stay opened between a couple of peers
    before they start sending their stick-table information, possibly in both
    directions (or not).


    Handshake
    +++++++++

    Just after having connected to another one, a peer must identify itself
    and identify the remote peer, sending a "hello" message. The remote peer
    replies with a "status" message.

    A "hello" message is made of three lines terminated by a line feed character
    as follows:

    <protocol identifier> <version>\n
    <remote peer identifier>\n
    <local peer identifier> <process ID> <relative process ID>\n

    protocol identifier   : HAProxyS
    version               : 2.1
    remote peer identifier: the peer name this "hello" message is sent to.
    local peer identifier : the name of the peer which sends this "hello" message.
    process ID            : the ID of the process handling this peer session.
    relative process ID   : the haproxy's relative process ID (0 if nbproc == 1).

    The "status" message is made of a unique line terminated by a line feed
    character as follows:

    <status code>\n

    with these values as status code (a three-digit number):

                 +-------------+---------------------------------+
                 | status code |         signification           |
                 +-------------+---------------------------------+
                 |     200     |      Handshake succeeded        |
                 +-------------+---------------------------------+
                 |     300     |        Try again later          |
                 +-------------+---------------------------------+
                 |     501     |         Protocol error          |
                 +-------------+---------------------------------+
                 |     502     |          Bad version            |
                 +-------------+---------------------------------+
                 |     503     | Local peer identifier mismatch  |
                 +-------------+---------------------------------+
                 |     504     | Remote peer identifier mismatch |
                 +-------------+---------------------------------+

    As the protocol is symmetrical, some peers may connect to each other at the
    same time. For efficiency reasons, the protocol ensures there may be only
    one TCP session opened after the handshake succeeded and before transmitting
    any stick-table data information. In fact, for each couple of peers, this is
    the last connected peer which wins. Each time a peer A receives a "hello"
    message from a peer B, peer A checks if it already managed to open a peer
    session with peer B, so with a successful handshake. If it is the case,
    peer A closes its peer session. So, this is the peer session opened by B
    which stays opened.


                           Peer A               Peer B
                                     hello
                            ---------------------->
                                   status 200
                            <----------------------
                                hello
                            <++++++++++++++++++++++
                                   TCP/FIN-ACK
                            ---------------------->
                                   TCP/FIN-ACK
                            <----------------------
                                   status 200
                            ++++++++++++++++++++++>
                                      data
                            <++++++++++++++++++++++
                                      data
                            ++++++++++++++++++++++>
                                      data
                            ++++++++++++++++++++++>
                                      data
                            <++++++++++++++++++++++
                                       .
                                       .
                                       .

     As it is still possible that a couple of peers decide to close both their
     peer sessions at the same time, the protocol ensures peers will not reconnect
     at the same time, adding a random delay (50 up to 2050 ms) before  any
     reconnection.


     Encoding
     ++++++++

     As some TCP data may be corrupted, for integrity reason, some data fields
     are encoded at peer session level.

     The following algorithms explain how to encode/decode the data.

     encode:
       input : val  (64bits integer)
       output: bitf (variable-length bitfield)

       if val has no bit set above bit 4 (or if val is less than 0xf0)
           set the next byte of bitf to the value of val
           return bitf

       set the next byte of bitf to the value of val OR'ed with 0xf0
       subtract 0xf0 from val
       right shift val by 4

       while val bit 7 is set (or if val is greater or equal to 0x80):
           set the next byte of bitf to the value of the byte made of the last
               7 bits of val OR'ed with 0x80
           subtract 0x80 from val
           right shift val by 7

       set the next byte of bitf to the value of val
       return bitf

     decode:
       input : bitf (variable-length bitfield)
       output: val  (64bits integer)

       set val to the value of the first byte of bitf
       if bit 4 up to 7 of val are not set
           return val

       set loop to 0
       do
           add to val the value of the next byte of bitf left shifted by (4 + 7*loop)
           set loop to (loop + 1)
       while the bit 7 of the next byte of bitf is set
       return val

     Example:

     let's say that we must encode 0x1234.

     "set the next byte of bitf to the value of val OR'ed with 0xf0"
     => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4

     "subtract 0xf0 from val"
     => val = 0x1144

     right shift val by 4
     => val = 0x114

     "set the next byte of bitf to the value of the byte made of the last
     7 bits of val OR'ed with 0x80"
     => bitf[1] = (0x114 | 0x80) & 0xff = 0x94

     "subtract 0x80 from val"
     => val= 0x94

     "right shift val by 7"
     => val = 0x1

     => bitf[2] = 0x1

     So, the encoded value of 0x1234 is 0xf49401.

     To decode this value:

     "set val to the value of the first byte of bitf"
     => val = 0xf4

     "add to val the value of the next byte of bitf left shifted by 4"
     => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34

     "add to val the value of the next byte of bitf left shifted by (4 + 7)"
     => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234


     Messages
     ++++++++

     *** General ***

     After the handshake has successfully completed, peers are authorized to send
     some messages to each others, possibly in both direction.

     All the messages are made at least of a two bytes length header.

     The first byte of this header identifies the class of the message. The next
     byte identifies the type of message in the class.

     Some of these messages are variable-length. Others have a fixed size.
     Variable-length messages are identified by the value of the message type
     byte. For such messages, it is greater than or equal to 128.

     All variable-length message headers must be followed by the encoded length
     of the remaining bytes (so the encoded length of the message minus 2 bytes
     for the header and minus the length of the encoded length).

     There exist four classes of messages:

                +------------+---------------------+--------------+
                | class byte |    signification    | message size |
                +------------+---------------------+--------------+
                |      0     |      control        |   fixed (2)  |
                +------------+---------------------+--------------|
                |      1     |       error         |   fixed (2)  |
                +------------+---------------------+--------------|
                |     10     | stick-table updates |   variable   |
                +------------+---------------------+--------------|
                |    255     |      reserved       |              |
                +------------+---------------------+--------------+

     At this time of this writing, only control and error messages have a fixed
     size of two bytes (header only). The stick-table updates messages are all
     variable-length (their message type bytes are greater than 128).


     *** Control message class ***

     At this time of writing, control messages are fixed-length messages used
     only to control the synchronizations between local and/or remote processes
     and to emit heartbeat messages.

     There exists five types of such control messages:

       +------------+--------------------------------------------------------+
       | type byte  |                   signification                        |
       +------------+--------------------------------------------------------+
       |      0     | synchronisation request: ask a remote peer for a full  |
       |            | synchronization                                        |
       +------------+--------------------------------------------------------+
       |      1     | synchronization finished: signal a remote peer that    |
       |            | local updates have been pushed and local is considered |
       |            | up to date.                                            |
       +------------+--------------------------------------------------------+
       |      2     | synchronization partial: signal a remote peer that     |
       |            | local updates have been pushed and local is not        |
       |            | considered up to date.                                 |
       +------------+--------------------------------------------------------+
       |      3     | synchronization confirmed: acknowledge a finished or   |
       |            | partial synchronization message.                       |
       +------------+--------------------------------------------------------+
       |      4     | Heartbeat message.                                     |
       +------------+--------------------------------------------------------+

     About heartbeat messages: a peer sends heartbeat messages to peers it is
     connected to after periods of 3s of inactivity (i.e. when there is no
     stick-table to synchronize for 3s). After a successful peer protocol
     handshake between two peers, if one of them does not send any other peer
     protocol messages (i.e. no heartbeat and no stick-table update messages)
     during a 5s period, it is considered as no more alive by its remote peer
     which closes the session and then tries to reconnect to the peer which
     has just disappeared.

     *** Error message class ***

     There exits two types of such error messages:

                          +-----------+------------------+
                          | type byte |   signification  |
                          +-----------+------------------+
                          |      0    |  protocol error  |
                          +-----------+------------------+
                          |      1    | size limit error |
                          +-----------+------------------+


     *** Stick-table update message class ***

     This class is the more important one because it is in relation with the
     stick-table entries handling between peers which is at the core of peers
     protocol.

     All the messages of this class are variable-length. Their type bytes are
     all greater than or equal to 128.

     There exits five types of such stick-table update messages:

                    +-----------+--------------------------------+
                    | type byte |          signification         |
                    +-----------+--------------------------------+
                    |    128    |          Entry update          |
                    +-----------+--------------------------------+
                    |    129    |    Incremental entry update    |
                    +-----------+--------------------------------+
                    |    130    |     Stick-table definition     |
                    +-----------+--------------------------------+
                    |    131    |   Stick-table switch (unused)  |
                    +-----------+--------------------------------+
                    |    133    | Update message acknowledgement |
                    +-----------+--------------------------------+

     Note that entry update messages may be multiplexed. This means that different
     entry update messages for different stick-tables may be sent over the same
     peer session.

     To do so, each time entry update messages have to sent, they must be preceded
     by a stick-table definition message. This remains true for incremental entry
     update messages.

     As its name indicate, "Update message acknowledgement" messages are used to
     acknowledge the entry update messages.

     In this following paragraph, we give some information about the format of
     each stick-table update messages. This very simple following legend will
     contribute in understanding it. The unit used is the octet.

                     XX
          +-----------+
          |    foo    |  Unique fixed sized "foo" field, made of XX octets.
          +-----------+

          +===========+
          |    foo    |  Variable-length "foo" field.
          +===========+

          +xxxxxxxxxxx+
          |    foo    |  Encoded variable-length "foo" field.
          +xxxxxxxxxxx+

          +###########+
          |    foo    |  hereunder described "foo" field.
          +###########+


     With this legend, all the stick-table update messages have such a header:

                               1                        1
          +--------------------+------------------------+xxxxxxxxxxxxxxxx+
          | Message Class (10) | Message type (128-133) | Message length |
          +--------------------+------------------------+xxxxxxxxxxxxxxxx+

     Note that to help in making communicate different versions of peers protocol,
     such stick-table update messages may be extended adding non mandatory
     fields at the end of such messages, announcing a total message length
     which is greater than the message length of the previous versions of
     peers protocol. After having parsed such messages, the remaining ones
     will be skipped to parse the next message.

     - Definition message format:

     Before sending entry update messages, a peer must announce the configuration
     of the stick-table in relation with these messages thanks to a
     "Stick-table definition" message with such a following format:

          +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
          | Stick-table ID | Stick-table name length | Stick-table name |
          +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+

          +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
          |  Key type  |  Key length  |  Data types bitfield  |  Expiry |
          +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+

          +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
          |   Frequency counter #1    |   Frequency counter #1 period   |
          +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+

          +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
          |   Frequency counter #2    |   Frequency counter #2 period   |
          +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
                                      .
                                      .
                                      .

     Note that "Stick-table ID" field is an encoded integer which is used  to
     identify the stick-table without using its name (or "Stick-table name"
     field). It is local to the process handling the stick-table. So we can have
     two peers attached to processes which generate stick-table updates for
     the same stick-table (same name) but with different stick-table IDs.

     Also note that the list of "Frequency counter #X" and their associated
     periods fields exists only if their underlying types are already defined
     in "Data types bitfield" field.

     "Expiry" field and the remaining ones are not used by all the existing
     version of haproxy peers. But they are MANDATORY, so that to make a
     stick-table aggregator peer be able to autoconfigure itself.


     - Entry update message format:
                         4
                   +-----------------+###########+############+
                   | Local update ID |    Key    |    Data    |
                   +-----------------+###########+############+

     with "Key" described as follows:

         +xxxxxxxxxxx+=======+
         |   length  | value |  if key type is (non null terminated) "string",
         +xxxxxxxxxxx+=======+

                             4
                     +-------+
                     | value |  if key type is "integer",
                     +-------+

                     +=======+
                     | value |  for other key types: the size is announced in
                     +=======+  the previous stick-table definition message.

      "Data" field is basically a list of encoded values for each type announced
      by the "Data types bitfield" field of the previous "Stick-table definition"
      message:

       +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+      +xxxxxxxxxxxxxxxxxxxx+
       | Data type #1 value | Data type #2 value | .... | Data type #n value |
       +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+      +xxxxxxxxxxxxxxxxxxxx+


     Most of these fields are internally stored as uint32_t (see STD_T_SINT,
     STD_T_UINT, STD_T_ULL C enumerations) or structures made of several uint32_t
     (see STD_T_FRQP C enumeration). The remaining one STD_T_DICT is internally
     used to store entries of LRU caches for others literal dictionary entries
     (couples of IDs associated to strings). It is used to transmit these cache
     entries as follows:

                    +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+
                    |   length  | ID | string length | string |
                    +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+

     "length" is the length in bytes of the remaining data after this "length" field.
     "string length" is the length of "string" field which follows.

     Here the cache is used so that not to have to send again and again an already
     sent string. Indeed, the second time we have to send the same dictionary entry,
     if still cached, a peer sends only its ID:

                              +xxxxxxxxxxx+xxxx+
                              |   length  | ID |
                              +xxxxxxxxxxx+xxxx+

     - Update message acknowledgement format:

     These messages are responses to "Entry update" messages only.

     Its format is very basic for efficiency reasons:

                                                      4
                         +xxxxxxxxxxxxxxxx+-----------+
                         | Stick-table ID | Update ID |
                         +xxxxxxxxxxxxxxxx+-----------+


     Note that the "Stick-table ID" field value is in relation with the one which
     has been previously announce by a "Stick-table definition" message.

     The following schema may help in understanding how to handle a stream of
     stick-table update messages. The handshake step is not represented.
     Stick-table IDs are preceded by a '#' character.


                          Peer A               Peer B

                                 stkt def. #1
                            ---------------------->
                                 updates (1-5)
                            ---------------------->
                                 stkt def. #3
                            ---------------------->
                              updates (1000-1005)
                            ---------------------->

                                 stkt def. #2
                            <----------------------
                                updates (10-15)
                            <----------------------
                                 ack 5 for #1
                            <----------------------
                                ack 1005 for #3
                            <----------------------
                                 stkt def. #4
                            <----------------------
                               updates (100-105)
                            <----------------------

                                 ack  10 for #2
                            ---------------------->
                                 ack 105 for #4
                            ---------------------->
                (from here, on both sides, all stick-table updates
                          are considered as received)