1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
|
.. _zebra:
*****
Zebra
*****
.. _zebra-protocol:
Overview of the Zebra Protocol
==============================
The Zebra protocol (or ``ZAPI``) is used by protocol daemons to
communicate with the **zebra** daemon.
Each protocol daemon may request and send information to and from the
**zebra** daemon such as interface states, routing state,
nexthop-validation, and so on. Protocol daemons may also install
routes with **zebra**. The **zebra** daemon manages which routes are
installed into the forwarding table with the kernel. Some daemons use
more than one ZAPI connection. This is supported: each ZAPI session is
identified by a tuple of: ``{protocol, instance, session_id}``. LDPD
is an example: it uses a second, synchronous ZAPI session to manage
label blocks. The default value for ``session_id`` is zero; daemons
who use multiple ZAPI sessions must assign unique values to the
sessions' ids.
The Zebra protocol is a streaming protocol, with a common header. Version 0
lacks a version field and is implicitly versioned. Version 1 and all subsequent
versions have a version field. Version 0 can be distinguished from all other
versions by examining the 3rd byte of the header, which contains a marker value
of 255 (in Quagga) or 254 (in FRR) for all versions except version 0. The
marker byte corresponds to the command field in version 0, and the marker value
is a reserved command in version 0.
Version History
---------------
- Version 0
Used by all versions of GNU Zebra and all version of Quagga up to and
including Quagga 0.98. This version has no ``version`` field, and so is
implicitly versioned as version 0.
- Version 1
Added ``marker`` and ``version`` fields, increased ``command`` field to 16
bits. Used by Quagga versions 0.99.3 through 0.99.20.
- Version 2
Used by Quagga versions 0.99.21 through 0.99.23.
- Version 3
Added ``vrf_id`` field. Used by Quagga versions 0.99.23 until FRR fork.
- Version 4
Change marker value to 254 to prevent people mixing and matching Quagga and
FRR daemon binaries. Used by FRR versions 2.0 through 3.0.3.
- Version 5
Increased VRF identifier field from 16 to 32 bits. Used by FRR versions 4.0
through 5.0.1.
- Version 6
Removed the following commands:
* ZEBRA_IPV4_ROUTE_ADD
* ZEBRA_IPV4_ROUTE_DELETE
* ZEBRA_IPV6_ROUTE_ADD
* ZEBRA_IPV6_ROUTE_DELETE
Used since FRR version 6.0.
Zebra Protocol Definition
=========================
Zebra Protocol Header Field Definitions
---------------------------------------
Length
Total packet length including this header.
Marker
Static marker. The marker value, when it exists, is 255 in all versions of
Quagga. It is 254 in all versions of FRR. This is to allow version 0 headers
(which do not include version explicitly) to be distinguished from versioned
headers.
Version
Zebra protocol version number. Clients should not continue processing
messages past the version field for versions they do not recognise.
Command
The Zebra protocol command.
Current Version
^^^^^^^^^^^^^^^
::
Version 5, 6
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Marker | Version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VRF ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Command |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Past Versions
^^^^^^^^^^^^^
::
Version 0
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Command |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
::
Version 1, 2
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Marker | Version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Command |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
::
Version 3, 4
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Marker | Version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VRF ID | Command |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Zebra Protocol Commands
-----------------------
The definitions of zebra protocol commands can be found at ``lib/zclient.h``.
Dataplane batching
==================
Dataplane batching is an optimization feature that reduces the processing
time involved in the user space to kernel space transition for every message we
want to send.
Design
-----------
With our dataplane abstraction, we create a queue of dataplane context objects
for the messages we want to send to the kernel. In a separate pthread, we
loop over this queue and send the context objects to the appropriate
dataplane. A batching enhancement tightly integrates with the dataplane
context objects so they are able to be batch sent to dataplanes that support
it.
There is one main change in the dataplane code. It does not call
kernel-dependent functions one-by-one, but instead it hands a list of work down
to the kernel level for processing.
Netlink
^^^^^^^
At the moment, this is the only dataplane that allows for batch sending
messages to it.
When messages must be sent to the kernel, they are consecutively added
to the batch represented by the `struct nl_batch`. Context objects are firstly
encoded to their binary representation. All the encoding functions use the same
interface: take a context object, a buffer and a size of the buffer as an
argument. It is important that they should handle a situation in which a message
wouldn't fit in the buffer and return a proper error. To achieve a zero-copy
(in the user space only) messages are encoded to the same buffer which will
be passed to the kernel. Hence, we can theoretically hit the boundary of the
buffer.
Messages stored in the batch are sent if one of the conditions occurs:
- When an encoding function returns the buffer overflow error. The context
object that caused this error is re-added to the new, empty batch.
- When the size of the batch hits certain limit.
- When the namespace of a currently being processed context object is
different from all the previous ones. They have to be sent through
distinct sockets, so the messages cannot share the same buffer.
- After the last message from the list is processed.
As mentioned earlier, there is a special threshold which is smaller than
the size of the underlying buffer. It prevents the overflow error and thus
eliminates the case, in which a message is encoded twice.
The buffer used in the batching is global, since allocating that big amount of
memory every time wouldn't be most effective. However, its size can be changed
dynamically, using hidden vtysh command:
``zebra kernel netlink batch-tx-buf (1-1048576) (1-1048576)``. This feature is
only used in tests and shouldn't be utilized in any other place.
For every failed message in the batch, the kernel responds with an error
message. Error messages are kept in the same order as they were sent, so parsing the
response is straightforward. We use the two pointer technique to match
requests with responses and then set appropriate status of dataplane context
objects. There is also a global receive buffer and it is assumed that whatever
the kernel sends it will fit in this buffer. The payload of netlink error messages
consists of a error code and the original netlink message of the request, so
the batch response won't be bigger than the batch request increased by
some space for the headers.
|