summaryrefslogtreecommitdiffstats
path: root/CBOR_DNS_STREAM.md
blob: a54dc63fd51a8239083f6d3f9153a5dee9147bf2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
# CBOR DNS Stream Format version 1 (CDSv1)

This is an experimental format for representing DNS information in CBOR
with the goals to:
- Be able to stream the information
- Support incomplete, broken and/or invalid DNS
- Have close to no data quality and signature degradation
- Support additional non-DNS meta data (such as ICMP/TCP attributes)

## Overview

In CBOR you are expected to have one root element, most likely an array or
map.  This format does not have a root element, instead you are expected to
read one CBOR array element at a time as a stream of CBOR elements with the
first array element being the stream initiator object.

```
[stream_init]
[message]
...
[message]
```

Here are some number on the compression rate compared to PCAP:

Uncompressed | PCAP       | CDS       | Factor
-------------|------------|-----------|-------
client       | 458373     | 133640    | 0,2915
zonalizer    | 51769844   | 9450475   | 0,1825
large ditl   | 1003931674 | 298167709 | 0,2970
small ditl   | 1651252    | 603314    | 0,3653

Gzipped      | PCAP       | CDS       | Factor  | F/Uncompressed
-------------|------------|-----------|---------|---------------
client       | 108136     | 45944     | 0,4248  | 0,1002
zonalizer    | 12468329   | 2485620   | 0,1993  | 0,0480
large ditl   | 327227203  | 117569598 | 0,3592  | 0,1171
small ditl   | 539323     | 253402    | 0,4698  | 0,1534

Xzipped      | PCAP       | CDS       | Factor  | F/Uncompressed
-------------|------------|-----------|---------|---------------
client       | 76248      | 36308     | 0,4761  | 0,0792
zonalizer    | 7894356    | 1695920   | 0,2148  | 0,0327
large ditl   | 267031412  | 86747604  | 0,3248  | 0,0864
small ditl   | 442260     | 206596    | 0,4671  | 0,1251

- `client` is a couple of hours of DNS from my workstation
- `zonalizer` is half a day from [Zonalizer](https://zonalizer.makeinstall.se) which continuously tests gTLDs
- `large ditl`, `small ditl` are capture from [DITL](https://www.dns-oarc.net/oarc/data/ditl)

## Types

- `int`: A CBOR integer (major type 0x00)
- `uint`: A CBOR integer (value >= 0, major type 0x00)
- `nint`: A CBOR negative integer (value < 0, major type 0x00), this type has special meaning see `Negative Integers`
- `simple`: A CBOR simple value (major type 0xe0)
- `bytes`: A CBOR byte string (major type 0x40)
- `string`: A CBOR UTF-8 string (major type 0x60)
- `any`: Any CBOR value
- `bool`: A CBOR boolean
- `rindex`: A CBOR negative integer that is a reverse index, see `Deduplication`

## Special Keywords

- `union`: Can be used to merge the given array or map into the current object
- `optional`: The attribute or object reference is optional

## Negative Integers

CBOR encodes negative numbers in a special way and this format uses that
for none negative number to tell them apart.

Because of that, all negative numbers needs special decoding:

```
value = -value - 1
```

## Objects

The object code below uses:
- `[` and `]` to indicate the start and end of an array
- `type name` per object attribute
- `name` per object reference
- `...` to indicate a list of previous definition
- `(`, `|` and `)` to indicate list of various types that the attribute can be

### stream_init

The initial object in the stream.

```
[
    string version,
    union stream_option option,
    ...
]
```

- `version`: The version of the format
- `option`: A list of stream option objects

### stream_option

A stream option that can specify critical information about the stream and
how it should be decoded, see `Stream Options` for more information.

```
[
    uint option_type,
    optional any option_value
]
```

- `option_type`: The type of option represented as a number
- `option_value`: The option value

### message

A message object that describes various DNS packets or other information.

```
[
    optional bool is_complete,
    union timestamp timestamp,
    simple message_bits,
    union ip_header ip_header,
    union ( icmp_message | udp_message | tcp_message | dns_message ) content
]
```

- `is_complete`: Will exist and be false if the message is not complete and following attributes may not exists
- `timestamp`: A timestamp object
- `message_bits`: Bitmap indicating message content
  - Bit 0: 0=Not DNS 1=DNS
  - Bit 1: if DNS: 0=UDP 1=TCP else: 0=ICMP/ICMPv6 1=TCP
  - Bit 2: Fragmented (0=no 1=yes)
  - Bit 3: Malformed (0=no 1=yes)
- `ip_header`: An IP header object
- `content`: The message content, may be an ICMP, UDP, TCP or DNS message object

### timestamp

The timestamp object of a message.

```
[
    ( uint seconds | nint diff_from_last ),
    optional uint useconds
    optional uint nseconds
]
```

- `seconds`: The seconds of a UNIX timestamp
- `diff_from_last`: The differentially from last `timestamp.seconds`
- `useconds`: The microseconds of a UNIX timestamp or if `diff_from_last` is used it will be the differentially from last `timestamp.useconds`
- `nseconds`: The nanoseconds of a UNIX timestamp or if `diff_from_last` is used it will be the differentially from last `timestamp.nseconds`

### ip_header

The IP header of a message.

```
[
    ( uint | nint ) ip_bits,
    optional bytes src_addr,
    optional bytes dest_addr,
    optional ( uint | nint ) src_dest_port
]
```

- `ip_bits`: Bitmap indicating IP header content, if the type is `nint` it also indicates that it is a reverse from last, see `Deduplication` for more information
  - Bit 0: address family (0=AF_INET, 1=AF_INET6)
  - Bit 1: src_addr present
  - Bit 2: dest_addr present
  - Bit 3: port present
- `src_addr`: The source address with length specifying address family, 4 bytes is IPv4 and 16 is IPv6
- `dest_addr`: The destination address with length specifying address family, 4 bytes is IPv4 and 16 is IPv6
- `src_dest_port`: A combined source and destination port, see `Source And Destination Port`

#### Source And Destination Port

The source and destination port are combined into one value.  If both source
and destination exists then the value is larger then 65535, the destination
will be the high 16 bits and source the low otherwise it will only be the
source.  If the value is negative then only the destination exists.

```
if value > 0xffff then
    src_port = value & 0xffff
    dest_port = value >> 16
else if value < 0 then
    dest_port = -value - 1
else
    src_port = value
```

### icmp_message

`if ip_header.ip_bits.1=0 && ip_header.ip_bits.2=0`

```
[
    uint type,
    uint code
]
```

- `type`: TODO
- `code`: TODO

### udp_message

`if ip_header.ip_bits.1=1 && ip_header.ip_bits.2=0`

TODO

### tcp_message

`if ip_header.ip_bits.2=1`

```
[
    uint seq_nr,
    uint ack_nr,
    uint tcp_bits,
    uint window
]
```

- `seq_nr`: TODO
- `ack_nr`: TODO
- `tcp_bits`: TODO
  - 0: URG
  - 1: ACK
  - 2: PSH
  - 3: RST
  - 4: SYN
  - 5: FIN
- `window`: TODO

### dns_message

A DNS packet.

```
[
    optional bool is_complete,
    uint id,
    uint raw_dns_header,        # TODO
    optional nint count_bits,
    optional uint qdcount,
    optional uint ancount,
    optional uint nscount,
    optional uint arcount,
    optional simple rr_bits,
    optional [
        dns_question question,
        ...
    ],
    optional [
        resource_record answer,
        ...
    ],
    optional [
        resource_record authority,
        ...
    ],
    optional [
        resource_record additional,
        ...
    ],
    optional bytes malformed
]
```

- `is_complete`: Will exist and be false if the message is not complete and following attributes may not exists
- `id`: DNS identifier
- `raw_dns_header`: TODO
- `count_bits`: Bitmap indicating which counts are present, see `Negative Integers` and `Deduplication`
  - Bit 0: qdcount present
  - Bit 1: ancount present
  - Bit 2: nscount present
  - Bit 3: arcount present
- `qdcount`: Number of question records if different from the number of entries in `question`
- `ancount`: Number of answer resource records if different from the number of entries in `answer`
- `nscount`: Number of authority resource records if different from the number of entries in `authority`
- `arcount`: Number of additional resource records if different from the number of entries in `additional`
- `question`: The question records
- `answer`: The answer resource records
- `authority`: The authority resource records
- `additional`: The additional resource records
- `malformed`: Holds the bytes of the message that was not parsed

### question

A DNS question record.

```
[
    optional bool is_complete,
    ( bytes | compressed_name | rindex ) qname,
    optional uint qtype,
    optional nint qclass
]
```

- `is_complete`: Will exist and be false if the message is not complete and following attributes may not exists
- `qname`: The QNAME as byte string, a name compression object or a reverse index, see `Deduplication`
- `qtype`: The QTYPE, see `Deduplication`
- `qclass`: The QCLASS, see `Negative Integers` and `Deduplication`

### compressed_name

An compressed name which has references to other labels within the same message.

```
[
    ( bytes label | uint label_index | nint offset | simple extension_bits ),
    ...
]
```

- `label`: A byte string with a label part
- `label_index`: An index to the N byte string label in the message
- `offset`: The offset specified in the DNS message which could not be translated into a label index
- `extension_bits`: The extension bits if not 0b00 or 0b11 # TODO: add the extension bits

### resource_record

A DNS resource record.

```
[
    optional bool is_complete,
    ( bytes | compressed_name | rindex ) name,
    optional simple rr_bits,
    optional uint type,
    optional uint class,
    optional uint ttl,
    optional uint rdlength,
    ( bytes | mixed_rdata ) rdata
]
```

- `is_complete`: Will exist and be false if the message is not complete and following attributes may not exists
- `name`:
- `rr_bits`: Bitmap indicating what is present, see `Deduplication`
  - Bit 0: type
  - Bit 1: class
  - Bit 2: ttl
  - Bit 3: rdlength     # TODO: reverse index for TTL?
- `type`: The resource record type
- `class`: The resource record class
- `ttl`: The resource record ttl
- `rdlength`: The resource record rdata length
- `rdata`: The resource record data

### mixed_rdata

An array mixed with resource data and compressed names.

```
[
    ( bytes | compressed_name ) rdata_part,
    ...
]
```
- `rdata_part`: The parts of the resource records data

## Stream Options

Each option is specified here as OptionName(OptionNumber) and optional
OptionValue type.

- `RLABELS(0) uint`: Indicates how many labels should be stored in the reverse label index before discarding them
- `RLABEL_MIN_SIZE(1) uint`: The minimum size a label must be to be put in the reverse label index
- `RDATA_RINDEX_SIZE(2) uint`: Indicates how many rdata should be stored in the reverse rdata index before discarding them
- `RDATA_RINDEX_MIN_SIZE(3) uint`: The minimum size a rdata must be to be put in the reverse rdata index
- `USE_RDATA_INDEX(4)`: If present then the stream uses rdata indexing
- `RDATA_INDEX_MIN_SIZE(5) uint`: The minimum size a rdata must be to be put in the rdata index

## Deduplication

Deduplication is done in a few different ways, data may be left out to
indicate that it is the same as the previous value, an index may be used to
indicate that it is the same as the N previous value and a reverse index
may be used to indicate that it is the N previous value looking backwards
across the stream.

In other words, using the index deduplication you will need to build a table
of the values you come across during the decoding of the stream, this table
can grow very large.

As an smaller alternative a reverse index can indicate often used data from
the N previous value looking back over the stream. This type of index also
reorder itself to try and put the most used data always in the index.

TODO: details of each attribute and it's deduplication