xpcom/ds/tools/make_dafsa.py


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372

#!/usr/bin/env python
# Copyright 2014 The Chromium Authors. All rights reserved.
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.

"""
A Deterministic acyclic finite state automaton (DAFSA) is a compact
representation of an unordered word list (dictionary).

http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

This python program converts a list of strings to a byte array in C++.
This python program fetches strings and return values from a gperf file
and generates a C++ file with a byte array representing graph that can be
used as a memory efficient replacement for the perfect hash table.

The input strings are assumed to consist of printable 7-bit ASCII characters
and the return values are assumed to be one digit integers.

In this program a DAFSA is a diamond shaped graph starting at a common
root node and ending at a common end node. All internal nodes contain
a character and each word is represented by the characters in one path from
the root node to the end node.

The order of the operations is crucial since lookups will be performed
starting from the source with no backtracking. Thus a node must have at
most one child with a label starting by the same character. The output
is also arranged so that all jumps are to increasing addresses, thus forward
in memory.

The generated output has suffix free decoding so that the sign of leading
bits in a link (a reference to a child node) indicate if it has a size of one,
two or three bytes and if it is the last outgoing link from the actual node.
A node label is terminated by a byte with the leading bit set.

The generated byte array can described by the following BNF:

<byte> ::= < 8-bit value in range [0x00-0xFF] >

<char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >
<end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >
<return value> ::= < value + 0x80, byte in range [0x80-0x8F] >

<offset1> ::= < byte in range [0x00-0x3F] >
<offset2> ::= < byte in range [0x40-0x5F] >
<offset3> ::= < byte in range [0x60-0x7F] >

<end_offset1> ::= < byte in range [0x80-0xBF] >
<end_offset2> ::= < byte in range [0xC0-0xDF] >
<end_offset3> ::= < byte in range [0xE0-0xFF] >

<prefix> ::= <char>

<label> ::= <end_char>
          | <char> <label>

<end_label> ::= <return_value>
          | <char> <end_label>

<offset> ::= <offset1>
           | <offset2> <byte>
           | <offset3> <byte> <byte>

<end_offset> ::= <end_offset1>
               | <end_offset2> <byte>
               | <end_offset3> <byte> <byte>

<offsets> ::= <end_offset>
            | <offset> <offsets>

<source> ::= <offsets>

<node> ::= <label> <offsets>
         | <prefix> <node>
         | <end_label>

<dafsa> ::= <source>
          | <dafsa> <node>

Decoding:

<char> -> printable 7-bit ASCII character
<end_char> & 0x7F -> printable 7-bit ASCII character
<return value> & 0x0F -> integer
<offset1 & 0x3F> -> integer
((<offset2> & 0x1F>) << 8) + <byte> -> integer
((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer

end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
offset2 and offset3 respectively.

The first offset in a list of offsets is the distance in bytes between the
offset itself and the first child node. Subsequent offsets are the distance
between previous child node and next child node. Thus each offset links a node
to a child node. The distance is always counted between start addresses, i.e.
first byte in decoded offset or first byte in child node.

Example 1:

%%
aa, 1
a, 2
%%

The input is first parsed to a list of words:
["aa1", "a2"]

This produces the following graph:
[root] --- a --- 0x02 --- [end]
            |              /
             |           /
              - a --- 0x01

A C++ representation of the compressed graph is generated:

const unsigned char dafsa[7] = {
  0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
};

The bytes in the generated array has the following meaning:

 0: 0x81 <end_offset1>  child at position 0 + (0x81 & 0x3F) -> jump to 1

 1: 0xE1 <end_char>     label character (0xE1 & 0x7F) -> match "a"
 2: 0x02 <offset1>      child at position 2 + (0x02 & 0x3F) -> jump to 4

 3: 0x81 <end_offset1>  child at position 4 + (0x81 & 0x3F) -> jump to 5
 4: 0x82 <return_value> 0x82 & 0x0F -> return 2

 5: 0x61 <char>         label character 0x61 -> match "a"
 6: 0x81 <return_value> 0x81 & 0x0F -> return 1

Example 2:

%%
aa, 1
bbb, 2
baa, 1
%%

The input is first parsed to a list of words:
["aa1", "bbb2", "baa1"]

This produces the following graph:
[root] --- a --- a --- 0x01 --- [end]
 |       /           /         /
  |    /           /         /
   - b --- b --- b --- 0x02

A C++ representation of the compressed graph is generated:

const unsigned char dafsa[11] = {
  0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
};

The bytes in the generated array has the following meaning:

 0: 0x02 <offset1>      child at position 0 + (0x02 & 0x3F) -> jump to 2
 1: 0x83 <end_offset1>  child at position 2 + (0x83 & 0x3F) -> jump to 5

 2: 0xE2 <end_char>     label character (0xE2 & 0x7F) -> match "b"
 3: 0x02 <offset1>      child at position 3 + (0x02 & 0x3F) -> jump to 5
 4: 0x83 <end_offset1>  child at position 5 + (0x83 & 0x3F) -> jump to 8

 5: 0x61 <char>         label character 0x61 -> match "a"
 6: 0x61 <char>         label character 0x61 -> match "a"
 7: 0x81 <return_value> 0x81 & 0x0F -> return 1

 8: 0x62 <char>         label character 0x62 -> match "b"
 9: 0x62 <char>         label character 0x62 -> match "b"
10: 0x82 <return_value> 0x82 & 0x0F -> return 2
"""
import struct
import sys

from incremental_dafsa import Dafsa, Node


class InputError(Exception):
    """Exception raised for errors in the input file."""


def top_sort(dafsa: Dafsa):
    """Generates list of nodes in topological sort order."""
    incoming = {}

    def count_incoming(node: Node):
        """Counts incoming references."""
        if not node.is_end_node:
            if id(node) not in incoming:
                incoming[id(node)] = 1
                for child in node.children.values():
                    count_incoming(child)
            else:
                incoming[id(node)] += 1

    for node in dafsa.root_node.children.values():
        count_incoming(node)

    for node in dafsa.root_node.children.values():
        incoming[id(node)] -= 1

    waiting = [
        node for node in dafsa.root_node.children.values() if incoming[id(node)] == 0
    ]
    nodes = []

    while waiting:
        node = waiting.pop()
        assert incoming[id(node)] == 0
        nodes.append(node)
        for child in node.children.values():
            if not child.is_end_node:
                incoming[id(child)] -= 1
                if incoming[id(child)] == 0:
                    waiting.append(child)
    return nodes


def encode_links(node: Node, offsets, current):
    """Encodes a list of children as one, two or three byte offsets."""
    if next(iter(node.children.values())).is_end_node:
        # This is an <end_label> node and no links follow such nodes
        return []
    guess = 3 * len(node.children)
    assert node.children

    children = sorted(node.children.values(), key=lambda x: -offsets[id(x)])
    while True:
        offset = current + guess
        buf = []
        for child in children:
            last = len(buf)
            distance = offset - offsets[id(child)]
            assert distance > 0 and distance < (1 << 21)

            if distance < (1 << 6):
                # A 6-bit offset: "s0xxxxxx"
                buf.append(distance)
            elif distance < (1 << 13):
                # A 13-bit offset: "s10xxxxxxxxxxxxx"
                buf.append(0x40 | (distance >> 8))
                buf.append(distance & 0xFF)
            else:
                # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
                buf.append(0x60 | (distance >> 16))
                buf.append((distance >> 8) & 0xFF)
                buf.append(distance & 0xFF)
            # Distance in first link is relative to following record.
            # Distance in other links are relative to previous link.
            offset -= distance
        if len(buf) == guess:
            break
        guess = len(buf)
    # Set most significant bit to mark end of links in this node.
    buf[last] |= 1 << 7
    buf.reverse()
    return buf


def encode_prefix(label):
    """Encodes a node label as a list of bytes without a trailing high byte.

    This method encodes a node if there is exactly one child  and the
    child follows immediately after so that no jump is needed. This label
    will then be a prefix to the label in the child node.
    """
    assert label
    return [ord(c) for c in reversed(label)]


def encode_label(label):
    """Encodes a node label as a list of bytes with a trailing high byte >0x80."""
    buf = encode_prefix(label)
    # Set most significant bit to mark end of label in this node.
    buf[0] |= 1 << 7
    return buf


def encode(dafsa: Dafsa):
    """Encodes a DAFSA to a list of bytes"""
    output = []
    offsets = {}

    for node in reversed(top_sort(dafsa)):
        if (
            len(node.children) == 1
            and not next(iter(node.children.values())).is_end_node
            and (offsets[id(next(iter(node.children.values())))] == len(output))
        ):
            output.extend(encode_prefix(node.character))
        else:
            output.extend(encode_links(node, offsets, len(output)))
            output.extend(encode_label(node.character))
        offsets[id(node)] = len(output)

    output.extend(encode_links(dafsa.root_node, offsets, len(output)))
    output.reverse()
    return output


def to_cxx(data, preamble=None):
    """Generates C++ code from a list of encoded bytes."""
    text = "/* This file is generated. DO NOT EDIT!\n\n"
    text += "The byte array encodes a dictionary of strings and values. See "
    text += "make_dafsa.py for documentation."
    text += "*/\n\n"

    if preamble:
        text += preamble
        text += "\n\n"

    text += "const unsigned char kDafsa[%s] = {\n" % len(data)
    for i in range(0, len(data), 12):
        text += "  "
        text += ", ".join("0x%02x" % byte for byte in data[i : i + 12])
        text += ",\n"
    text += "};\n"
    return text


def words_to_cxx(words, preamble=None):
    """Generates C++ code from a word list"""
    dafsa = Dafsa.from_tld_data(words)
    return to_cxx(encode(dafsa), preamble)


def words_to_bin(words):
    """Generates bytes from a word list"""
    dafsa = Dafsa.from_tld_data(words)
    data = encode(dafsa)
    return struct.pack("%dB" % len(data), *data)


def parse_gperf(infile):
    """Parses gperf file and extract strings and return code"""
    lines = [line.strip() for line in infile]

    # Extract the preamble.
    first_delimeter = lines.index("%%")
    preamble = "\n".join(lines[0:first_delimeter])

    # Extract strings after the first '%%' and before the second '%%'.
    begin = first_delimeter + 1
    end = lines.index("%%", begin)
    lines = lines[begin:end]
    for line in lines:
        if line[-3:-1] != ", ":
            raise InputError('Expected "domainname, <digit>", found "%s"' % line)
        # Technically the DAFSA format could support return values in range [0-31],
        # but the values below are the only with a defined meaning.
        if line[-1] not in "0124":
            raise InputError(
                'Expected value to be one of {0,1,2,4}, found "%s"' % line[-1]
            )
    return (preamble, [line[:-3] + line[-1] for line in lines])


def main(outfile, infile):
    with open(infile, "r") as infile:
        preamble, words = parse_gperf(infile)
        outfile.write(words_to_cxx(words, preamble))
    return 0


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("usage: %s infile outfile" % sys.argv[0])
        sys.exit(1)

    with open(sys.argv[2], "w") as outfile:
        sys.exit(main(outfile, sys.argv[1]))