summaryrefslogtreecommitdiffstats
path: root/web/server/h2o/libh2o/deps/klib/README.md
blob: ddd74f470e9a0f0680d01bbc7c258f7c409ef9a6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
#Klib: a Generic Library in C

##<a name="overview"></a>Overview

Klib is a standalone and lightweight C library distributed under [MIT/X11
license][1]. Most components are independent of external libraries, except the
standard C library, and independent of each other. To use a component of this
library, you only need to copy a couple of files to your source code tree
without worrying about library dependencies.

Klib strives for efficiency and a small memory footprint. Some components, such
as khash.h, kbtree.h, ksort.h and kvec.h, are among the most efficient
implementations of similar algorithms or data structures in all programming
languages, in terms of both speed and memory use.

A new documentation is available [here](http://attractivechaos.github.io/klib/)
which includes most information in this README file.

####Common components

* [khash.h][khash]: generic hash table based on [double hashing][2].
* [kbtree.h][kbtree]: generic search tree based on [B-tree][3].
* [ksort.h][ksort]: generic sort, including [introsort][4], [merge sort][5], [heap sort][6], [comb sort][7], [Knuth shuffle][8] and the [k-small][9] algorithm.
* [kseq.h][kseq]: generic stream buffer and a [FASTA][10]/[FASTQ][11] format parser.
* kvec.h: generic dynamic array.
* klist.h: generic single-linked list and [memory pool][12].
* kstring.{h,c}: basic string library.
* kmath.{h,c}: numerical routines including [MT19937-64][13] [pseudorandom generator][14], basic [nonlinear programming][15] and a few special math functions.

####Components for more specific use cases

* ksa.c: constructing [suffix arrays][16] for strings with multiple sentinels, based on a revised [SAIS algorithm][17].
* knetfile.{h,c}: random access to remote files on HTTP or FTP.
* kopen.c: smart stream opening.
* khmm.{h,c}: basic [HMM][18] library.
* ksw.(h,c}: Striped [Smith-Waterman algorithm][19].
* knhx.{h,c}: [Newick tree format][20] parser.


##<a name="methodology"></a>Methodology

For the implementation of generic [containers][21], klib extensively uses C
macros. To use these data structures, we usually need to instantiate methods by
expanding a long macro. This makes the source code look unusual or even ugly
and adds difficulty to debugging. Unfortunately, for efficient generic
programming in C that lacks [template][22], using macros is the only
solution. Only with macros, we can write a generic container which, once
instantiated, compete with a type-specific container in efficiency. Some
generic libraries in C, such as [Glib][23], use the `void*` type to implement
containers. These implementations are usually slower and use more memory than
klib (see [this benchmark][31]).

To effectively use klib, it is important to understand how it achieves generic
programming. We will use the hash table library as an example:

    #include "khash.h"
    KHASH_MAP_INIT_INT(m32, char)        // instantiate structs and methods
    int main() {
        int ret, is_missing;
        khint_t k;
        khash_t(m32) *h = kh_init(m32);  // allocate a hash table
        k = kh_put(m32, h, 5, &ret);     // insert a key to the hash table
        if (!ret) kh_del(m32, h, k);
        kh_value(h, k) = 10;             // set the value
        k = kh_get(m32, h, 10);          // query the hash table
        is_missing = (k == kh_end(h));   // test if the key is present
        k = kh_get(m32, h, 5);
        kh_del(m32, h, k);               // remove a key-value pair
        for (k = kh_begin(h); k != kh_end(h); ++k)  // traverse
            if (kh_exist(h, k))          // test if a bucket contains data
    			kh_value(h, k) = 1;
        kh_destroy(m32, h);              // deallocate the hash table
        return 0;
    }

In this example, the second line instantiates a hash table with `unsigned` as
the key type and `char` as the value type. `m32` names such a type of hash table.
All types and functions associated with this name are macros, which will be
explained later. Macro `kh_init()` initiates a hash table and `kh_destroy()`
frees it. `kh_put()` inserts a key and returns the iterator (or the position)
in the hash table. `kh_get()` and `kh_del()` get a key and delete an element,
respectively. Macro `kh_exist()` tests if an iterator (or a position) is filled
with data.

An immediate question is this piece of code does not look like a valid C
program (e.g. lacking semicolon, assignment to an _apparent_ function call and
_apparent_ undefined `m32` 'variable'). To understand why the code is correct,
let's go a bit further into the source code of `khash.h`, whose skeleton looks
like:

    #define KHASH_INIT(name, SCOPE, key_t, val_t, is_map, _hashf, _hasheq) \
      typedef struct { \
        int n_buckets, size, n_occupied, upper_bound; \
        unsigned *flags; \
        key_t *keys; \
        val_t *vals; \
      } kh_##name##_t; \
      SCOPE inline kh_##name##_t *init_##name() { \
        return (kh_##name##_t*)calloc(1, sizeof(kh_##name##_t)); \
      } \
      SCOPE inline int get_##name(kh_##name##_t *h, key_t k) \
      ... \
      SCOPE inline void destroy_##name(kh_##name##_t *h) { \
        if (h) { \
          free(h->keys); free(h->flags); free(h->vals); free(h); \
        } \
      }
    
    #define _int_hf(key) (unsigned)(key)
    #define _int_heq(a, b) (a == b)
    #define khash_t(name) kh_##name##_t
    #define kh_value(h, k) ((h)->vals[k])
    #define kh_begin(h, k) 0
    #define kh_end(h) ((h)->n_buckets)
    #define kh_init(name) init_##name()
    #define kh_get(name, h, k) get_##name(h, k)
    #define kh_destroy(name, h) destroy_##name(h)
    ...
    #define KHASH_MAP_INIT_INT(name, val_t) \
    	KHASH_INIT(name, static, unsigned, val_t, is_map, _int_hf, _int_heq)

`KHASH_INIT()` is a huge macro defining all the structs and methods. When this
macro is called, all the code inside it will be inserted by the [C
preprocess][37] to the place where it is called. If the macro is called
multiple times, multiple copies of the code will be inserted. To avoid naming
conflict of hash tables with different key-value types, the library uses [token
concatenation][36], which is a preprocessor feature whereby we can substitute
part of a symbol based on the parameter of the macro. In the end, the C
preprocessor will generate the following code and feed it to the compiler
(macro `kh_exist(h,k)` is a little complex and not expanded for simplicity):

    typedef struct {
      int n_buckets, size, n_occupied, upper_bound;
      unsigned *flags;
      unsigned *keys;
      char *vals;
    } kh_m32_t;
    static inline kh_m32_t *init_m32() {
      return (kh_m32_t*)calloc(1, sizeof(kh_m32_t));
    }
    static inline int get_m32(kh_m32_t *h, unsigned k)
    ...
    static inline void destroy_m32(kh_m32_t *h) {
      if (h) {
        free(h->keys); free(h->flags); free(h->vals); free(h);
      }
    }

	int main() {
		int ret, is_missing;
		khint_t k;
		kh_m32_t *h = init_m32();
		k = put_m32(h, 5, &ret);
		if (!ret) del_m32(h, k);
		h->vals[k] = 10;
		k = get_m32(h, 10);
		is_missing = (k == h->n_buckets);
		k = get_m32(h, 5);
		del_m32(h, k);
		for (k = 0; k != h->n_buckets; ++k)
			if (kh_exist(h, k)) h->vals[k] = 1;
		destroy_m32(h);
		return 0;
	}

This is the C program we know.

From this example, we can see that macros and the C preprocessor plays a key
role in klib. Klib is fast partly because the compiler knows the key-value
type at the compile time and is able to optimize the code to the same level
as type-specific code. A generic library written with `void*` will not get such
performance boost.

Massively inserting code upon instantiation may remind us of C++'s slow
compiling speed and huge binary size when STL/boost is in use. Klib is much
better in this respect due to its small code size and component independency.
Inserting several hundreds lines of code won't make compiling obviously slower.

##<a name="resources"></a>Resources

* Library documentation, if present, is available in the header files. Examples
can be found in the [test/][24] directory.
* **Obsolete** documentation of the hash table library can be found at
[SourceForge][25]. This README is partly adapted from the old documentation.
* [Blog post][26] describing the hash table library.
* [Blog post][27] on why using `void*` for generic programming may be inefficient.
* [Blog post][28] on the generic stream buffer.
* [Blog post][29] evaluating the performance of `kvec.h`.
* [Blog post][30] arguing B-tree may be a better data structure than a binary search tree.
* [Blog post][31] evaluating the performance of `khash.h` and `kbtree.h` among many other implementations.
[An older version][33] of the benchmark is also available.
* [Blog post][34] benchmarking internal sorting algorithms and implementations.
* [Blog post][32] on the k-small algorithm.
* [Blog post][35] on the Hooke-Jeeve's algorithm for nonlinear programming.

[1]: http://en.wikipedia.org/wiki/MIT_License
[2]: http://en.wikipedia.org/wiki/Double_hashing
[3]: http://en.wikipedia.org/wiki/B-tree
[4]: http://en.wikipedia.org/wiki/Introsort
[5]: http://en.wikipedia.org/wiki/Merge_sort
[6]: http://en.wikipedia.org/wiki/Heapsort
[7]: http://en.wikipedia.org/wiki/Comb_sort
[8]: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle
[9]: http://en.wikipedia.org/wiki/Selection_algorithm
[10]: http://en.wikipedia.org/wiki/FASTA_format
[11]: http://en.wikipedia.org/wiki/FASTQ_format
[12]: http://en.wikipedia.org/wiki/Memory_pool
[13]: http://en.wikipedia.org/wiki/Mersenne_twister
[14]: http://en.wikipedia.org/wiki/Pseudorandom_generator
[15]: http://en.wikipedia.org/wiki/Nonlinear_programming
[16]: http://en.wikipedia.org/wiki/Suffix_array
[17]: https://sites.google.com/site/yuta256/sais
[18]: http://en.wikipedia.org/wiki/Hidden_Markov_model
[19]: http://en.wikipedia.org/wiki/Smith-Waterman_algorithm
[20]: http://en.wikipedia.org/wiki/Newick_format
[21]: http://en.wikipedia.org/wiki/Container_(abstract_data_type)
[22]: http://en.wikipedia.org/wiki/Template_(C%2B%2B)
[23]: http://en.wikipedia.org/wiki/GLib
[24]: https://github.com/attractivechaos/klib/tree/master/test
[25]: http://klib.sourceforge.net/
[26]: http://attractivechaos.wordpress.com/2008/09/02/implementing-generic-hash-library-in-c/
[27]: http://attractivechaos.wordpress.com/2008/10/02/using-void-in-generic-c-programming-may-be-inefficient/
[28]: http://attractivechaos.wordpress.com/2008/10/11/a-generic-buffered-stream-wrapper/
[29]: http://attractivechaos.wordpress.com/2008/09/19/c-array-vs-c-vector/
[30]: http://attractivechaos.wordpress.com/2008/09/24/b-tree-vs-binary-search-tree/
[31]: http://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/
[32]: http://attractivechaos.wordpress.com/2008/09/13/calculating-median/
[33]: http://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/
[34]: http://attractivechaos.wordpress.com/2008/08/28/comparison-of-internal-sorting-algorithms/
[35]: http://attractivechaos.wordpress.com/2008/08/24/derivative-free-optimization-dfo/
[36]: http://en.wikipedia.org/wiki/C_preprocessor#Token_concatenation
[37]: http://en.wikipedia.org/wiki/C_preprocessor

[kbtree]: http://attractivechaos.github.io/klib/#KBtree%3A%20generic%20ordered%20map:%5B%5BKBtree%3A%20generic%20ordered%20map%5D%5D
[khash]: http://attractivechaos.github.io/klib/#Khash%3A%20generic%20hash%20table:%5B%5BKhash%3A%20generic%20hash%20table%5D%5D
[kseq]: http://attractivechaos.github.io/klib/#Kseq%3A%20stream%20buffer%20and%20FASTA%2FQ%20parser:%5B%5BKseq%3A%20stream%20buffer%20and%20FASTA%2FQ%20parser%5D%5D
[ksort]: http://attractivechaos.github.io/klib/#Ksort%3A%20sorting%2C%20shuffling%2C%20heap%20and%20k-small:%5B%5BKsort%3A%20sorting%2C%20shuffling%2C%20heap%20and%20k-small%5D%5D