diff options
Diffstat (limited to '')
-rw-r--r-- | docs/README | 273 | ||||
-rw-r--r-- | docs/mainpage.dox | 61 | ||||
-rw-r--r-- | docs/mutex.txt | 136 | ||||
-rw-r--r-- | docs/tdb.magic | 10 | ||||
-rw-r--r-- | docs/tracing.txt | 46 |
5 files changed, 526 insertions, 0 deletions
diff --git a/docs/README b/docs/README new file mode 100644 index 0000000..86d46a3 --- /dev/null +++ b/docs/README @@ -0,0 +1,273 @@ +tdb - a trivial database system +tridge@linuxcare.com December 1999 +================================== + +This is a simple database API. It was inspired by the realisation that +in Samba we have several ad-hoc bits of code that essentially +implement small databases for sharing structures between parts of +Samba. As I was about to add another I realised that a generic +database module was called for to replace all the ad-hoc bits. + +I based the interface on gdbm. I couldn't use gdbm as we need to be +able to have multiple writers to the databases at one time. + +Compilation +----------- + +add HAVE_MMAP=1 to use mmap instead of read/write +add NOLOCK=1 to disable locking code + +Testing +------- + +Compile tdbtest.c and link with gdbm for testing. tdbtest will perform +identical operations via tdb and gdbm then make sure the result is the +same + +Also included is tdbtool, which allows simple database manipulation +on the commandline. + +tdbtest and tdbtool are not built as part of Samba, but are included +for completeness. + +Interface +--------- + +The interface is very similar to gdbm except for the following: + +- different open interface. The tdb_open call is more similar to a + traditional open() +- no tdbm_reorganise() function +- no tdbm_sync() function. No operations are cached in the library anyway +- added a tdb_traverse() function for traversing the whole database +- added transactions support + +A general rule for using tdb is that the caller frees any returned +TDB_DATA structures. Just call free(p.dptr) to free a TDB_DATA +return value called p. This is the same as gdbm. + +here is a full list of tdb functions with brief descriptions. + + +---------------------------------------------------------------------- +TDB_CONTEXT *tdb_open(char *name, int hash_size, int tdb_flags, + int open_flags, mode_t mode) + + open the database, creating it if necessary + + The open_flags and mode are passed straight to the open call on the database + file. A flags value of O_WRONLY is invalid + + The hash size is advisory, use zero for a default value. + + return is NULL on error + + possible tdb_flags are: + TDB_CLEAR_IF_FIRST - clear database if we are the only one with it open + TDB_INTERNAL - don't use a file, instead store the data in + memory. The filename is ignored in this case. + TDB_NOLOCK - don't do any locking + TDB_NOMMAP - don't use mmap + TDB_NOSYNC - don't synchronise transactions to disk + TDB_SEQNUM - maintain a sequence number + TDB_VOLATILE - activate the per-hashchain freelist, default 5 + TDB_ALLOW_NESTING - allow transactions to nest + TDB_DISALLOW_NESTING - disallow transactions to nest + +---------------------------------------------------------------------- +TDB_CONTEXT *tdb_open_ex(char *name, int hash_size, int tdb_flags, + int open_flags, mode_t mode, + const struct tdb_logging_context *log_ctx, + tdb_hash_func hash_fn) + +This is like tdb_open(), but allows you to pass an initial logging and +hash function. Be careful when passing a hash function - all users of +the database must use the same hash function or you will get data +corruption. + + +---------------------------------------------------------------------- +char *tdb_error(TDB_CONTEXT *tdb); + + return a error string for the last tdb error + +---------------------------------------------------------------------- +int tdb_close(TDB_CONTEXT *tdb); + + close a database + +---------------------------------------------------------------------- +TDB_DATA tdb_fetch(TDB_CONTEXT *tdb, TDB_DATA key); + + fetch an entry in the database given a key + if the return value has a null dptr then a error occurred + + caller must free the resulting data + +---------------------------------------------------------------------- +int tdb_parse_record(struct tdb_context *tdb, TDB_DATA key, + int (*parser)(TDB_DATA key, TDB_DATA data, + void *private_data), + void *private_data); + + Hand a record to a parser function without allocating it. + + This function is meant as a fast tdb_fetch alternative for large records + that are frequently read. The "key" and "data" arguments point directly + into the tdb shared memory, they are not aligned at any boundary. + + WARNING: The parser is called while tdb holds a lock on the record. DO NOT + call other tdb routines from within the parser. Also, for good performance + you should make the parser fast to allow parallel operations. + + tdb_parse_record returns -1 if the record was not found. If the record was + found, the return value of "parser" is passed up to the caller. + +---------------------------------------------------------------------- +int tdb_exists(TDB_CONTEXT *tdb, TDB_DATA key); + + check if an entry in the database exists + + note that 1 is returned if the key is found and 0 is returned if not found + this doesn't match the conventions in the rest of this module, but is + compatible with gdbm + +---------------------------------------------------------------------- +int tdb_traverse(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb, + TDB_DATA key, TDB_DATA dbuf, void *state), void *state); + + traverse the entire database - calling fn(tdb, key, data, state) on each + element. + + return -1 on error or the record count traversed + + if fn is NULL then it is not called + + a non-zero return value from fn() indicates that the traversal + should stop. Traversal callbacks may not start transactions. + + WARNING: The data buffer given to the callback fn does NOT meet the + alignment restrictions malloc gives you. + +---------------------------------------------------------------------- +int tdb_traverse_read(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb, + TDB_DATA key, TDB_DATA dbuf, void *state), void *state); + + traverse the entire database - calling fn(tdb, key, data, state) on + each element, but marking the database read only during the + traversal, so any write operations will fail. This allows tdb to + use read locks, which increases the parallelism possible during the + traversal. + + return -1 on error or the record count traversed + + if fn is NULL then it is not called + + a non-zero return value from fn() indicates that the traversal + should stop. Traversal callbacks may not start transactions. + +---------------------------------------------------------------------- +TDB_DATA tdb_firstkey(TDB_CONTEXT *tdb); + + find the first entry in the database and return its key + + the caller must free the returned data + +---------------------------------------------------------------------- +TDB_DATA tdb_nextkey(TDB_CONTEXT *tdb, TDB_DATA key); + + find the next entry in the database, returning its key + + the caller must free the returned data + +---------------------------------------------------------------------- +int tdb_delete(TDB_CONTEXT *tdb, TDB_DATA key); + + delete an entry in the database given a key + +---------------------------------------------------------------------- +int tdb_store(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA dbuf, int flag); + + store an element in the database, replacing any existing element + with the same key + + If flag==TDB_INSERT then don't overwrite an existing entry + If flag==TDB_MODIFY then don't create a new entry + + return 0 on success, -1 on failure + +---------------------------------------------------------------------- +int tdb_writelock(TDB_CONTEXT *tdb); + + lock the database. If we already have it locked then don't do anything + +---------------------------------------------------------------------- +int tdb_writeunlock(TDB_CONTEXT *tdb); + unlock the database + +---------------------------------------------------------------------- +int tdb_chainlock(TDB_CONTEXT *tdb, TDB_DATA key); + + lock one hash chain. This is meant to be used to reduce locking + contention - it cannot guarantee how many records will be locked + +---------------------------------------------------------------------- +int tdb_chainunlock(TDB_CONTEXT *tdb, TDB_DATA key); + + unlock one hash chain + +---------------------------------------------------------------------- +int tdb_transaction_start(TDB_CONTEXT *tdb) + + start a transaction. All operations after the transaction start can + either be committed with tdb_transaction_commit() or cancelled with + tdb_transaction_cancel(). + + If you call tdb_transaction_start() again on the same tdb context + while a transaction is in progress, then the same transaction + buffer is re-used. The number of tdb_transaction_{commit,cancel} + operations must match the number of successful + tdb_transaction_start() calls. + + Note that transactions are by default disk synchronous, and use a + recover area in the database to automatically recover the database + on the next open if the system crashes during a transaction. You + can disable the synchronous transaction recovery setup using the + TDB_NOSYNC flag, which will greatly speed up operations at the risk + of corrupting your database if the system crashes. + + Operations made within a transaction are not visible to other users + of the database until a successful commit. + +---------------------------------------------------------------------- +int tdb_transaction_cancel(TDB_CONTEXT *tdb) + + cancel a current transaction, discarding all write and lock + operations that have been made since the transaction started. + + +---------------------------------------------------------------------- +int tdb_transaction_commit(TDB_CONTEXT *tdb) + + commit a current transaction, updating the database and releasing + the transaction locks. + +---------------------------------------------------------------------- +int tdb_transaction_prepare_commit(TDB_CONTEXT *tdb) + + prepare to commit a current transaction, for two-phase commits. + Once prepared for commit, the only allowed calls are + tdb_transaction_commit() or tdb_transaction_cancel(). Preparing + allocates disk space for the pending updates, so a subsequent + commit should succeed (barring any hardware failures). + +---------------------------------------------------------------------- +int tdb_check(TDB_CONTEXT *tdb, + int (*check)(TDB_DATA key, TDB_DATA data, void *private_data), + void *private_data);) + + check the consistency of the database, calling back the check function + (if non-NULL) with each record. If some consistency check fails, or + the supplied check function returns -1, tdb_check returns -1, otherwise + 0. Note that logging function (if set) will be called with additional + information on the corruption found. diff --git a/docs/mainpage.dox b/docs/mainpage.dox new file mode 100644 index 0000000..d130769 --- /dev/null +++ b/docs/mainpage.dox @@ -0,0 +1,61 @@ +/** + +@mainpage + +This is a simple database API. It was inspired by the realisation that in Samba +we have several ad-hoc bits of code that essentially implement small databases +for sharing structures between parts of Samba. + +The interface is based on gdbm. gdbm couldn't be use as we needed to be able to +have multiple writers to the databases at one time. + +@section tdb_download Download + +You can download the latest releases of tdb from the +<a href="http://samba.org/ftp/tdb">tdb directory</a> on the samba public source +archive. + +You can download the latest code either via git or rsync. + +To fetch via git see the following guide: + +<a href="http://wiki.samba.org/index.php/Using_Git_for_Samba_Development">Using Git for Samba Development</a> +Once you have cloned the tree switch to the master branch and cd into the source/lib/tdb directory. + +To fetch via rsync use these commands: + +<pre> + rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/tdb . + rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/replace . +</pre> + +and build in tdb. It will find the replace library in the directory above +automatically. + +@section tdb_bugs Discussion and bug reports + +tdb does not currently have its own mailing list or bug tracking system. For now, +please use the +<a href="https://lists.samba.org/mailman/listinfo/samba-technical">samba-technical</a> +mailing list, and the <a href="http://bugzilla.samba.org/">Samba bugzilla</a> bug +tracking system. + + +@section tdb_compilation Compilation + +add HAVE_MMAP=1 to use mmap instead of read/write +add NOLOCK=1 to disable locking code + +@section tdb_testing Testing + +Compile tdbtest.c and link with gdbm for testing. tdbtest will perform +identical operations via tdb and gdbm then make sure the result is the +same + +Also included is tdbtool, which allows simple database manipulation +on the commandline. + +tdbtest and tdbtool are not built as part of Samba, but are included +for completeness. + +*/ diff --git a/docs/mutex.txt b/docs/mutex.txt new file mode 100644 index 0000000..a5a7542 --- /dev/null +++ b/docs/mutex.txt @@ -0,0 +1,136 @@ +Tdb is a hashtable database with multiple concurrent writer and external +record lock support. For speed reasons, wherever possible tdb uses a shared +memory mapped area for data access. In its currently released form, it uses +fcntl byte-range locks to coordinate access to the data itself. + +The tdb data is organized as a hashtable. Hash collisions are dealt with by +forming a linked list of records that share a hash value. The individual +linked lists are protected across processes with 1-byte fcntl locks on the +starting pointer of the linked list representing a hash value. + +The external locking API of tdb allows one to lock individual records. Instead of +really locking individual records, the tdb API locks a complete linked list +with a fcntl lock. + +The external locking API of tdb also allows one to lock the complete database, and +ctdb uses this facility to freeze databases during a recovery. While the +so-called allrecord lock is held, all linked lists and all individual records +are frozen alltogether. Tdb achieves this by locking the complete file range +with a single fcntl lock. Individual 1-byte locks for the linked lists +conflict with this. Access to records is prevented by the one large fnctl byte +range lock. + +Fcntl locks have been chosen for tdb for two reasons: First they are portable +across all current unixes. Secondly they provide auto-cleanup. If a process +dies while holding a fcntl lock, the lock is given up as if it was explicitly +unlocked. Thus fcntl locks provide a very robust locking scheme, if a process +dies for any reason the database will not stay blocked until reboot. This +robustness is very important for long-running services, a reboot is not an +option for most users of tdb. + +Unfortunately, during stress testing, fcntl locks have turned out to be a major +problem for performance. The particular problem that was seen happens when +ctdb on a busy server does a recovery. A recovery means that ctdb has to +freeze all tdb databases for some time, usually a few seconds. This is done +with the allrecord lock. During the recovery phase on a busy server many smbd +processes try to access the tdb file with blocking fcntl calls. The specific +test in question easily reproduces 7,000 processes piling up waiting for +1-byte fcntl locks. When ctdb is done with the recovery, it gives up the +allrecord lock, covering the whole file range. All 7,000 processes waiting for +1-byte fcntl locks are woken up, trying to acquire their lock. The special +implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects +all fcntl lock operations with a single system-wide spinlock. If 7,000 process +waiting for the allrecord lock to become released this leads to a thundering +herd condition, all CPUs are spinning on that single spinlock. + +Functionally the kernel is fine, eventually the thundering herd slows down and +every process correctly gets his share and locking range, but the performance +of the system while the herd is active is worse than expected. + +The thundering herd is only the worst case scenario for fcntl lock use. The +single spinlock for fcntl operations is also a performance penalty for normal +operations. In the cluster case, every read and write SMB request has to do +two fcntl calls to provide correct SMB mandatory locks. The single spinlock +is one source of serialization for the SMB read/write requests, limiting the +parallelism that can be achieved in a multi-core system. + +While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl +locks to be a problem on Solaris as well. Ira pointed out that there is a +potential alternative locking mechanism that might be more scalable: Process +shared robust mutexes, as defined by Posix 2008 for example via + +http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html +http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html + +Pthread mutexes provide one of the core mechanisms in posix threads to protect +in-process data structures from concurrent access by multiple threads. In the +Linux implementation, a pthread_mutex_t is represented by a data structure in +user space that requires no kernel calls in the uncontended case for locking +and unlocking. Locking and unlocking in the uncontended case is implemented +purely in user space with atomic CPU instructions and thus are very fast. + +The setpshared functions indicate to the kernel that the mutex is about to be +shared between processes in a common shared memory area. + +The process shared posix mutexes have the potential to replace fcntl locking +to coordinate mmap access for tdbs. However, they are missing the criticial +auto-cleanup property that fcntl provides when a process dies. A process that +dies hard while holding a shared mutex has no chance to clean up the protected +data structures and unlock the shared mutex. Thus with a pure process shared +mutex the mutex will remain locked forever until the data structures are +re-initialized from scratch. + +With the robust mutexes defined by Posix the process shared mutexes have been +extended with a limited auto-cleanup property. If a mutex has been declared +robust, when a process exits while holding that mutex, the next process trying +to lock the mutex will get the special error message EOWNERDEAD. This informs +the caller that the data structures the mutex protects are potentially corrupt +and need to be cleaned up. + +The error message EOWNERDEAD when trying to lock a mutex is an extension over +the fcntl functionality. A process that does a blocking fcntl lock call is not +informed about whether the lock was explicitly freed by a process still alive +or due to an unplanned process exit. At the time of this writing (February +2013), at least Linux and OpenSolaris also implement the robustness feature of +process-shared mutexes. + +Converting the tdb locking mechanism from fcntl to mutexes has to take care of +both types of locks that are used on tdb files. + +The easy part is to use mutexes to replace the 1-byte linked list locks +covering the individual hashes. Those can be represented by a mutex each. + +Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl +lock spanning all hash list locks simultaneously. This basic functionality is +not easily possible with mutexes. A mutex carries 1 bit of information, a +fcntl lock can carry an arbitrary amount of information. + +In order to support the allrecord lock, we have an allrecord_lock variable +protected by an allrecord_mutex. The coordination between the allrecord lock +and the chainlocks works like this: + +- Getting a chain lock works like this: + + 1. get chain mutex + 2. return success if allrecord_lock is F_UNLCK (not locked) + 3. return success if allrecord_lock is F_RDLCK (locked readonly) + and we only need a read lock. + 4. release chain mutex + 5. wait for allrecord_mutex + 6. unlock allrecord_mutex + 7. goto 1. + +- Getting the allrecord lock: + + 1. get the allrecord mutex + 2. return error if allrecord_lock is not F_UNLCK (it's locked) + 3. set allrecord_lock to the desired value. + 4. in a loop: lock(blocking) / unlock each chain mutex. + 5. return success. + +- allrecord lock upgrade: + + 1. check we already have the allrecord lock with F_RDLCK. + 3. set allrecord_lock to F_WRLCK + 4. in a loop: lock(blocking) / unlock each chain mutex. + 5. return success. diff --git a/docs/tdb.magic b/docs/tdb.magic new file mode 100644 index 0000000..f5619e7 --- /dev/null +++ b/docs/tdb.magic @@ -0,0 +1,10 @@ +# Magic file(1) information about tdb files. +# +# Install this into /etc/magic or the corresponding location for your +# system, or pass as a -m argument to file(1). + +# You may use and redistribute this file without restriction. + +0 string TDB\ file TDB database +>32 lelong =0x2601196D version 6, little-endian +>>36 lelong x hash size %d bytes diff --git a/docs/tracing.txt b/docs/tracing.txt new file mode 100644 index 0000000..a7657ff --- /dev/null +++ b/docs/tracing.txt @@ -0,0 +1,46 @@ +How And Why To Use TDB Tracing +============================== + +You can trace all TDB operations, using TDB_TRACE. It is not complete +(error conditions which expect to the logged will not always be traced +correctly, so you should set up a logging function too), but is designed +to collect benchmark-style traces to allow us to optimize TDB. + +Note: tracing is not efficient, and the trace files are huge: a +traverse of the database is particularly large! But they compress very +well with rzip (http://rzip.samba.org) + +How to gather trace files: +-------------------------- +1) Uncomment /* #define TDB_TRACE 1 */ in tdb_private.h. +2) Rebuild TDB, and everything that uses it. +3) Run something. + +Your trace files will be called <tdbname>.trace.<pid>. These files +will not be overwritten: if the same process reopens the same TDB, an +error will be logged and tracing will be disabled. + +How to replay trace files: +-------------------------- +1) For benchmarking, remember to rebuild tdb with #define TDB_TRACE commented + out again! +2) Grab the latest "replace_trace.c" from CCAN's tdb module (tools/ dir): + http://ccan.ozlabs.org/tarballs/tdb.tar.bz2 +3) Compile up replay_trace, munging as necessary. +4) Run replay_trace <scratch-tdb-name> <tracefiles>... + +If given more than one trace file (presumably from the same tdb) +replay_trace will try to figure out the dependencies between the operations +and fire off a child to run each trace. Occasionally it gets stuck, in +which case it will add another dependency and retry. Eventually it will +give a speed value. + +replay_trace can intuit the existence of previous data in the tdb (ie. +activity prior to the trace(s) supplied) and will prepopulate as +necessary. + +You can run --quiet for straight benchmark results, and -n to run multiple +times (this saves time, since it need only calculate dependencies once). + +Good luck! +Rusty Russell <rusty@rustcorp.com.au> |