summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--docs/README273
-rw-r--r--docs/mainpage.dox61
-rw-r--r--docs/mutex.txt136
-rw-r--r--docs/tdb.magic10
-rw-r--r--docs/tracing.txt46
5 files changed, 526 insertions, 0 deletions
diff --git a/docs/README b/docs/README
new file mode 100644
index 0000000..86d46a3
--- /dev/null
+++ b/docs/README
@@ -0,0 +1,273 @@
+tdb - a trivial database system
+tridge@linuxcare.com December 1999
+==================================
+
+This is a simple database API. It was inspired by the realisation that
+in Samba we have several ad-hoc bits of code that essentially
+implement small databases for sharing structures between parts of
+Samba. As I was about to add another I realised that a generic
+database module was called for to replace all the ad-hoc bits.
+
+I based the interface on gdbm. I couldn't use gdbm as we need to be
+able to have multiple writers to the databases at one time.
+
+Compilation
+-----------
+
+add HAVE_MMAP=1 to use mmap instead of read/write
+add NOLOCK=1 to disable locking code
+
+Testing
+-------
+
+Compile tdbtest.c and link with gdbm for testing. tdbtest will perform
+identical operations via tdb and gdbm then make sure the result is the
+same
+
+Also included is tdbtool, which allows simple database manipulation
+on the commandline.
+
+tdbtest and tdbtool are not built as part of Samba, but are included
+for completeness.
+
+Interface
+---------
+
+The interface is very similar to gdbm except for the following:
+
+- different open interface. The tdb_open call is more similar to a
+ traditional open()
+- no tdbm_reorganise() function
+- no tdbm_sync() function. No operations are cached in the library anyway
+- added a tdb_traverse() function for traversing the whole database
+- added transactions support
+
+A general rule for using tdb is that the caller frees any returned
+TDB_DATA structures. Just call free(p.dptr) to free a TDB_DATA
+return value called p. This is the same as gdbm.
+
+here is a full list of tdb functions with brief descriptions.
+
+
+----------------------------------------------------------------------
+TDB_CONTEXT *tdb_open(char *name, int hash_size, int tdb_flags,
+ int open_flags, mode_t mode)
+
+ open the database, creating it if necessary
+
+ The open_flags and mode are passed straight to the open call on the database
+ file. A flags value of O_WRONLY is invalid
+
+ The hash size is advisory, use zero for a default value.
+
+ return is NULL on error
+
+ possible tdb_flags are:
+ TDB_CLEAR_IF_FIRST - clear database if we are the only one with it open
+ TDB_INTERNAL - don't use a file, instead store the data in
+ memory. The filename is ignored in this case.
+ TDB_NOLOCK - don't do any locking
+ TDB_NOMMAP - don't use mmap
+ TDB_NOSYNC - don't synchronise transactions to disk
+ TDB_SEQNUM - maintain a sequence number
+ TDB_VOLATILE - activate the per-hashchain freelist, default 5
+ TDB_ALLOW_NESTING - allow transactions to nest
+ TDB_DISALLOW_NESTING - disallow transactions to nest
+
+----------------------------------------------------------------------
+TDB_CONTEXT *tdb_open_ex(char *name, int hash_size, int tdb_flags,
+ int open_flags, mode_t mode,
+ const struct tdb_logging_context *log_ctx,
+ tdb_hash_func hash_fn)
+
+This is like tdb_open(), but allows you to pass an initial logging and
+hash function. Be careful when passing a hash function - all users of
+the database must use the same hash function or you will get data
+corruption.
+
+
+----------------------------------------------------------------------
+char *tdb_error(TDB_CONTEXT *tdb);
+
+ return a error string for the last tdb error
+
+----------------------------------------------------------------------
+int tdb_close(TDB_CONTEXT *tdb);
+
+ close a database
+
+----------------------------------------------------------------------
+TDB_DATA tdb_fetch(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ fetch an entry in the database given a key
+ if the return value has a null dptr then a error occurred
+
+ caller must free the resulting data
+
+----------------------------------------------------------------------
+int tdb_parse_record(struct tdb_context *tdb, TDB_DATA key,
+ int (*parser)(TDB_DATA key, TDB_DATA data,
+ void *private_data),
+ void *private_data);
+
+ Hand a record to a parser function without allocating it.
+
+ This function is meant as a fast tdb_fetch alternative for large records
+ that are frequently read. The "key" and "data" arguments point directly
+ into the tdb shared memory, they are not aligned at any boundary.
+
+ WARNING: The parser is called while tdb holds a lock on the record. DO NOT
+ call other tdb routines from within the parser. Also, for good performance
+ you should make the parser fast to allow parallel operations.
+
+ tdb_parse_record returns -1 if the record was not found. If the record was
+ found, the return value of "parser" is passed up to the caller.
+
+----------------------------------------------------------------------
+int tdb_exists(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ check if an entry in the database exists
+
+ note that 1 is returned if the key is found and 0 is returned if not found
+ this doesn't match the conventions in the rest of this module, but is
+ compatible with gdbm
+
+----------------------------------------------------------------------
+int tdb_traverse(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb,
+ TDB_DATA key, TDB_DATA dbuf, void *state), void *state);
+
+ traverse the entire database - calling fn(tdb, key, data, state) on each
+ element.
+
+ return -1 on error or the record count traversed
+
+ if fn is NULL then it is not called
+
+ a non-zero return value from fn() indicates that the traversal
+ should stop. Traversal callbacks may not start transactions.
+
+ WARNING: The data buffer given to the callback fn does NOT meet the
+ alignment restrictions malloc gives you.
+
+----------------------------------------------------------------------
+int tdb_traverse_read(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb,
+ TDB_DATA key, TDB_DATA dbuf, void *state), void *state);
+
+ traverse the entire database - calling fn(tdb, key, data, state) on
+ each element, but marking the database read only during the
+ traversal, so any write operations will fail. This allows tdb to
+ use read locks, which increases the parallelism possible during the
+ traversal.
+
+ return -1 on error or the record count traversed
+
+ if fn is NULL then it is not called
+
+ a non-zero return value from fn() indicates that the traversal
+ should stop. Traversal callbacks may not start transactions.
+
+----------------------------------------------------------------------
+TDB_DATA tdb_firstkey(TDB_CONTEXT *tdb);
+
+ find the first entry in the database and return its key
+
+ the caller must free the returned data
+
+----------------------------------------------------------------------
+TDB_DATA tdb_nextkey(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ find the next entry in the database, returning its key
+
+ the caller must free the returned data
+
+----------------------------------------------------------------------
+int tdb_delete(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ delete an entry in the database given a key
+
+----------------------------------------------------------------------
+int tdb_store(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA dbuf, int flag);
+
+ store an element in the database, replacing any existing element
+ with the same key
+
+ If flag==TDB_INSERT then don't overwrite an existing entry
+ If flag==TDB_MODIFY then don't create a new entry
+
+ return 0 on success, -1 on failure
+
+----------------------------------------------------------------------
+int tdb_writelock(TDB_CONTEXT *tdb);
+
+ lock the database. If we already have it locked then don't do anything
+
+----------------------------------------------------------------------
+int tdb_writeunlock(TDB_CONTEXT *tdb);
+ unlock the database
+
+----------------------------------------------------------------------
+int tdb_chainlock(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ lock one hash chain. This is meant to be used to reduce locking
+ contention - it cannot guarantee how many records will be locked
+
+----------------------------------------------------------------------
+int tdb_chainunlock(TDB_CONTEXT *tdb, TDB_DATA key);
+
+ unlock one hash chain
+
+----------------------------------------------------------------------
+int tdb_transaction_start(TDB_CONTEXT *tdb)
+
+ start a transaction. All operations after the transaction start can
+ either be committed with tdb_transaction_commit() or cancelled with
+ tdb_transaction_cancel().
+
+ If you call tdb_transaction_start() again on the same tdb context
+ while a transaction is in progress, then the same transaction
+ buffer is re-used. The number of tdb_transaction_{commit,cancel}
+ operations must match the number of successful
+ tdb_transaction_start() calls.
+
+ Note that transactions are by default disk synchronous, and use a
+ recover area in the database to automatically recover the database
+ on the next open if the system crashes during a transaction. You
+ can disable the synchronous transaction recovery setup using the
+ TDB_NOSYNC flag, which will greatly speed up operations at the risk
+ of corrupting your database if the system crashes.
+
+ Operations made within a transaction are not visible to other users
+ of the database until a successful commit.
+
+----------------------------------------------------------------------
+int tdb_transaction_cancel(TDB_CONTEXT *tdb)
+
+ cancel a current transaction, discarding all write and lock
+ operations that have been made since the transaction started.
+
+
+----------------------------------------------------------------------
+int tdb_transaction_commit(TDB_CONTEXT *tdb)
+
+ commit a current transaction, updating the database and releasing
+ the transaction locks.
+
+----------------------------------------------------------------------
+int tdb_transaction_prepare_commit(TDB_CONTEXT *tdb)
+
+ prepare to commit a current transaction, for two-phase commits.
+ Once prepared for commit, the only allowed calls are
+ tdb_transaction_commit() or tdb_transaction_cancel(). Preparing
+ allocates disk space for the pending updates, so a subsequent
+ commit should succeed (barring any hardware failures).
+
+----------------------------------------------------------------------
+int tdb_check(TDB_CONTEXT *tdb,
+ int (*check)(TDB_DATA key, TDB_DATA data, void *private_data),
+ void *private_data);)
+
+ check the consistency of the database, calling back the check function
+ (if non-NULL) with each record. If some consistency check fails, or
+ the supplied check function returns -1, tdb_check returns -1, otherwise
+ 0. Note that logging function (if set) will be called with additional
+ information on the corruption found.
diff --git a/docs/mainpage.dox b/docs/mainpage.dox
new file mode 100644
index 0000000..d130769
--- /dev/null
+++ b/docs/mainpage.dox
@@ -0,0 +1,61 @@
+/**
+
+@mainpage
+
+This is a simple database API. It was inspired by the realisation that in Samba
+we have several ad-hoc bits of code that essentially implement small databases
+for sharing structures between parts of Samba.
+
+The interface is based on gdbm. gdbm couldn't be use as we needed to be able to
+have multiple writers to the databases at one time.
+
+@section tdb_download Download
+
+You can download the latest releases of tdb from the
+<a href="http://samba.org/ftp/tdb">tdb directory</a> on the samba public source
+archive.
+
+You can download the latest code either via git or rsync.
+
+To fetch via git see the following guide:
+
+<a href="http://wiki.samba.org/index.php/Using_Git_for_Samba_Development">Using Git for Samba Development</a>
+Once you have cloned the tree switch to the master branch and cd into the source/lib/tdb directory.
+
+To fetch via rsync use these commands:
+
+<pre>
+ rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/tdb .
+ rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/replace .
+</pre>
+
+and build in tdb. It will find the replace library in the directory above
+automatically.
+
+@section tdb_bugs Discussion and bug reports
+
+tdb does not currently have its own mailing list or bug tracking system. For now,
+please use the
+<a href="https://lists.samba.org/mailman/listinfo/samba-technical">samba-technical</a>
+mailing list, and the <a href="http://bugzilla.samba.org/">Samba bugzilla</a> bug
+tracking system.
+
+
+@section tdb_compilation Compilation
+
+add HAVE_MMAP=1 to use mmap instead of read/write
+add NOLOCK=1 to disable locking code
+
+@section tdb_testing Testing
+
+Compile tdbtest.c and link with gdbm for testing. tdbtest will perform
+identical operations via tdb and gdbm then make sure the result is the
+same
+
+Also included is tdbtool, which allows simple database manipulation
+on the commandline.
+
+tdbtest and tdbtool are not built as part of Samba, but are included
+for completeness.
+
+*/
diff --git a/docs/mutex.txt b/docs/mutex.txt
new file mode 100644
index 0000000..a5a7542
--- /dev/null
+++ b/docs/mutex.txt
@@ -0,0 +1,136 @@
+Tdb is a hashtable database with multiple concurrent writer and external
+record lock support. For speed reasons, wherever possible tdb uses a shared
+memory mapped area for data access. In its currently released form, it uses
+fcntl byte-range locks to coordinate access to the data itself.
+
+The tdb data is organized as a hashtable. Hash collisions are dealt with by
+forming a linked list of records that share a hash value. The individual
+linked lists are protected across processes with 1-byte fcntl locks on the
+starting pointer of the linked list representing a hash value.
+
+The external locking API of tdb allows one to lock individual records. Instead of
+really locking individual records, the tdb API locks a complete linked list
+with a fcntl lock.
+
+The external locking API of tdb also allows one to lock the complete database, and
+ctdb uses this facility to freeze databases during a recovery. While the
+so-called allrecord lock is held, all linked lists and all individual records
+are frozen alltogether. Tdb achieves this by locking the complete file range
+with a single fcntl lock. Individual 1-byte locks for the linked lists
+conflict with this. Access to records is prevented by the one large fnctl byte
+range lock.
+
+Fcntl locks have been chosen for tdb for two reasons: First they are portable
+across all current unixes. Secondly they provide auto-cleanup. If a process
+dies while holding a fcntl lock, the lock is given up as if it was explicitly
+unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
+dies for any reason the database will not stay blocked until reboot. This
+robustness is very important for long-running services, a reboot is not an
+option for most users of tdb.
+
+Unfortunately, during stress testing, fcntl locks have turned out to be a major
+problem for performance. The particular problem that was seen happens when
+ctdb on a busy server does a recovery. A recovery means that ctdb has to
+freeze all tdb databases for some time, usually a few seconds. This is done
+with the allrecord lock. During the recovery phase on a busy server many smbd
+processes try to access the tdb file with blocking fcntl calls. The specific
+test in question easily reproduces 7,000 processes piling up waiting for
+1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
+allrecord lock, covering the whole file range. All 7,000 processes waiting for
+1-byte fcntl locks are woken up, trying to acquire their lock. The special
+implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
+all fcntl lock operations with a single system-wide spinlock. If 7,000 process
+waiting for the allrecord lock to become released this leads to a thundering
+herd condition, all CPUs are spinning on that single spinlock.
+
+Functionally the kernel is fine, eventually the thundering herd slows down and
+every process correctly gets his share and locking range, but the performance
+of the system while the herd is active is worse than expected.
+
+The thundering herd is only the worst case scenario for fcntl lock use. The
+single spinlock for fcntl operations is also a performance penalty for normal
+operations. In the cluster case, every read and write SMB request has to do
+two fcntl calls to provide correct SMB mandatory locks. The single spinlock
+is one source of serialization for the SMB read/write requests, limiting the
+parallelism that can be achieved in a multi-core system.
+
+While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
+locks to be a problem on Solaris as well. Ira pointed out that there is a
+potential alternative locking mechanism that might be more scalable: Process
+shared robust mutexes, as defined by Posix 2008 for example via
+
+http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
+http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html
+
+Pthread mutexes provide one of the core mechanisms in posix threads to protect
+in-process data structures from concurrent access by multiple threads. In the
+Linux implementation, a pthread_mutex_t is represented by a data structure in
+user space that requires no kernel calls in the uncontended case for locking
+and unlocking. Locking and unlocking in the uncontended case is implemented
+purely in user space with atomic CPU instructions and thus are very fast.
+
+The setpshared functions indicate to the kernel that the mutex is about to be
+shared between processes in a common shared memory area.
+
+The process shared posix mutexes have the potential to replace fcntl locking
+to coordinate mmap access for tdbs. However, they are missing the criticial
+auto-cleanup property that fcntl provides when a process dies. A process that
+dies hard while holding a shared mutex has no chance to clean up the protected
+data structures and unlock the shared mutex. Thus with a pure process shared
+mutex the mutex will remain locked forever until the data structures are
+re-initialized from scratch.
+
+With the robust mutexes defined by Posix the process shared mutexes have been
+extended with a limited auto-cleanup property. If a mutex has been declared
+robust, when a process exits while holding that mutex, the next process trying
+to lock the mutex will get the special error message EOWNERDEAD. This informs
+the caller that the data structures the mutex protects are potentially corrupt
+and need to be cleaned up.
+
+The error message EOWNERDEAD when trying to lock a mutex is an extension over
+the fcntl functionality. A process that does a blocking fcntl lock call is not
+informed about whether the lock was explicitly freed by a process still alive
+or due to an unplanned process exit. At the time of this writing (February
+2013), at least Linux and OpenSolaris also implement the robustness feature of
+process-shared mutexes.
+
+Converting the tdb locking mechanism from fcntl to mutexes has to take care of
+both types of locks that are used on tdb files.
+
+The easy part is to use mutexes to replace the 1-byte linked list locks
+covering the individual hashes. Those can be represented by a mutex each.
+
+Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
+lock spanning all hash list locks simultaneously. This basic functionality is
+not easily possible with mutexes. A mutex carries 1 bit of information, a
+fcntl lock can carry an arbitrary amount of information.
+
+In order to support the allrecord lock, we have an allrecord_lock variable
+protected by an allrecord_mutex. The coordination between the allrecord lock
+and the chainlocks works like this:
+
+- Getting a chain lock works like this:
+
+ 1. get chain mutex
+ 2. return success if allrecord_lock is F_UNLCK (not locked)
+ 3. return success if allrecord_lock is F_RDLCK (locked readonly)
+ and we only need a read lock.
+ 4. release chain mutex
+ 5. wait for allrecord_mutex
+ 6. unlock allrecord_mutex
+ 7. goto 1.
+
+- Getting the allrecord lock:
+
+ 1. get the allrecord mutex
+ 2. return error if allrecord_lock is not F_UNLCK (it's locked)
+ 3. set allrecord_lock to the desired value.
+ 4. in a loop: lock(blocking) / unlock each chain mutex.
+ 5. return success.
+
+- allrecord lock upgrade:
+
+ 1. check we already have the allrecord lock with F_RDLCK.
+ 3. set allrecord_lock to F_WRLCK
+ 4. in a loop: lock(blocking) / unlock each chain mutex.
+ 5. return success.
diff --git a/docs/tdb.magic b/docs/tdb.magic
new file mode 100644
index 0000000..f5619e7
--- /dev/null
+++ b/docs/tdb.magic
@@ -0,0 +1,10 @@
+# Magic file(1) information about tdb files.
+#
+# Install this into /etc/magic or the corresponding location for your
+# system, or pass as a -m argument to file(1).
+
+# You may use and redistribute this file without restriction.
+
+0 string TDB\ file TDB database
+>32 lelong =0x2601196D version 6, little-endian
+>>36 lelong x hash size %d bytes
diff --git a/docs/tracing.txt b/docs/tracing.txt
new file mode 100644
index 0000000..a7657ff
--- /dev/null
+++ b/docs/tracing.txt
@@ -0,0 +1,46 @@
+How And Why To Use TDB Tracing
+==============================
+
+You can trace all TDB operations, using TDB_TRACE. It is not complete
+(error conditions which expect to the logged will not always be traced
+correctly, so you should set up a logging function too), but is designed
+to collect benchmark-style traces to allow us to optimize TDB.
+
+Note: tracing is not efficient, and the trace files are huge: a
+traverse of the database is particularly large! But they compress very
+well with rzip (http://rzip.samba.org)
+
+How to gather trace files:
+--------------------------
+1) Uncomment /* #define TDB_TRACE 1 */ in tdb_private.h.
+2) Rebuild TDB, and everything that uses it.
+3) Run something.
+
+Your trace files will be called <tdbname>.trace.<pid>. These files
+will not be overwritten: if the same process reopens the same TDB, an
+error will be logged and tracing will be disabled.
+
+How to replay trace files:
+--------------------------
+1) For benchmarking, remember to rebuild tdb with #define TDB_TRACE commented
+ out again!
+2) Grab the latest "replace_trace.c" from CCAN's tdb module (tools/ dir):
+ http://ccan.ozlabs.org/tarballs/tdb.tar.bz2
+3) Compile up replay_trace, munging as necessary.
+4) Run replay_trace <scratch-tdb-name> <tracefiles>...
+
+If given more than one trace file (presumably from the same tdb)
+replay_trace will try to figure out the dependencies between the operations
+and fire off a child to run each trace. Occasionally it gets stuck, in
+which case it will add another dependency and retry. Eventually it will
+give a speed value.
+
+replay_trace can intuit the existence of previous data in the tdb (ie.
+activity prior to the trace(s) supplied) and will prepopulate as
+necessary.
+
+You can run --quiet for straight benchmark results, and -n to run multiple
+times (this saves time, since it need only calculate dependencies once).
+
+Good luck!
+Rusty Russell <rusty@rustcorp.com.au>