5 files changed, 526 insertions, 0 deletions
diff --git a/docs/README b/docs/README
new file mode 100644
index 0000000..86d46a3
--- /dev/null
+++ b/docs/README
@@ -0,0 +1,273 @@
+tdb - a trivial database system
+tridge@linuxcare.com December 1999
+==================================
+
+This is a simple database API. It was inspired by the realisation that
+in Samba we have several ad-hoc bits of code that essentially
+implement small databases for sharing structures between parts of
+Samba. As I was about to add another I realised that a generic
+database module was called for to replace all the ad-hoc bits.
+
+I based the interface on gdbm. I couldn't use gdbm as we need to be
+able to have multiple writers to the databases at one time.
+
+Compilation
+-----------
+
+add HAVE_MMAP=1 to use mmap instead of read/write
+add NOLOCK=1 to disable locking code
+
+Testing
+-------
+
+Compile tdbtest.c and link with gdbm for testing. tdbtest will perform
+identical operations via tdb and gdbm then make sure the result is the
+same
+
+Also included is tdbtool, which allows simple database manipulation
+on the commandline.
+
+tdbtest and tdbtool are not built as part of Samba, but are included
+for completeness.
+
+Interface
+---------
+
+The interface is very similar to gdbm except for the following:
+
+- different open interface. The tdb_open call is more similar to a
+  traditional open()
+- no tdbm_reorganise() function
+- no tdbm_sync() function. No operations are cached in the library anyway
+- added a tdb_traverse() function for traversing the whole database
+- added transactions support
+
+A general rule for using tdb is that the caller frees any returned
+TDB_DATA structures. Just call free(p.dptr) to free a TDB_DATA
+return value called p. This is the same as gdbm.
+
+here is a full list of tdb functions with brief descriptions.
+
+
+----------------------------------------------------------------------
+TDB_CONTEXT *tdb_open(char *name, int hash_size, int tdb_flags,
+		      int open_flags, mode_t mode)
+
+   open the database, creating it if necessary 
+
+   The open_flags and mode are passed straight to the open call on the database
+   file. A flags value of O_WRONLY is invalid
+
+   The hash size is advisory, use zero for a default value. 
+
+   return is NULL on error
+
+   possible tdb_flags are:
+    TDB_CLEAR_IF_FIRST - clear database if we are the only one with it open
+    TDB_INTERNAL - don't use a file, instead store the data in
+                   memory. The filename is ignored in this case.
+    TDB_NOLOCK - don't do any locking
+    TDB_NOMMAP - don't use mmap
+    TDB_NOSYNC - don't synchronise transactions to disk
+    TDB_SEQNUM - maintain a sequence number
+    TDB_VOLATILE - activate the per-hashchain freelist, default 5
+    TDB_ALLOW_NESTING - allow transactions to nest
+    TDB_DISALLOW_NESTING - disallow transactions to nest
+
+----------------------------------------------------------------------
+TDB_CONTEXT *tdb_open_ex(char *name, int hash_size, int tdb_flags,
+		         int open_flags, mode_t mode,
+			 const struct tdb_logging_context *log_ctx,
+			 tdb_hash_func hash_fn)
+
+This is like tdb_open(), but allows you to pass an initial logging and
+hash function. Be careful when passing a hash function - all users of
+the database must use the same hash function or you will get data
+corruption.
+
+
+----------------------------------------------------------------------
+char *tdb_error(TDB_CONTEXT *tdb);
+
+     return a error string for the last tdb error
+
+----------------------------------------------------------------------
+int tdb_close(TDB_CONTEXT *tdb);
+
+   close a database
+
+----------------------------------------------------------------------
+TDB_DATA tdb_fetch(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   fetch an entry in the database given a key 
+   if the return value has a null dptr then a error occurred
+
+   caller must free the resulting data
+
+----------------------------------------------------------------------
+int tdb_parse_record(struct tdb_context *tdb, TDB_DATA key,
+		     int (*parser)(TDB_DATA key, TDB_DATA data,
+				   void *private_data),
+		     void *private_data);
+
+   Hand a record to a parser function without allocating it.
+
+   This function is meant as a fast tdb_fetch alternative for large records
+   that are frequently read. The "key" and "data" arguments point directly
+   into the tdb shared memory, they are not aligned at any boundary.
+
+   WARNING: The parser is called while tdb holds a lock on the record. DO NOT
+   call other tdb routines from within the parser. Also, for good performance
+   you should make the parser fast to allow parallel operations.
+
+   tdb_parse_record returns -1 if the record was not found.  If the record was
+   found, the return value of "parser" is passed up to the caller.
+
+----------------------------------------------------------------------
+int tdb_exists(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   check if an entry in the database exists 
+
+   note that 1 is returned if the key is found and 0 is returned if not found
+   this doesn't match the conventions in the rest of this module, but is
+   compatible with gdbm
+
+----------------------------------------------------------------------
+int tdb_traverse(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb,
+                 TDB_DATA key, TDB_DATA dbuf, void *state), void *state);
+
+   traverse the entire database - calling fn(tdb, key, data, state) on each 
+   element.
+
+   return -1 on error or the record count traversed
+
+   if fn is NULL then it is not called
+
+   a non-zero return value from fn() indicates that the traversal
+   should stop. Traversal callbacks may not start transactions.
+
+   WARNING: The data buffer given to the callback fn does NOT meet the
+   alignment restrictions malloc gives you.
+
+----------------------------------------------------------------------
+int tdb_traverse_read(TDB_CONTEXT *tdb, int (*fn)(TDB_CONTEXT *tdb,
+                     TDB_DATA key, TDB_DATA dbuf, void *state), void *state);
+
+   traverse the entire database - calling fn(tdb, key, data, state) on
+   each element, but marking the database read only during the
+   traversal, so any write operations will fail. This allows tdb to
+   use read locks, which increases the parallelism possible during the
+   traversal.
+
+   return -1 on error or the record count traversed
+
+   if fn is NULL then it is not called
+
+   a non-zero return value from fn() indicates that the traversal
+   should stop. Traversal callbacks may not start transactions.
+
+----------------------------------------------------------------------
+TDB_DATA tdb_firstkey(TDB_CONTEXT *tdb);
+
+   find the first entry in the database and return its key
+
+   the caller must free the returned data
+
+----------------------------------------------------------------------
+TDB_DATA tdb_nextkey(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   find the next entry in the database, returning its key
+
+   the caller must free the returned data
+
+----------------------------------------------------------------------
+int tdb_delete(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   delete an entry in the database given a key
+
+----------------------------------------------------------------------
+int tdb_store(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA dbuf, int flag);
+
+   store an element in the database, replacing any existing element
+   with the same key 
+
+   If flag==TDB_INSERT then don't overwrite an existing entry
+   If flag==TDB_MODIFY then don't create a new entry
+
+   return 0 on success, -1 on failure
+
+----------------------------------------------------------------------
+int tdb_writelock(TDB_CONTEXT *tdb);
+
+   lock the database. If we already have it locked then don't do anything
+
+----------------------------------------------------------------------
+int tdb_writeunlock(TDB_CONTEXT *tdb);
+   unlock the database
+
+----------------------------------------------------------------------
+int tdb_chainlock(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   lock one hash chain. This is meant to be used to reduce locking
+   contention - it cannot guarantee how many records will be locked
+
+----------------------------------------------------------------------
+int tdb_chainunlock(TDB_CONTEXT *tdb, TDB_DATA key);
+
+   unlock one hash chain
+
+----------------------------------------------------------------------
+int tdb_transaction_start(TDB_CONTEXT *tdb)
+
+   start a transaction. All operations after the transaction start can
+   either be committed with tdb_transaction_commit() or cancelled with
+   tdb_transaction_cancel(). 
+
+   If you call tdb_transaction_start() again on the same tdb context
+   while a transaction is in progress, then the same transaction
+   buffer is re-used. The number of tdb_transaction_{commit,cancel}
+   operations must match the number of successful
+   tdb_transaction_start() calls.
+
+   Note that transactions are by default disk synchronous, and use a
+   recover area in the database to automatically recover the database
+   on the next open if the system crashes during a transaction. You
+   can disable the synchronous transaction recovery setup using the
+   TDB_NOSYNC flag, which will greatly speed up operations at the risk
+   of corrupting your database if the system crashes.
+
+   Operations made within a transaction are not visible to other users
+   of the database until a successful commit.
+
+----------------------------------------------------------------------
+int tdb_transaction_cancel(TDB_CONTEXT *tdb)
+
+   cancel a current transaction, discarding all write and lock
+   operations that have been made since the transaction started.
+
+
+----------------------------------------------------------------------
+int tdb_transaction_commit(TDB_CONTEXT *tdb)
+
+   commit a current transaction, updating the database and releasing
+   the transaction locks.
+
+----------------------------------------------------------------------
+int tdb_transaction_prepare_commit(TDB_CONTEXT *tdb)
+
+   prepare to commit a current transaction, for two-phase commits.
+   Once prepared for commit, the only allowed calls are
+   tdb_transaction_commit() or tdb_transaction_cancel(). Preparing
+   allocates disk space for the pending updates, so a subsequent
+   commit should succeed (barring any hardware failures).
+
+----------------------------------------------------------------------
+int tdb_check(TDB_CONTEXT *tdb,
+	      int (*check)(TDB_DATA key, TDB_DATA data, void *private_data),
+	      void *private_data);)
+
+   check the consistency of the database, calling back the check function
+   (if non-NULL) with each record.  If some consistency check fails, or
+   the supplied check function returns -1, tdb_check returns -1, otherwise
+   0.  Note that logging function (if set) will be called with additional
+   information on the corruption found.
diff --git a/docs/mainpage.dox b/docs/mainpage.dox
new file mode 100644
index 0000000..d130769
--- /dev/null
+++ b/docs/mainpage.dox
@@ -0,0 +1,61 @@
+/**
+
+@mainpage
+
+This is a simple database API. It was inspired by the realisation that in Samba
+we have several ad-hoc bits of code that essentially implement small databases
+for sharing structures between parts of Samba.
+
+The interface is based on gdbm. gdbm couldn't be use as we needed to be able to
+have multiple writers to the databases at one time.
+
+@section tdb_download Download
+
+You can download the latest releases of tdb from the
+<a href="http://samba.org/ftp/tdb">tdb directory</a> on the samba public source
+archive.
+
+You can download the latest code either via git or rsync.
+
+To fetch via git see the following guide:
+
+<a href="http://wiki.samba.org/index.php/Using_Git_for_Samba_Development">Using Git for Samba Development</a>
+Once you have cloned the tree switch to the master branch and cd into the source/lib/tdb directory.
+
+To fetch via rsync use these commands:
+
+<pre>
+  rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/tdb .
+  rsync -Pavz samba.org::ftp/unpacked/standalone_projects/lib/replace .
+</pre>
+
+and build in tdb. It will find the replace library in the directory above
+automatically.
+
+@section tdb_bugs Discussion and bug reports
+
+tdb does not currently have its own mailing list or bug tracking system. For now,
+please use the
+<a href="https://lists.samba.org/mailman/listinfo/samba-technical">samba-technical</a>
+mailing list, and the <a href="http://bugzilla.samba.org/">Samba bugzilla</a> bug
+tracking system.
+
+
+@section tdb_compilation Compilation
+
+add HAVE_MMAP=1 to use mmap instead of read/write
+add NOLOCK=1 to disable locking code
+
+@section tdb_testing Testing
+
+Compile tdbtest.c and link with gdbm for testing. tdbtest will perform
+identical operations via tdb and gdbm then make sure the result is the
+same
+
+Also included is tdbtool, which allows simple database manipulation
+on the commandline.
+
+tdbtest and tdbtool are not built as part of Samba, but are included
+for completeness.
+
+*/
diff --git a/docs/mutex.txt b/docs/mutex.txt
new file mode 100644
index 0000000..0003c7f
--- /dev/null
+++ b/docs/mutex.txt
@@ -0,0 +1,136 @@
+Tdb is a hashtable database with multiple concurrent writer and external
+record lock support. For speed reasons, wherever possible tdb uses a shared
+memory mapped area for data access. In its currently released form, it uses
+fcntl byte-range locks to coordinate access to the data itself.
+
+The tdb data is organized as a hashtable. Hash collisions are dealt with by
+forming a linked list of records that share a hash value. The individual
+linked lists are protected across processes with 1-byte fcntl locks on the
+starting pointer of the linked list representing a hash value.
+
+The external locking API of tdb allows one to lock individual records. Instead of
+really locking individual records, the tdb API locks a complete linked list
+with a fcntl lock.
+
+The external locking API of tdb also allows one to lock the complete database, and
+ctdb uses this facility to freeze databases during a recovery. While the
+so-called allrecord lock is held, all linked lists and all individual records
+are frozen altogether. Tdb achieves this by locking the complete file range
+with a single fcntl lock. Individual 1-byte locks for the linked lists
+conflict with this. Access to records is prevented by the one large fnctl byte
+range lock.
+
+Fcntl locks have been chosen for tdb for two reasons: First they are portable
+across all current unixes. Secondly they provide auto-cleanup. If a process
+dies while holding a fcntl lock, the lock is given up as if it was explicitly
+unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
+dies for any reason the database will not stay blocked until reboot. This
+robustness is very important for long-running services, a reboot is not an
+option for most users of tdb.
+
+Unfortunately, during stress testing, fcntl locks have turned out to be a major
+problem for performance. The particular problem that was seen happens when
+ctdb on a busy server does a recovery. A recovery means that ctdb has to
+freeze all tdb databases for some time, usually a few seconds. This is done
+with the allrecord lock. During the recovery phase on a busy server many smbd
+processes try to access the tdb file with blocking fcntl calls. The specific
+test in question easily reproduces 7,000 processes piling up waiting for
+1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
+allrecord lock, covering the whole file range. All 7,000 processes waiting for
+1-byte fcntl locks are woken up, trying to acquire their lock. The special
+implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
+all fcntl lock operations with a single system-wide spinlock. If 7,000 process
+waiting for the allrecord lock to become released this leads to a thundering
+herd condition, all CPUs are spinning on that single spinlock.
+
+Functionally the kernel is fine, eventually the thundering herd slows down and
+every process correctly gets his share and locking range, but the performance
+of the system while the herd is active is worse than expected.
+
+The thundering herd is only the worst case scenario for fcntl lock use. The
+single spinlock for fcntl operations is also a performance penalty for normal
+operations. In the cluster case, every read and write SMB request has to do
+two fcntl calls to provide correct SMB mandatory locks. The single spinlock
+is one source of serialization for the SMB read/write requests, limiting the
+parallelism that can be achieved in a multi-core system.
+
+While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
+locks to be a problem on Solaris as well. Ira pointed out that there is a
+potential alternative locking mechanism that might be more scalable: Process
+shared robust mutexes, as defined by Posix 2008 for example via
+
+http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
+http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html
+
+Pthread mutexes provide one of the core mechanisms in posix threads to protect
+in-process data structures from concurrent access by multiple threads. In the
+Linux implementation, a pthread_mutex_t is represented by a data structure in
+user space that requires no kernel calls in the uncontended case for locking
+and unlocking. Locking and unlocking in the uncontended case is implemented
+purely in user space with atomic CPU instructions and thus are very fast.
+
+The setpshared functions indicate to the kernel that the mutex is about to be
+shared between processes in a common shared memory area.
+
+The process shared posix mutexes have the potential to replace fcntl locking
+to coordinate mmap access for tdbs. However, they are missing the critical
+auto-cleanup property that fcntl provides when a process dies. A process that
+dies hard while holding a shared mutex has no chance to clean up the protected
+data structures and unlock the shared mutex. Thus with a pure process shared
+mutex the mutex will remain locked forever until the data structures are
+re-initialized from scratch.
+
+With the robust mutexes defined by Posix the process shared mutexes have been
+extended with a limited auto-cleanup property. If a mutex has been declared
+robust, when a process exits while holding that mutex, the next process trying
+to lock the mutex will get the special error message EOWNERDEAD. This informs
+the caller that the data structures the mutex protects are potentially corrupt
+and need to be cleaned up.
+
+The error message EOWNERDEAD when trying to lock a mutex is an extension over
+the fcntl functionality. A process that does a blocking fcntl lock call is not
+informed about whether the lock was explicitly freed by a process still alive
+or due to an unplanned process exit. At the time of this writing (February
+2013), at least Linux and OpenSolaris also implement the robustness feature of
+process-shared mutexes.
+
+Converting the tdb locking mechanism from fcntl to mutexes has to take care of
+both types of locks that are used on tdb files.
+
+The easy part is to use mutexes to replace the 1-byte linked list locks
+covering the individual hashes. Those can be represented by a mutex each.
+
+Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
+lock spanning all hash list locks simultaneously. This basic functionality is
+not easily possible with mutexes. A mutex carries 1 bit of information, a
+fcntl lock can carry an arbitrary amount of information.
+
+In order to support the allrecord lock, we have an allrecord_lock variable
+protected by an allrecord_mutex. The coordination between the allrecord lock
+and the chainlocks works like this:
+
+- Getting a chain lock works like this:
+
+  1. get chain mutex
+  2. return success if allrecord_lock is F_UNLCK (not locked)
+  3. return success if allrecord_lock is F_RDLCK (locked readonly)
+     and we only need a read lock.
+  4. release chain mutex
+  5. wait for allrecord_mutex
+  6. unlock allrecord_mutex
+  7. goto 1.
+
+- Getting the allrecord lock:
+
+  1. get the allrecord mutex
+  2. return error if allrecord_lock is not F_UNLCK (it's locked)
+  3. set allrecord_lock to the desired value.
+  4. in a loop: lock(blocking) / unlock each chain mutex.
+  5. return success.
+
+- allrecord lock upgrade:
+
+  1. check we already have the allrecord lock with F_RDLCK.
+  3. set allrecord_lock to F_WRLCK
+  4. in a loop: lock(blocking) / unlock each chain mutex.
+  5. return success.
diff --git a/docs/tdb.magic b/docs/tdb.magic
new file mode 100644
index 0000000..f5619e7
--- /dev/null
+++ b/docs/tdb.magic
@@ -0,0 +1,10 @@
+# Magic file(1) information about tdb files.
+#
+# Install this into /etc/magic or the corresponding location for your
+# system, or pass as a -m argument to file(1).
+
+# You may use and redistribute this file without restriction.
+
+0	string	TDB\ file		TDB database
+>32	lelong	=0x2601196D		version 6, little-endian
+>>36	lelong	x			hash size %d bytes
diff --git a/docs/tracing.txt b/docs/tracing.txt
new file mode 100644
index 0000000..a7657ff
--- /dev/null
+++ b/docs/tracing.txt
@@ -0,0 +1,46 @@
+How And Why To Use TDB Tracing
+==============================
+
+You can trace all TDB operations, using TDB_TRACE.  It is not complete
+(error conditions which expect to the logged will not always be traced
+correctly, so you should set up a logging function too), but is designed
+to collect benchmark-style traces to allow us to optimize TDB.
+
+Note: tracing is not efficient, and the trace files are huge: a
+traverse of the database is particularly large!  But they compress very
+well with rzip (http://rzip.samba.org)
+
+How to gather trace files:
+--------------------------
+1) Uncomment /* #define TDB_TRACE 1 */ in tdb_private.h.
+2) Rebuild TDB, and everything that uses it.
+3) Run something.
+
+Your trace files will be called <tdbname>.trace.<pid>.  These files
+will not be overwritten: if the same process reopens the same TDB, an
+error will be logged and tracing will be disabled.
+
+How to replay trace files:
+--------------------------
+1) For benchmarking, remember to rebuild tdb with #define TDB_TRACE commented
+   out again!
+2) Grab the latest "replace_trace.c" from CCAN's tdb module (tools/ dir):
+	http://ccan.ozlabs.org/tarballs/tdb.tar.bz2
+3) Compile up replay_trace, munging as necessary.
+4) Run replay_trace <scratch-tdb-name> <tracefiles>...
+
+If given more than one trace file (presumably from the same tdb)
+replay_trace will try to figure out the dependencies between the operations
+and fire off a child to run each trace.  Occasionally it gets stuck, in
+which case it will add another dependency and retry.  Eventually it will
+give a speed value.
+
+replay_trace can intuit the existence of previous data in the tdb (ie.
+activity prior to the trace(s) supplied) and will prepopulate as
+necessary.
+
+You can run --quiet for straight benchmark results, and -n to run multiple
+times (this saves time, since it need only calculate dependencies once).
+
+Good luck!
+Rusty Russell <rusty@rustcorp.com.au>