1 files changed, 155 insertions, 0 deletions
diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
new file mode 100644
index 0000000..6441e8b
--- /dev/null
+++ b/src/backend/access/heap/README.tuplock
@@ -0,0 +1,155 @@
+Locking tuples
+--------------
+
+Locking tuples is not as easy as locking tables or other database objects.
+The problem is that transactions might want to lock large numbers of tuples at
+any one time, so it's not possible to keep the locks objects in shared memory.
+To work around this limitation, we use a two-level mechanism.  The first level
+is implemented by storing locking information in the tuple header: a tuple is
+marked as locked by setting the current transaction's XID as its XMAX, and
+setting additional infomask bits to distinguish this case from the more normal
+case of having deleted the tuple.  When multiple transactions concurrently
+lock a tuple, a MultiXact is used; see below.  This mechanism can accommodate
+arbitrarily large numbers of tuples being locked simultaneously.
+
+When it is necessary to wait for a tuple-level lock to be released, the basic
+delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
+the tuple's XMAX.  However, that mechanism will release all waiters
+concurrently, so there would be a race condition as to which waiter gets the
+tuple, potentially leading to indefinite starvation of some waiters.  The
+possibility of share-locking makes the problem much worse --- a steady stream
+of share-lockers can easily block an exclusive locker forever.  To provide
+more reliable semantics about who gets a tuple-level lock first, we use the
+standard lock manager, which implements the second level mentioned above.  The
+protocol for waiting for a tuple-level lock is really
+
+     LockTuple()
+     XactLockTableWait()
+     mark tuple as locked by me
+     UnlockTuple()
+
+When there are multiple waiters, arbitration of who is to get the lock next
+is provided by LockTuple().  However, at most one tuple-level lock will
+be held or awaited per backend at any time, so we don't risk overflow
+of the lock table.  Note that incoming share-lockers are required to
+do LockTuple as well, if there is any conflict, to ensure that they don't
+starve out waiting exclusive-lockers.  However, if there is not any active
+conflict for a tuple, we don't incur any extra overhead.
+
+We make an exception to the above rule for those lockers that already hold
+some lock on a tuple and attempt to acquire a stronger one on it.  In that
+case, we skip the LockTuple() call even when there are conflicts, provided
+that the target tuple is being locked, updated or deleted by multiple sessions
+concurrently.  Failing to skip the lock would risk a deadlock, e.g., between a
+session that was first to record its weaker lock in the tuple header and would
+be waiting on the LockTuple() call to upgrade to the stronger lock level, and
+another session that has already done LockTuple() and is waiting for the first
+session transaction to release its tuple header-level lock.
+
+We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
+exclusive lock which prevents any kind of modification of the tuple. This is
+the lock level that is implicitly taken by DELETE operations, and also by
+UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
+KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
+and modifications which might alter the tuple's key. This is the lock that is
+implicitly taken by UPDATE operations which leave all key fields unchanged.
+SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
+modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
+prevents tuple removal and modifications of key fields. This lock level is
+just strong enough to implement RI checks, i.e. it ensures that tuples do not
+go away from under a check, without blocking transactions that want to update
+the tuple without changing its key.
+
+The conflict table is:
+
+                  UPDATE       NO KEY UPDATE    SHARE        KEY SHARE
+UPDATE           conflict        conflict      conflict      conflict
+NO KEY UPDATE    conflict        conflict      conflict
+SHARE            conflict        conflict
+KEY SHARE        conflict
+
+When there is a single locker in a tuple, we can just store the locking info
+in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
+setting infomask bits specifying the locking strength.  There is one exception
+here: since infomask space is limited, we do not provide a separate bit
+for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
+that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
+presumably more commonly used due to being the standards-mandated locking
+mechanism, or heavily used by the RI code, so we want to provide fast paths
+for those.)
+
+MultiXacts
+----------
+
+A tuple header provides very limited space for storing information about tuple
+locking and updates: there is room only for a single Xid and a small number of
+infomask bits.  Whenever we need to store more than one lock, we replace the
+first locker's Xid with a new MultiXactId.  Each MultiXact provides extended
+locking data; it comprises an array of Xids plus some flags bits for each one.
+The flags are currently used to store the locking strength of each member
+transaction.  (The flags also distinguish a pure locker from an updater.)
+
+In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
+locked in shared mode by multiple transactions.  This is no longer the case; a
+MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
+in a transaction do not conflict with other tuple locks in the same
+transaction, so it's possible to have otherwise conflicting locks in a
+MultiXact if they belong to the same transaction).
+
+Note that each lock is attributed to the subtransaction that acquires it.
+This means that a subtransaction that aborts is seen as though it releases the
+locks it acquired; concurrent transactions can then proceed without having to
+wait for the main transaction to finish.  It also means that a subtransaction
+can upgrade to a stronger lock level than an earlier transaction had, and if
+the subxact aborts, the earlier, weaker lock is kept.
+
+The possibility of having an update within a MultiXact means that they must
+persist across crashes and restarts: a future reader of the tuple needs to
+figure out whether the update committed or aborted.  So we have a requirement
+that pg_multixact needs to retain pages of its data until we're certain that
+the MultiXacts in them are no longer of interest.
+
+VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
+The lower bound used by vacuum (that is, the value below which all multixacts
+are removed) is stored as pg_class.relminmxid for each table; the minimum of
+all such values is stored in pg_database.datminmxid.  The minimum across
+all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
+removes pg_multixact/ segments older than that value once the checkpoint
+record has been flushed.
+
+Infomask Bits
+-------------
+
+The following infomask bits are applicable:
+
+- HEAP_XMAX_INVALID
+  Any tuple with this bit set does not have a valid value stored in XMAX.
+
+- HEAP_XMAX_IS_MULTI
+  This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
+  regular TransactionId).
+
+- HEAP_XMAX_LOCK_ONLY
+  This bit is set when the XMAX is a locker only; that is, if it's a
+  multixact, it does not contain an update among its members.  It's set when
+  the XMAX is a plain Xid that locked the tuple, as well.
+
+- HEAP_XMAX_KEYSHR_LOCK
+- HEAP_XMAX_SHR_LOCK
+- HEAP_XMAX_EXCL_LOCK
+  These bits indicate the strength of the lock acquired; they are useful when
+  the XMAX is not a MultiXactId.  If it's a multi, the info is to be found in
+  the member flags.  If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
+  is set, then one of these *must* be set as well.
+
+  Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
+  FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.
+
+- HEAP_KEYS_UPDATED
+  This bit lives in t_infomask2.  If set, indicates that the operation(s) done
+  by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
+  that modifies the columns of the key, or a DELETE.  It's set regardless of
+  whether the XMAX is a TransactionId or a MultiXactId.
+
+We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
+is set.