diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 12:15:05 +0000 |
commit | 46651ce6fe013220ed397add242004d764fc0153 (patch) | |
tree | 6e5299f990f88e60174a1d3ae6e48eedd2688b2b /src/backend/storage/lmgr/README | |
parent | Initial commit. (diff) | |
download | postgresql-14-upstream.tar.xz postgresql-14-upstream.zip |
Adding upstream version 14.5.upstream/14.5upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/backend/storage/lmgr/README')
-rw-r--r-- | src/backend/storage/lmgr/README | 739 |
1 files changed, 739 insertions, 0 deletions
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README new file mode 100644 index 0000000..c96cc7b --- /dev/null +++ b/src/backend/storage/lmgr/README @@ -0,0 +1,739 @@ +src/backend/storage/lmgr/README + +Locking Overview +================ + +Postgres uses four types of interprocess locks: + +* Spinlocks. These are intended for *very* short-term locks. If a lock +is to be held more than a few dozen instructions, or across any sort of +kernel call (or even a call to a nontrivial subroutine), don't use a +spinlock. Spinlocks are primarily used as infrastructure for lightweight +locks. They are implemented using a hardware atomic-test-and-set +instruction, if available. Waiting processes busy-loop until they can +get the lock. There is no provision for deadlock detection, automatic +release on error, or any other nicety. There is a timeout if the lock +cannot be gotten after a minute or so (which is approximately forever in +comparison to the intended lock hold time, so this is certainly an error +condition). + +* Lightweight locks (LWLocks). These locks are typically used to +interlock access to datastructures in shared memory. LWLocks support +both exclusive and shared lock modes (for read/write and read-only +access to a shared object). There is no provision for deadlock +detection, but the LWLock manager will automatically release held +LWLocks during elog() recovery, so it is safe to raise an error while +holding LWLocks. Obtaining or releasing an LWLock is quite fast (a few +dozen instructions) when there is no contention for the lock. When a +process has to wait for an LWLock, it blocks on a SysV semaphore so as +to not consume CPU time. Waiting processes will be granted the lock in +arrival order. There is no timeout. + +* Regular locks (a/k/a heavyweight locks). The regular lock manager +supports a variety of lock modes with table-driven semantics, and it has +full deadlock detection and automatic release at transaction end. +Regular locks should be used for all user-driven lock requests. + +* SIReadLock predicate locks. See separate README-SSI file for details. + +Acquisition of either a spinlock or a lightweight lock causes query +cancel and die() interrupts to be held off until all such locks are +released. No such restriction exists for regular locks, however. Also +note that we can accept query cancel and die() interrupts while waiting +for a regular lock, but we will not accept them while waiting for +spinlocks or LW locks. It is therefore not a good idea to use LW locks +when the wait time might exceed a few seconds. + +The rest of this README file discusses the regular lock manager in detail. + + +Lock Data Structures +-------------------- + +Lock methods describe the overall locking behavior. Currently there are +two lock methods: DEFAULT and USER. + +Lock modes describe the type of the lock (read/write or shared/exclusive). +In principle, each lock method can have its own set of lock modes with +different conflict rules, but currently DEFAULT and USER methods use +identical lock mode sets. See src/include/storage/lock.h for more details. +(Lock modes are also called lock types in some places in the code and +documentation.) + +There are two main methods for recording locks in shared memory. The primary +mechanism uses two main structures: the per-lockable-object LOCK struct, and +the per-lock-and-requestor PROCLOCK struct. A LOCK object exists for each +lockable object that currently has locks held or requested on it. A PROCLOCK +struct exists for each backend that is holding or requesting lock(s) on each +LOCK object. + +There is also a special "fast path" mechanism which backends may use to +record a limited number of locks with very specific characteristics: they must +use the DEFAULT lockmethod; they must represent a lock on a database relation +(not a shared relation), they must be a "weak" lock which is unlikely to +conflict (AccessShareLock, RowShareLock, or RowExclusiveLock); and the system +must be able to quickly verify that no conflicting locks could possibly be +present. See "Fast Path Locking", below, for more details. + +Each backend also maintains an unshared LOCALLOCK structure for each lockable +object and lock mode that it is currently holding or requesting. The shared +lock structures only allow a single lock grant to be made per lockable +object/lock mode/backend. Internally to a backend, however, the same lock may +be requested and perhaps released multiple times in a transaction, and it can +also be held both transactionally and session-wide. The internal request +counts are held in LOCALLOCK so that the shared data structures need not be +accessed to alter them. + +--------------------------------------------------------------------------- + +The lock manager's LOCK objects contain: + +tag - + The key fields that are used for hashing locks in the shared memory + lock hash table. The contents of the tag essentially define an + individual lockable object. See include/storage/lock.h for details + about the supported types of lockable objects. This is declared as + a separate struct to ensure that we always zero out the correct number + of bytes. It is critical that any alignment-padding bytes the compiler + might insert in the struct be zeroed out, else the hash computation + will be random. (Currently, we are careful to define struct LOCKTAG + so that there are no padding bytes.) + +grantMask - + This bitmask indicates what types of locks are currently held on the + given lockable object. It is used (against the lock table's conflict + table) to determine if a new lock request will conflict with existing + lock types held. Conflicts are determined by bitwise AND operations + between the grantMask and the conflict table entry for the requested + lock type. Bit i of grantMask is 1 if and only if granted[i] > 0. + +waitMask - + This bitmask shows the types of locks being waited for. Bit i of waitMask + is 1 if and only if requested[i] > granted[i]. + +procLocks - + This is a shared memory queue of all the PROCLOCK structs associated with + the lock object. Note that both granted and waiting PROCLOCKs are in this + list (indeed, the same PROCLOCK might have some already-granted locks and + be waiting for more!). + +waitProcs - + This is a shared memory queue of all PGPROC structures corresponding to + backends that are waiting (sleeping) until another backend releases this + lock. The process structure holds the information needed to determine + if it should be woken up when the lock is released. + +nRequested - + Keeps a count of how many times this lock has been attempted to be + acquired. The count includes attempts by processes which were put + to sleep due to conflicts. It also counts the same backend twice + if, for example, a backend process first acquires a read and then + acquires a write. (But multiple acquisitions of the same lock/lock mode + within a backend are not multiply counted here; they are recorded + only in the backend's LOCALLOCK structure.) + +requested - + Keeps a count of how many locks of each type have been attempted. Only + elements 1 through MAX_LOCKMODES-1 are used as they correspond to the lock + type defined constants. Summing the values of requested[] should come out + equal to nRequested. + +nGranted - + Keeps count of how many times this lock has been successfully acquired. + This count does not include attempts that are waiting due to conflicts. + Otherwise the counting rules are the same as for nRequested. + +granted - + Keeps count of how many locks of each type are currently held. Once again + only elements 1 through MAX_LOCKMODES-1 are used (0 is not). Also, like + requested[], summing the values of granted[] should total to the value + of nGranted. + +We should always have 0 <= nGranted <= nRequested, and +0 <= granted[i] <= requested[i] for each i. When all the request counts +go to zero, the LOCK object is no longer needed and can be freed. + +--------------------------------------------------------------------------- + +The lock manager's PROCLOCK objects contain: + +tag - + The key fields that are used for hashing entries in the shared memory + PROCLOCK hash table. This is declared as a separate struct to ensure that + we always zero out the correct number of bytes. It is critical that any + alignment-padding bytes the compiler might insert in the struct be zeroed + out, else the hash computation will be random. (Currently, we are careful + to define struct PROCLOCKTAG so that there are no padding bytes.) + + tag.myLock + Pointer to the shared LOCK object this PROCLOCK is for. + + tag.myProc + Pointer to the PGPROC of backend process that owns this PROCLOCK. + + Note: it's OK to use pointers here because a PROCLOCK never outlives + either its lock or its proc. The tag is therefore unique for as long + as it needs to be, even though the same tag values might mean something + else at other times. + +holdMask - + A bitmask for the lock modes successfully acquired by this PROCLOCK. + This should be a subset of the LOCK object's grantMask, and also a + subset of the PGPROC object's heldLocks mask (if the PGPROC is + currently waiting for another lock mode on this lock). + +releaseMask - + A bitmask for the lock modes due to be released during LockReleaseAll. + This must be a subset of the holdMask. Note that it is modified without + taking the partition LWLock, and therefore it is unsafe for any + backend except the one owning the PROCLOCK to examine/change it. + +lockLink - + List link for shared memory queue of all the PROCLOCK objects for the + same LOCK. + +procLink - + List link for shared memory queue of all the PROCLOCK objects for the + same backend. + +--------------------------------------------------------------------------- + + +Lock Manager Internal Locking +----------------------------- + +Before PostgreSQL 8.2, all of the shared-memory data structures used by +the lock manager were protected by a single LWLock, the LockMgrLock; +any operation involving these data structures had to exclusively lock +LockMgrLock. Not too surprisingly, this became a contention bottleneck. +To reduce contention, the lock manager's data structures have been split +into multiple "partitions", each protected by an independent LWLock. +Most operations only need to lock the single partition they are working in. +Here are the details: + +* Each possible lock is assigned to one partition according to a hash of +its LOCKTAG value. The partition's LWLock is considered to protect all the +LOCK objects of that partition as well as their subsidiary PROCLOCKs. + +* The shared-memory hash tables for LOCKs and PROCLOCKs are organized +so that different partitions use different hash chains, and thus there +is no conflict in working with objects in different partitions. This +is supported directly by dynahash.c's "partitioned table" mechanism +for the LOCK table: we need only ensure that the partition number is +taken from the low-order bits of the dynahash hash value for the LOCKTAG. +To make it work for PROCLOCKs, we have to ensure that a PROCLOCK's hash +value has the same low-order bits as its associated LOCK. This requires +a specialized hash function (see proclock_hash). + +* Formerly, each PGPROC had a single list of PROCLOCKs belonging to it. +This has now been split into per-partition lists, so that access to a +particular PROCLOCK list can be protected by the associated partition's +LWLock. (This rule allows one backend to manipulate another backend's +PROCLOCK lists, which was not originally necessary but is now required in +connection with fast-path locking; see below.) + +* The other lock-related fields of a PGPROC are only interesting when +the PGPROC is waiting for a lock, so we consider that they are protected +by the partition LWLock of the awaited lock. + +For normal lock acquisition and release, it is sufficient to lock the +partition containing the desired lock. Deadlock checking needs to touch +multiple partitions in general; for simplicity, we just make it lock all +the partitions in partition-number order. (To prevent LWLock deadlock, +we establish the rule that any backend needing to lock more than one +partition at once must lock them in partition-number order.) It's +possible that deadlock checking could be done without touching every +partition in typical cases, but since in a properly functioning system +deadlock checking should not occur often enough to be performance-critical, +trying to make this work does not seem a productive use of effort. + +A backend's internal LOCALLOCK hash table is not partitioned. We do store +a copy of the locktag hash code in LOCALLOCK table entries, from which the +partition number can be computed, but this is a straight speed-for-space +tradeoff: we could instead recalculate the partition number from the LOCKTAG +when needed. + + +Fast Path Locking +----------------- + +Fast path locking is a special purpose mechanism designed to reduce the +overhead of taking and releasing certain types of locks which are taken +and released very frequently but rarely conflict. Currently, this includes +two categories of locks: + +(1) Weak relation locks. SELECT, INSERT, UPDATE, and DELETE must acquire a +lock on every relation they operate on, as well as various system catalogs +that can be used internally. Many DML operations can proceed in parallel +against the same table at the same time; only DDL operations such as +CLUSTER, ALTER TABLE, or DROP -- or explicit user action such as LOCK TABLE +-- will create lock conflicts with the "weak" locks (AccessShareLock, +RowShareLock, RowExclusiveLock) acquired by DML operations. + +(2) VXID locks. Every transaction takes a lock on its own virtual +transaction ID. Currently, the only operations that wait for these locks +are CREATE INDEX CONCURRENTLY and Hot Standby (in the case of a conflict), +so most VXID locks are taken and released by the owner without anyone else +needing to care. + +The primary locking mechanism does not cope well with this workload. Even +though the lock manager locks are partitioned, the locktag for any given +relation still falls in one, and only one, partition. Thus, if many short +queries are accessing the same relation, the lock manager partition lock for +that partition becomes a contention bottleneck. This effect is measurable +even on 2-core servers, and becomes very pronounced as core count increases. + +To alleviate this bottleneck, beginning in PostgreSQL 9.2, each backend is +permitted to record a limited number of locks on unshared relations in an +array within its PGPROC structure, rather than using the primary lock table. +This mechanism can only be used when the locker can verify that no conflicting +locks exist at the time of taking the lock. + +A key point of this algorithm is that it must be possible to verify the +absence of possibly conflicting locks without fighting over a shared LWLock or +spinlock. Otherwise, this effort would simply move the contention bottleneck +from one place to another. We accomplish this using an array of 1024 integer +counters, which are in effect a 1024-way partitioning of the lock space. +Each counter records the number of "strong" locks (that is, ShareLock, +ShareRowExclusiveLock, ExclusiveLock, and AccessExclusiveLock) on unshared +relations that fall into that partition. When this counter is non-zero, the +fast path mechanism may not be used to take new relation locks within that +partition. A strong locker bumps the counter and then scans each per-backend +array for matching fast-path locks; any which are found must be transferred to +the primary lock table before attempting to acquire the lock, to ensure proper +lock conflict and deadlock detection. + +On an SMP system, we must guarantee proper memory synchronization. Here we +rely on the fact that LWLock acquisition acts as a memory sequence point: if +A performs a store, A and B both acquire an LWLock in either order, and B +then performs a load on the same memory location, it is guaranteed to see +A's store. In this case, each backend's fast-path lock queue is protected +by an LWLock. A backend wishing to acquire a fast-path lock grabs this +LWLock before examining FastPathStrongRelationLocks to check for the presence +of a conflicting strong lock. And the backend attempting to acquire a strong +lock, because it must transfer any matching weak locks taken via the fast-path +mechanism to the shared lock table, will acquire every LWLock protecting a +backend fast-path queue in turn. So, if we examine +FastPathStrongRelationLocks and see a zero, then either the value is truly +zero, or if it is a stale value, the strong locker has yet to acquire the +per-backend LWLock we now hold (or, indeed, even the first per-backend LWLock) +and will notice any weak lock we take when it does. + +Fast-path VXID locks do not use the FastPathStrongRelationLocks table. The +first lock taken on a VXID is always the ExclusiveLock taken by its owner. +Any subsequent lockers are share lockers waiting for the VXID to terminate. +Indeed, the only reason VXID locks use the lock manager at all (rather than +waiting for the VXID to terminate via some other method) is for deadlock +detection. Thus, the initial VXID lock can *always* be taken via the fast +path without checking for conflicts. Any subsequent locker must check +whether the lock has been transferred to the main lock table, and if not, +do so. The backend owning the VXID must be careful to clean up any entry +made in the main lock table at end of transaction. + +Deadlock detection does not need to examine the fast-path data structures, +because any lock that could possibly be involved in a deadlock must have +been transferred to the main tables beforehand. + + +The Deadlock Detection Algorithm +-------------------------------- + +Since we allow user transactions to request locks in any order, deadlock +is possible. We use a deadlock detection/breaking algorithm that is +fairly standard in essence, but there are many special considerations +needed to deal with Postgres' generalized locking model. + +A key design consideration is that we want to make routine operations +(lock grant and release) run quickly when there is no deadlock, and +avoid the overhead of deadlock handling as much as possible. We do this +using an "optimistic waiting" approach: if a process cannot acquire the +lock it wants immediately, it goes to sleep without any deadlock check. +But it also sets a delay timer, with a delay of DeadlockTimeout +milliseconds (typically set to one second). If the delay expires before +the process is granted the lock it wants, it runs the deadlock +detection/breaking code. Normally this code will determine that there is +no deadlock condition, and then the process will go back to sleep and +wait quietly until it is granted the lock. But if a deadlock condition +does exist, it will be resolved, usually by aborting the detecting +process' transaction. In this way, we avoid deadlock handling overhead +whenever the wait time for a lock is less than DeadlockTimeout, while +not imposing an unreasonable delay of detection when there is an error. + +Lock acquisition (routines LockAcquire and ProcSleep) follows these rules: + +1. A lock request is granted immediately if it does not conflict with +any existing or waiting lock request, or if the process already holds an +instance of the same lock type (eg, there's no penalty to acquire a read +lock twice). Note that a process never conflicts with itself, eg one +can obtain read lock when one already holds exclusive lock. + +2. Otherwise the process joins the lock's wait queue. Normally it will +be added to the end of the queue, but there is an exception: if the +process already holds locks on this same lockable object that conflict +with the request of any pending waiter, then the process will be +inserted in the wait queue just ahead of the first such waiter. (If we +did not make this check, the deadlock detection code would adjust the +queue order to resolve the conflict, but it's relatively cheap to make +the check in ProcSleep and avoid a deadlock timeout delay in this case.) +Note special case when inserting before the end of the queue: if the +process's request does not conflict with any existing lock nor any +waiting request before its insertion point, then go ahead and grant the +lock without waiting. + +When a lock is released, the lock release routine (ProcLockWakeup) scans +the lock object's wait queue. Each waiter is awoken if (a) its request +does not conflict with already-granted locks, and (b) its request does +not conflict with the requests of prior un-wakable waiters. Rule (b) +ensures that conflicting requests are granted in order of arrival. There +are cases where a later waiter must be allowed to go in front of +conflicting earlier waiters to avoid deadlock, but it is not +ProcLockWakeup's responsibility to recognize these cases; instead, the +deadlock detection code will re-order the wait queue when necessary. + +To perform deadlock checking, we use the standard method of viewing the +various processes as nodes in a directed graph (the waits-for graph or +WFG). There is a graph edge leading from process A to process B if A +waits for B, ie, A is waiting for some lock and B holds a conflicting +lock. There is a deadlock condition if and only if the WFG contains a +cycle. We detect cycles by searching outward along waits-for edges to +see if we return to our starting point. There are three possible +outcomes: + +1. All outgoing paths terminate at a running process (which has no +outgoing edge). + +2. A deadlock is detected by looping back to the start point. We +resolve such a deadlock by canceling the start point's lock request and +reporting an error in that transaction, which normally leads to +transaction abort and release of that transaction's held locks. Note +that it's sufficient to cancel one request to remove the cycle; we don't +need to kill all the transactions involved. + +3. Some path(s) loop back to a node other than the start point. This +indicates a deadlock, but one that does not involve our starting +process. We ignore this condition on the grounds that resolving such a +deadlock is the responsibility of the processes involved --- killing our +start-point process would not resolve the deadlock. So, cases 1 and 3 +both report "no deadlock". + +Postgres' situation is a little more complex than the standard discussion +of deadlock detection, for two reasons: + +1. A process can be waiting for more than one other process, since there +might be multiple PROCLOCKs of (non-conflicting) lock types that all +conflict with the waiter's request. This creates no real difficulty +however; we simply need to be prepared to trace more than one outgoing +edge. + +2. If a process A is behind a process B in some lock's wait queue, and +their requested locks conflict, then we must say that A waits for B, since +ProcLockWakeup will never awaken A before B. This creates additional +edges in the WFG. We call these "soft" edges, as opposed to the "hard" +edges induced by locks already held. Note that if B already holds any +locks conflicting with A's request, then their relationship is a hard edge +not a soft edge. + +A "soft" block, or wait-priority block, has the same potential for +inducing deadlock as a hard block. However, we may be able to resolve +a soft block without aborting the transactions involved: we can instead +rearrange the order of the wait queue. This rearrangement reverses the +direction of the soft edge between two processes with conflicting requests +whose queue order is reversed. If we can find a rearrangement that +eliminates a cycle without creating new ones, then we can avoid an abort. +Checking for such possible rearrangements is the trickiest part of the +algorithm. + +The workhorse of the deadlock detector is a routine FindLockCycle() which +is given a starting point process (which must be a waiting process). +It recursively scans outward across waits-for edges as discussed above. +If it finds no cycle involving the start point, it returns "false". +(As discussed above, we can ignore cycles not involving the start point.) +When such a cycle is found, FindLockCycle() returns "true", and as it +unwinds it also builds a list of any "soft" edges involved in the cycle. +If the resulting list is empty then there is a hard deadlock and the +configuration cannot succeed. However, if the list is not empty, then +reversing any one of the listed edges through wait-queue rearrangement +will eliminate that cycle. Since such a reversal might create cycles +elsewhere, we may need to try every possibility. Therefore, we need to +be able to invoke FindLockCycle() on hypothetical configurations (wait +orders) as well as the current real order. + +The easiest way to handle this seems to be to have a lookaside table that +shows the proposed new queue order for each wait queue that we are +considering rearranging. This table is checked by FindLockCycle, and it +believes the proposed queue order rather than the real order for each lock +that has an entry in the lookaside table. + +We build a proposed new queue order by doing a "topological sort" of the +existing entries. Each soft edge that we are currently considering +reversing creates a property of the partial order that the topological sort +has to enforce. We must use a sort method that preserves the input +ordering as much as possible, so as not to gratuitously break arrival +order for processes not involved in a deadlock. (This is not true of the +tsort method shown in Knuth, for example, but it's easily done by a simple +doubly-nested-loop method that emits the first legal candidate at each +step. Fortunately, we don't need a highly efficient sort algorithm, since +the number of partial order constraints is not likely to be large.) Note +that failure of the topological sort tells us we have conflicting ordering +constraints, and therefore that the last-added soft edge reversal +conflicts with a prior edge reversal. We need to detect this case to +avoid an infinite loop in the case where no possible rearrangement will +work: otherwise, we might try a reversal, find that it still leads to +a cycle, then try to un-reverse the reversal while trying to get rid of +that cycle, etc etc. Topological sort failure tells us the un-reversal +is not a legitimate move in this context. + +So, the basic step in our rearrangement method is to take a list of +soft edges in a cycle (as returned by FindLockCycle()) and successively +try the reversal of each one as a topological-sort constraint added to +whatever constraints we are already considering. We recursively search +through all such sets of constraints to see if any one eliminates all +the deadlock cycles at once. Although this might seem impossibly +inefficient, it shouldn't be a big problem in practice, because there +will normally be very few, and not very large, deadlock cycles --- if +any at all. So the combinatorial inefficiency isn't going to hurt us. +Besides, it's better to spend some time to guarantee that we've checked +all possible escape routes than to abort a transaction when we didn't +really have to. + +Each edge reversal constraint can be viewed as requesting that the waiting +process A be moved to before the blocking process B in the wait queue they +are both in. This action will reverse the desired soft edge, as well as +any other soft edges between A and other processes it is advanced over. +No other edges will be affected (note this is actually a constraint on our +topological sort method to not re-order the queue more than necessary.) +Therefore, we can be sure we have not created any new deadlock cycles if +neither FindLockCycle(A) nor FindLockCycle(B) discovers any cycle. Given +the above-defined behavior of FindLockCycle, each of these searches is +necessary as well as sufficient, since FindLockCycle starting at the +original start point will not complain about cycles that include A or B +but not the original start point. + +In short then, a proposed rearrangement of the wait queue(s) is determined +by one or more broken soft edges A->B, fully specified by the output of +topological sorts of each wait queue involved, and then tested by invoking +FindLockCycle() starting at the original start point as well as each of +the mentioned processes (A's and B's). If none of the tests detect a +cycle, then we have a valid configuration and can implement it by +reordering the wait queues per the sort outputs (and then applying +ProcLockWakeup on each reordered queue, in case a waiter has become wakable). +If any test detects a soft cycle, we can try to resolve it by adding each +soft link in that cycle, in turn, to the proposed rearrangement list. +This is repeated recursively until we either find a workable rearrangement +or determine that none exists. In the latter case, the outer level +resolves the deadlock by aborting the original start-point transaction. + +The particular order in which rearrangements are tried depends on the +order FindLockCycle() happens to scan in, so if there are multiple +workable rearrangements of the wait queues, then it is unspecified which +one will be chosen. What's more important is that we guarantee to try +every queue rearrangement that could lead to success. (For example, +if we have A before B before C and the needed order constraints are +C before A and B before C, we would first discover that A before C +doesn't work and try the rearrangement C before A before B. This would +eventually lead to the discovery of the additional constraint B before C.) + +Got that? + +Miscellaneous Notes +------------------- + +1. It is easily proven that no deadlock will be missed due to our +asynchronous invocation of deadlock checking. A deadlock cycle in the WFG +is formed when the last edge in the cycle is added; therefore the last +process in the cycle to wait (the one from which that edge is outgoing) is +certain to detect and resolve the cycle when it later runs CheckDeadLock. +This holds even if that edge addition created multiple cycles; the process +may indeed abort without ever noticing those additional cycles, but we +don't particularly care. The only other possible creation of deadlocks is +during deadlock resolution's rearrangement of wait queues, and we already +saw that that algorithm will prove that it creates no new deadlocks before +it attempts to actually execute any rearrangement. + +2. It is not certain that a deadlock will be resolved by aborting the +last-to-wait process. If earlier waiters in the cycle have not yet run +CheckDeadLock, then the first one to do so will be the victim. + +3. No live (wakable) process can be missed by ProcLockWakeup, since it +examines every member of the wait queue (this was not true in the 7.0 +implementation, BTW). Therefore, if ProcLockWakeup is always invoked +after a lock is released or a wait queue is rearranged, there can be no +failure to wake a wakable process. One should also note that +LockErrorCleanup (abort a waiter due to outside factors) must run +ProcLockWakeup, in case the canceled waiter was soft-blocking other +waiters. + +4. We can minimize excess rearrangement-trial work by being careful to +scan the wait queue from the front when looking for soft edges. For +example, if we have queue order A,B,C and C has deadlock conflicts with +both A and B, we want to generate the "C before A" constraint first, +rather than wasting time with "C before B", which won't move C far +enough up. So we look for soft edges outgoing from C starting at the +front of the wait queue. + +5. The working data structures needed by the deadlock detection code can +be limited to numbers of entries computed from MaxBackends. Therefore, +we can allocate the worst-case space needed during backend startup. This +seems a safer approach than trying to allocate workspace on the fly; we +don't want to risk having the deadlock detector run out of memory, else +we really have no guarantees at all that deadlock will be detected. + +6. We abuse the deadlock detector to implement autovacuum cancellation. +When we run the detector and we find that there's an autovacuum worker +involved in the waits-for graph, we store a pointer to its PGPROC, and +return a special return code (unless a hard deadlock has been detected). +The caller can then send a cancellation signal. This implements the +principle that autovacuum has a low locking priority (eg it must not block +DDL on the table). + +Group Locking +------------- + +As if all of that weren't already complicated enough, PostgreSQL now supports +parallelism (see src/backend/access/transam/README.parallel), which means that +we might need to resolve deadlocks that occur between gangs of related +processes rather than individual processes. This doesn't change the basic +deadlock detection algorithm very much, but it makes the bookkeeping more +complicated. + +We choose to regard locks held by processes in the same parallel group as +non-conflicting with the exception of relation extension and page locks. This +means that two processes in a parallel group can hold a self-exclusive lock on +the same relation at the same time, or one process can acquire an AccessShareLock +while the other already holds AccessExclusiveLock. This might seem dangerous and +could be in some cases (more on that below), but if we didn't do this then +parallel query would be extremely prone to self-deadlock. For example, a +parallel query against a relation on which the leader already had +AccessExclusiveLock would hang, because the workers would try to lock the same +relation and be blocked by the leader; yet the leader can't finish until it +receives completion indications from all workers. An undetected deadlock +results. This is far from the only scenario where such a problem happens. The +same thing will occur if the leader holds only AccessShareLock, the worker +seeks AccessShareLock, but between the time the leader attempts to acquire the +lock and the time the worker attempts to acquire it, some other process queues +up waiting for an AccessExclusiveLock. In this case, too, an indefinite hang +results. + +It might seem that we could predict which locks the workers will attempt to +acquire and ensure before going parallel that those locks would be acquired +successfully. But this is very difficult to make work in a general way. For +example, a parallel worker's portion of the query plan could involve an +SQL-callable function which generates a query dynamically, and that query +might happen to hit a table on which the leader happens to hold +AccessExclusiveLock. By imposing enough restrictions on what workers can do, +we could eventually create a situation where their behavior can be adequately +restricted, but these restrictions would be fairly onerous, and even then, the +system required to decide whether the workers will succeed at acquiring the +necessary locks would be complex and possibly buggy. + +So, instead, we take the approach of deciding that locks within a lock group +do not conflict. This eliminates the possibility of an undetected deadlock, +but also opens up some problem cases: if the leader and worker try to do some +operation at the same time which would ordinarily be prevented by the +heavyweight lock mechanism, undefined behavior might result. In practice, the +dangers are modest. The leader and worker share the same transaction, +snapshot, and combo CID hash, and neither can perform any DDL or, indeed, +write any data at all. Thus, for either to read a table locked exclusively by +the other is safe enough. Problems would occur if the leader initiated +parallelism from a point in the code at which it had some backend-private +state that made table access from another process unsafe, for example after +calling SetReindexProcessing and before calling ResetReindexProcessing, +catastrophe could ensue, because the worker won't have that state. + +To allow parallel inserts and parallel copy, we have ensured that relation +extension and page locks don't participate in group locking which means such +locks can conflict among the same group members. This is required as it is no +safer for two related processes to extend the same relation or perform clean up +in gin indexes at a time than for unrelated processes to do the same. We don't +acquire a heavyweight lock on any other object after relation extension lock +which means such a lock can never participate in the deadlock cycle. After +acquiring page locks, we can acquire relation extension lock but reverse never +happens, so those will also not participate in deadlock. To allow for other +parallel writes like parallel update or parallel delete, we'll either need to +(1) further enhance the deadlock detector to handle those tuple locks in a +different way than other types; or (2) have parallel workers use some other +mutual exclusion method for such cases. Currently, the parallel mode is +strictly read-only, but now we have the infrastructure to allow parallel +inserts and parallel copy. + +Group locking adds three new members to each PGPROC: lockGroupLeader, +lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for +processes not involved in parallel query. When a process wants to cooperate +with parallel workers, it becomes a lock group leader, which means setting +this field to point to its own PGPROC. When a parallel worker starts up, it +points this field at the leader. The lockGroupMembers field is only used in +the leader; it is a list of the member PGPROCs of the lock group (the leader +and all workers). The lockGroupLink field is the list link for this list. + +All three of these fields are considered to be protected by a lock manager +partition lock. The partition lock that protects these fields within a given +lock group is chosen by taking the leader's pgprocno modulo the number of lock +manager partitions. This unusual arrangement has a major advantage: the +deadlock detector can count on the fact that no lockGroupLeader field can +change while the deadlock detector is running, because it knows that it holds +all the lock manager locks. Also, holding this single lock allows safe +manipulation of the lockGroupMembers list for the lock group. + +We need an additional interlock when setting these fields, because a newly +started parallel worker has to try to join the leader's lock group, but it +has no guarantee that the group leader is still alive by the time it gets +started. We try to ensure that the parallel leader dies after all workers +in normal cases, but also that the system could survive relatively intact +if that somehow fails to happen. This is one of the precautions against +such a scenario: the leader relays its PGPROC and also its PID to the +worker, and the worker fails to join the lock group unless the given PGPROC +still has the same PID and is still a lock group leader. We assume that +PIDs are not recycled quickly enough for this interlock to fail. + + +User Locks (Advisory Locks) +--------------------------- + +User locks are handled totally on the application side as long term +cooperative locks which may extend beyond the normal transaction boundaries. +Their purpose is to indicate to an application that someone is `working' +on an item. So it is possible to put an user lock on a tuple's oid, +retrieve the tuple, work on it for an hour and then update it and remove +the lock. While the lock is active other clients can still read and write +the tuple but they can be aware that it has been locked at the application +level by someone. + +User locks and normal locks are completely orthogonal and they don't +interfere with each other. + +User locks can be acquired either at session level or transaction level. +A session-level lock request is not automatically released at transaction +end, but must be explicitly released by the application. (However, any +remaining locks are always released at session end.) Transaction-level +user lock requests behave the same as normal lock requests, in that they +are released at transaction end and do not need explicit unlocking. + +Locking during Hot Standby +-------------------------- + +The Startup process is the only backend that can make changes during +recovery, all other backends are read only. As a result the Startup +process does not acquire locks on relations or objects except when the lock +level is AccessExclusiveLock. + +Regular backends are only allowed to take locks on relations or objects +at RowExclusiveLock or lower. This ensures that they do not conflict with +each other or with the Startup process, unless AccessExclusiveLocks are +requested by the Startup process. + +Deadlocks involving AccessExclusiveLocks are not possible, so we need +not be concerned that a user initiated deadlock can prevent recovery from +progressing. + +AccessExclusiveLocks on the primary node generate WAL records +that are then applied by the Startup process. Locks are released at end +of transaction just as they are in normal processing. These locks are +held by the Startup process, acting as a proxy for the backends that +originally acquired these locks. Again, these locks cannot conflict with +one another, so the Startup process cannot deadlock itself either. + +Although deadlock is not possible, a regular backend's weak lock can +prevent the Startup process from making progress in applying WAL, which is +usually not something that should be tolerated for very long. Mechanisms +exist to forcibly cancel a regular backend's query if it blocks the +Startup process for too long. |