summaryrefslogtreecommitdiffstats
path: root/man2/futex.2
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--man2/futex.21976
1 files changed, 1976 insertions, 0 deletions
diff --git a/man2/futex.2 b/man2/futex.2
new file mode 100644
index 0000000..43b1075
--- /dev/null
+++ b/man2/futex.2
@@ -0,0 +1,1976 @@
+.\" Page by b.hubert
+.\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de>
+.\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
+.\" may be freely modified and distributed
+.\" %%%LICENSE_END
+.\"
+.\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com)
+.\" added ERRORS section.
+.\"
+.\" Modified 2004-06-17 mtk
+.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
+.\"
+.\" FIXME Still to integrate are some points from Torvald Riegel's mail of
+.\" 2015-01-23:
+.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
+.\"
+.\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail
+.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
+.\"
+.TH futex 2 2023-05-03 "Linux man-pages 6.05.01"
+.SH NAME
+futex \- fast user-space locking
+.SH LIBRARY
+Standard C library
+.RI ( libc ", " \-lc )
+.SH SYNOPSIS
+.nf
+.PP
+.BR "#include <linux/futex.h>" " /* Definition of " FUTEX_* " constants */"
+.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
+.B #include <unistd.h>
+.PP
+.BI "long syscall(SYS_futex, uint32_t *" uaddr ", int " futex_op \
+", uint32_t " val ,
+.BI " const struct timespec *" timeout , \
+" \fR /* or: \fBuint32_t \fIval2\fP */"
+.BI " uint32_t *" uaddr2 ", uint32_t " val3 );
+.fi
+.PP
+.IR Note :
+glibc provides no wrapper for
+.BR futex (),
+necessitating the use of
+.BR syscall (2).
+.SH DESCRIPTION
+The
+.BR futex ()
+system call provides a method for waiting until a certain condition becomes
+true.
+It is typically used as a blocking construct in the context of
+shared-memory synchronization.
+When using futexes, the majority of
+the synchronization operations are performed in user space.
+A user-space program employs the
+.BR futex ()
+system call only when it is likely that the program has to block for
+a longer time until the condition becomes true.
+Other
+.BR futex ()
+operations can be used to wake any processes or threads waiting
+for a particular condition.
+.PP
+A futex is a 32-bit value\[em]referred to below as a
+.IR "futex word" \[em]whose
+address is supplied to the
+.BR futex ()
+system call.
+(Futexes are 32 bits in size on all platforms, including 64-bit systems.)
+All futex operations are governed by this value.
+In order to share a futex between processes,
+the futex is placed in a region of shared memory,
+created using (for example)
+.BR mmap (2)
+or
+.BR shmat (2).
+(Thus, the futex word may have different
+virtual addresses in different processes,
+but these addresses all refer to the same location in physical memory.)
+In a multithreaded program, it is sufficient to place the futex word
+in a global variable shared by all threads.
+.PP
+When executing a futex operation that requests to block a thread,
+the kernel will block only if the futex word has the value that the
+calling thread supplied (as one of the arguments of the
+.BR futex ()
+call) as the expected value of the futex word.
+The loading of the futex word's value,
+the comparison of that value with the expected value,
+and the actual blocking will happen atomically and will be totally ordered
+with respect to concurrent operations performed by other threads
+on the same futex word.
+.\" Notes from Darren Hart (Dec 2015):
+.\" Totally ordered with respect futex operations refers to semantics
+.\" of the ACQUIRE/RELEASE operations and how they impact ordering of
+.\" memory reads and writes. The kernel futex operations are protected
+.\" by spinlocks, which ensure that all operations are serialized
+.\" with respect to one another.
+.\"
+.\" This is a lot to attempt to define in this document. Perhaps a
+.\" reference to linux/Documentation/memory-barriers.txt as a footnote
+.\" would be sufficient? Or perhaps for this manual, "serialized" would
+.\" be sufficient, with a footnote regarding "totally ordered" and a
+.\" pointer to the memory-barrier documentation?
+Thus, the futex word is used to connect the synchronization in user space
+with the implementation of blocking by the kernel.
+Analogously to an atomic
+compare-and-exchange operation that potentially changes shared memory,
+blocking via a futex is an atomic compare-and-block operation.
+.\" FIXME(Torvald Riegel):
+.\" Eventually we want to have some text in NOTES to satisfy
+.\" the reference in the following sentence
+.\" See NOTES for a detailed specification of
+.\" the synchronization semantics.
+.PP
+One use of futexes is for implementing locks.
+The state of the lock (i.e., acquired or not acquired)
+can be represented as an atomically accessed flag in shared memory.
+In the uncontended case,
+a thread can access or modify the lock state with atomic instructions,
+for example atomically changing it from not acquired to acquired
+using an atomic compare-and-exchange instruction.
+(Such instructions are performed entirely in user mode,
+and the kernel maintains no information about the lock state.)
+On the other hand, a thread may be unable to acquire a lock because
+it is already acquired by another thread.
+It then may pass the lock's flag as a futex word and the value
+representing the acquired state as the expected value to a
+.BR futex ()
+wait operation.
+This
+.BR futex ()
+operation will block if and only if the lock is still acquired
+(i.e., the value in the futex word still matches the "acquired state").
+When releasing the lock, a thread has to first reset the
+lock state to not acquired and then execute a futex
+operation that wakes threads blocked on the lock flag used as a futex word
+(this can be further optimized to avoid unnecessary wake-ups).
+See
+.BR futex (7)
+for more detail on how to use futexes.
+.PP
+Besides the basic wait and wake-up futex functionality, there are further
+futex operations aimed at supporting more complex use cases.
+.PP
+Note that
+no explicit initialization or destruction is necessary to use futexes;
+the kernel maintains a futex
+(i.e., the kernel-internal implementation artifact)
+only while operations such as
+.BR FUTEX_WAIT ,
+described below, are being performed on a particular futex word.
+.\"
+.SS Arguments
+The
+.I uaddr
+argument points to the futex word.
+On all platforms, futexes are four-byte
+integers that must be aligned on a four-byte boundary.
+The operation to perform on the futex is specified in the
+.I futex_op
+argument;
+.I val
+is a value whose meaning and purpose depends on
+.IR futex_op .
+.PP
+The remaining arguments
+.RI ( timeout ,
+.IR uaddr2 ,
+and
+.IR val3 )
+are required only for certain of the futex operations described below.
+Where one of these arguments is not required, it is ignored.
+.PP
+For several blocking operations, the
+.I timeout
+argument is a pointer to a
+.I timespec
+structure that specifies a timeout for the operation.
+However, notwithstanding the prototype shown above, for some operations,
+the least significant four bytes of this argument are instead
+used as an integer whose meaning is determined by the operation.
+For these operations, the kernel casts the
+.I timeout
+value first to
+.IR "unsigned long",
+then to
+.IR uint32_t ,
+and in the remainder of this page, this argument is referred to as
+.I val2
+when interpreted in this fashion.
+.PP
+Where it is required, the
+.I uaddr2
+argument is a pointer to a second futex word that is employed
+by the operation.
+.PP
+The interpretation of the final integer argument,
+.IR val3 ,
+depends on the operation.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.SS Futex operations
+The
+.I futex_op
+argument consists of two parts:
+a command that specifies the operation to be performed,
+bitwise ORed with zero or more options that
+modify the behaviour of the operation.
+The options that may be included in
+.I futex_op
+are as follows:
+.TP
+.BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)"
+.\" commit 34f01cc1f512fa783302982776895c73714ebbc2
+This option bit can be employed with all futex operations.
+It tells the kernel that the futex is process-private and not shared
+with another process (i.e., it is being used for synchronization
+only between threads of the same process).
+This allows the kernel to make some additional performance optimizations.
+.\" I.e., It allows the kernel choose the fast path for validating
+.\" the user-space address and avoids expensive VMA lookups,
+.\" taking reference counts on file backing store, and so on.
+.IP
+As a convenience,
+.I <linux/futex.h>
+defines a set of constants with the suffix
+.B _PRIVATE
+that are equivalents of all of the operations listed below,
+.\" except the obsolete FUTEX_FD, for which the "private" flag was
+.\" meaningless
+but with the
+.B FUTEX_PRIVATE_FLAG
+ORed into the constant value.
+Thus, there are
+.BR FUTEX_WAIT_PRIVATE ,
+.BR FUTEX_WAKE_PRIVATE ,
+and so on.
+.TP
+.BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)"
+.\" commit 1acdac104668a0834cfa267de9946fac7764d486
+This option bit can be employed only with the
+.BR FUTEX_WAIT_BITSET ,
+.BR FUTEX_WAIT_REQUEUE_PI ,
+(since Linux 4.5)
+.\" commit 337f13046ff03717a9e99675284a817527440a49
+.BR FUTEX_WAIT ,
+and
+(since Linux 5.14)
+.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae
+.B FUTEX_LOCK_PI2
+operations.
+.IP
+If this option is set, the kernel measures the
+.I timeout
+against the
+.B CLOCK_REALTIME
+clock.
+.IP
+If this option is not set, the kernel measures the
+.I timeout
+against the
+.B CLOCK_MONOTONIC
+clock.
+.PP
+The operation specified in
+.I futex_op
+is one of the following:
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAIT " (since Linux 2.6.0)"
+.\" Strictly speaking, since some time in Linux 2.5.x
+This operation tests that the value at the
+futex word pointed to by the address
+.I uaddr
+still contains the expected value
+.IR val ,
+and if so, then sleeps waiting for a
+.B FUTEX_WAKE
+operation on the futex word.
+The load of the value of the futex word is an atomic memory
+access (i.e., using atomic machine instructions of the respective
+architecture).
+This load, the comparison with the expected value, and
+starting to sleep are performed atomically
+.\" FIXME: Torvald, I think we may need to add some explanation of
+.\" "totally ordered" here.
+and totally ordered
+with respect to other futex operations on the same futex word.
+If the thread starts to sleep,
+it is considered a waiter on this futex word.
+If the futex value does not match
+.IR val ,
+then the call fails immediately with the error
+.BR EAGAIN .
+.IP
+The purpose of the comparison with the expected value is to prevent lost
+wake-ups.
+If another thread changed the value of the futex word after the
+calling thread decided to block based on the prior value,
+and if the other thread executed a
+.B FUTEX_WAKE
+operation (or similar wake-up) after the value change and before this
+.B FUTEX_WAIT
+operation, then the calling thread will observe the
+value change and will not start to sleep.
+.IP
+If the
+.I timeout
+is not NULL, the structure it points to specifies a
+timeout for the wait.
+(This interval will be rounded up to the system clock granularity,
+and is guaranteed not to expire early.)
+The timeout is by default measured according to the
+.B CLOCK_MONOTONIC
+clock, but, since Linux 4.5, the
+.B CLOCK_REALTIME
+clock can be selected by specifying
+.B FUTEX_CLOCK_REALTIME
+in
+.IR futex_op .
+If
+.I timeout
+is NULL, the call blocks indefinitely.
+.IP
+.IR Note :
+for
+.BR FUTEX_WAIT ,
+.I timeout
+is interpreted as a
+.I relative
+value.
+This differs from other futex operations, where
+.I timeout
+is interpreted as an absolute value.
+To obtain the equivalent of
+.B FUTEX_WAIT
+with an absolute timeout, employ
+.B FUTEX_WAIT_BITSET
+with
+.I val3
+specified as
+.BR FUTEX_BITSET_MATCH_ANY .
+.IP
+The arguments
+.I uaddr2
+and
+.I val3
+are ignored.
+.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a
+.\" different example.
+.\"
+.\" For
+.\" .BR futex (7),
+.\" this call is executed if decrementing the count gave a negative value
+.\" (indicating contention),
+.\" and will sleep until another process or thread releases
+.\" the futex and executes the
+.\" .B FUTEX_WAKE
+.\" operation.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAKE " (since Linux 2.6.0)"
+.\" Strictly speaking, since Linux 2.5.x
+This operation wakes at most
+.I val
+of the waiters that are waiting (e.g., inside
+.BR FUTEX_WAIT )
+on the futex word at the address
+.IR uaddr .
+Most commonly,
+.I val
+is specified as either 1 (wake up a single waiter) or
+.B INT_MAX
+(wake up all waiters).
+No guarantee is provided about which waiters are awoken
+(e.g., a waiter with a higher scheduling priority is not guaranteed
+to be awoken in preference to a waiter with a lower priority).
+.IP
+The arguments
+.IR timeout ,
+.IR uaddr2 ,
+and
+.I val3
+are ignored.
+.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to
+.\" a different example.
+.\"
+.\" For
+.\" .BR futex (7),
+.\" this is executed if incrementing the count showed that
+.\" there were waiters,
+.\" once the futex value has been set to 1
+.\" (indicating that it is available).
+.\"
+.\" How does "incrementing the count show that there were waiters"?
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)"
+.\" Strictly speaking, from Linux 2.5.x to Linux 2.6.25
+This operation creates a file descriptor that is associated with
+the futex at
+.IR uaddr .
+The caller must close the returned file descriptor after use.
+When another process or thread performs a
+.B FUTEX_WAKE
+on the futex word, the file descriptor indicates as being readable with
+.BR select (2),
+.BR poll (2),
+and
+.BR epoll (7)
+.IP
+The file descriptor can be used to obtain asynchronous notifications: if
+.I val
+is nonzero, then, when another process or thread executes a
+.BR FUTEX_WAKE ,
+the caller will receive the signal number that was passed in
+.IR val .
+.IP
+The arguments
+.IR timeout ,
+.IR uaddr2 ,
+and
+.I val3
+are ignored.
+.IP
+Because it was inherently racy,
+.B FUTEX_FD
+has been removed
+.\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80
+from Linux 2.6.26 onward.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_REQUEUE " (since Linux 2.6.0)"
+This operation performs the same task as
+.B FUTEX_CMP_REQUEUE
+(see below), except that no check is made using the value in
+.IR val3 .
+(The argument
+.I val3
+is ignored.)
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
+This operation first checks whether the location
+.I uaddr
+still contains the value
+.IR val3 .
+If not, the operation fails with the error
+.BR EAGAIN .
+Otherwise, the operation wakes up a maximum of
+.I val
+waiters that are waiting on the futex at
+.IR uaddr .
+If there are more than
+.I val
+waiters, then the remaining waiters are removed
+from the wait queue of the source futex at
+.I uaddr
+and added to the wait queue of the target futex at
+.IR uaddr2 .
+The
+.I val2
+argument specifies an upper limit on the number of waiters
+that are requeued to the futex at
+.IR uaddr2 .
+.IP
+.\" FIXME(Torvald) Is the following correct? Or is just the decision
+.\" which threads to wake or requeue part of the atomic operation?
+The load from
+.I uaddr
+is an atomic memory access (i.e., using atomic machine instructions of
+the respective architecture).
+This load, the comparison with
+.IR val3 ,
+and the requeueing of any waiters are performed atomically and totally
+ordered with respect to other operations on the same futex word.
+.\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ###
+.\" The operation is serialized with respect to operations on both
+.\" source and target futex. No other waiter can enqueue itself
+.\" for waiting and no other waiter can dequeue itself because of
+.\" a timeout or signal.
+.IP
+Typical values to specify for
+.I val
+are 0 or 1.
+(Specifying
+.B INT_MAX
+is not useful, because it would make the
+.B FUTEX_CMP_REQUEUE
+operation equivalent to
+.BR FUTEX_WAKE .)
+The limit value specified via
+.I val2
+is typically either 1 or
+.BR INT_MAX .
+(Specifying the argument as 0 is not useful, because it would make the
+.B FUTEX_CMP_REQUEUE
+operation equivalent to
+.BR FUTEX_WAIT .)
+.IP
+The
+.B FUTEX_CMP_REQUEUE
+operation was added as a replacement for the earlier
+.BR FUTEX_REQUEUE .
+The difference is that the check of the value at
+.I uaddr
+can be used to ensure that requeueing happens only under certain
+conditions, which allows race conditions to be avoided in certain use cases.
+.\" But, as Rich Felker points out, there remain valid use cases for
+.\" FUTEX_REQUEUE, for example, when the calling thread is requeuing
+.\" the target(s) to a lock that the calling thread owns
+.\" From: Rich Felker <dalias@libc.org>
+.\" Date: Wed, 29 Oct 2014 22:43:17 -0400
+.\" To: Darren Hart <dvhart@infradead.org>
+.\" CC: libc-alpha@sourceware.org, ...
+.\" Subject: Re: Add futex wrapper to glibc?
+.IP
+Both
+.B FUTEX_REQUEUE
+and
+.B FUTEX_CMP_REQUEUE
+can be used to avoid "thundering herd" wake-ups that could occur when using
+.B FUTEX_WAKE
+in cases where all of the waiters that are woken need to acquire
+another futex.
+Consider the following scenario,
+where multiple waiter threads are waiting on B,
+a wait queue implemented using a futex:
+.IP
+.in +4n
+.EX
+lock(A)
+while (!check_value(V)) {
+ unlock(A);
+ block_on(B);
+ lock(A);
+};
+unlock(A);
+.EE
+.in
+.IP
+If a waker thread used
+.BR FUTEX_WAKE ,
+then all waiters waiting on B would be woken up,
+and they would all try to acquire lock A.
+However, waking all of the threads in this manner would be pointless because
+all except one of the threads would immediately block on lock A again.
+By contrast, a requeue operation wakes just one waiter and moves
+the other waiters to lock A,
+and when the woken waiter unlocks A then the next waiter can proceed.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAKE_OP " (since Linux 2.6.14)"
+.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
+.\" Author: Jakub Jelinek <jakub@redhat.com>
+.\" Date: Tue Sep 6 15:16:25 2005 -0700
+.\" FIXME. (Torvald) The glibc condvar implementation is currently being
+.\" revised (e.g., to not use an internal lock anymore).
+.\" It is probably more future-proof to remove this paragraph.
+.\" [Torvald, do you have an update here?]
+This operation was added to support some user-space use cases
+where more than one futex must be handled at the same time.
+The most notable example is the implementation of
+.BR pthread_cond_signal (3),
+which requires operations on two futexes,
+the one used to implement the mutex and the one used in the implementation
+of the wait queue associated with the condition variable.
+.B FUTEX_WAKE_OP
+allows such cases to be implemented without leading to
+high rates of contention and context switching.
+.IP
+The
+.B FUTEX_WAKE_OP
+operation is equivalent to executing the following code atomically
+and totally ordered with respect to other futex operations on
+any of the two supplied futex words:
+.IP
+.in +4n
+.EX
+uint32_t oldval = *(uint32_t *) uaddr2;
+*(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
+futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
+if (oldval \fIcmp\fP \fIcmparg\fP)
+ futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
+.EE
+.in
+.IP
+In other words,
+.B FUTEX_WAKE_OP
+does the following:
+.RS
+.IP \[bu] 3
+saves the original value of the futex word at
+.I uaddr2
+and performs an operation to modify the value of the futex at
+.IR uaddr2 ;
+this is an atomic read-modify-write memory access (i.e., using atomic
+machine instructions of the respective architecture)
+.IP \[bu]
+wakes up a maximum of
+.I val
+waiters on the futex for the futex word at
+.IR uaddr ;
+and
+.IP \[bu]
+dependent on the results of a test of the original value of the
+futex word at
+.IR uaddr2 ,
+wakes up a maximum of
+.I val2
+waiters on the futex for the futex word at
+.IR uaddr2 .
+.RE
+.IP
+The operation and comparison that are to be performed are encoded
+in the bits of the argument
+.IR val3 .
+Pictorially, the encoding is:
+.IP
+.in +4n
+.EX
++---+---+-----------+-----------+
+|op |cmp| oparg | cmparg |
++---+---+-----------+-----------+
+ 4 4 12 12 <== # of bits
+.EE
+.in
+.IP
+Expressed in code, the encoding is:
+.IP
+.in +4n
+.EX
+#define FUTEX_OP(op, oparg, cmp, cmparg) \e
+ (((op & 0xf) << 28) | \e
+ ((cmp & 0xf) << 24) | \e
+ ((oparg & 0xfff) << 12) | \e
+ (cmparg & 0xfff))
+.EE
+.in
+.IP
+In the above,
+.I op
+and
+.I cmp
+are each one of the codes listed below.
+The
+.I oparg
+and
+.I cmparg
+components are literal numeric values, except as noted below.
+.IP
+The
+.I op
+component has one of the following values:
+.IP
+.in +4n
+.EX
+FUTEX_OP_SET 0 /* uaddr2 = oparg; */
+FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
+FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
+FUTEX_OP_ANDN 3 /* uaddr2 &= \[ti]oparg; */
+FUTEX_OP_XOR 4 /* uaddr2 \[ha]= oparg; */
+.EE
+.in
+.IP
+In addition, bitwise ORing the following value into
+.I op
+causes
+.I (1\~<<\~oparg)
+to be used as the operand:
+.IP
+.in +4n
+.EX
+FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
+.EE
+.in
+.IP
+The
+.I cmp
+field is one of the following:
+.IP
+.in +4n
+.EX
+FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
+FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
+FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
+FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
+FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
+FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
+.EE
+.in
+.IP
+The return value of
+.B FUTEX_WAKE_OP
+is the sum of the number of waiters woken on the futex
+.I uaddr
+plus the number of waiters woken on the futex
+.IR uaddr2 .
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)"
+.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
+This operation is like
+.B FUTEX_WAIT
+except that
+.I val3
+is used to provide a 32-bit bit mask to the kernel.
+This bit mask, in which at least one bit must be set,
+is stored in the kernel-internal state of the waiter.
+See the description of
+.B FUTEX_WAKE_BITSET
+for further details.
+.IP
+If
+.I timeout
+is not NULL, the structure it points to specifies
+an absolute timeout for the wait operation.
+If
+.I timeout
+is NULL, the operation can block indefinitely.
+.IP
+The
+.I uaddr2
+argument is ignored.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)"
+.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
+This operation is the same as
+.B FUTEX_WAKE
+except that the
+.I val3
+argument is used to provide a 32-bit bit mask to the kernel.
+This bit mask, in which at least one bit must be set,
+is used to select which waiters should be woken up.
+The selection is done by a bitwise AND of the "wake" bit mask
+(i.e., the value in
+.IR val3 )
+and the bit mask which is stored in the kernel-internal
+state of the waiter (the "wait" bit mask that is set using
+.BR FUTEX_WAIT_BITSET ).
+All of the waiters for which the result of the AND is nonzero are woken up;
+the remaining waiters are left sleeping.
+.IP
+The effect of
+.B FUTEX_WAIT_BITSET
+and
+.B FUTEX_WAKE_BITSET
+is to allow selective wake-ups among multiple waiters that are blocked
+on the same futex.
+However, note that, depending on the use case,
+employing this bit-mask multiplexing feature on a
+futex can be less efficient than simply using multiple futexes,
+because employing bit-mask multiplexing requires the kernel
+to check all waiters on a futex,
+including those that are not interested in being woken up
+(i.e., they do not have the relevant bit set in their "wait" bit mask).
+.\" According to http://locklessinc.com/articles/futex_cheat_sheet/:
+.\"
+.\" "The original reason for the addition of these extensions
+.\" was to improve the performance of pthread read-write locks
+.\" in glibc. However, the pthreads library no longer uses the
+.\" same locking algorithm, and these extensions are not used
+.\" without the bitset parameter being all ones.
+.\"
+.\" The page goes on to note that the FUTEX_WAIT_BITSET operation
+.\" is nevertheless used (with a bit mask of all ones) in order to
+.\" obtain the absolute timeout functionality that is useful
+.\" for efficiently implementing Pthreads APIs (which use absolute
+.\" timeouts); FUTEX_WAIT provides only relative timeouts.
+.IP
+The constant
+.BR FUTEX_BITSET_MATCH_ANY ,
+which corresponds to all 32 bits set in the bit mask, can be used as the
+.I val3
+argument for
+.B FUTEX_WAIT_BITSET
+and
+.BR FUTEX_WAKE_BITSET .
+Other than differences in the handling of the
+.I timeout
+argument, the
+.B FUTEX_WAIT
+operation is equivalent to
+.B FUTEX_WAIT_BITSET
+with
+.I val3
+specified as
+.BR FUTEX_BITSET_MATCH_ANY ;
+that is, allow a wake-up by any waker.
+The
+.B FUTEX_WAKE
+operation is equivalent to
+.B FUTEX_WAKE_BITSET
+with
+.I val3
+specified as
+.BR FUTEX_BITSET_MATCH_ANY ;
+that is, wake up any waiter(s).
+.IP
+The
+.I uaddr2
+and
+.I timeout
+arguments are ignored.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.SS Priority-inheritance futexes
+Linux supports priority-inheritance (PI) futexes in order to handle
+priority-inversion problems that can be encountered with
+normal futex locks.
+Priority inversion is the problem that occurs when a high-priority
+task is blocked waiting to acquire a lock held by a low-priority task,
+while tasks at an intermediate priority continuously preempt
+the low-priority task from the CPU.
+Consequently, the low-priority task makes no progress toward
+releasing the lock, and the high-priority task remains blocked.
+.PP
+Priority inheritance is a mechanism for dealing with
+the priority-inversion problem.
+With this mechanism, when a high-priority task becomes blocked
+by a lock held by a low-priority task,
+the priority of the low-priority task is temporarily raised
+to that of the high-priority task,
+so that it is not preempted by any intermediate level tasks,
+and can thus make progress toward releasing the lock.
+To be effective, priority inheritance must be transitive,
+meaning that if a high-priority task blocks on a lock
+held by a lower-priority task that is itself blocked by a lock
+held by another intermediate-priority task
+(and so on, for chains of arbitrary length),
+then both of those tasks
+(or more generally, all of the tasks in a lock chain)
+have their priorities raised to be the same as the high-priority task.
+.PP
+From a user-space perspective,
+what makes a futex PI-aware is a policy agreement (described below)
+between user space and the kernel about the value of the futex word,
+coupled with the use of the PI-futex operations described below.
+(Unlike the other futex operations described above,
+the PI-futex operations are designed
+for the implementation of very specific IPC mechanisms.)
+.\"
+.\" Quoting Darren Hart:
+.\" These opcodes paired with the PI futex value policy (described below)
+.\" defines a "futex" as PI aware. These were created very specifically
+.\" in support of PI pthread_mutexes, so it makes a lot more sense to
+.\" talk about a PI aware pthread_mutex, than a PI aware futex, since
+.\" there is a lot of policy and scaffolding that has to be built up
+.\" around it to use it properly (this is what a PI pthread_mutex is).
+.PP
+.\" mtk: The following text is drawn from the Hart/Guniguntala paper
+.\" (listed in SEE ALSO), but I have reworded some pieces
+.\" significantly.
+.\"
+The PI-futex operations described below differ from the other
+futex operations in that they impose policy on the use of the value of the
+futex word:
+.IP \[bu] 3
+If the lock is not acquired, the futex word's value shall be 0.
+.IP \[bu]
+If the lock is acquired, the futex word's value shall
+be the thread ID (TID;
+see
+.BR gettid (2))
+of the owning thread.
+.IP \[bu]
+If the lock is owned and there are threads contending for the lock,
+then the
+.B FUTEX_WAITERS
+bit shall be set in the futex word's value; in other words, this value is:
+.IP
+.in +4n
+.EX
+FUTEX_WAITERS | TID
+.EE
+.in
+.IP
+(Note that is invalid for a PI futex word to have no owner and
+.B FUTEX_WAITERS
+set.)
+.PP
+With this policy in place,
+a user-space application can acquire an unacquired
+lock or release a lock using atomic instructions executed in user mode
+(e.g., a compare-and-swap operation such as
+.I cmpxchg
+on the x86 architecture).
+Acquiring a lock simply consists of using compare-and-swap to atomically
+set the futex word's value to the caller's TID if its previous value was 0.
+Releasing a lock requires using compare-and-swap to set the futex word's
+value to 0 if the previous value was the expected TID.
+.PP
+If a futex is already acquired (i.e., has a nonzero value),
+waiters must employ the
+.B FUTEX_LOCK_PI
+or
+.B FUTEX_LOCK_PI2
+operations to acquire the lock.
+If other threads are waiting for the lock, then the
+.B FUTEX_WAITERS
+bit is set in the futex value;
+in this case, the lock owner must employ the
+.B FUTEX_UNLOCK_PI
+operation to release the lock.
+.PP
+In the cases where callers are forced into the kernel
+(i.e., required to perform a
+.BR futex ()
+call),
+they then deal directly with a so-called RT-mutex,
+a kernel locking mechanism which implements the required
+priority-inheritance semantics.
+After the RT-mutex is acquired, the futex value is updated accordingly,
+before the calling thread returns to user space.
+.PP
+It is important to note
+.\" tglx (July 2015):
+.\" If there are multiple waiters on a pi futex then a wake pi operation
+.\" will wake the first waiter and hand over the lock to this waiter. This
+.\" includes handing over the rtmutex which represents the futex in the
+.\" kernel. The strict requirement is that the futex owner and the rtmutex
+.\" owner must be the same, except for the update period which is
+.\" serialized by the futex internal locking. That means the kernel must
+.\" update the user-space value prior to returning to user space
+that the kernel will update the futex word's value prior
+to returning to user space.
+(This prevents the possibility of the futex word's value ending
+up in an invalid state, such as having an owner but the value being 0,
+or having waiters but not having the
+.B FUTEX_WAITERS
+bit set.)
+.PP
+If a futex has an associated RT-mutex in the kernel
+(i.e., there are blocked waiters)
+and the owner of the futex/RT-mutex dies unexpectedly,
+then the kernel cleans up the RT-mutex and hands it over to the next waiter.
+This in turn requires that the user-space value is updated accordingly.
+To indicate that this is required, the kernel sets the
+.B FUTEX_OWNER_DIED
+bit in the futex word along with the thread ID of the new owner.
+User space can detect this situation via the presence of the
+.B FUTEX_OWNER_DIED
+bit and is then responsible for cleaning up the stale state left over by
+the dead owner.
+.\" tglx (July 2015):
+.\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where
+.\" the kernel has no state associated. This happens via the robust futex
+.\" mechanism. In that case the futex value will be set to
+.\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non
+.\" PI futexes.
+.PP
+PI futexes are operated on by specifying one of the values listed below in
+.IR futex_op .
+Note that the PI futex operations must be used as paired operations
+and are subject to some additional requirements:
+.IP \[bu] 3
+.BR FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+and
+.B FUTEX_TRYLOCK_PI
+pair with
+.BR FUTEX_UNLOCK_PI .
+.B FUTEX_UNLOCK_PI
+must be called only on a futex owned by the calling thread,
+as defined by the value policy, otherwise the error
+.B EPERM
+results.
+.IP \[bu]
+.B FUTEX_WAIT_REQUEUE_PI
+pairs with
+.BR FUTEX_CMP_REQUEUE_PI .
+This must be performed from a non-PI futex to a distinct PI futex
+(or the error
+.B EINVAL
+results).
+Additionally,
+.I val
+(the number of waiters to be woken) must be 1
+(or the error
+.B EINVAL
+results).
+.PP
+The PI futex operations are as follows:
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_LOCK_PI " (since Linux 2.6.18)"
+.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
+This operation is used after an attempt to acquire
+the lock via an atomic user-mode instruction failed
+because the futex word has a nonzero value\[em]specifically,
+because it contained the (PID-namespace-specific) TID of the lock owner.
+.IP
+The operation checks the value of the futex word at the address
+.IR uaddr .
+If the value is 0, then the kernel tries to atomically set
+the futex value to the caller's TID.
+If the futex word's value is nonzero,
+the kernel atomically sets the
+.B FUTEX_WAITERS
+bit, which signals the futex owner that it cannot unlock the futex in
+user space atomically by setting the futex value to 0.
+.\" tglx (July 2015):
+.\" The operation here is similar to the FUTEX_WAIT logic. When the user
+.\" space atomic acquire does not succeed because the futex value was non
+.\" zero, then the waiter goes into the kernel, takes the kernel internal
+.\" lock and retries the acquisition under the lock. If the acquisition
+.\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal
+.\" the lock owner that it needs to go into the kernel. Here is the pseudo
+.\" code:
+.\"
+.\" lock(kernel_lock);
+.\" retry:
+.\"
+.\" /*
+.\" * Owner might have unlocked in user space before we
+.\" * were able to set the waiter bit.
+.\" */
+.\" if (atomic_acquire(futex) == SUCCESS) {
+.\" unlock(kernel_lock());
+.\" return 0;
+.\" }
+.\"
+.\" /*
+.\" * Owner might have unlocked after the above atomic_acquire()
+.\" * attempt.
+.\" */
+.\" if (atomic_set_waiters_bit(futex) != SUCCESS)
+.\" goto retry;
+.\"
+.\" queue_waiter();
+.\" unlock(kernel_lock);
+.\" block();
+.\"
+After that, the kernel:
+.RS
+.IP (1) 5
+Tries to find the thread which is associated with the owner TID.
+.IP (2)
+Creates or reuses kernel state on behalf of the owner.
+(If this is the first waiter, there is no kernel state for this
+futex, so kernel state is created by locking the RT-mutex
+and the futex owner is made the owner of the RT-mutex.
+If there are existing waiters, then the existing state is reused.)
+.IP (3)
+Attaches the waiter to the futex
+(i.e., the waiter is enqueued on the RT-mutex waiter list).
+.RE
+.IP
+If more than one waiter exists,
+the enqueueing of the waiter is in descending priority order.
+(For information on priority ordering, see the discussion of the
+.BR SCHED_DEADLINE ,
+.BR SCHED_FIFO ,
+and
+.B SCHED_RR
+scheduling policies in
+.BR sched (7).)
+The owner inherits either the waiter's CPU bandwidth
+(if the waiter is scheduled under the
+.B SCHED_DEADLINE
+policy) or the waiter's priority (if the waiter is scheduled under the
+.B SCHED_RR
+or
+.B SCHED_FIFO
+policy).
+.\" August 2015:
+.\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL)
+.\" processes, does the nice value come into play also?
+.\"
+.\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order
+This inheritance follows the lock chain in the case of nested locking
+.\" (i.e., task 1 blocks on lock A, held by task 2,
+.\" while task 2 blocks on lock B, held by task 3)
+and performs deadlock detection.
+.IP
+The
+.I timeout
+argument provides a timeout for the lock attempt.
+If
+.I timeout
+is not NULL, the structure it points to specifies
+an absolute timeout, measured against the
+.B CLOCK_REALTIME
+clock.
+.\" 2016-07-07 response from Thomas Gleixner on LKML:
+.\" From: Thomas Gleixner <tglx@linutronix.de>
+.\" Date: 6 July 2016 at 20:57
+.\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
+.\"
+.\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:
+.\" > On 06/23/2016 08:28 PM, Darren Hart wrote:
+.\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
+.\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
+.\" > > specific amount of time, regardless of wall time.
+.\" >
+.\" > Yes, that's another weird inconsistency.
+.\"
+.\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on
+.\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back
+.\" then when we added LOCK_PI.
+If
+.I timeout
+is NULL, the operation will block indefinitely.
+.IP
+The
+.IR uaddr2 ,
+.IR val ,
+and
+.I val3
+arguments are ignored.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_LOCK_PI2 " (since Linux 5.14)"
+.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae
+This operation is the same as
+.BR FUTEX_LOCK_PI ,
+except that the clock against which
+.I timeout
+is measured is selectable.
+By default, the (absolute) timeout specified in
+.I timeout
+is measured against the
+.B CLOCK_MONOTONIC
+clock, but if the
+.B FUTEX_CLOCK_REALTIME
+flag is specified in
+.IR futex_op ,
+then the timeout is measured against the
+.B CLOCK_REALTIME
+clock.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)"
+.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
+This operation tries to acquire the lock at
+.IR uaddr .
+It is invoked when a user-space atomic acquire did not
+succeed because the futex word was not 0.
+.IP
+Because the kernel has access to more state information than user space,
+acquisition of the lock might succeed if performed by the
+kernel in cases where the futex word
+(i.e., the state information accessible to use-space) contains stale state
+.RB ( FUTEX_WAITERS
+and/or
+.BR FUTEX_OWNER_DIED ).
+This can happen when the owner of the futex died.
+User space cannot handle this condition in a race-free manner,
+but the kernel can fix this up and acquire the futex.
+.\" Paraphrasing a f2f conversation with Thomas Gleixner about the
+.\" above point (Aug 2015): ###
+.\" There is a rare possibility of a race condition involving an
+.\" uncontended futex with no owner, but with waiters. The
+.\" kernel-user-space contract is that if a futex is nonzero, you must
+.\" go into kernel. The futex was owned by a task, and that task dies
+.\" but there are no waiters, so the futex value is non zero.
+.\" Therefore, the next locker has to go into the kernel,
+.\" so that the kernel has a chance to clean up. (CMXCH on zero
+.\" in user space would fail, so kernel has to clean up.)
+.\" Darren Hart (Oct 2015):
+.\" The trylock in the kernel has more state, so it can independently
+.\" verify the flags that user space must trust implicitly.
+.IP
+The
+.IR uaddr2 ,
+.IR val ,
+.IR timeout ,
+and
+.I val3
+arguments are ignored.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)"
+.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
+This operation wakes the top priority waiter that is waiting in
+.B FUTEX_LOCK_PI
+or
+.B FUTEX_LOCK_PI2
+on the futex address provided by the
+.I uaddr
+argument.
+.IP
+This is called when the user-space value at
+.I uaddr
+cannot be changed atomically from a TID (of the owner) to 0.
+.IP
+The
+.IR uaddr2 ,
+.IR val ,
+.IR timeout ,
+and
+.I val3
+arguments are ignored.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)"
+.\" commit 52400ba946759af28442dee6265c5c0180ac7122
+This operation is a PI-aware variant of
+.BR FUTEX_CMP_REQUEUE .
+It requeues waiters that are blocked via
+.B FUTEX_WAIT_REQUEUE_PI
+on
+.I uaddr
+from a non-PI source futex
+.RI ( uaddr )
+to a PI target futex
+.RI ( uaddr2 ).
+.IP
+As with
+.BR FUTEX_CMP_REQUEUE ,
+this operation wakes up a maximum of
+.I val
+waiters that are waiting on the futex at
+.IR uaddr .
+However, for
+.BR FUTEX_CMP_REQUEUE_PI ,
+.I val
+is required to be 1
+(since the main point is to avoid a thundering herd).
+The remaining waiters are removed from the wait queue of the source futex at
+.I uaddr
+and added to the wait queue of the target futex at
+.IR uaddr2 .
+.IP
+The
+.I val2
+.\" val2 is the cap on the number of requeued waiters.
+.\" In the glibc pthread_cond_broadcast() implementation, this argument
+.\" is specified as INT_MAX, and for pthread_cond_signal() it is 0.
+and
+.I val3
+arguments serve the same purposes as for
+.BR FUTEX_CMP_REQUEUE .
+.\"
+.\" The page at http://locklessinc.com/articles/futex_cheat_sheet/
+.\" notes that "priority-inheritance Futex to priority-inheritance
+.\" Futex requeues are currently unsupported". However, probably
+.\" the page does not need to say nothing about this, since
+.\" Thomas Gleixner commented (July 2015): "they never will be
+.\" supported because they make no sense at all"
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.TP
+.BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)"
+.\" commit 52400ba946759af28442dee6265c5c0180ac7122
+.\"
+Wait on a non-PI futex at
+.I uaddr
+and potentially be requeued (via a
+.B FUTEX_CMP_REQUEUE_PI
+operation in another task) onto a PI futex at
+.IR uaddr2 .
+The wait operation on
+.I uaddr
+is the same as for
+.BR FUTEX_WAIT .
+.IP
+The waiter can be removed from the wait on
+.I uaddr
+without requeueing on
+.I uaddr2
+via a
+.B FUTEX_WAKE
+operation in another task.
+In this case, the
+.B FUTEX_WAIT_REQUEUE_PI
+operation fails with the error
+.BR EAGAIN .
+.IP
+If
+.I timeout
+is not NULL, the structure it points to specifies
+an absolute timeout for the wait operation.
+If
+.I timeout
+is NULL, the operation can block indefinitely.
+.IP
+The
+.I val3
+argument is ignored.
+.IP
+The
+.B FUTEX_WAIT_REQUEUE_PI
+and
+.B FUTEX_CMP_REQUEUE_PI
+were added to support a fairly specific use case:
+support for priority-inheritance-aware POSIX threads condition variables.
+The idea is that these operations should always be paired,
+in order to ensure that user space and the kernel remain in sync.
+Thus, in the
+.B FUTEX_WAIT_REQUEUE_PI
+operation, the user-space application pre-specifies the target
+of the requeue that takes place in the
+.B FUTEX_CMP_REQUEUE_PI
+operation.
+.\"
+.\" Darren Hart notes that a patch to allow glibc to fully support
+.\" PI-aware pthreads condition variables has not yet been accepted into
+.\" glibc. The story is complex, and can be found at
+.\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588
+.\" Darren notes that in the meantime, the patch is shipped with various
+.\" PREEMPT_RT-enabled Linux systems.
+.\"
+.\" Related to the preceding, Darren proposed that somewhere, man-pages
+.\" should document the following point:
+.\"
+.\" While the Linux kernel, since Linux 2.6.31, supports requeueing of
+.\" priority-inheritance (PI) aware mutexes via the
+.\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations,
+.\" the glibc implementation does not yet take full advantage of this.
+.\" Specifically, the condvar internal data lock remains a non-PI aware
+.\" mutex, regardless of the type of the pthread_mutex associated with
+.\" the condvar. This can lead to an unbounded priority inversion on
+.\" the internal data lock even when associating a PI aware
+.\" pthread_mutex with a condvar during a pthread_cond*_wait
+.\" operation. For this reason, it is not recommended to rely on
+.\" priority inheritance when using pthread condition variables.
+.\"
+.\" The problem is that the obvious location for this text is
+.\" the pthread_cond*wait(3) man page. However, such a man page
+.\" does not currently exist.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.SH RETURN VALUE
+In the event of an error (and assuming that
+.BR futex ()
+was invoked via
+.BR syscall (2)),
+all operations return \-1 and set
+.I errno
+to indicate the error.
+.PP
+The return value on success depends on the operation,
+as described in the following list:
+.TP
+.B FUTEX_WAIT
+Returns 0 if the caller was woken up.
+Note that a wake-up can also be caused by common futex usage patterns
+in unrelated code that happened to have previously used the futex word's
+memory location (e.g., typical futex-based implementations of
+Pthreads mutexes can cause this under some conditions).
+Therefore, callers should always conservatively assume that a return
+value of 0 can mean a spurious wake-up, and use the futex word's value
+(i.e., the user-space synchronization scheme)
+to decide whether to continue to block or not.
+.TP
+.B FUTEX_WAKE
+Returns the number of waiters that were woken up.
+.TP
+.B FUTEX_FD
+Returns the new file descriptor associated with the futex.
+.TP
+.B FUTEX_REQUEUE
+Returns the number of waiters that were woken up.
+.TP
+.B FUTEX_CMP_REQUEUE
+Returns the total number of waiters that were woken up or
+requeued to the futex for the futex word at
+.IR uaddr2 .
+If this value is greater than
+.IR val ,
+then the difference is the number of waiters requeued to the futex for the
+futex word at
+.IR uaddr2 .
+.TP
+.B FUTEX_WAKE_OP
+Returns the total number of waiters that were woken up.
+This is the sum of the woken waiters on the two futexes for
+the futex words at
+.I uaddr
+and
+.IR uaddr2 .
+.TP
+.B FUTEX_WAIT_BITSET
+Returns 0 if the caller was woken up.
+See
+.B FUTEX_WAIT
+for how to interpret this correctly in practice.
+.TP
+.B FUTEX_WAKE_BITSET
+Returns the number of waiters that were woken up.
+.TP
+.B FUTEX_LOCK_PI
+Returns 0 if the futex was successfully locked.
+.TP
+.B FUTEX_LOCK_PI2
+Returns 0 if the futex was successfully locked.
+.TP
+.B FUTEX_TRYLOCK_PI
+Returns 0 if the futex was successfully locked.
+.TP
+.B FUTEX_UNLOCK_PI
+Returns 0 if the futex was successfully unlocked.
+.TP
+.B FUTEX_CMP_REQUEUE_PI
+Returns the total number of waiters that were woken up or
+requeued to the futex for the futex word at
+.IR uaddr2 .
+If this value is greater than
+.IR val ,
+then difference is the number of waiters requeued to the futex for
+the futex word at
+.IR uaddr2 .
+.TP
+.B FUTEX_WAIT_REQUEUE_PI
+Returns 0 if the caller was successfully requeued to the futex for
+the futex word at
+.IR uaddr2 .
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.SH ERRORS
+.TP
+.B EACCES
+No read access to the memory of a futex word.
+.TP
+.B EAGAIN
+.RB ( FUTEX_WAIT ,
+.BR FUTEX_WAIT_BITSET ,
+.BR FUTEX_WAIT_REQUEUE_PI )
+The value pointed to by
+.I uaddr
+was not equal to the expected value
+.I val
+at the time of the call.
+.IP
+.BR Note :
+on Linux, the symbolic names
+.B EAGAIN
+and
+.B EWOULDBLOCK
+(both of which appear in different parts of the kernel futex code)
+have the same value.
+.TP
+.B EAGAIN
+.RB ( FUTEX_CMP_REQUEUE ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The value pointed to by
+.I uaddr
+is not equal to the expected value
+.IR val3 .
+.TP
+.B EAGAIN
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The futex owner thread ID of
+.I uaddr
+(for
+.BR FUTEX_CMP_REQUEUE_PI :
+.IR uaddr2 )
+is about to exit,
+but has not yet handled the internal state cleanup.
+Try again.
+.TP
+.B EDEADLK
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The futex word at
+.I uaddr
+is already locked by the caller.
+.TP
+.B EDEADLK
+.\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some
+.\" places, and EDEADLOCK in others. On almost all architectures
+.\" these constants are synonymous. Is there a reason that both
+.\" names are used?
+.\"
+.\" tglx (July 2015): "No. We should probably fix that."
+.\"
+.RB ( FUTEX_CMP_REQUEUE_PI )
+While requeueing a waiter to the PI futex for the futex word at
+.IR uaddr2 ,
+the kernel detected a deadlock.
+.TP
+.B EFAULT
+A required pointer argument (i.e.,
+.IR uaddr ,
+.IR uaddr2 ,
+or
+.IR timeout )
+did not point to a valid user-space address.
+.TP
+.B EINTR
+A
+.B FUTEX_WAIT
+or
+.B FUTEX_WAIT_BITSET
+operation was interrupted by a signal (see
+.BR signal (7)).
+Before Linux 2.6.22, this error could also be returned for
+a spurious wakeup; since Linux 2.6.22, this no longer happens.
+.TP
+.B EINVAL
+The operation in
+.I futex_op
+is one of those that employs a timeout, but the supplied
+.I timeout
+argument was invalid
+.RI ( tv_sec
+was less than zero, or
+.I tv_nsec
+was not less than 1,000,000,000).
+.TP
+.B EINVAL
+The operation specified in
+.I futex_op
+employs one or both of the pointers
+.I uaddr
+and
+.IR uaddr2 ,
+but one of these does not point to a valid object\[em]that is,
+the address is not four-byte-aligned.
+.TP
+.B EINVAL
+.RB ( FUTEX_WAIT_BITSET ,
+.BR FUTEX_WAKE_BITSET )
+The bit mask supplied in
+.I val3
+is zero.
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+.I uaddr
+equals
+.I uaddr2
+(i.e., an attempt was made to requeue to the same futex).
+.TP
+.B EINVAL
+.RB ( FUTEX_FD )
+The signal number supplied in
+.I val
+is invalid.
+.TP
+.B EINVAL
+.RB ( FUTEX_WAKE ,
+.BR FUTEX_WAKE_OP ,
+.BR FUTEX_WAKE_BITSET ,
+.BR FUTEX_REQUEUE ,
+.BR FUTEX_CMP_REQUEUE )
+The kernel detected an inconsistency between the user-space state at
+.I uaddr
+and the kernel state\[em]that is, it detected a waiter which waits in
+.B FUTEX_LOCK_PI
+or
+.B FUTEX_LOCK_PI2
+on
+.IR uaddr .
+.TP
+.B EINVAL
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_UNLOCK_PI )
+The kernel detected an inconsistency between the user-space state at
+.I uaddr
+and the kernel state.
+This indicates either state corruption
+or that the kernel found a waiter on
+.I uaddr
+which is waiting via
+.B FUTEX_WAIT
+or
+.BR FUTEX_WAIT_BITSET .
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+The kernel detected an inconsistency between the user-space state at
+.I uaddr2
+and the kernel state;
+.\" From a conversation with Thomas Gleixner (Aug 2015): ###
+.\" The kernel sees: I have non PI state for a futex you tried to
+.\" tell me was PI
+that is, the kernel detected a waiter which waits via
+.B FUTEX_WAIT
+or
+.B FUTEX_WAIT_BITSET
+on
+.IR uaddr2 .
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+The kernel detected an inconsistency between the user-space state at
+.I uaddr
+and the kernel state;
+that is, the kernel detected a waiter which waits via
+.B FUTEX_WAIT
+or
+.B FUTEX_WAIT_BITSET
+on
+.IR uaddr .
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+The kernel detected an inconsistency between the user-space state at
+.I uaddr
+and the kernel state;
+that is, the kernel detected a waiter which waits on
+.I uaddr
+via
+.B FUTEX_LOCK_PI
+or
+.B FUTEX_LOCK_PI2
+(instead of
+.BR FUTEX_WAIT_REQUEUE_PI ).
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+.\" This deals with the case:
+.\" wait_requeue_pi(A, B);
+.\" requeue_pi(A, C);
+An attempt was made to requeue a waiter to a futex other than that
+specified by the matching
+.B FUTEX_WAIT_REQUEUE_PI
+call for that waiter.
+.TP
+.B EINVAL
+.RB ( FUTEX_CMP_REQUEUE_PI )
+The
+.I val
+argument is not 1.
+.TP
+.B EINVAL
+Invalid argument.
+.TP
+.B ENFILE
+.RB ( FUTEX_FD )
+The system-wide limit on the total number of open files has been reached.
+.TP
+.B ENOMEM
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The kernel could not allocate memory to hold state information.
+.TP
+.B ENOSYS
+Invalid operation specified in
+.IR futex_op .
+.TP
+.B ENOSYS
+The
+.B FUTEX_CLOCK_REALTIME
+option was specified in
+.IR futex_op ,
+but the accompanying operation was neither
+.BR FUTEX_WAIT ,
+.BR FUTEX_WAIT_BITSET ,
+.BR FUTEX_WAIT_REQUEUE_PI ,
+nor
+.BR FUTEX_LOCK_PI2 .
+.TP
+.B ENOSYS
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_UNLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI ,
+.BR FUTEX_WAIT_REQUEUE_PI )
+A run-time check determined that the operation is not available.
+The PI-futex operations are not implemented on all architectures and
+are not supported on some CPU variants.
+.TP
+.B EPERM
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The caller is not allowed to attach itself to the futex at
+.I uaddr
+(for
+.BR FUTEX_CMP_REQUEUE_PI :
+the futex at
+.IR uaddr2 ).
+(This may be caused by a state corruption in user space.)
+.TP
+.B EPERM
+.RB ( FUTEX_UNLOCK_PI )
+The caller does not own the lock represented by the futex word.
+.TP
+.B ESRCH
+.RB ( FUTEX_LOCK_PI ,
+.BR FUTEX_LOCK_PI2 ,
+.BR FUTEX_TRYLOCK_PI ,
+.BR FUTEX_CMP_REQUEUE_PI )
+The thread ID in the futex word at
+.I uaddr
+does not exist.
+.TP
+.B ESRCH
+.RB ( FUTEX_CMP_REQUEUE_PI )
+The thread ID in the futex word at
+.I uaddr2
+does not exist.
+.TP
+.B ETIMEDOUT
+The operation in
+.I futex_op
+employed the timeout specified in
+.IR timeout ,
+and the timeout expired before the operation completed.
+.\"
+.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.\"
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 2.6.0.
+.PP
+Initial futex support was merged in Linux 2.5.7 but with different
+semantics from what was described above.
+A four-argument system call with the semantics
+described in this page was introduced in Linux 2.5.40.
+A fifth argument was added in Linux 2.5.70,
+and a sixth argument was added in Linux 2.6.7.
+.SH EXAMPLES
+The program below demonstrates use of futexes in a program where a parent
+process and a child process use a pair of futexes located inside a
+shared anonymous mapping to synchronize access to a shared resource:
+the terminal.
+The two processes each write
+.I nloops
+(a command-line argument that defaults to 5 if omitted)
+messages to the terminal and employ a synchronization protocol
+that ensures that they alternate in writing messages.
+Upon running this program we see output such as the following:
+.PP
+.in +4n
+.EX
+$ \fB./futex_demo\fP
+Parent (18534) 0
+Child (18535) 0
+Parent (18534) 1
+Child (18535) 1
+Parent (18534) 2
+Child (18535) 2
+Parent (18534) 3
+Child (18535) 3
+Parent (18534) 4
+Child (18535) 4
+.EE
+.in
+.SS Program source
+\&
+.\" SRC BEGIN (futex.c)
+.EX
+/* futex_demo.c
+\&
+ Usage: futex_demo [nloops]
+ (Default: 5)
+\&
+ Demonstrate the use of futexes in a program where parent and child
+ use a pair of futexes located inside a shared anonymous mapping to
+ synchronize access to a shared resource: the terminal. The two
+ processes each write \[aq]num\-loops\[aq] messages to the terminal and employ
+ a synchronization protocol that ensures that they alternate in
+ writing messages.
+*/
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <linux/futex.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <unistd.h>
+\&
+static uint32_t *futex1, *futex2, *iaddr;
+\&
+static int
+futex(uint32_t *uaddr, int futex_op, uint32_t val,
+ const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
+{
+ return syscall(SYS_futex, uaddr, futex_op, val,
+ timeout, uaddr2, val3);
+}
+\&
+/* Acquire the futex pointed to by \[aq]futexp\[aq]: wait for its value to
+ become 1, and then set the value to 0. */
+\&
+static void
+fwait(uint32_t *futexp)
+{
+ long s;
+ const uint32_t one = 1;
+\&
+ /* atomic_compare_exchange_strong(ptr, oldval, newval)
+ atomically performs the equivalent of:
+\&
+ if (*ptr == *oldval)
+ *ptr = newval;
+\&
+ It returns true if the test yielded true and *ptr was updated. */
+\&
+ while (1) {
+\&
+ /* Is the futex available? */
+ if (atomic_compare_exchange_strong(futexp, &one, 0))
+ break; /* Yes */
+\&
+ /* Futex is not available; wait. */
+\&
+ s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
+ if (s == \-1 && errno != EAGAIN)
+ err(EXIT_FAILURE, "futex\-FUTEX_WAIT");
+ }
+}
+\&
+/* Release the futex pointed to by \[aq]futexp\[aq]: if the futex currently
+ has the value 0, set its value to 1 and then wake any futex waiters,
+ so that if the peer is blocked in fwait(), it can proceed. */
+\&
+static void
+fpost(uint32_t *futexp)
+{
+ long s;
+ const uint32_t zero = 0;
+\&
+ /* atomic_compare_exchange_strong() was described
+ in comments above. */
+\&
+ if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
+ s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
+ if (s == \-1)
+ err(EXIT_FAILURE, "futex\-FUTEX_WAKE");
+ }
+}
+\&
+int
+main(int argc, char *argv[])
+{
+ pid_t childPid;
+ unsigned int nloops;
+\&
+ setbuf(stdout, NULL);
+\&
+ nloops = (argc > 1) ? atoi(argv[1]) : 5;
+\&
+ /* Create a shared anonymous mapping that will hold the futexes.
+ Since the futexes are being shared between processes, we
+ subsequently use the "shared" futex operations (i.e., not the
+ ones suffixed "_PRIVATE"). */
+\&
+ iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_SHARED, \-1, 0);
+ if (iaddr == MAP_FAILED)
+ err(EXIT_FAILURE, "mmap");
+\&
+ futex1 = &iaddr[0];
+ futex2 = &iaddr[1];
+\&
+ *futex1 = 0; /* State: unavailable */
+ *futex2 = 1; /* State: available */
+\&
+ /* Create a child process that inherits the shared anonymous
+ mapping. */
+\&
+ childPid = fork();
+ if (childPid == \-1)
+ err(EXIT_FAILURE, "fork");
+\&
+ if (childPid == 0) { /* Child */
+ for (unsigned int j = 0; j < nloops; j++) {
+ fwait(futex1);
+ printf("Child (%jd) %u\en", (intmax_t) getpid(), j);
+ fpost(futex2);
+ }
+\&
+ exit(EXIT_SUCCESS);
+ }
+\&
+ /* Parent falls through to here. */
+\&
+ for (unsigned int j = 0; j < nloops; j++) {
+ fwait(futex2);
+ printf("Parent (%jd) %u\en", (intmax_t) getpid(), j);
+ fpost(futex1);
+ }
+\&
+ wait(NULL);
+\&
+ exit(EXIT_SUCCESS);
+}
+.EE
+.\" SRC END
+.SH SEE ALSO
+.ad l
+.BR get_robust_list (2),
+.BR restart_syscall (2),
+.BR pthread_mutexattr_getprotocol (3),
+.BR futex (7),
+.BR sched (7)
+.PP
+The following kernel source files:
+.IP \[bu] 3
+.I Documentation/pi\-futex.txt
+.IP \[bu]
+.I Documentation/futex\-requeue\-pi.txt
+.IP \[bu]
+.I Documentation/locking/rt\-mutex.txt
+.IP \[bu]
+.I Documentation/locking/rt\-mutex\-design.txt
+.IP \[bu]
+.I Documentation/robust\-futex\-ABI.txt
+.PP
+Franke, H., Russell, R., and Kirwood, M., 2002.
+\fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP
+(from proceedings of the Ottawa Linux Symposium 2002),
+.br
+.UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf
+.UE
+.PP
+Hart, D., 2009. \fIA futex overview and update\fP,
+.UR http://lwn.net/Articles/360699/
+.UE
+.PP
+Hart, D.\& and Guniguntala, D., 2009.
+\fIRequeue-PI: Making glibc Condvars PI-Aware\fP
+(from proceedings of the 2009 Real-Time Linux Workshop),
+.UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf
+.UE
+.PP
+Drepper, U., 2011. \fIFutexes Are Tricky\fP,
+.UR http://www.akkadia.org/drepper/futex.pdf
+.UE
+.PP
+Futex example library, futex\-*.tar.bz2 at
+.br
+.UR https://mirrors.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/
+.UE
+.\"
+.\" FIXME(Torvald) We should probably refer to the glibc code here, in
+.\" particular the glibc-internal futex wrapper functions that are
+.\" WIP, and the generic pthread_mutex_t and perhaps condvar
+.\" implementations.