diff options
Diffstat (limited to 'man2/futex.2')
-rw-r--r-- | man2/futex.2 | 1976 |
1 files changed, 1976 insertions, 0 deletions
diff --git a/man2/futex.2 b/man2/futex.2 new file mode 100644 index 0000000..43b1075 --- /dev/null +++ b/man2/futex.2 @@ -0,0 +1,1976 @@ +.\" Page by b.hubert +.\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de> +.\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com> +.\" +.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) +.\" may be freely modified and distributed +.\" %%%LICENSE_END +.\" +.\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com) +.\" added ERRORS section. +.\" +.\" Modified 2004-06-17 mtk +.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE +.\" +.\" FIXME Still to integrate are some points from Torvald Riegel's mail of +.\" 2015-01-23: +.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 +.\" +.\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail +.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 +.\" +.TH futex 2 2023-05-03 "Linux man-pages 6.05.01" +.SH NAME +futex \- fast user-space locking +.SH LIBRARY +Standard C library +.RI ( libc ", " \-lc ) +.SH SYNOPSIS +.nf +.PP +.BR "#include <linux/futex.h>" " /* Definition of " FUTEX_* " constants */" +.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */" +.B #include <unistd.h> +.PP +.BI "long syscall(SYS_futex, uint32_t *" uaddr ", int " futex_op \ +", uint32_t " val , +.BI " const struct timespec *" timeout , \ +" \fR /* or: \fBuint32_t \fIval2\fP */" +.BI " uint32_t *" uaddr2 ", uint32_t " val3 ); +.fi +.PP +.IR Note : +glibc provides no wrapper for +.BR futex (), +necessitating the use of +.BR syscall (2). +.SH DESCRIPTION +The +.BR futex () +system call provides a method for waiting until a certain condition becomes +true. +It is typically used as a blocking construct in the context of +shared-memory synchronization. +When using futexes, the majority of +the synchronization operations are performed in user space. +A user-space program employs the +.BR futex () +system call only when it is likely that the program has to block for +a longer time until the condition becomes true. +Other +.BR futex () +operations can be used to wake any processes or threads waiting +for a particular condition. +.PP +A futex is a 32-bit value\[em]referred to below as a +.IR "futex word" \[em]whose +address is supplied to the +.BR futex () +system call. +(Futexes are 32 bits in size on all platforms, including 64-bit systems.) +All futex operations are governed by this value. +In order to share a futex between processes, +the futex is placed in a region of shared memory, +created using (for example) +.BR mmap (2) +or +.BR shmat (2). +(Thus, the futex word may have different +virtual addresses in different processes, +but these addresses all refer to the same location in physical memory.) +In a multithreaded program, it is sufficient to place the futex word +in a global variable shared by all threads. +.PP +When executing a futex operation that requests to block a thread, +the kernel will block only if the futex word has the value that the +calling thread supplied (as one of the arguments of the +.BR futex () +call) as the expected value of the futex word. +The loading of the futex word's value, +the comparison of that value with the expected value, +and the actual blocking will happen atomically and will be totally ordered +with respect to concurrent operations performed by other threads +on the same futex word. +.\" Notes from Darren Hart (Dec 2015): +.\" Totally ordered with respect futex operations refers to semantics +.\" of the ACQUIRE/RELEASE operations and how they impact ordering of +.\" memory reads and writes. The kernel futex operations are protected +.\" by spinlocks, which ensure that all operations are serialized +.\" with respect to one another. +.\" +.\" This is a lot to attempt to define in this document. Perhaps a +.\" reference to linux/Documentation/memory-barriers.txt as a footnote +.\" would be sufficient? Or perhaps for this manual, "serialized" would +.\" be sufficient, with a footnote regarding "totally ordered" and a +.\" pointer to the memory-barrier documentation? +Thus, the futex word is used to connect the synchronization in user space +with the implementation of blocking by the kernel. +Analogously to an atomic +compare-and-exchange operation that potentially changes shared memory, +blocking via a futex is an atomic compare-and-block operation. +.\" FIXME(Torvald Riegel): +.\" Eventually we want to have some text in NOTES to satisfy +.\" the reference in the following sentence +.\" See NOTES for a detailed specification of +.\" the synchronization semantics. +.PP +One use of futexes is for implementing locks. +The state of the lock (i.e., acquired or not acquired) +can be represented as an atomically accessed flag in shared memory. +In the uncontended case, +a thread can access or modify the lock state with atomic instructions, +for example atomically changing it from not acquired to acquired +using an atomic compare-and-exchange instruction. +(Such instructions are performed entirely in user mode, +and the kernel maintains no information about the lock state.) +On the other hand, a thread may be unable to acquire a lock because +it is already acquired by another thread. +It then may pass the lock's flag as a futex word and the value +representing the acquired state as the expected value to a +.BR futex () +wait operation. +This +.BR futex () +operation will block if and only if the lock is still acquired +(i.e., the value in the futex word still matches the "acquired state"). +When releasing the lock, a thread has to first reset the +lock state to not acquired and then execute a futex +operation that wakes threads blocked on the lock flag used as a futex word +(this can be further optimized to avoid unnecessary wake-ups). +See +.BR futex (7) +for more detail on how to use futexes. +.PP +Besides the basic wait and wake-up futex functionality, there are further +futex operations aimed at supporting more complex use cases. +.PP +Note that +no explicit initialization or destruction is necessary to use futexes; +the kernel maintains a futex +(i.e., the kernel-internal implementation artifact) +only while operations such as +.BR FUTEX_WAIT , +described below, are being performed on a particular futex word. +.\" +.SS Arguments +The +.I uaddr +argument points to the futex word. +On all platforms, futexes are four-byte +integers that must be aligned on a four-byte boundary. +The operation to perform on the futex is specified in the +.I futex_op +argument; +.I val +is a value whose meaning and purpose depends on +.IR futex_op . +.PP +The remaining arguments +.RI ( timeout , +.IR uaddr2 , +and +.IR val3 ) +are required only for certain of the futex operations described below. +Where one of these arguments is not required, it is ignored. +.PP +For several blocking operations, the +.I timeout +argument is a pointer to a +.I timespec +structure that specifies a timeout for the operation. +However, notwithstanding the prototype shown above, for some operations, +the least significant four bytes of this argument are instead +used as an integer whose meaning is determined by the operation. +For these operations, the kernel casts the +.I timeout +value first to +.IR "unsigned long", +then to +.IR uint32_t , +and in the remainder of this page, this argument is referred to as +.I val2 +when interpreted in this fashion. +.PP +Where it is required, the +.I uaddr2 +argument is a pointer to a second futex word that is employed +by the operation. +.PP +The interpretation of the final integer argument, +.IR val3 , +depends on the operation. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.SS Futex operations +The +.I futex_op +argument consists of two parts: +a command that specifies the operation to be performed, +bitwise ORed with zero or more options that +modify the behaviour of the operation. +The options that may be included in +.I futex_op +are as follows: +.TP +.BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)" +.\" commit 34f01cc1f512fa783302982776895c73714ebbc2 +This option bit can be employed with all futex operations. +It tells the kernel that the futex is process-private and not shared +with another process (i.e., it is being used for synchronization +only between threads of the same process). +This allows the kernel to make some additional performance optimizations. +.\" I.e., It allows the kernel choose the fast path for validating +.\" the user-space address and avoids expensive VMA lookups, +.\" taking reference counts on file backing store, and so on. +.IP +As a convenience, +.I <linux/futex.h> +defines a set of constants with the suffix +.B _PRIVATE +that are equivalents of all of the operations listed below, +.\" except the obsolete FUTEX_FD, for which the "private" flag was +.\" meaningless +but with the +.B FUTEX_PRIVATE_FLAG +ORed into the constant value. +Thus, there are +.BR FUTEX_WAIT_PRIVATE , +.BR FUTEX_WAKE_PRIVATE , +and so on. +.TP +.BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)" +.\" commit 1acdac104668a0834cfa267de9946fac7764d486 +This option bit can be employed only with the +.BR FUTEX_WAIT_BITSET , +.BR FUTEX_WAIT_REQUEUE_PI , +(since Linux 4.5) +.\" commit 337f13046ff03717a9e99675284a817527440a49 +.BR FUTEX_WAIT , +and +(since Linux 5.14) +.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae +.B FUTEX_LOCK_PI2 +operations. +.IP +If this option is set, the kernel measures the +.I timeout +against the +.B CLOCK_REALTIME +clock. +.IP +If this option is not set, the kernel measures the +.I timeout +against the +.B CLOCK_MONOTONIC +clock. +.PP +The operation specified in +.I futex_op +is one of the following: +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAIT " (since Linux 2.6.0)" +.\" Strictly speaking, since some time in Linux 2.5.x +This operation tests that the value at the +futex word pointed to by the address +.I uaddr +still contains the expected value +.IR val , +and if so, then sleeps waiting for a +.B FUTEX_WAKE +operation on the futex word. +The load of the value of the futex word is an atomic memory +access (i.e., using atomic machine instructions of the respective +architecture). +This load, the comparison with the expected value, and +starting to sleep are performed atomically +.\" FIXME: Torvald, I think we may need to add some explanation of +.\" "totally ordered" here. +and totally ordered +with respect to other futex operations on the same futex word. +If the thread starts to sleep, +it is considered a waiter on this futex word. +If the futex value does not match +.IR val , +then the call fails immediately with the error +.BR EAGAIN . +.IP +The purpose of the comparison with the expected value is to prevent lost +wake-ups. +If another thread changed the value of the futex word after the +calling thread decided to block based on the prior value, +and if the other thread executed a +.B FUTEX_WAKE +operation (or similar wake-up) after the value change and before this +.B FUTEX_WAIT +operation, then the calling thread will observe the +value change and will not start to sleep. +.IP +If the +.I timeout +is not NULL, the structure it points to specifies a +timeout for the wait. +(This interval will be rounded up to the system clock granularity, +and is guaranteed not to expire early.) +The timeout is by default measured according to the +.B CLOCK_MONOTONIC +clock, but, since Linux 4.5, the +.B CLOCK_REALTIME +clock can be selected by specifying +.B FUTEX_CLOCK_REALTIME +in +.IR futex_op . +If +.I timeout +is NULL, the call blocks indefinitely. +.IP +.IR Note : +for +.BR FUTEX_WAIT , +.I timeout +is interpreted as a +.I relative +value. +This differs from other futex operations, where +.I timeout +is interpreted as an absolute value. +To obtain the equivalent of +.B FUTEX_WAIT +with an absolute timeout, employ +.B FUTEX_WAIT_BITSET +with +.I val3 +specified as +.BR FUTEX_BITSET_MATCH_ANY . +.IP +The arguments +.I uaddr2 +and +.I val3 +are ignored. +.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a +.\" different example. +.\" +.\" For +.\" .BR futex (7), +.\" this call is executed if decrementing the count gave a negative value +.\" (indicating contention), +.\" and will sleep until another process or thread releases +.\" the futex and executes the +.\" .B FUTEX_WAKE +.\" operation. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAKE " (since Linux 2.6.0)" +.\" Strictly speaking, since Linux 2.5.x +This operation wakes at most +.I val +of the waiters that are waiting (e.g., inside +.BR FUTEX_WAIT ) +on the futex word at the address +.IR uaddr . +Most commonly, +.I val +is specified as either 1 (wake up a single waiter) or +.B INT_MAX +(wake up all waiters). +No guarantee is provided about which waiters are awoken +(e.g., a waiter with a higher scheduling priority is not guaranteed +to be awoken in preference to a waiter with a lower priority). +.IP +The arguments +.IR timeout , +.IR uaddr2 , +and +.I val3 +are ignored. +.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to +.\" a different example. +.\" +.\" For +.\" .BR futex (7), +.\" this is executed if incrementing the count showed that +.\" there were waiters, +.\" once the futex value has been set to 1 +.\" (indicating that it is available). +.\" +.\" How does "incrementing the count show that there were waiters"? +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)" +.\" Strictly speaking, from Linux 2.5.x to Linux 2.6.25 +This operation creates a file descriptor that is associated with +the futex at +.IR uaddr . +The caller must close the returned file descriptor after use. +When another process or thread performs a +.B FUTEX_WAKE +on the futex word, the file descriptor indicates as being readable with +.BR select (2), +.BR poll (2), +and +.BR epoll (7) +.IP +The file descriptor can be used to obtain asynchronous notifications: if +.I val +is nonzero, then, when another process or thread executes a +.BR FUTEX_WAKE , +the caller will receive the signal number that was passed in +.IR val . +.IP +The arguments +.IR timeout , +.IR uaddr2 , +and +.I val3 +are ignored. +.IP +Because it was inherently racy, +.B FUTEX_FD +has been removed +.\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80 +from Linux 2.6.26 onward. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_REQUEUE " (since Linux 2.6.0)" +This operation performs the same task as +.B FUTEX_CMP_REQUEUE +(see below), except that no check is made using the value in +.IR val3 . +(The argument +.I val3 +is ignored.) +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)" +This operation first checks whether the location +.I uaddr +still contains the value +.IR val3 . +If not, the operation fails with the error +.BR EAGAIN . +Otherwise, the operation wakes up a maximum of +.I val +waiters that are waiting on the futex at +.IR uaddr . +If there are more than +.I val +waiters, then the remaining waiters are removed +from the wait queue of the source futex at +.I uaddr +and added to the wait queue of the target futex at +.IR uaddr2 . +The +.I val2 +argument specifies an upper limit on the number of waiters +that are requeued to the futex at +.IR uaddr2 . +.IP +.\" FIXME(Torvald) Is the following correct? Or is just the decision +.\" which threads to wake or requeue part of the atomic operation? +The load from +.I uaddr +is an atomic memory access (i.e., using atomic machine instructions of +the respective architecture). +This load, the comparison with +.IR val3 , +and the requeueing of any waiters are performed atomically and totally +ordered with respect to other operations on the same futex word. +.\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ### +.\" The operation is serialized with respect to operations on both +.\" source and target futex. No other waiter can enqueue itself +.\" for waiting and no other waiter can dequeue itself because of +.\" a timeout or signal. +.IP +Typical values to specify for +.I val +are 0 or 1. +(Specifying +.B INT_MAX +is not useful, because it would make the +.B FUTEX_CMP_REQUEUE +operation equivalent to +.BR FUTEX_WAKE .) +The limit value specified via +.I val2 +is typically either 1 or +.BR INT_MAX . +(Specifying the argument as 0 is not useful, because it would make the +.B FUTEX_CMP_REQUEUE +operation equivalent to +.BR FUTEX_WAIT .) +.IP +The +.B FUTEX_CMP_REQUEUE +operation was added as a replacement for the earlier +.BR FUTEX_REQUEUE . +The difference is that the check of the value at +.I uaddr +can be used to ensure that requeueing happens only under certain +conditions, which allows race conditions to be avoided in certain use cases. +.\" But, as Rich Felker points out, there remain valid use cases for +.\" FUTEX_REQUEUE, for example, when the calling thread is requeuing +.\" the target(s) to a lock that the calling thread owns +.\" From: Rich Felker <dalias@libc.org> +.\" Date: Wed, 29 Oct 2014 22:43:17 -0400 +.\" To: Darren Hart <dvhart@infradead.org> +.\" CC: libc-alpha@sourceware.org, ... +.\" Subject: Re: Add futex wrapper to glibc? +.IP +Both +.B FUTEX_REQUEUE +and +.B FUTEX_CMP_REQUEUE +can be used to avoid "thundering herd" wake-ups that could occur when using +.B FUTEX_WAKE +in cases where all of the waiters that are woken need to acquire +another futex. +Consider the following scenario, +where multiple waiter threads are waiting on B, +a wait queue implemented using a futex: +.IP +.in +4n +.EX +lock(A) +while (!check_value(V)) { + unlock(A); + block_on(B); + lock(A); +}; +unlock(A); +.EE +.in +.IP +If a waker thread used +.BR FUTEX_WAKE , +then all waiters waiting on B would be woken up, +and they would all try to acquire lock A. +However, waking all of the threads in this manner would be pointless because +all except one of the threads would immediately block on lock A again. +By contrast, a requeue operation wakes just one waiter and moves +the other waiters to lock A, +and when the woken waiter unlocks A then the next waiter can proceed. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAKE_OP " (since Linux 2.6.14)" +.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721 +.\" Author: Jakub Jelinek <jakub@redhat.com> +.\" Date: Tue Sep 6 15:16:25 2005 -0700 +.\" FIXME. (Torvald) The glibc condvar implementation is currently being +.\" revised (e.g., to not use an internal lock anymore). +.\" It is probably more future-proof to remove this paragraph. +.\" [Torvald, do you have an update here?] +This operation was added to support some user-space use cases +where more than one futex must be handled at the same time. +The most notable example is the implementation of +.BR pthread_cond_signal (3), +which requires operations on two futexes, +the one used to implement the mutex and the one used in the implementation +of the wait queue associated with the condition variable. +.B FUTEX_WAKE_OP +allows such cases to be implemented without leading to +high rates of contention and context switching. +.IP +The +.B FUTEX_WAKE_OP +operation is equivalent to executing the following code atomically +and totally ordered with respect to other futex operations on +any of the two supplied futex words: +.IP +.in +4n +.EX +uint32_t oldval = *(uint32_t *) uaddr2; +*(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP; +futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); +if (oldval \fIcmp\fP \fIcmparg\fP) + futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); +.EE +.in +.IP +In other words, +.B FUTEX_WAKE_OP +does the following: +.RS +.IP \[bu] 3 +saves the original value of the futex word at +.I uaddr2 +and performs an operation to modify the value of the futex at +.IR uaddr2 ; +this is an atomic read-modify-write memory access (i.e., using atomic +machine instructions of the respective architecture) +.IP \[bu] +wakes up a maximum of +.I val +waiters on the futex for the futex word at +.IR uaddr ; +and +.IP \[bu] +dependent on the results of a test of the original value of the +futex word at +.IR uaddr2 , +wakes up a maximum of +.I val2 +waiters on the futex for the futex word at +.IR uaddr2 . +.RE +.IP +The operation and comparison that are to be performed are encoded +in the bits of the argument +.IR val3 . +Pictorially, the encoding is: +.IP +.in +4n +.EX ++---+---+-----------+-----------+ +|op |cmp| oparg | cmparg | ++---+---+-----------+-----------+ + 4 4 12 12 <== # of bits +.EE +.in +.IP +Expressed in code, the encoding is: +.IP +.in +4n +.EX +#define FUTEX_OP(op, oparg, cmp, cmparg) \e + (((op & 0xf) << 28) | \e + ((cmp & 0xf) << 24) | \e + ((oparg & 0xfff) << 12) | \e + (cmparg & 0xfff)) +.EE +.in +.IP +In the above, +.I op +and +.I cmp +are each one of the codes listed below. +The +.I oparg +and +.I cmparg +components are literal numeric values, except as noted below. +.IP +The +.I op +component has one of the following values: +.IP +.in +4n +.EX +FUTEX_OP_SET 0 /* uaddr2 = oparg; */ +FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ +FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ +FUTEX_OP_ANDN 3 /* uaddr2 &= \[ti]oparg; */ +FUTEX_OP_XOR 4 /* uaddr2 \[ha]= oparg; */ +.EE +.in +.IP +In addition, bitwise ORing the following value into +.I op +causes +.I (1\~<<\~oparg) +to be used as the operand: +.IP +.in +4n +.EX +FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ +.EE +.in +.IP +The +.I cmp +field is one of the following: +.IP +.in +4n +.EX +FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ +FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ +FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ +FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ +FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ +FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ +.EE +.in +.IP +The return value of +.B FUTEX_WAKE_OP +is the sum of the number of waiters woken on the futex +.I uaddr +plus the number of waiters woken on the futex +.IR uaddr2 . +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)" +.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d +This operation is like +.B FUTEX_WAIT +except that +.I val3 +is used to provide a 32-bit bit mask to the kernel. +This bit mask, in which at least one bit must be set, +is stored in the kernel-internal state of the waiter. +See the description of +.B FUTEX_WAKE_BITSET +for further details. +.IP +If +.I timeout +is not NULL, the structure it points to specifies +an absolute timeout for the wait operation. +If +.I timeout +is NULL, the operation can block indefinitely. +.IP +The +.I uaddr2 +argument is ignored. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)" +.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d +This operation is the same as +.B FUTEX_WAKE +except that the +.I val3 +argument is used to provide a 32-bit bit mask to the kernel. +This bit mask, in which at least one bit must be set, +is used to select which waiters should be woken up. +The selection is done by a bitwise AND of the "wake" bit mask +(i.e., the value in +.IR val3 ) +and the bit mask which is stored in the kernel-internal +state of the waiter (the "wait" bit mask that is set using +.BR FUTEX_WAIT_BITSET ). +All of the waiters for which the result of the AND is nonzero are woken up; +the remaining waiters are left sleeping. +.IP +The effect of +.B FUTEX_WAIT_BITSET +and +.B FUTEX_WAKE_BITSET +is to allow selective wake-ups among multiple waiters that are blocked +on the same futex. +However, note that, depending on the use case, +employing this bit-mask multiplexing feature on a +futex can be less efficient than simply using multiple futexes, +because employing bit-mask multiplexing requires the kernel +to check all waiters on a futex, +including those that are not interested in being woken up +(i.e., they do not have the relevant bit set in their "wait" bit mask). +.\" According to http://locklessinc.com/articles/futex_cheat_sheet/: +.\" +.\" "The original reason for the addition of these extensions +.\" was to improve the performance of pthread read-write locks +.\" in glibc. However, the pthreads library no longer uses the +.\" same locking algorithm, and these extensions are not used +.\" without the bitset parameter being all ones. +.\" +.\" The page goes on to note that the FUTEX_WAIT_BITSET operation +.\" is nevertheless used (with a bit mask of all ones) in order to +.\" obtain the absolute timeout functionality that is useful +.\" for efficiently implementing Pthreads APIs (which use absolute +.\" timeouts); FUTEX_WAIT provides only relative timeouts. +.IP +The constant +.BR FUTEX_BITSET_MATCH_ANY , +which corresponds to all 32 bits set in the bit mask, can be used as the +.I val3 +argument for +.B FUTEX_WAIT_BITSET +and +.BR FUTEX_WAKE_BITSET . +Other than differences in the handling of the +.I timeout +argument, the +.B FUTEX_WAIT +operation is equivalent to +.B FUTEX_WAIT_BITSET +with +.I val3 +specified as +.BR FUTEX_BITSET_MATCH_ANY ; +that is, allow a wake-up by any waker. +The +.B FUTEX_WAKE +operation is equivalent to +.B FUTEX_WAKE_BITSET +with +.I val3 +specified as +.BR FUTEX_BITSET_MATCH_ANY ; +that is, wake up any waiter(s). +.IP +The +.I uaddr2 +and +.I timeout +arguments are ignored. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.SS Priority-inheritance futexes +Linux supports priority-inheritance (PI) futexes in order to handle +priority-inversion problems that can be encountered with +normal futex locks. +Priority inversion is the problem that occurs when a high-priority +task is blocked waiting to acquire a lock held by a low-priority task, +while tasks at an intermediate priority continuously preempt +the low-priority task from the CPU. +Consequently, the low-priority task makes no progress toward +releasing the lock, and the high-priority task remains blocked. +.PP +Priority inheritance is a mechanism for dealing with +the priority-inversion problem. +With this mechanism, when a high-priority task becomes blocked +by a lock held by a low-priority task, +the priority of the low-priority task is temporarily raised +to that of the high-priority task, +so that it is not preempted by any intermediate level tasks, +and can thus make progress toward releasing the lock. +To be effective, priority inheritance must be transitive, +meaning that if a high-priority task blocks on a lock +held by a lower-priority task that is itself blocked by a lock +held by another intermediate-priority task +(and so on, for chains of arbitrary length), +then both of those tasks +(or more generally, all of the tasks in a lock chain) +have their priorities raised to be the same as the high-priority task. +.PP +From a user-space perspective, +what makes a futex PI-aware is a policy agreement (described below) +between user space and the kernel about the value of the futex word, +coupled with the use of the PI-futex operations described below. +(Unlike the other futex operations described above, +the PI-futex operations are designed +for the implementation of very specific IPC mechanisms.) +.\" +.\" Quoting Darren Hart: +.\" These opcodes paired with the PI futex value policy (described below) +.\" defines a "futex" as PI aware. These were created very specifically +.\" in support of PI pthread_mutexes, so it makes a lot more sense to +.\" talk about a PI aware pthread_mutex, than a PI aware futex, since +.\" there is a lot of policy and scaffolding that has to be built up +.\" around it to use it properly (this is what a PI pthread_mutex is). +.PP +.\" mtk: The following text is drawn from the Hart/Guniguntala paper +.\" (listed in SEE ALSO), but I have reworded some pieces +.\" significantly. +.\" +The PI-futex operations described below differ from the other +futex operations in that they impose policy on the use of the value of the +futex word: +.IP \[bu] 3 +If the lock is not acquired, the futex word's value shall be 0. +.IP \[bu] +If the lock is acquired, the futex word's value shall +be the thread ID (TID; +see +.BR gettid (2)) +of the owning thread. +.IP \[bu] +If the lock is owned and there are threads contending for the lock, +then the +.B FUTEX_WAITERS +bit shall be set in the futex word's value; in other words, this value is: +.IP +.in +4n +.EX +FUTEX_WAITERS | TID +.EE +.in +.IP +(Note that is invalid for a PI futex word to have no owner and +.B FUTEX_WAITERS +set.) +.PP +With this policy in place, +a user-space application can acquire an unacquired +lock or release a lock using atomic instructions executed in user mode +(e.g., a compare-and-swap operation such as +.I cmpxchg +on the x86 architecture). +Acquiring a lock simply consists of using compare-and-swap to atomically +set the futex word's value to the caller's TID if its previous value was 0. +Releasing a lock requires using compare-and-swap to set the futex word's +value to 0 if the previous value was the expected TID. +.PP +If a futex is already acquired (i.e., has a nonzero value), +waiters must employ the +.B FUTEX_LOCK_PI +or +.B FUTEX_LOCK_PI2 +operations to acquire the lock. +If other threads are waiting for the lock, then the +.B FUTEX_WAITERS +bit is set in the futex value; +in this case, the lock owner must employ the +.B FUTEX_UNLOCK_PI +operation to release the lock. +.PP +In the cases where callers are forced into the kernel +(i.e., required to perform a +.BR futex () +call), +they then deal directly with a so-called RT-mutex, +a kernel locking mechanism which implements the required +priority-inheritance semantics. +After the RT-mutex is acquired, the futex value is updated accordingly, +before the calling thread returns to user space. +.PP +It is important to note +.\" tglx (July 2015): +.\" If there are multiple waiters on a pi futex then a wake pi operation +.\" will wake the first waiter and hand over the lock to this waiter. This +.\" includes handing over the rtmutex which represents the futex in the +.\" kernel. The strict requirement is that the futex owner and the rtmutex +.\" owner must be the same, except for the update period which is +.\" serialized by the futex internal locking. That means the kernel must +.\" update the user-space value prior to returning to user space +that the kernel will update the futex word's value prior +to returning to user space. +(This prevents the possibility of the futex word's value ending +up in an invalid state, such as having an owner but the value being 0, +or having waiters but not having the +.B FUTEX_WAITERS +bit set.) +.PP +If a futex has an associated RT-mutex in the kernel +(i.e., there are blocked waiters) +and the owner of the futex/RT-mutex dies unexpectedly, +then the kernel cleans up the RT-mutex and hands it over to the next waiter. +This in turn requires that the user-space value is updated accordingly. +To indicate that this is required, the kernel sets the +.B FUTEX_OWNER_DIED +bit in the futex word along with the thread ID of the new owner. +User space can detect this situation via the presence of the +.B FUTEX_OWNER_DIED +bit and is then responsible for cleaning up the stale state left over by +the dead owner. +.\" tglx (July 2015): +.\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where +.\" the kernel has no state associated. This happens via the robust futex +.\" mechanism. In that case the futex value will be set to +.\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non +.\" PI futexes. +.PP +PI futexes are operated on by specifying one of the values listed below in +.IR futex_op . +Note that the PI futex operations must be used as paired operations +and are subject to some additional requirements: +.IP \[bu] 3 +.BR FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +and +.B FUTEX_TRYLOCK_PI +pair with +.BR FUTEX_UNLOCK_PI . +.B FUTEX_UNLOCK_PI +must be called only on a futex owned by the calling thread, +as defined by the value policy, otherwise the error +.B EPERM +results. +.IP \[bu] +.B FUTEX_WAIT_REQUEUE_PI +pairs with +.BR FUTEX_CMP_REQUEUE_PI . +This must be performed from a non-PI futex to a distinct PI futex +(or the error +.B EINVAL +results). +Additionally, +.I val +(the number of waiters to be woken) must be 1 +(or the error +.B EINVAL +results). +.PP +The PI futex operations are as follows: +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_LOCK_PI " (since Linux 2.6.18)" +.\" commit c87e2837be82df479a6bae9f155c43516d2feebc +This operation is used after an attempt to acquire +the lock via an atomic user-mode instruction failed +because the futex word has a nonzero value\[em]specifically, +because it contained the (PID-namespace-specific) TID of the lock owner. +.IP +The operation checks the value of the futex word at the address +.IR uaddr . +If the value is 0, then the kernel tries to atomically set +the futex value to the caller's TID. +If the futex word's value is nonzero, +the kernel atomically sets the +.B FUTEX_WAITERS +bit, which signals the futex owner that it cannot unlock the futex in +user space atomically by setting the futex value to 0. +.\" tglx (July 2015): +.\" The operation here is similar to the FUTEX_WAIT logic. When the user +.\" space atomic acquire does not succeed because the futex value was non +.\" zero, then the waiter goes into the kernel, takes the kernel internal +.\" lock and retries the acquisition under the lock. If the acquisition +.\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal +.\" the lock owner that it needs to go into the kernel. Here is the pseudo +.\" code: +.\" +.\" lock(kernel_lock); +.\" retry: +.\" +.\" /* +.\" * Owner might have unlocked in user space before we +.\" * were able to set the waiter bit. +.\" */ +.\" if (atomic_acquire(futex) == SUCCESS) { +.\" unlock(kernel_lock()); +.\" return 0; +.\" } +.\" +.\" /* +.\" * Owner might have unlocked after the above atomic_acquire() +.\" * attempt. +.\" */ +.\" if (atomic_set_waiters_bit(futex) != SUCCESS) +.\" goto retry; +.\" +.\" queue_waiter(); +.\" unlock(kernel_lock); +.\" block(); +.\" +After that, the kernel: +.RS +.IP (1) 5 +Tries to find the thread which is associated with the owner TID. +.IP (2) +Creates or reuses kernel state on behalf of the owner. +(If this is the first waiter, there is no kernel state for this +futex, so kernel state is created by locking the RT-mutex +and the futex owner is made the owner of the RT-mutex. +If there are existing waiters, then the existing state is reused.) +.IP (3) +Attaches the waiter to the futex +(i.e., the waiter is enqueued on the RT-mutex waiter list). +.RE +.IP +If more than one waiter exists, +the enqueueing of the waiter is in descending priority order. +(For information on priority ordering, see the discussion of the +.BR SCHED_DEADLINE , +.BR SCHED_FIFO , +and +.B SCHED_RR +scheduling policies in +.BR sched (7).) +The owner inherits either the waiter's CPU bandwidth +(if the waiter is scheduled under the +.B SCHED_DEADLINE +policy) or the waiter's priority (if the waiter is scheduled under the +.B SCHED_RR +or +.B SCHED_FIFO +policy). +.\" August 2015: +.\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL) +.\" processes, does the nice value come into play also? +.\" +.\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order +This inheritance follows the lock chain in the case of nested locking +.\" (i.e., task 1 blocks on lock A, held by task 2, +.\" while task 2 blocks on lock B, held by task 3) +and performs deadlock detection. +.IP +The +.I timeout +argument provides a timeout for the lock attempt. +If +.I timeout +is not NULL, the structure it points to specifies +an absolute timeout, measured against the +.B CLOCK_REALTIME +clock. +.\" 2016-07-07 response from Thomas Gleixner on LKML: +.\" From: Thomas Gleixner <tglx@linutronix.de> +.\" Date: 6 July 2016 at 20:57 +.\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op +.\" +.\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: +.\" > On 06/23/2016 08:28 PM, Darren Hart wrote: +.\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using +.\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a +.\" > > specific amount of time, regardless of wall time. +.\" > +.\" > Yes, that's another weird inconsistency. +.\" +.\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on +.\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back +.\" then when we added LOCK_PI. +If +.I timeout +is NULL, the operation will block indefinitely. +.IP +The +.IR uaddr2 , +.IR val , +and +.I val3 +arguments are ignored. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_LOCK_PI2 " (since Linux 5.14)" +.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae +This operation is the same as +.BR FUTEX_LOCK_PI , +except that the clock against which +.I timeout +is measured is selectable. +By default, the (absolute) timeout specified in +.I timeout +is measured against the +.B CLOCK_MONOTONIC +clock, but if the +.B FUTEX_CLOCK_REALTIME +flag is specified in +.IR futex_op , +then the timeout is measured against the +.B CLOCK_REALTIME +clock. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)" +.\" commit c87e2837be82df479a6bae9f155c43516d2feebc +This operation tries to acquire the lock at +.IR uaddr . +It is invoked when a user-space atomic acquire did not +succeed because the futex word was not 0. +.IP +Because the kernel has access to more state information than user space, +acquisition of the lock might succeed if performed by the +kernel in cases where the futex word +(i.e., the state information accessible to use-space) contains stale state +.RB ( FUTEX_WAITERS +and/or +.BR FUTEX_OWNER_DIED ). +This can happen when the owner of the futex died. +User space cannot handle this condition in a race-free manner, +but the kernel can fix this up and acquire the futex. +.\" Paraphrasing a f2f conversation with Thomas Gleixner about the +.\" above point (Aug 2015): ### +.\" There is a rare possibility of a race condition involving an +.\" uncontended futex with no owner, but with waiters. The +.\" kernel-user-space contract is that if a futex is nonzero, you must +.\" go into kernel. The futex was owned by a task, and that task dies +.\" but there are no waiters, so the futex value is non zero. +.\" Therefore, the next locker has to go into the kernel, +.\" so that the kernel has a chance to clean up. (CMXCH on zero +.\" in user space would fail, so kernel has to clean up.) +.\" Darren Hart (Oct 2015): +.\" The trylock in the kernel has more state, so it can independently +.\" verify the flags that user space must trust implicitly. +.IP +The +.IR uaddr2 , +.IR val , +.IR timeout , +and +.I val3 +arguments are ignored. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)" +.\" commit c87e2837be82df479a6bae9f155c43516d2feebc +This operation wakes the top priority waiter that is waiting in +.B FUTEX_LOCK_PI +or +.B FUTEX_LOCK_PI2 +on the futex address provided by the +.I uaddr +argument. +.IP +This is called when the user-space value at +.I uaddr +cannot be changed atomically from a TID (of the owner) to 0. +.IP +The +.IR uaddr2 , +.IR val , +.IR timeout , +and +.I val3 +arguments are ignored. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)" +.\" commit 52400ba946759af28442dee6265c5c0180ac7122 +This operation is a PI-aware variant of +.BR FUTEX_CMP_REQUEUE . +It requeues waiters that are blocked via +.B FUTEX_WAIT_REQUEUE_PI +on +.I uaddr +from a non-PI source futex +.RI ( uaddr ) +to a PI target futex +.RI ( uaddr2 ). +.IP +As with +.BR FUTEX_CMP_REQUEUE , +this operation wakes up a maximum of +.I val +waiters that are waiting on the futex at +.IR uaddr . +However, for +.BR FUTEX_CMP_REQUEUE_PI , +.I val +is required to be 1 +(since the main point is to avoid a thundering herd). +The remaining waiters are removed from the wait queue of the source futex at +.I uaddr +and added to the wait queue of the target futex at +.IR uaddr2 . +.IP +The +.I val2 +.\" val2 is the cap on the number of requeued waiters. +.\" In the glibc pthread_cond_broadcast() implementation, this argument +.\" is specified as INT_MAX, and for pthread_cond_signal() it is 0. +and +.I val3 +arguments serve the same purposes as for +.BR FUTEX_CMP_REQUEUE . +.\" +.\" The page at http://locklessinc.com/articles/futex_cheat_sheet/ +.\" notes that "priority-inheritance Futex to priority-inheritance +.\" Futex requeues are currently unsupported". However, probably +.\" the page does not need to say nothing about this, since +.\" Thomas Gleixner commented (July 2015): "they never will be +.\" supported because they make no sense at all" +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.TP +.BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)" +.\" commit 52400ba946759af28442dee6265c5c0180ac7122 +.\" +Wait on a non-PI futex at +.I uaddr +and potentially be requeued (via a +.B FUTEX_CMP_REQUEUE_PI +operation in another task) onto a PI futex at +.IR uaddr2 . +The wait operation on +.I uaddr +is the same as for +.BR FUTEX_WAIT . +.IP +The waiter can be removed from the wait on +.I uaddr +without requeueing on +.I uaddr2 +via a +.B FUTEX_WAKE +operation in another task. +In this case, the +.B FUTEX_WAIT_REQUEUE_PI +operation fails with the error +.BR EAGAIN . +.IP +If +.I timeout +is not NULL, the structure it points to specifies +an absolute timeout for the wait operation. +If +.I timeout +is NULL, the operation can block indefinitely. +.IP +The +.I val3 +argument is ignored. +.IP +The +.B FUTEX_WAIT_REQUEUE_PI +and +.B FUTEX_CMP_REQUEUE_PI +were added to support a fairly specific use case: +support for priority-inheritance-aware POSIX threads condition variables. +The idea is that these operations should always be paired, +in order to ensure that user space and the kernel remain in sync. +Thus, in the +.B FUTEX_WAIT_REQUEUE_PI +operation, the user-space application pre-specifies the target +of the requeue that takes place in the +.B FUTEX_CMP_REQUEUE_PI +operation. +.\" +.\" Darren Hart notes that a patch to allow glibc to fully support +.\" PI-aware pthreads condition variables has not yet been accepted into +.\" glibc. The story is complex, and can be found at +.\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588 +.\" Darren notes that in the meantime, the patch is shipped with various +.\" PREEMPT_RT-enabled Linux systems. +.\" +.\" Related to the preceding, Darren proposed that somewhere, man-pages +.\" should document the following point: +.\" +.\" While the Linux kernel, since Linux 2.6.31, supports requeueing of +.\" priority-inheritance (PI) aware mutexes via the +.\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations, +.\" the glibc implementation does not yet take full advantage of this. +.\" Specifically, the condvar internal data lock remains a non-PI aware +.\" mutex, regardless of the type of the pthread_mutex associated with +.\" the condvar. This can lead to an unbounded priority inversion on +.\" the internal data lock even when associating a PI aware +.\" pthread_mutex with a condvar during a pthread_cond*_wait +.\" operation. For this reason, it is not recommended to rely on +.\" priority inheritance when using pthread condition variables. +.\" +.\" The problem is that the obvious location for this text is +.\" the pthread_cond*wait(3) man page. However, such a man page +.\" does not currently exist. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.SH RETURN VALUE +In the event of an error (and assuming that +.BR futex () +was invoked via +.BR syscall (2)), +all operations return \-1 and set +.I errno +to indicate the error. +.PP +The return value on success depends on the operation, +as described in the following list: +.TP +.B FUTEX_WAIT +Returns 0 if the caller was woken up. +Note that a wake-up can also be caused by common futex usage patterns +in unrelated code that happened to have previously used the futex word's +memory location (e.g., typical futex-based implementations of +Pthreads mutexes can cause this under some conditions). +Therefore, callers should always conservatively assume that a return +value of 0 can mean a spurious wake-up, and use the futex word's value +(i.e., the user-space synchronization scheme) +to decide whether to continue to block or not. +.TP +.B FUTEX_WAKE +Returns the number of waiters that were woken up. +.TP +.B FUTEX_FD +Returns the new file descriptor associated with the futex. +.TP +.B FUTEX_REQUEUE +Returns the number of waiters that were woken up. +.TP +.B FUTEX_CMP_REQUEUE +Returns the total number of waiters that were woken up or +requeued to the futex for the futex word at +.IR uaddr2 . +If this value is greater than +.IR val , +then the difference is the number of waiters requeued to the futex for the +futex word at +.IR uaddr2 . +.TP +.B FUTEX_WAKE_OP +Returns the total number of waiters that were woken up. +This is the sum of the woken waiters on the two futexes for +the futex words at +.I uaddr +and +.IR uaddr2 . +.TP +.B FUTEX_WAIT_BITSET +Returns 0 if the caller was woken up. +See +.B FUTEX_WAIT +for how to interpret this correctly in practice. +.TP +.B FUTEX_WAKE_BITSET +Returns the number of waiters that were woken up. +.TP +.B FUTEX_LOCK_PI +Returns 0 if the futex was successfully locked. +.TP +.B FUTEX_LOCK_PI2 +Returns 0 if the futex was successfully locked. +.TP +.B FUTEX_TRYLOCK_PI +Returns 0 if the futex was successfully locked. +.TP +.B FUTEX_UNLOCK_PI +Returns 0 if the futex was successfully unlocked. +.TP +.B FUTEX_CMP_REQUEUE_PI +Returns the total number of waiters that were woken up or +requeued to the futex for the futex word at +.IR uaddr2 . +If this value is greater than +.IR val , +then difference is the number of waiters requeued to the futex for +the futex word at +.IR uaddr2 . +.TP +.B FUTEX_WAIT_REQUEUE_PI +Returns 0 if the caller was successfully requeued to the futex for +the futex word at +.IR uaddr2 . +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.SH ERRORS +.TP +.B EACCES +No read access to the memory of a futex word. +.TP +.B EAGAIN +.RB ( FUTEX_WAIT , +.BR FUTEX_WAIT_BITSET , +.BR FUTEX_WAIT_REQUEUE_PI ) +The value pointed to by +.I uaddr +was not equal to the expected value +.I val +at the time of the call. +.IP +.BR Note : +on Linux, the symbolic names +.B EAGAIN +and +.B EWOULDBLOCK +(both of which appear in different parts of the kernel futex code) +have the same value. +.TP +.B EAGAIN +.RB ( FUTEX_CMP_REQUEUE , +.BR FUTEX_CMP_REQUEUE_PI ) +The value pointed to by +.I uaddr +is not equal to the expected value +.IR val3 . +.TP +.B EAGAIN +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI ) +The futex owner thread ID of +.I uaddr +(for +.BR FUTEX_CMP_REQUEUE_PI : +.IR uaddr2 ) +is about to exit, +but has not yet handled the internal state cleanup. +Try again. +.TP +.B EDEADLK +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI ) +The futex word at +.I uaddr +is already locked by the caller. +.TP +.B EDEADLK +.\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some +.\" places, and EDEADLOCK in others. On almost all architectures +.\" these constants are synonymous. Is there a reason that both +.\" names are used? +.\" +.\" tglx (July 2015): "No. We should probably fix that." +.\" +.RB ( FUTEX_CMP_REQUEUE_PI ) +While requeueing a waiter to the PI futex for the futex word at +.IR uaddr2 , +the kernel detected a deadlock. +.TP +.B EFAULT +A required pointer argument (i.e., +.IR uaddr , +.IR uaddr2 , +or +.IR timeout ) +did not point to a valid user-space address. +.TP +.B EINTR +A +.B FUTEX_WAIT +or +.B FUTEX_WAIT_BITSET +operation was interrupted by a signal (see +.BR signal (7)). +Before Linux 2.6.22, this error could also be returned for +a spurious wakeup; since Linux 2.6.22, this no longer happens. +.TP +.B EINVAL +The operation in +.I futex_op +is one of those that employs a timeout, but the supplied +.I timeout +argument was invalid +.RI ( tv_sec +was less than zero, or +.I tv_nsec +was not less than 1,000,000,000). +.TP +.B EINVAL +The operation specified in +.I futex_op +employs one or both of the pointers +.I uaddr +and +.IR uaddr2 , +but one of these does not point to a valid object\[em]that is, +the address is not four-byte-aligned. +.TP +.B EINVAL +.RB ( FUTEX_WAIT_BITSET , +.BR FUTEX_WAKE_BITSET ) +The bit mask supplied in +.I val3 +is zero. +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +.I uaddr +equals +.I uaddr2 +(i.e., an attempt was made to requeue to the same futex). +.TP +.B EINVAL +.RB ( FUTEX_FD ) +The signal number supplied in +.I val +is invalid. +.TP +.B EINVAL +.RB ( FUTEX_WAKE , +.BR FUTEX_WAKE_OP , +.BR FUTEX_WAKE_BITSET , +.BR FUTEX_REQUEUE , +.BR FUTEX_CMP_REQUEUE ) +The kernel detected an inconsistency between the user-space state at +.I uaddr +and the kernel state\[em]that is, it detected a waiter which waits in +.B FUTEX_LOCK_PI +or +.B FUTEX_LOCK_PI2 +on +.IR uaddr . +.TP +.B EINVAL +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_UNLOCK_PI ) +The kernel detected an inconsistency between the user-space state at +.I uaddr +and the kernel state. +This indicates either state corruption +or that the kernel found a waiter on +.I uaddr +which is waiting via +.B FUTEX_WAIT +or +.BR FUTEX_WAIT_BITSET . +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +The kernel detected an inconsistency between the user-space state at +.I uaddr2 +and the kernel state; +.\" From a conversation with Thomas Gleixner (Aug 2015): ### +.\" The kernel sees: I have non PI state for a futex you tried to +.\" tell me was PI +that is, the kernel detected a waiter which waits via +.B FUTEX_WAIT +or +.B FUTEX_WAIT_BITSET +on +.IR uaddr2 . +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +The kernel detected an inconsistency between the user-space state at +.I uaddr +and the kernel state; +that is, the kernel detected a waiter which waits via +.B FUTEX_WAIT +or +.B FUTEX_WAIT_BITSET +on +.IR uaddr . +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +The kernel detected an inconsistency between the user-space state at +.I uaddr +and the kernel state; +that is, the kernel detected a waiter which waits on +.I uaddr +via +.B FUTEX_LOCK_PI +or +.B FUTEX_LOCK_PI2 +(instead of +.BR FUTEX_WAIT_REQUEUE_PI ). +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +.\" This deals with the case: +.\" wait_requeue_pi(A, B); +.\" requeue_pi(A, C); +An attempt was made to requeue a waiter to a futex other than that +specified by the matching +.B FUTEX_WAIT_REQUEUE_PI +call for that waiter. +.TP +.B EINVAL +.RB ( FUTEX_CMP_REQUEUE_PI ) +The +.I val +argument is not 1. +.TP +.B EINVAL +Invalid argument. +.TP +.B ENFILE +.RB ( FUTEX_FD ) +The system-wide limit on the total number of open files has been reached. +.TP +.B ENOMEM +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI ) +The kernel could not allocate memory to hold state information. +.TP +.B ENOSYS +Invalid operation specified in +.IR futex_op . +.TP +.B ENOSYS +The +.B FUTEX_CLOCK_REALTIME +option was specified in +.IR futex_op , +but the accompanying operation was neither +.BR FUTEX_WAIT , +.BR FUTEX_WAIT_BITSET , +.BR FUTEX_WAIT_REQUEUE_PI , +nor +.BR FUTEX_LOCK_PI2 . +.TP +.B ENOSYS +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_UNLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI , +.BR FUTEX_WAIT_REQUEUE_PI ) +A run-time check determined that the operation is not available. +The PI-futex operations are not implemented on all architectures and +are not supported on some CPU variants. +.TP +.B EPERM +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI ) +The caller is not allowed to attach itself to the futex at +.I uaddr +(for +.BR FUTEX_CMP_REQUEUE_PI : +the futex at +.IR uaddr2 ). +(This may be caused by a state corruption in user space.) +.TP +.B EPERM +.RB ( FUTEX_UNLOCK_PI ) +The caller does not own the lock represented by the futex word. +.TP +.B ESRCH +.RB ( FUTEX_LOCK_PI , +.BR FUTEX_LOCK_PI2 , +.BR FUTEX_TRYLOCK_PI , +.BR FUTEX_CMP_REQUEUE_PI ) +The thread ID in the futex word at +.I uaddr +does not exist. +.TP +.B ESRCH +.RB ( FUTEX_CMP_REQUEUE_PI ) +The thread ID in the futex word at +.I uaddr2 +does not exist. +.TP +.B ETIMEDOUT +The operation in +.I futex_op +employed the timeout specified in +.IR timeout , +and the timeout expired before the operation completed. +.\" +.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" +.\" +.SH STANDARDS +Linux. +.SH HISTORY +Linux 2.6.0. +.PP +Initial futex support was merged in Linux 2.5.7 but with different +semantics from what was described above. +A four-argument system call with the semantics +described in this page was introduced in Linux 2.5.40. +A fifth argument was added in Linux 2.5.70, +and a sixth argument was added in Linux 2.6.7. +.SH EXAMPLES +The program below demonstrates use of futexes in a program where a parent +process and a child process use a pair of futexes located inside a +shared anonymous mapping to synchronize access to a shared resource: +the terminal. +The two processes each write +.I nloops +(a command-line argument that defaults to 5 if omitted) +messages to the terminal and employ a synchronization protocol +that ensures that they alternate in writing messages. +Upon running this program we see output such as the following: +.PP +.in +4n +.EX +$ \fB./futex_demo\fP +Parent (18534) 0 +Child (18535) 0 +Parent (18534) 1 +Child (18535) 1 +Parent (18534) 2 +Child (18535) 2 +Parent (18534) 3 +Child (18535) 3 +Parent (18534) 4 +Child (18535) 4 +.EE +.in +.SS Program source +\& +.\" SRC BEGIN (futex.c) +.EX +/* futex_demo.c +\& + Usage: futex_demo [nloops] + (Default: 5) +\& + Demonstrate the use of futexes in a program where parent and child + use a pair of futexes located inside a shared anonymous mapping to + synchronize access to a shared resource: the terminal. The two + processes each write \[aq]num\-loops\[aq] messages to the terminal and employ + a synchronization protocol that ensures that they alternate in + writing messages. +*/ +#define _GNU_SOURCE +#include <err.h> +#include <errno.h> +#include <linux/futex.h> +#include <stdatomic.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/wait.h> +#include <unistd.h> +\& +static uint32_t *futex1, *futex2, *iaddr; +\& +static int +futex(uint32_t *uaddr, int futex_op, uint32_t val, + const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3) +{ + return syscall(SYS_futex, uaddr, futex_op, val, + timeout, uaddr2, val3); +} +\& +/* Acquire the futex pointed to by \[aq]futexp\[aq]: wait for its value to + become 1, and then set the value to 0. */ +\& +static void +fwait(uint32_t *futexp) +{ + long s; + const uint32_t one = 1; +\& + /* atomic_compare_exchange_strong(ptr, oldval, newval) + atomically performs the equivalent of: +\& + if (*ptr == *oldval) + *ptr = newval; +\& + It returns true if the test yielded true and *ptr was updated. */ +\& + while (1) { +\& + /* Is the futex available? */ + if (atomic_compare_exchange_strong(futexp, &one, 0)) + break; /* Yes */ +\& + /* Futex is not available; wait. */ +\& + s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); + if (s == \-1 && errno != EAGAIN) + err(EXIT_FAILURE, "futex\-FUTEX_WAIT"); + } +} +\& +/* Release the futex pointed to by \[aq]futexp\[aq]: if the futex currently + has the value 0, set its value to 1 and then wake any futex waiters, + so that if the peer is blocked in fwait(), it can proceed. */ +\& +static void +fpost(uint32_t *futexp) +{ + long s; + const uint32_t zero = 0; +\& + /* atomic_compare_exchange_strong() was described + in comments above. */ +\& + if (atomic_compare_exchange_strong(futexp, &zero, 1)) { + s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); + if (s == \-1) + err(EXIT_FAILURE, "futex\-FUTEX_WAKE"); + } +} +\& +int +main(int argc, char *argv[]) +{ + pid_t childPid; + unsigned int nloops; +\& + setbuf(stdout, NULL); +\& + nloops = (argc > 1) ? atoi(argv[1]) : 5; +\& + /* Create a shared anonymous mapping that will hold the futexes. + Since the futexes are being shared between processes, we + subsequently use the "shared" futex operations (i.e., not the + ones suffixed "_PRIVATE"). */ +\& + iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED, \-1, 0); + if (iaddr == MAP_FAILED) + err(EXIT_FAILURE, "mmap"); +\& + futex1 = &iaddr[0]; + futex2 = &iaddr[1]; +\& + *futex1 = 0; /* State: unavailable */ + *futex2 = 1; /* State: available */ +\& + /* Create a child process that inherits the shared anonymous + mapping. */ +\& + childPid = fork(); + if (childPid == \-1) + err(EXIT_FAILURE, "fork"); +\& + if (childPid == 0) { /* Child */ + for (unsigned int j = 0; j < nloops; j++) { + fwait(futex1); + printf("Child (%jd) %u\en", (intmax_t) getpid(), j); + fpost(futex2); + } +\& + exit(EXIT_SUCCESS); + } +\& + /* Parent falls through to here. */ +\& + for (unsigned int j = 0; j < nloops; j++) { + fwait(futex2); + printf("Parent (%jd) %u\en", (intmax_t) getpid(), j); + fpost(futex1); + } +\& + wait(NULL); +\& + exit(EXIT_SUCCESS); +} +.EE +.\" SRC END +.SH SEE ALSO +.ad l +.BR get_robust_list (2), +.BR restart_syscall (2), +.BR pthread_mutexattr_getprotocol (3), +.BR futex (7), +.BR sched (7) +.PP +The following kernel source files: +.IP \[bu] 3 +.I Documentation/pi\-futex.txt +.IP \[bu] +.I Documentation/futex\-requeue\-pi.txt +.IP \[bu] +.I Documentation/locking/rt\-mutex.txt +.IP \[bu] +.I Documentation/locking/rt\-mutex\-design.txt +.IP \[bu] +.I Documentation/robust\-futex\-ABI.txt +.PP +Franke, H., Russell, R., and Kirwood, M., 2002. +\fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP +(from proceedings of the Ottawa Linux Symposium 2002), +.br +.UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf +.UE +.PP +Hart, D., 2009. \fIA futex overview and update\fP, +.UR http://lwn.net/Articles/360699/ +.UE +.PP +Hart, D.\& and Guniguntala, D., 2009. +\fIRequeue-PI: Making glibc Condvars PI-Aware\fP +(from proceedings of the 2009 Real-Time Linux Workshop), +.UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf +.UE +.PP +Drepper, U., 2011. \fIFutexes Are Tricky\fP, +.UR http://www.akkadia.org/drepper/futex.pdf +.UE +.PP +Futex example library, futex\-*.tar.bz2 at +.br +.UR https://mirrors.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/ +.UE +.\" +.\" FIXME(Torvald) We should probably refer to the glibc code here, in +.\" particular the glibc-internal futex wrapper functions that are +.\" WIP, and the generic pthread_mutex_t and perhaps condvar +.\" implementations. |