diff options
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.rst')
-rw-r--r-- | Documentation/RCU/Design/Requirements/Requirements.rst | 2680 |
1 files changed, 2680 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst new file mode 100644 index 000000000..1ae79a10a --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -0,0 +1,2680 @@ +================================= +A Tour Through RCU's Requirements +================================= + +Copyright IBM Corporation, 2015 + +Author: Paul E. McKenney + +The initial version of this document appeared in the +`LWN <https://lwn.net/>`_ on those articles: +`part 1 <https://lwn.net/Articles/652156/>`_, +`part 2 <https://lwn.net/Articles/652677/>`_, and +`part 3 <https://lwn.net/Articles/653326/>`_. + +Introduction +------------ + +Read-copy update (RCU) is a synchronization mechanism that is often used +as a replacement for reader-writer locking. RCU is unusual in that +updaters do not block readers, which means that RCU's read-side +primitives can be exceedingly fast and scalable. In addition, updaters +can make useful forward progress concurrently with readers. However, all +this concurrency between RCU readers and updaters does raise the +question of exactly what RCU readers are doing, which in turn raises the +question of exactly what RCU's requirements are. + +This document therefore summarizes RCU's requirements, and can be +thought of as an informal, high-level specification for RCU. It is +important to understand that RCU's specification is primarily empirical +in nature; in fact, I learned about many of these requirements the hard +way. This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been a +great privilege to work with so many people willing to apply +technologies in interesting new ways. + +All that aside, here are the categories of currently known RCU +requirements: + +#. `Fundamental Requirements`_ +#. `Fundamental Non-Requirements`_ +#. `Parallelism Facts of Life`_ +#. `Quality-of-Implementation Requirements`_ +#. `Linux Kernel Complications`_ +#. `Software-Engineering Requirements`_ +#. `Other RCU Flavors`_ +#. `Possible Future Changes`_ + +This is followed by a `summary <#Summary>`__, however, the answers to +each quick quiz immediately follows the quiz. Select the big white space +with your mouse to see the answer. + +Fundamental Requirements +------------------------ + +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. These are: + +#. `Grace-Period Guarantee`_ +#. `Publish/Subscribe Guarantee`_ +#. `Memory-Barrier Guarantees`_ +#. `RCU Primitives Guaranteed to Execute Unconditionally`_ +#. `Guaranteed Read-to-Write Upgrade`_ + +Grace-Period Guarantee +~~~~~~~~~~~~~~~~~~~~~~ + +RCU's grace-period guarantee is unusual in being premeditated: Jack +Slingwine and I had this guarantee firmly in mind when we started work +on RCU (then called “rclock”) in the early 1990s. That said, the past +two decades of experience with RCU have produced a much more detailed +understanding of this guarantee. + +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. An RCU read-side +critical section begins with the marker ``rcu_read_lock()`` and ends +with the marker ``rcu_read_unlock()``. These markers may be nested, and +RCU treats a nested set as one big RCU read-side critical section. +Production-quality implementations of ``rcu_read_lock()`` and +``rcu_read_unlock()`` are extremely lightweight, and in fact have +exactly zero overhead in Linux kernels built for production use with +``CONFIG_PREEMPT=n``. + +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + + :: + + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 WRITE_ONCE(x, 1); + 14 synchronize_rcu(); + 15 WRITE_ONCE(y, 1); + 16 } + +Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing +readers, any instance of ``thread0()`` that loads a value of zero from +``x`` must complete before ``thread1()`` stores to ``y``, so that +instance must also load a value of zero from ``y``. Similarly, any +instance of ``thread0()`` that loads a value of one from ``y`` must have +started after the ``synchronize_rcu()`` started, and must therefore also +load a value of one from ``x``. Therefore, the outcome: + + :: + + (r1 == 0 && r2 == 1) + +cannot happen. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that updaters can make useful forward | +| progress concurrently with readers, but pre-existing readers will | +| block ``synchronize_rcu()``!!! | +| Just who are you trying to fool??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| First, if updaters do not wish to be blocked by readers, they can use | +| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | +| Second, even when using ``synchronize_rcu()``, the other update-side | +| code does run concurrently with readers, whether pre-existing or not. | ++-----------------------------------------------------------------------+ + +This scenario resembles one of the first uses of RCU in +`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a +distributed lock manager's transition into a state suitable for handling +recovery from node failure, more or less as follows: + + :: + + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { + 10 int state_snap; + 11 + 12 rcu_read_lock(); + 13 state_snap = READ_ONCE(state); + 14 if (state_snap == STATE_NORMAL) + 15 do_something(); + 16 else + 17 do_something_carefully(); + 18 rcu_read_unlock(); + 19 } + 20 + 21 void start_recovery(void) + 22 { + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); + 24 synchronize_rcu(); + 25 WRITE_ONCE(state, STATE_RECOVERING); + 26 recovery(); + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); + 28 synchronize_rcu(); + 29 WRITE_ONCE(state, STATE_NORMAL); + 30 } + +The RCU read-side critical section in ``do_something_dlm()`` works with +the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that +``do_something()`` never runs concurrently with ``recovery()``, but with +little or no synchronization overhead in ``do_something_dlm()``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why is the ``synchronize_rcu()`` on line 28 needed? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Without that extra grace period, memory reordering could result in | +| ``do_something_dlm()`` executing ``do_something()`` concurrently with | +| the last bits of ``recovery()``. | ++-----------------------------------------------------------------------+ + +In order to avoid fatal problems such as deadlocks, an RCU read-side +critical section must not contain calls to ``synchronize_rcu()``. +Similarly, an RCU read-side critical section must not contain anything +that waits, directly or indirectly, on completion of an invocation of +``synchronize_rcu()``. + +Although RCU's grace-period guarantee is useful in and of itself, with +`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would +be good to be able to use RCU to coordinate read-side access to linked +data structures. For this, the grace-period guarantee is not sufficient, +as can be seen in function ``add_gp_buggy()`` below. We will look at the +reader's code later, but in the meantime, just think of the reader as +locklessly picking up the ``gp`` pointer, and, if the value loaded is +non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. + + :: + + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 gp = p; /* ORDERING BUG */ + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + + :: + + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 gp = p; /* ORDERING BUG */ + 12 p->a = a; + 13 p->b = a; + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` +executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. +And this is but one of many ways in which compiler and hardware +optimizations could cause trouble. Therefore, we clearly need some way +to prevent the compiler and the CPU from reordering in this manner, +which brings us to the publish-subscribe guarantee discussed in the next +section. + +Publish/Subscribe Guarantee +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RCU's publish-subscribe guarantee allows data to be inserted into a +linked data structure without disrupting RCU readers. The updater uses +``rcu_assign_pointer()`` to insert the new data, and readers use +``rcu_dereference()`` to access data, whether new or old. The following +shows an example of insertion: + + :: + + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 rcu_assign_pointer(gp, p); + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a +simple assignment statement, but also guarantees that its assignment +will happen after the two assignments in lines 11 and 12, similar to the +C11 ``memory_order_release`` store operation. It also prevents any +number of “interesting” compiler optimizations, for example, the use of +``gp`` as a scratch location immediately preceding the assignment. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But ``rcu_assign_pointer()`` does nothing to prevent the two | +| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | +| also cause problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, it cannot. The readers cannot see either of these two fields | +| until the assignment to ``gp``, by which time both fields are fully | +| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | +| cannot possibly cause any problems. | ++-----------------------------------------------------------------------+ + +It is tempting to assume that the reader need not do anything special to +control its accesses to the RCU-protected data, as shown in +``do_something_gp_buggy()`` below: + + :: + + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler (to say nothing of +`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__) +can trip this code up. For but one example, if the compiler were short +of registers, it might choose to refetch from ``gp`` rather than keeping +a separate copy in ``p`` as follows: + + :: + + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ + 5 do_something(gp->a, gp->b); + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); + 10 return false; + 11 } + +If this function ran concurrently with a series of updates that replaced +the current structure with a new one, the fetches of ``gp->a`` and +``gp->b`` might well come from two different structures, which could +cause serious confusion. To prevent this (and much else besides), +``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: + + :: + + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory +barriers in the Linux kernel. Should a `high-quality implementation of +C11 ``memory_order_consume`` +[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ +ever appear, then ``rcu_dereference()`` could be implemented as a +``memory_order_consume`` load. Regardless of the exact implementation, a +pointer fetched by ``rcu_dereference()`` may not be used outside of the +outermost RCU read-side critical section containing that +``rcu_dereference()``, unless protection of the corresponding data +element has been passed from RCU to some other synchronization +mechanism, most commonly locking or `reference +counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. + +In short, updaters use ``rcu_assign_pointer()`` and readers use +``rcu_dereference()``, and these two RCU API elements work together to +ensure that readers have a consistent view of newly added data elements. + +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +#. Remove the data element from the enclosing structure. +#. Wait for all pre-existing RCU read-side critical sections to complete + (because only pre-existing readers can possibly have a reference to + the newly removed data element). +#. At this point, only the updater has a reference to the newly removed + data element, so it can safely reclaim the data element, for example, + by passing it to ``kfree()``. + +This process is implemented by ``remove_gp_synchronous()``: + + :: + + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 rcu_assign_pointer(gp, NULL); + 12 spin_unlock(&gp_lock); + 13 synchronize_rcu(); + 14 kfree(p); + 15 return true; + 16 } + +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. This waiting ensures +that readers will reach line 7 of ``do_something_gp()`` before the data +element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on +line 6 is similar to ``rcu_dereference()``, except that: + +#. The value returned by ``rcu_access_pointer()`` cannot be + dereferenced. If you want to access the value pointed to as well as + the pointer itself, use ``rcu_dereference()`` instead of + ``rcu_access_pointer()``. +#. The call to ``rcu_access_pointer()`` need not be protected. In + contrast, ``rcu_dereference()`` must either be within an RCU + read-side critical section or in a code segment where the pointer + cannot change, for example, in code protected by the corresponding + update-side lock. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | +| what destructive optimizations might the compiler make use of? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Let's start with what happens to ``do_something_gp()`` if it fails to | +| use ``rcu_dereference()``. It could reuse a value formerly fetched | +| from this same pointer. It could also fetch the pointer from ``gp`` | +| in a byte-at-a-time manner, resulting in *load tearing*, in turn | +| resulting a bytewise mash-up of two distinct pointer values. It might | +| even use value-speculation optimizations, where it makes a wrong | +| guess, but by the time it gets around to checking the value, an | +| update has changed the pointer to match the wrong guess. Too bad | +| about any dereferences that returned pre-initialization garbage in | +| the meantime! | +| For ``remove_gp_synchronous()``, as long as all modifications to | +| ``gp`` are carried out while holding ``gp_lock``, the above | +| optimizations are harmless. However, ``sparse`` will complain if you | +| define ``gp`` with ``__rcu`` and then access it without using either | +| ``rcu_access_pointer()`` or ``rcu_dereference()``. | ++-----------------------------------------------------------------------+ + +In short, RCU's publish-subscribe guarantee is provided by the +combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This +guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. This guarantee +can be used in combination with the grace-period guarantee to also allow +data elements to be removed from RCU-protected linked data structures, +again without disrupting RCU readers. + +This guarantee was only partially premeditated. DYNIX/ptx used an +explicit memory barrier for publication, but had nothing resembling +``rcu_dereference()`` for subscription, nor did it have anything +resembling the dependency-ordering barrier that was later subsumed +into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The +need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. It took the Alpha architects a +good hour to convince me that any sort of barrier would ever be needed, +and it then took me a good *two* hours to convince them that their +documentation did not make this point clear. More recent work with the C +and C++ standards committees have provided much education on tricks and +traps from the compiler. In short, compilers were much less tricky in +the early 1990s, but in 2015, don't even think about omitting +``rcu_dereference()``! + +Memory-Barrier Guarantees +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +#. Each CPU that has an RCU read-side critical section that begins + before ``synchronize_rcu()`` starts is guaranteed to execute a full + memory barrier between the time that the RCU read-side critical + section ends and the time that ``synchronize_rcu()`` returns. Without + this guarantee, a pre-existing RCU read-side critical section might + hold a reference to the newly removed ``struct foo`` after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()``. +#. Each CPU that has an RCU read-side critical section that ends after + ``synchronize_rcu()`` returns is guaranteed to execute a full memory + barrier between the time that ``synchronize_rcu()`` begins and the + time that the RCU read-side critical section begins. Without this + guarantee, a later RCU read-side critical section running after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run + ``do_something_gp()`` and find the newly deleted ``struct foo``. +#. If the task invoking ``synchronize_rcu()`` remains on a given CPU, + then that CPU is guaranteed to execute a full memory barrier sometime + during the execution of ``synchronize_rcu()``. This guarantee ensures + that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really + does execute after the removal on line 11. +#. If the task invoking ``synchronize_rcu()`` migrates among a group of + CPUs during that invocation, then each of the CPUs in that group is + guaranteed to execute a full memory barrier sometime during the + execution of ``synchronize_rcu()``. This guarantee also ensures that + the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does + execute after the removal on line 11, but also in the case where the + thread executing the ``synchronize_rcu()`` migrates in the meantime. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that multiple CPUs can start RCU read-side critical sections at | +| any time without any ordering whatsoever, how can RCU possibly tell | +| whether or not a given RCU read-side critical section starts before a | +| given instance of ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| If RCU cannot tell whether or not a given RCU read-side critical | +| section starts before a given instance of ``synchronize_rcu()``, then | +| it must assume that the RCU read-side critical section started first. | +| In other words, a given instance of ``synchronize_rcu()`` can avoid | +| waiting on a given RCU read-side critical section only if it can | +| prove that ``synchronize_rcu()`` started first. | +| A related question is “When ``rcu_read_lock()`` doesn't generate any | +| code, why does it matter how it relates to a grace period?” The | +| answer is that it is not the relationship of ``rcu_read_lock()`` | +| itself that is important, but rather the relationship of the code | +| within the enclosed RCU read-side critical section to the code | +| preceding and following the grace period. If we take this viewpoint, | +| then a given RCU read-side critical section begins before a given | +| grace period when some access preceding the grace period observes the | +| effect of some access within the critical section, in which case none | +| of the accesses within the critical section may observe the effects | +| of any access following the grace period. | +| | +| As of late 2016, mathematical models of RCU take this viewpoint, for | +| example, see slides 62 and 63 of the `2016 LinuxCon | +| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | +| 6.10.04c.LCE.pdf>`__ | +| presentation. | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The first and second guarantees require unbelievably strict ordering! | +| Are all these memory barriers *really* required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Yes, they really are required. To see why the first guarantee is | +| required, consider the following sequence of events: | +| | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``do_something_with(q->a);`` | +| ``/* No smp_mb(), so might happen after kfree(). */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| | +| Therefore, there absolutely must be a full memory barrier between the | +| end of the RCU read-side critical section and the end of the grace | +| period. | +| | +| The sequence of events demonstrating the necessity of the second rule | +| is roughly similar: | +| | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp);`` | +| ``/* Might return p if no memory barrier. */`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| | +| And similarly, without a memory barrier between the beginning of the | +| grace period and the beginning of the RCU read-side critical section, | +| CPU 1 might end up accessing the freelist. | +| | +| The “as if” rule of course applies, so that any implementation that | +| acts as if the appropriate memory barriers were in place is a correct | +| implementation. That said, it is much easier to fool yourself into | +| believing that you have adhered to the as-if rule than it is to | +| actually adhere to it! | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code in some kernel builds. This means that the | +| compiler might arbitrarily rearrange consecutive RCU read-side | +| critical sections. Given such rearrangement, if a given RCU read-side | +| critical section is done, how can you be sure that all prior RCU | +| read-side critical sections are done? Won't the compiler | +| rearrangements make that impossible to determine? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code, RCU infers quiescent states only at special | +| locations, for example, within the scheduler. Because calls to | +| ``schedule()`` had better prevent calling-code accesses to shared | +| variables from being rearranged across the call to ``schedule()``, if | +| RCU detects the end of a given RCU read-side critical section, it | +| will necessarily detect the end of all prior RCU read-side critical | +| sections, no matter how aggressively the compiler scrambles the code. | +| Again, this all assumes that the compiler cannot scramble code across | +| calls to the scheduler, out of interrupt handlers, into the idle | +| loop, into user-mode code, and so on. But if your kernel build allows | +| that sort of scrambling, you have broken far more than just RCU! | ++-----------------------------------------------------------------------+ + +Note that these memory-barrier requirements do not replace the +fundamental RCU requirement that a grace period wait for all +pre-existing readers. On the contrary, the memory barriers called out in +this section must operate in such a way as to *enforce* this fundamental +requirement. Of course, different implementations enforce this +requirement in different ways, but enforce it they must. + +RCU Primitives Guaranteed to Execute Unconditionally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The common-case RCU primitives are unconditional. They are invoked, they +do their job, and they return, with no possibility of error, and no need +to retry. This is a key RCU design philosophy. + +However, this philosophy is pragmatic rather than pigheaded. If someone +comes up with a good justification for a particular conditional RCU +primitive, it might well be implemented and added. After all, this +guarantee was reverse-engineered, not premeditated. The unconditional +nature of the RCU primitives was initially an accident of +implementation, and later experience with synchronization primitives +with conditional primitives caused me to elevate this accident to a +guarantee. Therefore, the justification for adding a conditional +primitive to RCU would need to be based on detailed and compelling use +cases. + +Guaranteed Read-to-Write Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As far as RCU is concerned, it is always possible to carry out an update +within an RCU read-side critical section. For example, that RCU +read-side critical section might search for a given data element, and +then might acquire the update-side spinlock in order to update that +element, all while remaining in that RCU read-side critical section. Of +course, it is necessary to exit the RCU read-side critical section +before invoking ``synchronize_rcu()``, however, this inconvenience can +be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API +members described later in this document. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But how does the upgrade-to-write operation exclude other readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It doesn't, just like normal RCU updates, which also do not exclude | +| RCU readers. | ++-----------------------------------------------------------------------+ + +This guarantee allows lookup code to be shared between read-side and +update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +Fundamental Non-Requirements +---------------------------- + +RCU provides extremely lightweight readers, and its read-side +guarantees, though quite useful, are correspondingly lightweight. It is +therefore all too easy to assume that RCU is guaranteeing more than it +really is. Of course, the list of things that RCU does not guarantee is +infinitely long, however, the following sections list a few +non-guarantees that have caused confusion. Except where otherwise noted, +these non-guarantees were premeditated. + +#. `Readers Impose Minimal Ordering`_ +#. `Readers Do Not Exclude Updaters`_ +#. `Updaters Only Wait For Old Readers`_ +#. `Grace Periods Don't Partition Read-Side Critical Sections`_ +#. `Read-Side Critical Sections Don't Partition Grace Periods`_ + +Readers Impose Minimal Ordering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reader-side markers such as ``rcu_read_lock()`` and +``rcu_read_unlock()`` provide absolutely no ordering guarantees except +through their interaction with the grace-period APIs such as +``synchronize_rcu()``. To see this, consider the following pair of +threads: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 rcu_read_lock(); + 14 r1 = READ_ONCE(y); + 15 rcu_read_unlock(); + 16 rcu_read_lock(); + 17 r2 = READ_ONCE(x); + 18 rcu_read_unlock(); + 19 } + +After ``thread0()`` and ``thread1()`` execute concurrently, it is quite +possible to have + + :: + + (r1 == 1 && r2 == 0) + +(that is, ``y`` appears to have been assigned before ``x``), which would +not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had +much in the way of ordering properties. But they do not, so the CPU is +within its rights to do significant reordering. This is by design: Any +significant ordering constraints would slow down these fast-path APIs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Can't the compiler also reorder this code? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | +| prevent the compiler from reordering in this particular case. | ++-----------------------------------------------------------------------+ + +Readers Do Not Exclude Updaters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. +All they do is to prevent grace periods from ending. The following +example illustrates this: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } + 10 rcu_read_unlock(); + 11 } + 12 + 13 void thread1(void) + 14 { + 15 spin_lock(&my_lock); + 16 WRITE_ONCE(x, 1); + 17 WRITE_ONCE(y, 1); + 18 spin_unlock(&my_lock); + 19 } + +If the ``thread0()`` function's ``rcu_read_lock()`` excluded the +``thread1()`` function's update, the ``WARN_ON()`` could never fire. But +the fact is that ``rcu_read_lock()`` does not exclude much of anything +aside from subsequent grace periods, of which ``thread1()`` has none, so +the ``WARN_ON()`` can and does fire. + +Updaters Only Wait For Old Readers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It might be tempting to assume that after ``synchronize_rcu()`` +completes, there are no readers executing. This temptation must be +avoided because new readers can start immediately after +``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no +obligation to wait for these new readers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that synchronize_rcu() did wait until *all* readers had | +| completed instead of waiting only on pre-existing readers. For how | +| long would the updater be able to rely on there being no readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| For no time at all. Even if ``synchronize_rcu()`` were to wait until | +| all readers had completed, a new reader might start immediately after | +| ``synchronize_rcu()`` completed. Therefore, the code following | +| ``synchronize_rcu()`` can *never* rely on there being no readers. | ++-----------------------------------------------------------------------+ + +Grace Periods Don't Partition Read-Side Critical Sections +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. An example of this +situation can be illustrated as follows, where ``x``, ``y``, and ``z`` +are initially all zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 r2 = READ_ONCE(b); + 20 r3 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + +It turns out that the outcome: + + :: + + (r1 == 1 && r2 == 0 && r3 == 1) + +is entirely possible. The following figure show how this can happen, +with each circled ``QS`` indicating the point at which RCU recorded a +*quiescent state* for each thread, that is, a state in which RCU knows +that the thread cannot be in the midst of an RCU read-side critical +section that started before the current grace period: + +.. kernel-figure:: GPpartitionReaders1.svg + +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first grace +period is known to end before the second grace period starts: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 r2 = READ_ONCE(c); + 19 synchronize_rcu(); + 20 WRITE_ONCE(d, 1); + 21 } + 22 + 23 void thread3(void) + 24 { + 25 rcu_read_lock(); + 26 r3 = READ_ONCE(b); + 27 r4 = READ_ONCE(d); + 28 rcu_read_unlock(); + 29 } + +Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen +before the end of ``thread1()``'s grace period. If in addition +``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after +the beginning of ``thread2()``'s grace period. If it is also the case +that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must +precede the beginning of ``thread2()``'s grace period. This mean that +the two RCU read-side critical sections cannot overlap, guaranteeing +that ``(r3 == 1)``. As a result, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) + +cannot happen. + +This non-requirement was also non-premeditated, but became apparent when +studying RCU's interaction with memory ordering. + +Read-Side Critical Sections Don't Partition Grace Periods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. However, this temptation leads nowhere good, as can be +illustrated by the following, with all variables initially zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 WRITE_ONCE(d, 1); + 20 r2 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + 23 + 24 void thread3(void) + 25 { + 26 r3 = READ_ONCE(d); + 27 synchronize_rcu(); + 28 WRITE_ONCE(e, 1); + 29 } + 30 + 31 void thread4(void) + 32 { + 33 rcu_read_lock(); + 34 r4 = READ_ONCE(b); + 35 r5 = READ_ONCE(e); + 36 rcu_read_unlock(); + 37 } + +In this case, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) + +is entirely possible, as illustrated below: + +.. kernel-figure:: ReadersPartitionGP1.svg + +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire grace +period. As a result, an RCU read-side critical section cannot partition +a pair of RCU grace periods. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How long a sequence of grace periods, each separated by an RCU | +| read-side critical section, would be required to partition the RCU | +| read-side critical sections at the beginning and end of the chain? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In theory, an infinite number. In practice, an unknown number that is | +| sensitive to both implementation details and timing considerations. | +| Therefore, even in practice, RCU users must abide by the theoretical | +| rather than the practical answer. | ++-----------------------------------------------------------------------+ + +Parallelism Facts of Life +------------------------- + +These parallelism facts of life are by no means specific to RCU, but the +RCU implementation must abide by them. They therefore bear repeating: + +#. Any CPU or task may be delayed at any time, and any attempts to avoid + these delays by disabling preemption, interrupts, or whatever are + completely futile. This is most obvious in preemptible user-level + environments and in virtualized environments (where a given guest + OS's VCPUs can be preempted at any time by the underlying + hypervisor), but can also happen in bare-metal environments due to + ECC errors, NMIs, and other hardware events. Although a delay of more + than about 20 seconds can result in splats, the RCU implementation is + obligated to use algorithms that can tolerate extremely long delays, + but where “extremely long” is not long enough to allow wrap-around + when incrementing a 64-bit counter. +#. Both the compiler and the CPU can reorder memory accesses. Where it + matters, RCU must use compiler directives and memory-barrier + instructions to preserve ordering. +#. Conflicting writes to memory locations in any given cache line will + result in expensive cache misses. Greater numbers of concurrent + writes and more-frequent concurrent writes will result in more + dramatic slowdowns. RCU is therefore obligated to use algorithms that + have sufficient locality to avoid significant performance and + scalability problems. +#. As a rough rule of thumb, only one CPU's worth of processing may be + carried out under the protection of any given exclusive lock. RCU + must therefore use scalable locking designs. +#. Counters are finite, especially on 32-bit systems. RCU's use of + counters must therefore tolerate counter wrap, or be designed such + that counter wrap would take way more time than a single system is + likely to run. An uptime of ten years is quite possible, a runtime of + a century much less so. As an example of the latter, RCU's + dyntick-idle nesting counter allows 54 bits for interrupt nesting + level (this counter is 64 bits even on a 32-bit system). Overflowing + this counter requires 2\ :sup:`54` half-interrupts on a given CPU + without that CPU ever going idle. If a half-interrupt happened every + microsecond, it would take 570 years of runtime to overflow this + counter, which is currently believed to be an acceptably long time. +#. Linux systems can have thousands of CPUs running a single Linux + kernel in a single shared-memory environment. RCU must therefore pay + close attention to high-end scalability. + +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. The idea that Linux might +scale to systems with thousands of CPUs would have been met with some +skepticism in the 1990s, but these requirements would have otherwise +have been unsurprising, even in the early 1990s. + +Quality-of-Implementation Requirements +-------------------------------------- + +These sections list quality-of-implementation requirements. Although an +RCU implementation that ignores these requirements could still be used, +it would likely be subject to limitations that would make it +inappropriate for industrial-strength production use. Classes of +quality-of-implementation requirements are as follows: + +#. `Specialization`_ +#. `Performance and Scalability`_ +#. `Forward Progress`_ +#. `Composability`_ +#. `Corner Cases`_ + +These classes is covered in the following sections. + +Specialization +~~~~~~~~~~~~~~ + +RCU is and always has been intended primarily for read-mostly +situations, which means that RCU's read-side primitives are optimized, +often at the expense of its update-side primitives. Experience thus far +is captured by the following list of situations: + +#. Read-mostly data, where stale and inconsistent data is not a problem: + RCU works great! +#. Read-mostly data, where data must be consistent: RCU works well. +#. Read-write data, where data must be consistent: RCU *might* work OK. + Or not. +#. Write-mostly data, where data must be consistent: RCU is very + unlikely to be the right tool for the job, with the following + exceptions, where RCU can provide: + + a. Existence guarantees for update-friendly mechanisms. + b. Wait-free read-side primitives for real-time use. + +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. For example, the ``add_gp()`` and +``remove_gp_synchronous()`` examples discussed earlier use RCU to +protect readers and locking to coordinate updaters. However, the need +extends much farther, requiring that a variety of synchronization +primitives be legal within RCU read-side critical sections, including +spinlocks, sequence locks, atomic operations, reference counters, and +memory barriers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What about sleeping locks? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| These are forbidden within Linux-kernel RCU read-side critical | +| sections because it is not legal to place a quiescent state (in this | +| case, voluntary context switch) within an RCU read-side critical | +| section. However, sleeping locks may be used within userspace RCU | +| read-side critical sections, and also within Linux-kernel sleepable | +| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | +| addition, the -rt patchset turns spinlocks into a sleeping locks so | +| that the corresponding critical sections can be preempted, which also | +| means that these sleeplockified spinlocks (but not other sleeping | +| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | +| sections. | +| Note that it *is* legal for a normal RCU read-side critical section | +| to conditionally acquire a sleeping locks (as in | +| ``mutex_trylock()``), but only as long as it does not loop | +| indefinitely attempting to conditionally acquire that sleeping locks. | +| The key point is that things like ``mutex_trylock()`` either return | +| with the mutex held, or return an error indication if the mutex was | +| not immediately available. Either way, ``mutex_trylock()`` returns | +| immediately without sleeping. | ++-----------------------------------------------------------------------+ + +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, with +network routing being the poster child. Internet routing algorithms take +significant time to propagate updates, so that by the time an update +arrives at a given system, that system has been sending network traffic +the wrong way for a considerable length of time. Having a few threads +continue to send traffic the wrong way for a few more milliseconds is +clearly not a problem: In the worst case, TCP retransmissions will +eventually get the data where it needs to go. In general, when tracking +the state of the universe outside of the computer, some level of +inconsistency must be tolerated due to speed-of-light delays if nothing +else. + +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. But how long should they wait +after the last heartbeat to decide that the cat is in fact dead? Waiting +less than 400 milliseconds makes no sense because this would mean that a +relaxed cat would be considered to cycle between death and life more +than 100 times per minute. Moreover, just as with human beings, a cat's +heart might stop for some period of time, so the exact wait period is a +judgment call. One of our pair of veterinarians might wait 30 seconds +before pronouncing the cat dead, while the other might insist on waiting +a full minute. The two veterinarians would then disagree on the state of +the cat during the final 30 seconds of the minute following the last +heartbeat. + +Interestingly enough, this same situation applies to hardware. When push +comes to shove, how do we tell whether or not some external server has +failed? We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. Policy decisions +can usually tolerate short periods of inconsistency. The policy was +decided some time ago, and is only now being put into effect, so a few +milliseconds of delay is normally inconsequential. + +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore ID +to the corresponding in-kernel data structure is protected by RCU, but +it is absolutely forbidden to update a semaphore that has just been +removed. In the Linux kernel, this need for consistency is accommodated +by acquiring spinlocks located in the in-kernel data structure from +within the RCU read-side critical section, and this is indicated by the +green box in the figure above. Many other techniques may be used, and +are in fact used within the Linux kernel. + +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows the +right mix of synchronization tools to be used for a given job. + +Performance and Scalability +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Energy efficiency is a critical component of performance today, and +Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. I cannot claim that this requirement was +premeditated. In fact, I learned of it during a telephone conversation +in which I was given “frank and open” feedback on the importance of +energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. So much so that +mere Linux-kernel-mailing-list posts are insufficient to vent their ire. + +Memory consumption is not particularly important for in most situations, +and has become decreasingly so as memory sizes have expanded and memory +costs have plummeted. However, as I learned from Matt Mackall's +`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory +footprint is critically important on single-CPU systems with +non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny +RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__ +was born. Josh Triplett has since taken over the small-memory banner +with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ +project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional +for those kernels not needing it. + +The remaining performance requirements are, for the most part, +unsurprising. For example, in keeping with RCU's read-side +specialization, ``rcu_dereference()`` should have negligible overhead +(for example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have exactly zero overhead. + +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have minimal overhead. In particular, they +should not contain atomic read-modify-write operations, memory-barrier +instructions, preemption disabling, interrupt disabling, or backwards +branches. However, in the case where the RCU read-side critical section +was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable +interrupts. This is why it is better to nest an RCU read-side critical +section within a preempt-disable region than vice versa, at least in +cases where that critical section is short enough to avoid unduly +degrading real-time latencies. + +The ``synchronize_rcu()`` grace-period-wait primitive is optimized for +throughput. It may therefore incur several milliseconds of latency in +addition to the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +``synchronize_rcu()`` are required to use batching optimizations so that +they can be satisfied by a single underlying grace-period-wait +operation. For example, in the Linux kernel, it is not unusual for a +single grace-period-wait operation to serve more than `1,000 separate +invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ +of ``synchronize_rcu()``, thus amortizing the per-invocation overhead +down to nearly zero. However, the grace-period optimization is also +required to avoid measurable degradation of real-time scheduling and +interrupt latencies. + +In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are +unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be +used instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. There are currently no special latency +requirements for ``synchronize_rcu_expedited()`` on large systems, but, +consistent with the empirical nature of the RCU specification, that is +subject to change. However, there most definitely are scalability +requirements: A storm of ``synchronize_rcu_expedited()`` invocations on +4096 CPUs should at least make reasonable forward progress. In return +for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted +to impose modest degradation of real-time latency on non-idle online +CPUs. Here, “modest” means roughly the same latency degradation as a +scheduling-clock interrupt. + +There are a number of situations where even +``synchronize_rcu_expedited()``'s reduced grace-period latency is +unacceptable. In these situations, the asynchronous ``call_rcu()`` can +be used in place of ``synchronize_rcu()`` as follows: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); + 10 + 11 kfree(p); + 12 } + 13 + 14 bool remove_gp_asynchronous(void) + 15 { + 16 struct foo *p; + 17 + 18 spin_lock(&gp_lock); + 19 p = rcu_access_pointer(gp); + 20 if (!p) { + 21 spin_unlock(&gp_lock); + 22 return false; + 23 } + 24 rcu_assign_pointer(gp, NULL); + 25 call_rcu(&p->rh, remove_gp_cb); + 26 spin_unlock(&gp_lock); + 27 return true; + 28 } + +A definition of ``struct foo`` is finally needed, and appears on +lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` +on line 25, and will be invoked after the end of a subsequent grace +period. This gets the same effect as ``remove_gp_synchronous()``, but +without forcing the updater to wait for a grace period to elapse. The +``call_rcu()`` function may be used in a number of situations where +neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would +be legal, including within preempt-disable code, ``local_bh_disable()`` +code, interrupt-disable code, and interrupt handlers. However, even +``call_rcu()`` is illegal within NMI handlers and from idle and offline +CPUs. The callback function (``remove_gp_cb()`` in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, either within a real softirq handler or under the +protection of ``local_bh_disable()``. In both the Linux kernel and in +userspace, it is bad practice to write an RCU callback function that +takes too long. Long-running operations should be relegated to separate +threads or (in the Linux kernel) workqueues. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why does line 19 use ``rcu_access_pointer()``? After all, | +| ``call_rcu()`` on line 25 stores into the structure, which would | +| interact badly with concurrent insertions. Doesn't this mean that | +| ``rcu_dereference()`` is required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Presumably the ``->gp_lock`` acquired on line 18 excludes any | +| changes, including any insertions that ``rcu_dereference()`` would | +| protect against. Therefore, any insertions will be delayed until | +| after ``->gp_lock`` is released on line 25, which in turn means that | +| ``rcu_access_pointer()`` suffices. | ++-----------------------------------------------------------------------+ + +However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on +the data element. This is a common idiom, and is supported by +``kfree_rcu()``, which allows “fire and forget” operation as shown +below: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; + 10 + 11 spin_lock(&gp_lock); + 12 p = rcu_dereference(gp); + 13 if (!p) { + 14 spin_unlock(&gp_lock); + 15 return false; + 16 } + 17 rcu_assign_pointer(gp, NULL); + 18 kfree_rcu(p, rh); + 19 spin_unlock(&gp_lock); + 20 return true; + 21 } + +Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and +proceeds, without any need to pay any further attention to the +subsequent grace period and ``kfree()``. It is permissible to invoke +``kfree_rcu()`` from the same environments as for ``call_rcu()``. +Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` +and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the +fact that RCU was not heavily used within DYNIX/ptx, so the very few +places that needed something like ``synchronize_rcu()`` simply +open-coded it. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | +| allowed updaters to avoid being blocked by readers. But how can that | +| be correct, given that the invocation of the callback and the freeing | +| of the memory (respectively) must still wait for a grace period to | +| elapse? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| We could define things this way, but keep in mind that this sort of | +| definition would say that updates in garbage-collected languages | +| cannot complete until the next time the garbage collector runs, which | +| does not seem at all reasonable. The key point is that in most cases, | +| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | +| to the next update as soon as it has invoked ``call_rcu()`` or | +| ``kfree_rcu()``, without having to wait for a subsequent grace | +| period. | ++-----------------------------------------------------------------------+ + +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks that can +be carried out in the meantime? The polling-style +``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions +may be used for this purpose, as shown below: + + :: + + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); + 10 return false; + 11 } + 12 rcu_assign_pointer(gp, NULL); + 13 spin_unlock(&gp_lock); + 14 s = get_state_synchronize_rcu(); + 15 do_something_while_waiting(); + 16 cond_synchronize_rcu(s); + 17 kfree(p); + 18 return true; + 19 } + +On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, +then line 15 carries out other tasks, and finally, line 16 returns +immediately if a grace period has elapsed in the meantime, but otherwise +waits as required. The need for ``get_state_synchronize_rcu`` and +``cond_synchronize_rcu()`` has appeared quite recently, so it is too +early to tell whether they will stand the test of time. + +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +Forward Progress +~~~~~~~~~~~~~~~~ + +In theory, delaying grace-period completion and callback invocation is +harmless. In practice, not only are memory sizes finite but also +callbacks sometimes do wakeups, and sufficiently deferred wakeups can be +difficult to distinguish from system hangs. Therefore, RCU must provide +a number of mechanisms to promote forward progress. + +These mechanisms are not foolproof, nor can they be. For one simple +example, an infinite loop in an RCU read-side critical section must by +definition prevent later grace periods from ever completing. For a more +involved example, consider a 64-CPU system built with +``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where +CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even +if these tight loops also contain calls to ``cond_resched()`` (thus +allowing grace periods to complete), CPU 0 simply will not be able to +invoke callbacks as fast as the other 63 CPUs can register them, at +least not until the system runs out of memory. In both of these +examples, the Spiderman principle applies: With great power comes great +responsibility. However, short of this level of abuse, RCU is required +to ensure timely completion of grace periods and timely invocation of +callbacks. + +RCU takes the following steps to encourage timely completion of grace +periods: + +#. If a grace period fails to complete within 100 milliseconds, RCU + causes future invocations of ``cond_resched()`` on the holdout CPUs + to provide an RCU quiescent state. RCU also causes those CPUs' + ``need_resched()`` invocations to return ``true``, but only after the + corresponding CPU's next scheduling-clock. +#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run + indefinitely in the kernel without scheduling-clock interrupts, which + defeats the above ``need_resched()`` strategem. RCU will therefore + invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out + after 109 milliseconds. +#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that + has been preempted within an RCU read-side critical section is + holding out for more than 500 milliseconds, RCU will resort to + priority boosting. +#. If a CPU is still holding out 10 seconds into the grace period, RCU + will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` + state. + +The above values are defaults for systems running with ``HZ=1000``. They +will vary as the value of ``HZ`` varies, and can also be changed using +the relevant Kconfig options and kernel boot parameters. RCU currently +does not do much sanity checking of these parameters, so please use +caution when changing them. Note that these forward-progress measures +are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. + +RCU takes the following steps in ``call_rcu()`` to encourage timely +invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has +10,000 callbacks, or has 10,000 more callbacks than it had the last time +encouragement was provided: + +#. Starts a grace period, if one is not already in progress. +#. Forces immediate checking for quiescent states, rather than waiting + for three milliseconds to have elapsed since the beginning of the + grace period. +#. Immediately tags the CPU's callbacks with their grace period + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` + handler to get around to it. +#. Lifts callback-execution batch limits, which speeds up callback + invocation at the expense of degrading realtime response. + +Again, these are default values when running at ``HZ=1000``, and can be +overridden. Again, these forward-progress measures are provided only for +RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward +progress for ``rcu_nocbs`` CPUs is much less well-developed, in part +because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke +``call_rcu()`` relatively infrequently. If workloads emerge that need +both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then +additional forward-progress work will be required. + +Composability +~~~~~~~~~~~~~ + +Composability has received much attention in recent years, perhaps in +part due to the collision of multicore hardware with object-oriented +techniques designed in single-threaded environments for single-threaded +use. And in theory, RCU read-side critical sections may be composed, and +in fact may be nested arbitrarily deeply. In practice, as with all +real-world implementations of composable constructs, there are +limitations. + +Implementations of RCU for which ``rcu_read_lock()`` and +``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when +``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there +is no overhead. Except that if all these instances of +``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the +compiler, compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. If the nesting is +not visible to the compiler, as is the case with mutually recursive +functions each in its own translation unit, stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux +kernel) you will get an RCU CPU stall warning. Nevertheless, this class +of RCU implementations is one of the most composable constructs in +existence. + +RCU implementations that explicitly track nesting depth are limited by +the nesting-depth counter. For example, the Linux kernel's preemptible +RCU limits nesting to ``INT_MAX``. This should suffice for almost all +practical purposes. That said, a consecutive pair of RCU read-side +critical sections between which there is an operation that waits for a +grace period cannot be enclosed in another RCU read-side critical +section. This is because it is not legal to wait for a grace period +within an RCU read-side critical section: To do so would result either +in deadlock or in RCU implicitly splitting the enclosing RCU read-side +critical section, neither of which is conducive to a long-lived and +prosperous kernel. + +It is worth noting that RCU is not alone in limiting composability. For +example, many transactional-memory implementations prohibit composing a +pair of transactions separated by an irrevocable operation (for example, +a network receive operation). For another example, lock-based critical +sections can be composed surprisingly freely, but only if deadlock is +avoided. + +In short, although RCU read-side critical sections are highly +composable, care is required in some situations, just as is the case for +any other composable synchronization mechanism. + +Corner Cases +~~~~~~~~~~~~ + +A given RCU workload might have an endless and intense stream of RCU +read-side critical sections, perhaps even so intense that there was +never a point in time during which there was not at least one RCU +read-side critical section in flight. RCU cannot allow this situation to +block grace periods: As long as all the RCU read-side critical sections +are finite, grace periods must also be finite. + +That said, preemptible RCU implementations could potentially result in +RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side critical +section. This situation can arise only in heavily loaded systems, but +systems using real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. That said, the exact requirements on RCU priority boosting will +likely evolve as more experience accumulates. + +Other workloads might have very high update rates. Although one can +argue that such workloads should instead use something other than RCU, +the fact remains that RCU must handle such workloads gracefully. This +requirement is another factor driving batching of grace periods, but it +is also the driving force behind the checks for large numbers of queued +RCU callbacks in the ``call_rcu()`` code path. Finally, high update +rates should not delay RCU read-side critical sections, although some +small read-side delays can occur when using +``synchronize_rcu_expedited()``, courtesy of this function's use of +``smp_call_function_single()``. + +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of ``close(open(path))`` in a +tight loop in the early 2000s suddenly provided a much deeper +appreciation of the high-update-rate corner case. This test also +motivated addition of some RCU code to react to high update rates, for +example, if a given CPU finds itself with more than 10,000 RCU callbacks +queued, it will cause RCU to take evasive action by more aggressively +starting grace periods and more aggressively forcing completion of +grace-period processing. This evasive action causes the grace period to +complete more quickly, but at the cost of restricting RCU's batching +optimizations, thus increasing the CPU overhead incurred by that grace +period. + +Software-Engineering Requirements +--------------------------------- + +Between Murphy's Law and “To err is human”, it is necessary to guard +against mishaps and misuse: + +#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will + splat if ``rcu_dereference()`` is used outside of an RCU read-side + critical section. Update-side code can use + ``rcu_dereference_protected()``, which takes a `lockdep + expression <https://lwn.net/Articles/371986/>`__ to indicate what is + providing the protection. If the indicated protection is not + provided, a lockdep splat is emitted. + Code shared between readers and updaters can use + ``rcu_dereference_check()``, which also takes a lockdep expression, + and emits a lockdep splat if neither ``rcu_read_lock()`` nor the + indicated protection is in place. In addition, + ``rcu_dereference_raw()`` is used in those (hopefully rare) cases + where the required protection cannot be easily described. Finally, + ``rcu_read_lock_held()`` is provided to allow a function to verify + that it has been invoked within an RCU read-side critical section. I + was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +#. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. The + ``rcu_lockdep_assert()`` does this job, asserting the expression in + kernels having lockdep enabled and doing nothing otherwise. +#. It is also easy to forget to use ``rcu_assign_pointer()`` and + ``rcu_dereference()``, perhaps (incorrectly) substituting a simple + assignment. To catch this sort of error, a given RCU-protected + pointer may be tagged with ``__rcu``, after which sparse will + complain about simple-assignment accesses to that pointer. Arnd + Bergmann made me aware of this requirement, and also supplied the + needed `patch series <https://lwn.net/Articles/376011/>`__. +#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if + a data element is passed to ``call_rcu()`` twice in a row, without a + grace period in between. (This error is similar to a double free.) + The corresponding ``rcu_head`` structures that are dynamically + allocated are automatically tracked, but ``rcu_head`` structures + allocated on the stack must be initialized with + ``init_rcu_head_on_stack()`` and cleaned up with + ``destroy_rcu_head_on_stack()``. Similarly, statically allocated + non-stack ``rcu_head`` structures must be initialized with + ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__. +#. An infinite loop in an RCU read-side critical section will eventually + trigger an RCU CPU stall warning splat, with the duration of + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` + ``Kconfig`` option, or, alternatively, by the + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU + is not obligated to produce this splat unless there is a grace period + waiting on that particular RCU read-side critical section. + + Some extreme workloads might intentionally delay RCU grace periods, + and systems running those workloads can be booted with + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU + stall warnings are counter-productive during sysrq dumps and during + panics. RCU therefore supplies the ``rcu_sysrq_start()`` and + ``rcu_sysrq_end()`` API members to be called before and after long + sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is + automatically invoked at the beginning of a panic to suppress further + RCU CPU stall warnings. + + This requirement made itself known in the early 1990s, pretty much + the first time that it was necessary to debug a CPU stall. That said, + the initial implementation in DYNIX/ptx was quite generic in + comparison with that of Linux. + +#. Although it would be very good to detect pointers leaking out of RCU + read-side critical sections, there is currently no good way of doing + this. One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to some other + synchronization mechanism, for example, reference counting. +#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information + is provided via event tracing. +#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` + to create typical linked data structures can be surprisingly + error-prone. Therefore, RCU-protected `linked + lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, + more recently, RCU-protected `hash + tables <https://lwn.net/Articles/612100/>`__ are available. Many + other special-purpose RCU-protected data structures are available in + the Linux kernel and the userspace RCU library. +#. Some linked structures are created at compile time, but still require + ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves + this purpose. +#. It is not necessary to use ``rcu_assign_pointer()`` when creating + linked structures that are to be published via a single external + pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task + and also for assigning ``NULL`` pointers at runtime. + +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found in +real-world RCU usage. + +Linux Kernel Complications +-------------------------- + +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. Some of the relevant points of interest are as +follows: + +#. `Configuration`_ +#. `Firmware Interface`_ +#. `Early Boot`_ +#. `Interrupts and NMIs`_ +#. `Loadable Modules`_ +#. `Hotplug CPU`_ +#. `Scheduler and RCU`_ +#. `Tracing and RCU`_ +#. `Accesses to User Memory and RCU`_ +#. `Energy Efficiency`_ +#. `Scheduling-Clock Interrupts and RCU`_ +#. `Memory Efficiency`_ +#. `Performance, Scalability, Response Time, and Reliability`_ + +This list is probably incomplete, but it does give a feel for the most +notable Linux-kernel complications. Each of the following sections +covers one of the above topics. + +Configuration +~~~~~~~~~~~~~ + +RCU's goal is automatic configuration, so that almost nobody needs to +worry about RCU's ``Kconfig`` options. And for almost all users, RCU +does in fact work well “out of the box.” + +However, there are specialized use cases that are handled by kernel boot +parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` +system will explicitly ask users about new ``Kconfig`` options, which +requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` +``Kconfig`` option. + +This all should be quite obvious, but the fact remains that Linus +Torvalds recently had to +`remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ +me of this requirement. + +Firmware Interface +~~~~~~~~~~~~~~~~~~ + +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. Or the +translation is accurate, but the original message is bogus. + +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. If RCU naively believed the firmware, as it +used to do, it would create too many per-CPU kthreads. Although the +resulting system will still run correctly, the extra kthreads needlessly +consume memory and can cause confusion when they show up in ``ps`` +listings. + +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. The +resulting “ghost CPUs” (which are never going to come online) cause a +number of `interesting +complications <https://paulmck.livejournal.com/37494.html>`__. + +Early Boot +~~~~~~~~~~ + +The Linux kernel's boot sequence is an interesting process, and RCU is +used early, even before ``rcu_init()`` is invoked. In fact, a number of +RCU's primitives can be used as soon as the initial task's +``task_struct`` is available and the boot CPU's per-CPU variables are +set up. The read-side primitives (``rcu_read_lock()``, +``rcu_read_unlock()``, ``rcu_dereference()``, and +``rcu_access_pointer()``) will operate normally very early on, as will +``rcu_assign_pointer()``. + +Although ``call_rcu()`` may be invoked at any time during boot, +callbacks are not guaranteed to be invoked until after all of RCU's +kthreads have been spawned, which occurs at ``early_initcall()`` time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full +initialization cannot occur until after the scheduler has initialized +itself to the point where RCU can spawn and run its kthreads. In theory, +it would be possible to invoke callbacks earlier, however, this is not a +panacea because there would be severe restrictions on what operations +those callbacks could invoke. + +Perhaps surprisingly, ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``, will operate normally during very early +boot, the reason being that there is only one CPU and preemption is +disabled. This means that the call ``synchronize_rcu()`` (or friends) +itself is a quiescent state and thus a grace period, so the early-boot +implementation can be a no-op. + +However, once the scheduler has spawned its first kthread, this early +boot trick fails for ``synchronize_rcu()`` (as well as for +``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The +reason is that an RCU read-side critical section might be preempted, +which means that a subsequent ``synchronize_rcu()`` really does have to +wait for something, as opposed to simply returning immediately. +Unfortunately, ``synchronize_rcu()`` can't do this until all of its +kthreads are spawned, which doesn't happen until some time during +``early_initcalls()`` time. But this is no excuse: RCU is nevertheless +required to correctly handle synchronous grace periods during this time +period. Once all of its kthreads are up and running, RCU starts running +normally. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How can RCU possibly handle grace periods before all of its kthreads | +| have been spawned??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Very carefully! | +| During the “dead zone” between the time that the scheduler spawns the | +| first task and the time that all of RCU's kthreads have been spawned, | +| all synchronous grace periods are handled by the expedited | +| grace-period mechanism. At runtime, this expedited mechanism relies | +| on workqueues, but during the dead zone the requesting task itself | +| drives the desired expedited grace period. Because dead-zone | +| execution takes place within task context, everything works. Once the | +| dead zone ends, expedited grace periods go back to using workqueues, | +| as is required to avoid problems that would otherwise occur when a | +| user task received a POSIX signal while driving an expedited grace | +| period. | +| | +| And yes, this does mean that it is unhelpful to send POSIX signals to | +| random tasks between the time that the scheduler spawns its first | +| kthread and the time that RCU's kthreads have all been spawned. If | +| there ever turns out to be a good reason for sending POSIX signals | +| during that time, appropriate adjustments will be made. (If it turns | +| out that POSIX signals are sent during this time for no good reason, | +| other adjustments will be made, appropriate or otherwise.) | ++-----------------------------------------------------------------------+ + +I learned of these boot-time requirements as a result of a series of +system hangs. + +Interrupts and NMIs +~~~~~~~~~~~~~~~~~~~ + +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions of +code, as are invocations of ``call_rcu()``. + +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead +stealthily transitioning back to process context. This trick is +sometimes used to invoke system calls from inside the kernel. These +“half-interrupts” mean that RCU has to be very careful about how it +counts interrupt nesting levels. I learned of this requirement the hard +way during a rewrite of RCU's dyntick-idle code. + +The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side +critical sections are legal within NMI handlers. Thankfully, RCU +update-side primitives, including ``call_rcu()``, are prohibited within +NMI handlers. + +The name notwithstanding, some Linux-kernel architectures can have +nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised +me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ +with this requirement; he also kindly surprised me with `an +algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ +that meets this requirement. + +Furthermore, NMI handlers can be interrupted by what appear to RCU to be +normal interrupts. One way that this can happen is for code that +directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called +from an NMI handler. This astonishing fact of life prompted the current +code structure, which has ``rcu_irq_enter()`` invoking +``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. +And yes, I also learned of this requirement the hard way. + +Loadable Modules +~~~~~~~~~~~~~~~~ + +The Linux kernel has loadable modules, and these modules can also be +unloaded. After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. The module-unload +functions must therefore cancel any delayed calls to loadable-module +functions, for example, any outstanding ``mod_timer()`` must be dealt +with via ``del_timer_sync()`` or similar. + +Unfortunately, there is no way to cancel an RCU callback; once you +invoke ``call_rcu()``, the callback function is eventually going to be +invoked, unless the system goes down first. Because it is normally +considered socially irresponsible to crash the system in response to a +module unload request, we need some other way to deal with in-flight RCU +callbacks. + +RCU therefore provides ``rcu_barrier()``, which waits until all +in-flight RCU callbacks have been invoked. If a module uses +``call_rcu()``, its exit function should therefore prevent any future +invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, +the underlying module-unload code could invoke ``rcu_barrier()`` +unconditionally, but in practice this would incur unacceptable +latencies. + +Nikita Danilov noted this requirement for an analogous +filesystem-unmount situation, and Dipankar Sarma incorporated +``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module +unloading became apparent later. + +.. important:: + + The ``rcu_barrier()`` function is not, repeat, + *not*, obligated to wait for a grace period. It is instead only required + to wait for RCU callbacks that have already been posted. Therefore, if + there are no RCU callbacks posted anywhere in the system, + ``rcu_barrier()`` is within its rights to return immediately. Even if + there are callbacks posted, ``rcu_barrier()`` does not necessarily need + to wait for a grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! Each RCU callbacks must wait for a grace period to | +| complete, and ``rcu_barrier()`` must wait for each pre-existing | +| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | +| wait for a full grace period if there is even one callback posted | +| anywhere in the system? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Absolutely not!!! | +| Yes, each RCU callbacks must wait for a grace period to complete, but | +| it might well be partly (or even completely) finished waiting by the | +| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | +| need only wait for the remaining portion of the grace period to | +| elapse. So even if there are quite a few callbacks posted, | +| ``rcu_barrier()`` might well return quite quickly. | +| | +| So if you need to wait for a grace period as well as for all | +| pre-existing callbacks, you will need to invoke both | +| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | +| you can always use workqueues to invoke them concurrently. | ++-----------------------------------------------------------------------+ + +Hotplug CPU +~~~~~~~~~~~ + +The Linux kernel supports CPU hotplug, which means that CPUs can come +and go. It is of course illegal to use any RCU API member from an +offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side +critical sections. This requirement was present from day one in +DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug +implementation is “interesting.” + +The Linux-kernel CPU-hotplug implementation has notifiers that are used +to allow the various kernel subsystems (including RCU) to respond +appropriately to a given CPU-hotplug operation. Most RCU operations may +be invoked from CPU-hotplug notifiers, including even synchronous +grace-period operations such as ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``. + +However, all-callback-wait operations such as ``rcu_barrier()`` are also +not supported, due to the fact that there are phases of CPU-hotplug +operations where the outgoing CPU's callbacks will not be invoked until +after the CPU-hotplug operation ends, which could also result in +deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations +during its execution, which results in another type of deadlock when +invoked from a CPU-hotplug notifier. + +Scheduler and RCU +~~~~~~~~~~~~~~~~~ + +RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time +accumulation by these kthreads. This requirement was no surprise, but +RCU's violation of it when running context-switch-heavy workloads when +built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise +[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. +RCU has made good progress towards meeting this requirement, even for +context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is +room for further improvement. + +There is no longer any prohibition against holding any of +scheduler's runqueue or priority-inheritance spinlocks across an +``rcu_read_unlock()``, even if interrupts and preemption were enabled +somewhere within the corresponding RCU read-side critical section. +Therefore, it is now perfectly legal to execute ``rcu_read_lock()`` +with preemption enabled, acquire one of the scheduler locks, and hold +that lock across the matching ``rcu_read_unlock()``. + +Similarly, the RCU flavor consolidation has removed the need for negative +nesting. The fact that interrupt-disabled regions of code act as RCU +read-side critical sections implicitly avoids earlier issues that used +to result in destructive recursion via interrupt handler's use of RCU. + +Tracing and RCU +~~~~~~~~~~~~~~~ + +It is possible to use tracing on RCU code, but tracing itself uses RCU. +For this reason, ``rcu_dereference_raw_check()`` is provided for use +by tracing, which avoids the destructive recursion that could otherwise +ensue. This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing cannot be +used. The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +Accesses to User Memory and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel needs to access user-space memory, for example, to access data +referenced by system-call parameters. The ``get_user()`` macro does this job. + +However, user-space memory might well be paged out, which means that +``get_user()`` might well page-fault and thus block while waiting for the +resulting I/O to complete. It would be a very bad thing for the compiler to +reorder a ``get_user()`` invocation into an RCU read-side critical section. + +For example, suppose that the source code looked like this: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + +The compiler must not be permitted to transform this source code into +the following: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + +If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel +build, and if ``get_user()`` did page fault, the result would be a quiescent +state in the middle of an RCU read-side critical section. This misplaced +quiescent state could result in line 4 being a use-after-free access, +which could be bad for your kernel's actuarial statistics. Similar examples +can be constructed with the call to ``get_user()`` preceding the +``rcu_read_lock()``. + +Unfortunately, ``get_user()`` doesn't have any particular ordering properties, +and in some architectures the underlying ``asm`` isn't even marked +``volatile``. And even if it was marked ``volatile``, the above access to +``p->value`` is not volatile, so the compiler would not have any reason to keep +those two accesses in order. + +Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and +``rcu_read_unlock()`` must act as compiler barriers, at least for outermost +instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set +of RCU read-side critical sections. + +Energy Efficiency +~~~~~~~~~~~~~~~~~ + +Interrupting idle CPUs is considered socially unacceptable, especially +by people with battery-powered embedded systems. RCU therefore conserves +energy by detecting which CPUs are idle, including tracking CPUs that +have been interrupted from idle. This is a large part of the +energy-efficiency requirement, so I learned of this via an irate phone +call. + +Because RCU avoids interrupting idle CPUs, it is illegal to execute an +RCU read-side critical section on an idle CPU. (Kernels built with +``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` +macro and ``_rcuidle`` event tracing is provided to work around this +restriction. In addition, ``rcu_is_watching()`` may be used to test +whether or not it is currently legal to run RCU read-side critical +sections on this CPU. I learned of the need for diagnostics on the one +hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. +Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite +heavily in the idle loop. However, there are some restrictions on the +code placed within ``RCU_NONIDLE()``: + +#. Blocking is prohibited. In practice, this is not a serious + restriction given that idle tasks are prohibited from blocking to + begin with. +#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest + indefinitely deeply. However, given that they can be nested on the + order of a million deep, even on 32-bit systems, this should not be a + serious restriction. This nesting limit would probably be reached + long after the compiler OOMed or the stack overflowed. +#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that + same ``RCU_NONIDLE()``. For example, the following is grossly + illegal: + + :: + + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + + + It is just as illegal to transfer control into the middle of + ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in + as long as you also transferred out, but in practice you could also + expect to get sharply worded review comments. + +It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU +running in userspace. RCU must therefore track ``nohz_full`` userspace +execution. RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent any +time idle and/or executing in userspace. + +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of which +was finally able to demonstrate `real energy savings running on real +hardware +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. +As noted earlier, I learned of many of these requirements via angry +phone calls: Flaming me on the Linux-kernel mailing list was apparently +not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. Depending on kernel configuration, RCU +handles these states differently: + ++-----------------+------------------+------------------+-----------------+ +| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | ++=================+==================+==================+=================+ +| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | +| | sometimes rely | RCU's | RCU's | +| | on | dyntick-idle | dyntick-idle | +| | scheduling-clock | detection. | detection. | +| | interrupt. In | | | +| | other cases, it | | | +| | is necessary to | | | +| | bound kernel | | | +| | execution times | | | +| | and/or use | | | +| | IPIs. | | | ++-----------------+------------------+------------------+-----------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | +| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | +| ``NO_HZ_IDLE`` do? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because, as a performance optimization, ``NO_HZ_FULL`` does not | +| necessarily re-enable the scheduling-clock interrupt on entry to each | +| and every system call. | ++-----------------------------------------------------------------------+ + +However, RCU must be reliably informed as to whether any given CPU is +currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that +CPU is executing in usermode, as discussed +`earlier <#Energy%20Efficiency>`__. It also requires that the +scheduling-clock interrupt be enabled when RCU needs it to be: + +#. If a CPU is either idle or executing in usermode, and RCU believes it + is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, very long + (11-second) grace periods, with a pointless IPI waking the CPU from + time to time. +#. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. **DON'T DO THIS!!!** + This is one reason to test with lockdep, which will complain about + this sort of thing. +#. If a CPU is in a portion of the kernel that is absolutely positively + no-joking guaranteed to never execute any RCU read-side critical + sections, and RCU believes this CPU to be idle, no problem. This + sort of thing is used by some architectures for light-weight + exception handlers, which can then avoid the overhead of + ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and + exit, respectively. Some go further and avoid the entireties of + ``irq_enter()`` and ``irq_exit()``. + Just make very sure you are running some of your tests with + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in + fact joking about not doing RCU read-side critical sections. +#. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, and if + the CPU goes idle (from an RCU perspective) every few jiffies, no + problem. It is usually OK for there to be the occasional gap between + idle periods of up to a second or so. + If the gap grows too long, you get RCU CPU stall warnings. +#. If a CPU is either idle or executing in usermode, and RCU believes it + to be idle, of course no problem. +#. If a CPU is executing in the kernel, the kernel code path is passing + through quiescent states at a reasonable frequency (preferably about + once per few jiffies, but the occasional excursion to a second or so + is usually OK) and the scheduling-clock interrupt is enabled, of + course no problem. + If the gap between a successive pair of quiescent states grows too + long, you get RCU CPU stall warnings. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what if my driver has a hardware interrupt handler that can run | +| for many seconds? I cannot invoke ``schedule()`` from an hardware | +| interrupt handler, after all! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | +| often. But given that long-running interrupt handlers can cause other | +| problems, not least for response time, shouldn't you work to keep | +| your interrupt handler's runtime within reasonable bounds? | ++-----------------------------------------------------------------------+ + +But as long as RCU is properly informed of kernel state transitions +between in-kernel execution, usermode execution, and idle, and as long +as the scheduling-clock interrupt is enabled when RCU needs it to be, +you can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +Memory Efficiency +~~~~~~~~~~~~~~~~~ + +Although small-memory non-realtime systems can simply use Tiny RCU, code +size is only one aspect of memory efficiency. Another aspect is the size +of the ``rcu_head`` structure used by ``call_rcu()`` and +``kfree_rcu()``. Although this structure contains nothing more than a +pair of pointers, it does appear in many RCU-protected data structures, +including some that are size critical. The ``page`` structure is a case +in point, as evidenced by the many occurrences of the ``union`` keyword +within that structure. + +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the ``rcu_head`` structures that are +waiting for a grace period to elapse. It is also the reason why +``rcu_head`` structures do not contain debug information, such as fields +tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that +posted them. Although this information might appear in debug-only kernel +builds at some point, in the meantime, the ``->func`` field will often +provide the needed debug information. + +However, in some cases, the need for memory efficiency leads to even +more extreme measures. Returning to the ``page`` structure, the +``rcu_head`` field shares storage with a great many other structures +that are used at various points in the corresponding page's lifetime. In +order to correctly resolve certain `race +conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, +the Linux kernel's memory-management subsystem needs a particular bit to +remain zero during all phases of grace-period processing, and that bit +happens to map to the bottom bit of the ``rcu_head`` structure's +``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is +used to post the callback, as opposed to ``kfree_rcu()`` or some future +“lazy” variant of ``call_rcu()`` that might one day be created for +energy-efficiency purposes. + +That said, there are limits. RCU requires that the ``rcu_head`` +structure be aligned to a two-byte boundary, and passing a misaligned +``rcu_head`` structure to one of the ``call_rcu()`` family of functions +will result in a splat. It is therefore necessary to exercise caution +when packing structures containing fields of type ``rcu_head``. Why not +a four-byte or even eight-byte alignment requirement? Because the m68k +architecture provides only two-byte alignment, and thus acts as +alignment's least common denominator. + +The reason for reserving the bottom bit of pointers to ``rcu_head`` +structures is to leave the door open to “lazy” callbacks whose +invocations can safely be deferred. Deferring invocation could +potentially have energy-efficiency benefits, but only if the rate of +non-lazy callbacks decreases significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open in case +it one day becomes useful. + +Performance, Scalability, Response Time, and Reliability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Expanding on the `earlier +discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by +hot code paths in performance-critical portions of the Linux kernel's +networking, security, virtualization, and scheduling code paths. RCU +must therefore use efficient implementations, especially in its +read-side primitives. To that end, it would be good if preemptible RCU's +implementation of ``rcu_read_lock()`` could be inlined, however, doing +this requires resolving ``#include`` issues with the ``task_struct`` +structure. + +The Linux kernel supports hardware configurations with up to 4096 CPUs, +which means that RCU must be extremely scalable. Algorithms that involve +frequent acquisitions of global locks or frequent atomic operations on +global variables simply cannot be tolerated within the RCU +implementation. RCU therefore makes heavy use of a combining tree based +on the ``rcu_node`` structure. RCU is required to tolerate all CPUs +continuously invoking any combination of RCU's runtime primitives with +minimal per-operation overhead. In fact, in many cases, increasing load +must *decrease* the per-operation overhead, witness the batching +optimizations for ``synchronize_rcu()``, ``call_rcu()``, +``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general +rule, RCU must cheerfully accept whatever the rest of the Linux kernel +decides to throw at it. + +The Linux kernel is used for real-time workloads, especially in +conjunction with the `-rt +patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The +real-time-latency response requirements are such that the traditional +approach of disabling preemption across RCU read-side critical sections +is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use +an RCU implementation that allows RCU read-side critical sections to be +preempted. This requirement made its presence known after users made it +clear that an earlier `real-time +patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in +conjunction with some `RCU +issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__ +encountered by a very early version of the -rt patchset. + +In addition, RCU must make do with a sub-100-microsecond real-time +latency budget. In fact, on smaller systems with the -rt patchset, the +Linux kernel provides sub-20-microsecond real-time latencies for the +whole kernel, including RCU. RCU's scalability and latency must +therefore be sufficient for these sorts of configurations. To my +surprise, the sub-100-microsecond real-time latency budget `applies to +even the largest systems +[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, +up to and including systems with 4096 CPUs. This real-time requirement +motivated the grace-period kthread, which also simplified handling of a +number of race conditions. + +RCU must avoid degrading real-time response for CPU-bound threads, +whether executing in usermode (which is one use case for +``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in +the kernel must execute ``cond_resched()`` at least once per few tens of +milliseconds in order to avoid receiving an IPI from RCU. + +Finally, RCU's status as a synchronization primitive means that any RCU +failure can result in arbitrary memory corruption that can be extremely +difficult to debug. This means that RCU must be extremely reliable, +which in practice also means that RCU must have an aggressive +stress-test suite. This stress-test suite is called ``rcutorture``. + +Although the need for ``rcutorture`` was no surprise, the current +immense popularity of the Linux kernel is posing interesting—and perhaps +unprecedented—validation challenges. To see this, keep in mind that +there are well over one billion instances of the Linux kernel running +today, given Android smartphones, Linux-powered televisions, and +servers. This number can be expected to increase sharply with the advent +of the celebrated Internet of Things. + +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. This bug will be occurring about +three times per *day* across the installed base. RCU could simply hide +behind hardware error rates, given that no one should really expect +their smartphone to last for a million years. However, anyone taking too +much comfort from this thought should consider the fact that in most +jurisdictions, a successful multi-year test of a given mechanism, which +might include a Linux kernel, suffices for a number of types of +safety-critical certifications. In fact, rumor has it that the Linux +kernel is already being used in production for safety-critical +applications. I don't know about you, but I would feel quite bad if a +bug in RCU killed someone. Which might explain my recent focus on +validation and verification. + +Other RCU Flavors +----------------- + +One of the more surprising things about RCU is that there are now no +fewer than five *flavors*, or API families. In addition, the primary +flavor that has been the sole focus up to this point has two different +implementations, non-preemptible and preemptible. The other four flavors +are listed below, with requirements for each described in a separate +section. + +#. `Bottom-Half Flavor (Historical)`_ +#. `Sched Flavor (Historical)`_ +#. `Sleepable RCU`_ +#. `Tasks RCU`_ + +Bottom-Half Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-bh flavor of RCU has since been expressed in terms of the other +RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +softirq and to be accounted for by lockdep. Much of the material in this +section is therefore strictly historical in nature. + +The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) +flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a +flavor of RCU that could withstand the network-based denial-of-service +attacks researched by Robert Olsson. These attacks placed so much +networking load on the system that some of the CPUs never exited softirq +execution, which in turn prevented those CPUs from ever executing a +context switch, which, in the RCU implementation of that time, prevented +grace periods from ever ending. The result was an out-of-memory +condition and a system hang. + +The solution was the creation of RCU-bh, which does +``local_bh_disable()`` across its read-side critical sections, and which +uses the transition from one type of softirq processing to another as a +quiescent state in addition to context switch, idle, user mode, and +offline. This means that RCU-bh grace periods can complete even when +some of the CPUs execute in softirq indefinitely, thus allowing +algorithms based on RCU-bh to withstand network-based denial-of-service +attacks. + +Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and +re-enable softirq handlers, any attempt to start a softirq handlers +during the RCU-bh read-side critical section will be deferred. In this +case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can +take considerable time. One can of course argue that this softirq +overhead should be associated with the code following the RCU-bh +read-side critical section rather than ``rcu_read_unlock_bh()``, but the +fact is that most profiling tools cannot be expected to make this sort +of fine distinction. For example, suppose that a three-millisecond-long +RCU-bh read-side critical section executes during a time of heavy +networking load. There will very likely be an attempt to invoke at least +one softirq handler during that three milliseconds, but any such +invocation will be delayed until the time of the +``rcu_read_unlock_bh()``. This can of course make it appear at first +glance as if ``rcu_read_unlock_bh()`` was executing very slowly. + +The `RCU-bh +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, +``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, +``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, +``call_rcu_bh()``, ``rcu_barrier_bh()``, and +``rcu_read_lock_bh_held()``. However, the update-side APIs are now +simple wrappers for other RCU flavors, namely RCU-sched in +CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. + +Sched Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-sched flavor of RCU has since been expressed in terms of the +other RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +preemption and to be accounted for by lockdep. Much of the material in +this section is therefore strictly historical in nature. + +Before preemptible RCU, waiting for an RCU grace period had the side +effect of also waiting for all pre-existing interrupt and NMI handlers. +However, there are legitimate preemptible-RCU implementations that do +not have this property, given that any point in the code outside of an +RCU read-side critical section can be a quiescent state. Therefore, +*RCU-sched* was created, which follows “classic” RCU in that an +RCU-sched grace period waits for pre-existing interrupt and NMI +handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and +RCU-sched APIs have identical implementations, while kernels built with +``CONFIG_PREEMPT=y`` provide a separate implementation for each. + +Note well that in ``CONFIG_PREEMPT=y`` kernels, +``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and +re-enable preemption, respectively. This means that if there was a +preemption attempt during the RCU-sched read-side critical section, +``rcu_read_unlock_sched()`` will enter the scheduler, with all the +latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, +this can make it look as if ``rcu_read_unlock_sched()`` was executing +very slowly. However, the highest-priority task won't be preempted, so +that task will enjoy low-overhead ``rcu_read_unlock_sched()`` +invocations. + +The `RCU-sched +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, +``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, +``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, +``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, +``call_rcu_sched()``, ``rcu_barrier_sched()``, and +``rcu_read_lock_sched_held()``. However, anything that disables +preemption also marks an RCU-sched read-side critical section, including +``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and +``local_irq_restore()``, and so on. + +Sleepable RCU +~~~~~~~~~~~~~ + +For well over a decade, someone saying “I need to block within an RCU +read-side critical section” was a reliable indication that this someone +did not understand RCU. After all, if you are always blocking in an RCU +read-side critical section, you can probably afford to use a +higher-overhead synchronization mechanism. However, that changed with +the advent of the Linux kernel's notifiers, whose RCU read-side critical +sections almost never sleep, but sometimes need to. This resulted in the +introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or +*SRCU*. + +SRCU allows different domains to be defined, with each such domain +defined by an instance of an ``srcu_struct`` structure. A pointer to +this structure must be passed in to each SRCU function, for example, +``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` +structure. The key benefit of these domains is that a slow SRCU reader +in one domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code must +pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for +example, as follows: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); + +As noted above, it is legal to block within SRCU read-side critical +sections, however, with great power comes great responsibility. If you +block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can wait, either directly or indirectly, for that domain's grace +period to elapse. For example, this results in a self-deadlock: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); + +However, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for domain ``ss``, deadlock would still be +possible. Furthermore, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for some other domain ``ss1``, and if an +``ss1``-domain SRCU read-side critical section acquired another mutex +that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock +would again be possible. Such a deadlock cycle could extend across an +arbitrarily large number of different SRCU domains. Again, with great +power comes great responsibility. + +Unlike the other RCU flavors, SRCU read-side critical sections can run +on idle and even offline CPUs. This ability requires that +``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, +which means that SRCU readers will run a bit slower than would RCU +readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, +which, in combination with ``srcu_read_unlock()``, guarantees a full +memory barrier. + +Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be +invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being +temporarily “stranded” on the outgoing CPU. This stranding of timers +means that timers posted to the outgoing CPU will not fire until late in +the CPU-hotplug process. The problem is that if a notifier is waiting on +an SRCU grace period, that grace period is waiting on a timer, and that +timer is stranded on the outgoing CPU, then the notifier will never be +awakened, in other words, deadlock has occurred. This same situation of +course also prohibits ``srcu_barrier()`` from being invoked from +CPU-hotplug notifiers. + +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. This +means that in the current SRCU implementation, expediting a future grace +period has the side effect of expediting all prior grace periods that +have not yet completed. (But please note that this is a property of the +current implementation, not necessarily of future implementations.) In +addition, if SRCU has been idle for longer than the interval specified +by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds +by default), and if a ``synchronize_srcu()`` invocation ends this idle +period, that invocation will be automatically expedited. + +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a +locking bottleneck present in prior kernel versions. Although this will +allow users to put much heavier stress on ``call_srcu()``, it is +important to note that SRCU does not yet take any special steps to deal +with callback flooding. So if you are posting (say) 10,000 SRCU +callbacks per second per CPU, you are probably totally OK, but if you +intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please +run some tests first. SRCU just might need a few adjustment to deal with +that sort of load. Of course, your mileage may vary based on the speed +of your CPUs and the size of your memory. + +The `SRCU +API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ +includes ``srcu_read_lock()``, ``srcu_read_unlock()``, +``srcu_dereference()``, ``srcu_dereference_check()``, +``synchronize_srcu()``, ``synchronize_srcu_expedited()``, +``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It +also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and +``init_srcu_struct()`` APIs for defining and initializing +``srcu_struct`` structures. + +Tasks RCU +~~~~~~~~~ + +Some forms of tracing use “trampolines” to handle the binary rewriting +required to install different types of probes. It would be good to be +able to free old trampolines, which sounds like a job for some form of +RCU. However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers such +as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does +not work to have these markers in the trampoline itself, because there +would need to be instructions following ``rcu_read_unlock()``. Although +``synchronize_rcu()`` would guarantee that execution reached the +``rcu_read_unlock()``, it would not be able to guarantee that execution +had completely left the trampoline. Worse yet, in some situations +the trampoline's protection must extend a few instructions *prior* to +execution reaching the trampoline. For example, these few instructions +might calculate the address of the trampoline, so that entering the +trampoline would be pre-ordained a surprisingly long time before execution +actually reached the trampoline itself. + +The solution, in the form of `Tasks +RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side +critical sections that are delimited by voluntary context switches, that +is, calls to ``schedule()``, ``cond_resched()``, and +``synchronize_rcu_tasks()``. In addition, transitions to and from +userspace execution also delimit tasks-RCU read-side critical sections. + +The tasks-RCU API is quite compact, consisting only of +``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and +``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines +cannot be preempted, so these APIs map to ``call_rcu()``, +``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In +``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these +three APIs are therefore implemented by separate functions that check +for voluntary context switches. + +Possible Future Changes +----------------------- + +One of the tricks that RCU uses to attain update-side scalability is to +increase grace-period latency with increasing numbers of CPUs. If this +becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +RCU disables CPU hotplug in a few places, perhaps most notably in the +``rcu_barrier()`` operations. If there is a strong reason to use +``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to +avoid disabling CPU hotplug. This would introduce some complexity, so +there had better be a *very* good reason. + +The tradeoff between grace-period latency on the one hand and +interruptions of other CPUs on the other hand may need to be +re-examined. The desire is of course for zero grace-period latency as +well as zero interprocessor interrupts undertaken during an expedited +grace period operation. While this ideal is unlikely to be achievable, +it is quite possible that further improvements can be made. + +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such as +sockets or cores. Such spreading and alignment is currently believed to +be unnecessary because the hotpath read-side primitives do not access +the combining tree, nor does ``call_rcu()`` in the common case. If you +believe that your architecture needs such spreading and alignment, then +your architecture should also benefit from the +``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the +number of CPUs in a socket, NUMA node, or whatever. If the number of +CPUs is too large, use a fraction of the number of CPUs. If the number +of CPUs is a large prime number, well, that certainly is an +“interesting” architectural choice! More flexible arrangements might be +considered, but only if ``rcutree.rcu_fanout_leaf`` has proven +inadequate, and only if the inadequacy has been demonstrated by a +carefully run and realistic system-level workload. + +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +RCU's various kthreads are reasonably recent additions. It is quite +likely that adjustments will be required to more gracefully handle +extreme loads. It might also be necessary to be able to relate CPU +utilization by RCU's kthreads and softirq handlers to the code that +instigated this CPU utilization. For example, RCU callback overhead +might be charged back to the originating ``call_rcu()`` instance, though +probably not in production kernels. + +Additional work may be required to provide reasonable forward-progress +guarantees under heavy load for grace periods and for callback +invocation. + +Summary +------- + +This document has presented more than two decade's worth of RCU +requirements. Given that the requirements keep changing, this will not +be the last word on this subject, but at least it serves to get an +important subset of the requirements set forth. + +Acknowledgments +--------------- + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg +Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy +Lutomirski for their help in rendering this article human readable, and +to Michelle Rankin for her support of this effort. Other contributions +are acknowledged in the Linux kernel's git archive. |