diff options
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.html')
-rw-r--r-- | Documentation/RCU/Design/Requirements/Requirements.html | 3324 |
1 files changed, 3324 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html new file mode 100644 index 000000000..49690228b --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -0,0 +1,3324 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> + <html> + <head><title>A Tour Through RCU's Requirements [LWN.net]</title> + <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> + +<h1>A Tour Through RCU's Requirements</h1> + +<p>Copyright IBM Corporation, 2015</p> +<p>Author: Paul E. McKenney</p> +<p><i>The initial version of this document appeared in the +<a href="https://lwn.net/">LWN</a> articles +<a href="https://lwn.net/Articles/652156/">here</a>, +<a href="https://lwn.net/Articles/652677/">here</a>, and +<a href="https://lwn.net/Articles/653326/">here</a>.</i></p> + +<h2>Introduction</h2> + +<p> +Read-copy update (RCU) is a synchronization mechanism that is often +used as a replacement for reader-writer locking. +RCU is unusual in that updaters do not block readers, +which means that RCU's read-side primitives can be exceedingly fast +and scalable. +In addition, updaters can make useful forward progress concurrently +with readers. +However, all this concurrency between RCU readers and updaters does raise +the question of exactly what RCU readers are doing, which in turn +raises the question of exactly what RCU's requirements are. + +<p> +This document therefore summarizes RCU's requirements, and can be thought +of as an informal, high-level specification for RCU. +It is important to understand that RCU's specification is primarily +empirical in nature; +in fact, I learned about many of these requirements the hard way. +This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been +a great privilege to work with so many people willing to apply +technologies in interesting new ways. + +<p> +All that aside, here are the categories of currently known RCU requirements: +</p> + +<ol> +<li> <a href="#Fundamental Requirements"> + Fundamental Requirements</a> +<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> +<li> <a href="#Parallelism Facts of Life"> + Parallelism Facts of Life</a> +<li> <a href="#Quality-of-Implementation Requirements"> + Quality-of-Implementation Requirements</a> +<li> <a href="#Linux Kernel Complications"> + Linux Kernel Complications</a> +<li> <a href="#Software-Engineering Requirements"> + Software-Engineering Requirements</a> +<li> <a href="#Other RCU Flavors"> + Other RCU Flavors</a> +<li> <a href="#Possible Future Changes"> + Possible Future Changes</a> +</ol> + +<p> +This is followed by a <a href="#Summary">summary</a>, +however, the answers to each quick quiz immediately follows the quiz. +Select the big white space with your mouse to see the answer. + +<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> + +<p> +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. +These are: + +<ol> +<li> <a href="#Grace-Period Guarantee"> + Grace-Period Guarantee</a> +<li> <a href="#Publish-Subscribe Guarantee"> + Publish-Subscribe Guarantee</a> +<li> <a href="#Memory-Barrier Guarantees"> + Memory-Barrier Guarantees</a> +<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> + RCU Primitives Guaranteed to Execute Unconditionally</a> +<li> <a href="#Guaranteed Read-to-Write Upgrade"> + Guaranteed Read-to-Write Upgrade</a> +</ol> + +<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> + +<p> +RCU's grace-period guarantee is unusual in being premeditated: +Jack Slingwine and I had this guarantee firmly in mind when we started +work on RCU (then called “rclock”) in the early 1990s. +That said, the past two decades of experience with RCU have produced +a much more detailed understanding of this guarantee. + +<p> +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. +An RCU read-side critical section +begins with the marker <tt>rcu_read_lock()</tt> and ends with +the marker <tt>rcu_read_unlock()</tt>. +These markers may be nested, and RCU treats a nested set as one +big RCU read-side critical section. +Production-quality implementations of <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> are extremely lightweight, and in +fact have exactly zero overhead in Linux kernels built for production +use with <tt>CONFIG_PREEMPT=n</tt>. + +<p> +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + +<blockquote> +<pre> + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } +10 +11 void thread1(void) +12 { +13 WRITE_ONCE(x, 1); +14 synchronize_rcu(); +15 WRITE_ONCE(y, 1); +16 } +</pre> +</blockquote> + +<p> +Because the <tt>synchronize_rcu()</tt> on line 14 waits for +all pre-existing readers, any instance of <tt>thread0()</tt> that +loads a value of zero from <tt>x</tt> must complete before +<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must +also load a value of zero from <tt>y</tt>. +Similarly, any instance of <tt>thread0()</tt> that loads a value of +one from <tt>y</tt> must have started after the +<tt>synchronize_rcu()</tt> started, and must therefore also load +a value of one from <tt>x</tt>. +Therefore, the outcome: +<blockquote> +<pre> +(r1 == 0 && r2 == 1) +</pre> +</blockquote> +cannot happen. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Wait a minute! + You said that updaters can make useful forward progress concurrently + with readers, but pre-existing readers will block + <tt>synchronize_rcu()</tt>!!! + Just who are you trying to fool??? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + First, if updaters do not wish to be blocked by readers, they can use + <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will + be discussed later. + Second, even when using <tt>synchronize_rcu()</tt>, the other + update-side code does run concurrently with readers, whether + pre-existing or not. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +This scenario resembles one of the first uses of RCU in +<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, +which managed a distributed lock manager's transition into +a state suitable for handling recovery from node failure, +more or less as follows: + +<blockquote> +<pre> + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { +10 int state_snap; +11 +12 rcu_read_lock(); +13 state_snap = READ_ONCE(state); +14 if (state_snap == STATE_NORMAL) +15 do_something(); +16 else +17 do_something_carefully(); +18 rcu_read_unlock(); +19 } +20 +21 void start_recovery(void) +22 { +23 WRITE_ONCE(state, STATE_WANT_RECOVERY); +24 synchronize_rcu(); +25 WRITE_ONCE(state, STATE_RECOVERING); +26 recovery(); +27 WRITE_ONCE(state, STATE_WANT_NORMAL); +28 synchronize_rcu(); +29 WRITE_ONCE(state, STATE_NORMAL); +30 } +</pre> +</blockquote> + +<p> +The RCU read-side critical section in <tt>do_something_dlm()</tt> +works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> +to guarantee that <tt>do_something()</tt> never runs concurrently +with <tt>recovery()</tt>, but with little or no synchronization +overhead in <tt>do_something_dlm()</tt>. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why is the <tt>synchronize_rcu()</tt> on line 28 needed? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Without that extra grace period, memory reordering could result in + <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> + concurrently with the last bits of <tt>recovery()</tt>. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +In order to avoid fatal problems such as deadlocks, +an RCU read-side critical section must not contain calls to +<tt>synchronize_rcu()</tt>. +Similarly, an RCU read-side critical section must not +contain anything that waits, directly or indirectly, on completion of +an invocation of <tt>synchronize_rcu()</tt>. + +<p> +Although RCU's grace-period guarantee is useful in and of itself, with +<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, +it would be good to be able to use RCU to coordinate read-side +access to linked data structures. +For this, the grace-period guarantee is not sufficient, as can +be seen in function <tt>add_gp_buggy()</tt> below. +We will look at the reader's code later, but in the meantime, just think of +the reader as locklessly picking up the <tt>gp</tt> pointer, +and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the +<tt>->a</tt> and <tt>->b</tt> fields. + +<blockquote> +<pre> + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; +10 } +11 p->a = a; +12 p->b = a; +13 gp = p; /* ORDERING BUG */ +14 spin_unlock(&gp_lock); +15 return true; +16 } +</pre> +</blockquote> + +<p> +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + +<blockquote> +<pre> + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; +10 } +<b>11 gp = p; /* ORDERING BUG */ +12 p->a = a; +13 p->b = a;</b> +14 spin_unlock(&gp_lock); +15 return true; +16 } +</pre> +</blockquote> + +<p> +If an RCU reader fetches <tt>gp</tt> just after +<tt>add_gp_buggy_optimized</tt> executes line 11, +it will see garbage in the <tt>->a</tt> and <tt>->b</tt> +fields. +And this is but one of many ways in which compiler and hardware optimizations +could cause trouble. +Therefore, we clearly need some way to prevent the compiler and the CPU from +reordering in this manner, which brings us to the publish-subscribe +guarantee discussed in the next section. + +<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> + +<p> +RCU's publish-subscribe guarantee allows data to be inserted +into a linked data structure without disrupting RCU readers. +The updater uses <tt>rcu_assign_pointer()</tt> to insert the +new data, and readers use <tt>rcu_dereference()</tt> to +access data, whether new or old. +The following shows an example of insertion: + +<blockquote> +<pre> + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; +10 } +11 p->a = a; +12 p->b = a; +13 rcu_assign_pointer(gp, p); +14 spin_unlock(&gp_lock); +15 return true; +16 } +</pre> +</blockquote> + +<p> +The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually +equivalent to a simple assignment statement, but also guarantees +that its assignment will +happen after the two assignments in lines 11 and 12, +similar to the C11 <tt>memory_order_release</tt> store operation. +It also prevents any number of “interesting” compiler +optimizations, for example, the use of <tt>gp</tt> as a scratch +location immediately preceding the assignment. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But <tt>rcu_assign_pointer()</tt> does nothing to prevent the + two assignments to <tt>p->a</tt> and <tt>p->b</tt> + from being reordered. + Can't that also cause problems? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + No, it cannot. + The readers cannot see either of these two fields until + the assignment to <tt>gp</tt>, by which time both fields are + fully initialized. + So reordering the assignments + to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly + cause any problems. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +It is tempting to assume that the reader need not do anything special +to control its accesses to the RCU-protected data, +as shown in <tt>do_something_gp_buggy()</tt> below: + +<blockquote> +<pre> + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } +10 rcu_read_unlock(); +11 return false; +12 } +</pre> +</blockquote> + +<p> +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler +(to say nothing of +<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) +can trip this code up. +For but one example, if the compiler were short of registers, it +might choose to refetch from <tt>gp</tt> rather than keeping +a separate copy in <tt>p</tt> as follows: + +<blockquote> +<pre> + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ +<b> 5 do_something(gp->a, gp->b);</b> + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); +10 return false; +11 } +</pre> +</blockquote> + +<p> +If this function ran concurrently with a series of updates that +replaced the current structure with a new one, +the fetches of <tt>gp->a</tt> +and <tt>gp->b</tt> might well come from two different structures, +which could cause serious confusion. +To prevent this (and much else besides), <tt>do_something_gp()</tt> uses +<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: + +<blockquote> +<pre> + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } +10 rcu_read_unlock(); +11 return false; +12 } +</pre> +</blockquote> + +<p> +The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) +memory barriers in the Linux kernel. +Should a +<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> +ever appear, then <tt>rcu_dereference()</tt> could be implemented +as a <tt>memory_order_consume</tt> load. +Regardless of the exact implementation, a pointer fetched by +<tt>rcu_dereference()</tt> may not be used outside of the +outermost RCU read-side critical section containing that +<tt>rcu_dereference()</tt>, unless protection of +the corresponding data element has been passed from RCU to some +other synchronization mechanism, most commonly locking or +<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. + +<p> +In short, updaters use <tt>rcu_assign_pointer()</tt> and readers +use <tt>rcu_dereference()</tt>, and these two RCU API elements +work together to ensure that readers have a consistent view of +newly added data elements. + +<p> +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +<ol> +<li> Remove the data element from the enclosing structure. +<li> Wait for all pre-existing RCU read-side critical sections + to complete (because only pre-existing readers can possibly have + a reference to the newly removed data element). +<li> At this point, only the updater has a reference to the + newly removed data element, so it can safely reclaim + the data element, for example, by passing it to <tt>kfree()</tt>. +</ol> + +This process is implemented by <tt>remove_gp_synchronous()</tt>: + +<blockquote> +<pre> + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; +10 } +11 rcu_assign_pointer(gp, NULL); +12 spin_unlock(&gp_lock); +13 synchronize_rcu(); +14 kfree(p); +15 return true; +16 } +</pre> +</blockquote> + +<p> +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. +This waiting ensures that readers will reach line 7 of +<tt>do_something_gp()</tt> before the data element referenced by +<tt>p</tt> is freed. +The <tt>rcu_access_pointer()</tt> on line 6 is similar to +<tt>rcu_dereference()</tt>, except that: + +<ol> +<li> The value returned by <tt>rcu_access_pointer()</tt> + cannot be dereferenced. + If you want to access the value pointed to as well as + the pointer itself, use <tt>rcu_dereference()</tt> + instead of <tt>rcu_access_pointer()</tt>. +<li> The call to <tt>rcu_access_pointer()</tt> need not be + protected. + In contrast, <tt>rcu_dereference()</tt> must either be + within an RCU read-side critical section or in a code + segment where the pointer cannot change, for example, in + code protected by the corresponding update-side lock. +</ol> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Without the <tt>rcu_dereference()</tt> or the + <tt>rcu_access_pointer()</tt>, what destructive optimizations + might the compiler make use of? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Let's start with what happens to <tt>do_something_gp()</tt> + if it fails to use <tt>rcu_dereference()</tt>. + It could reuse a value formerly fetched from this same pointer. + It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time + manner, resulting in <i>load tearing</i>, in turn resulting a bytewise + mash-up of two distinct pointer values. + It might even use value-speculation optimizations, where it makes + a wrong guess, but by the time it gets around to checking the + value, an update has changed the pointer to match the wrong guess. + Too bad about any dereferences that returned pre-initialization garbage + in the meantime! + </font> + + <p><font color="ffffff"> + For <tt>remove_gp_synchronous()</tt>, as long as all modifications + to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, + the above optimizations are harmless. + However, <tt>sparse</tt> will complain if you + define <tt>gp</tt> with <tt>__rcu</tt> and then + access it without using + either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +In short, RCU's publish-subscribe guarantee is provided by the combination +of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. +This guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. +This guarantee can be used in combination with the grace-period +guarantee to also allow data elements to be removed from RCU-protected +linked data structures, again without disrupting RCU readers. + +<p> +This guarantee was only partially premeditated. +DYNIX/ptx used an explicit memory barrier for publication, but had nothing +resembling <tt>rcu_dereference()</tt> for subscription, nor did it +have anything resembling the <tt>smp_read_barrier_depends()</tt> +that was later subsumed into <tt>rcu_dereference()</tt> and later +still into <tt>READ_ONCE()</tt>. +The need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. +It took the Alpha architects a good hour to convince me that any sort +of barrier would ever be needed, and it then took me a good <i>two</i> hours +to convince them that their documentation did not make this point clear. +More recent work with the C and C++ standards committees have provided +much education on tricks and traps from the compiler. +In short, compilers were much less tricky in the early 1990s, but in +2015, don't even think about omitting <tt>rcu_dereference()</tt>! + +<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> + +<p> +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +<ol> +<li> Each CPU that has an RCU read-side critical section that + begins before <tt>synchronize_rcu()</tt> starts is + guaranteed to execute a full memory barrier between the time + that the RCU read-side critical section ends and the time that + <tt>synchronize_rcu()</tt> returns. + Without this guarantee, a pre-existing RCU read-side critical section + might hold a reference to the newly removed <tt>struct foo</tt> + after the <tt>kfree()</tt> on line 14 of + <tt>remove_gp_synchronous()</tt>. +<li> Each CPU that has an RCU read-side critical section that ends + after <tt>synchronize_rcu()</tt> returns is guaranteed + to execute a full memory barrier between the time that + <tt>synchronize_rcu()</tt> begins and the time that the RCU + read-side critical section begins. + Without this guarantee, a later RCU read-side critical section + running after the <tt>kfree()</tt> on line 14 of + <tt>remove_gp_synchronous()</tt> might + later run <tt>do_something_gp()</tt> and find the + newly deleted <tt>struct foo</tt>. +<li> If the task invoking <tt>synchronize_rcu()</tt> remains + on a given CPU, then that CPU is guaranteed to execute a full + memory barrier sometime during the execution of + <tt>synchronize_rcu()</tt>. + This guarantee ensures that the <tt>kfree()</tt> on + line 14 of <tt>remove_gp_synchronous()</tt> really does + execute after the removal on line 11. +<li> If the task invoking <tt>synchronize_rcu()</tt> migrates + among a group of CPUs during that invocation, then each of the + CPUs in that group is guaranteed to execute a full memory barrier + sometime during the execution of <tt>synchronize_rcu()</tt>. + This guarantee also ensures that the <tt>kfree()</tt> on + line 14 of <tt>remove_gp_synchronous()</tt> really does + execute after the removal on + line 11, but also in the case where the thread executing the + <tt>synchronize_rcu()</tt> migrates in the meantime. +</ol> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Given that multiple CPUs can start RCU read-side critical sections + at any time without any ordering whatsoever, how can RCU possibly + tell whether or not a given RCU read-side critical section starts + before a given instance of <tt>synchronize_rcu()</tt>? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + If RCU cannot tell whether or not a given + RCU read-side critical section starts before a + given instance of <tt>synchronize_rcu()</tt>, + then it must assume that the RCU read-side critical section + started first. + In other words, a given instance of <tt>synchronize_rcu()</tt> + can avoid waiting on a given RCU read-side critical section only + if it can prove that <tt>synchronize_rcu()</tt> started first. + </font> + + <p><font color="ffffff"> + A related question is “When <tt>rcu_read_lock()</tt> + doesn't generate any code, why does it matter how it relates + to a grace period?” + The answer is that it is not the relationship of + <tt>rcu_read_lock()</tt> itself that is important, but rather + the relationship of the code within the enclosed RCU read-side + critical section to the code preceding and following the + grace period. + If we take this viewpoint, then a given RCU read-side critical + section begins before a given grace period when some access + preceding the grace period observes the effect of some access + within the critical section, in which case none of the accesses + within the critical section may observe the effects of any + access following the grace period. + </font> + + <p><font color="ffffff"> + As of late 2016, mathematical models of RCU take this + viewpoint, for example, see slides 62 and 63 + of the + <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> + presentation. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + The first and second guarantees require unbelievably strict ordering! + Are all these memory barriers <i> really</i> required? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Yes, they really are required. + To see why the first guarantee is required, consider the following + sequence of events: + </font> + + <ol> + <li> <font color="ffffff"> + CPU 1: <tt>rcu_read_lock()</tt> + </font> + <li> <font color="ffffff"> + CPU 1: <tt>q = rcu_dereference(gp); + /* Very likely to return p. */</tt> + </font> + <li> <font color="ffffff"> + CPU 0: <tt>list_del_rcu(p);</tt> + </font> + <li> <font color="ffffff"> + CPU 0: <tt>synchronize_rcu()</tt> starts. + </font> + <li> <font color="ffffff"> + CPU 1: <tt>do_something_with(q->a); + /* No smp_mb(), so might happen after kfree(). */</tt> + </font> + <li> <font color="ffffff"> + CPU 1: <tt>rcu_read_unlock()</tt> + </font> + <li> <font color="ffffff"> + CPU 0: <tt>synchronize_rcu()</tt> returns. + </font> + <li> <font color="ffffff"> + CPU 0: <tt>kfree(p);</tt> + </font> + </ol> + + <p><font color="ffffff"> + Therefore, there absolutely must be a full memory barrier between the + end of the RCU read-side critical section and the end of the + grace period. + </font> + + <p><font color="ffffff"> + The sequence of events demonstrating the necessity of the second rule + is roughly similar: + </font> + + <ol> + <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> + </font> + <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. + </font> + <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> + </font> + <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); + /* Might return p if no memory barrier. */</tt> + </font> + <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. + </font> + <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> + </font> + <li> <font color="ffffff"> + CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> + </font> + <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> + </font> + </ol> + + <p><font color="ffffff"> + And similarly, without a memory barrier between the beginning of the + grace period and the beginning of the RCU read-side critical section, + CPU 1 might end up accessing the freelist. + </font> + + <p><font color="ffffff"> + The “as if” rule of course applies, so that any + implementation that acts as if the appropriate memory barriers + were in place is a correct implementation. + That said, it is much easier to fool yourself into believing + that you have adhered to the as-if rule than it is to actually + adhere to it! +</font></td></tr> +<tr><td> </td></tr> +</table> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> + generate absolutely no code in some kernel builds. + This means that the compiler might arbitrarily rearrange consecutive + RCU read-side critical sections. + Given such rearrangement, if a given RCU read-side critical section + is done, how can you be sure that all prior RCU read-side critical + sections are done? + Won't the compiler rearrangements make that impossible to determine? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> + generate absolutely no code, RCU infers quiescent states only at + special locations, for example, within the scheduler. + Because calls to <tt>schedule()</tt> had better prevent calling-code + accesses to shared variables from being rearranged across the call to + <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side + critical section, it will necessarily detect the end of all prior + RCU read-side critical sections, no matter how aggressively the + compiler scrambles the code. + </font> + + <p><font color="ffffff"> + Again, this all assumes that the compiler cannot scramble code across + calls to the scheduler, out of interrupt handlers, into the idle loop, + into user-mode code, and so on. + But if your kernel build allows that sort of scrambling, you have broken + far more than just RCU! +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +Note that these memory-barrier requirements do not replace the fundamental +RCU requirement that a grace period wait for all pre-existing readers. +On the contrary, the memory barriers called out in this section must operate in +such a way as to <i>enforce</i> this fundamental requirement. +Of course, different implementations enforce this requirement in different +ways, but enforce it they must. + +<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> + +<p> +The common-case RCU primitives are unconditional. +They are invoked, they do their job, and they return, with no possibility +of error, and no need to retry. +This is a key RCU design philosophy. + +<p> +However, this philosophy is pragmatic rather than pigheaded. +If someone comes up with a good justification for a particular conditional +RCU primitive, it might well be implemented and added. +After all, this guarantee was reverse-engineered, not premeditated. +The unconditional nature of the RCU primitives was initially an +accident of implementation, and later experience with synchronization +primitives with conditional primitives caused me to elevate this +accident to a guarantee. +Therefore, the justification for adding a conditional primitive to +RCU would need to be based on detailed and compelling use cases. + +<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> + +<p> +As far as RCU is concerned, it is always possible to carry out an +update within an RCU read-side critical section. +For example, that RCU read-side critical section might search for +a given data element, and then might acquire the update-side +spinlock in order to update that element, all while remaining +in that RCU read-side critical section. +Of course, it is necessary to exit the RCU read-side critical section +before invoking <tt>synchronize_rcu()</tt>, however, this +inconvenience can be avoided through use of the +<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members +described later in this document. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But how does the upgrade-to-write operation exclude other readers? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + It doesn't, just like normal RCU updates, which also do not exclude + RCU readers. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +This guarantee allows lookup code to be shared between read-side +and update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> + +<p> +RCU provides extremely lightweight readers, and its read-side guarantees, +though quite useful, are correspondingly lightweight. +It is therefore all too easy to assume that RCU is guaranteeing more +than it really is. +Of course, the list of things that RCU does not guarantee is infinitely +long, however, the following sections list a few non-guarantees that +have caused confusion. +Except where otherwise noted, these non-guarantees were premeditated. + +<ol> +<li> <a href="#Readers Impose Minimal Ordering"> + Readers Impose Minimal Ordering</a> +<li> <a href="#Readers Do Not Exclude Updaters"> + Readers Do Not Exclude Updaters</a> +<li> <a href="#Updaters Only Wait For Old Readers"> + Updaters Only Wait For Old Readers</a> +<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> + Grace Periods Don't Partition Read-Side Critical Sections</a> +<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> + Read-Side Critical Sections Don't Partition Grace Periods</a> +<li> <a href="#Disabling Preemption Does Not Block Grace Periods"> + Disabling Preemption Does Not Block Grace Periods</a> +</ol> + +<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> + +<p> +Reader-side markers such as <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees +except through their interaction with the grace-period APIs such as +<tt>synchronize_rcu()</tt>. +To see this, consider the following pair of threads: + +<blockquote> +<pre> + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } +10 +11 void thread1(void) +12 { +13 rcu_read_lock(); +14 r1 = READ_ONCE(y); +15 rcu_read_unlock(); +16 rcu_read_lock(); +17 r2 = READ_ONCE(x); +18 rcu_read_unlock(); +19 } +</pre> +</blockquote> + +<p> +After <tt>thread0()</tt> and <tt>thread1()</tt> execute +concurrently, it is quite possible to have + +<blockquote> +<pre> +(r1 == 1 && r2 == 0) +</pre> +</blockquote> + +(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), +which would not be possible if <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> had much in the way of ordering +properties. +But they do not, so the CPU is within its rights +to do significant reordering. +This is by design: Any significant ordering constraints would slow down +these fast-path APIs. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Can't the compiler also reorder this code? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + No, the volatile casts in <tt>READ_ONCE()</tt> and + <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in + this particular case. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> + +<p> +Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> +exclude updates. +All they do is to prevent grace periods from ending. +The following example illustrates this: + +<blockquote> +<pre> + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } +10 rcu_read_unlock(); +11 } +12 +13 void thread1(void) +14 { +15 spin_lock(&my_lock); +16 WRITE_ONCE(x, 1); +17 WRITE_ONCE(y, 1); +18 spin_unlock(&my_lock); +19 } +</pre> +</blockquote> + +<p> +If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> +excluded the <tt>thread1()</tt> function's update, +the <tt>WARN_ON()</tt> could never fire. +But the fact is that <tt>rcu_read_lock()</tt> does not exclude +much of anything aside from subsequent grace periods, of which +<tt>thread1()</tt> has none, so the +<tt>WARN_ON()</tt> can and does fire. + +<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> + +<p> +It might be tempting to assume that after <tt>synchronize_rcu()</tt> +completes, there are no readers executing. +This temptation must be avoided because +new readers can start immediately after <tt>synchronize_rcu()</tt> +starts, and <tt>synchronize_rcu()</tt> is under no +obligation to wait for these new readers. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Suppose that synchronize_rcu() did wait until <i>all</i> + readers had completed instead of waiting only on + pre-existing readers. + For how long would the updater be able to rely on there + being no readers? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + For no time at all. + Even if <tt>synchronize_rcu()</tt> were to wait until + all readers had completed, a new reader might start immediately after + <tt>synchronize_rcu()</tt> completed. + Therefore, the code following + <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being + no readers. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> +Grace Periods Don't Partition Read-Side Critical Sections</a></h3> + +<p> +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. +An example of this situation can be illustrated as follows, where +<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: + +<blockquote> +<pre> + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) +10 { +11 r1 = READ_ONCE(a); +12 synchronize_rcu(); +13 WRITE_ONCE(c, 1); +14 } +15 +16 void thread2(void) +17 { +18 rcu_read_lock(); +19 r2 = READ_ONCE(b); +20 r3 = READ_ONCE(c); +21 rcu_read_unlock(); +22 } +</pre> +</blockquote> + +<p> +It turns out that the outcome: + +<blockquote> +<pre> +(r1 == 1 && r2 == 0 && r3 == 1) +</pre> +</blockquote> + +is entirely possible. +The following figure show how this can happen, with each circled +<tt>QS</tt> indicating the point at which RCU recorded a +<i>quiescent state</i> for each thread, that is, a state in which +RCU knows that the thread cannot be in the midst of an RCU read-side +critical section that started before the current grace period: + +<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> + +<p> +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first +grace period is known to end before the second grace period starts: + +<blockquote> +<pre> + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) +10 { +11 r1 = READ_ONCE(a); +12 synchronize_rcu(); +13 WRITE_ONCE(c, 1); +14 } +15 +16 void thread2(void) +17 { +18 r2 = READ_ONCE(c); +19 synchronize_rcu(); +20 WRITE_ONCE(d, 1); +21 } +22 +23 void thread3(void) +24 { +25 rcu_read_lock(); +26 r3 = READ_ONCE(b); +27 r4 = READ_ONCE(d); +28 rcu_read_unlock(); +29 } +</pre> +</blockquote> + +<p> +Here, if <tt>(r1 == 1)</tt>, then +<tt>thread0()</tt>'s write to <tt>b</tt> must happen +before the end of <tt>thread1()</tt>'s grace period. +If in addition <tt>(r4 == 1)</tt>, then +<tt>thread3()</tt>'s read from <tt>b</tt> must happen +after the beginning of <tt>thread2()</tt>'s grace period. +If it is also the case that <tt>(r2 == 1)</tt>, then the +end of <tt>thread1()</tt>'s grace period must precede the +beginning of <tt>thread2()</tt>'s grace period. +This mean that the two RCU read-side critical sections cannot overlap, +guaranteeing that <tt>(r3 == 1)</tt>. +As a result, the outcome: + +<blockquote> +<pre> +(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) +</pre> +</blockquote> + +cannot happen. + +<p> +This non-requirement was also non-premeditated, but became apparent +when studying RCU's interaction with memory ordering. + +<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> +Read-Side Critical Sections Don't Partition Grace Periods</a></h3> + +<p> +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. +However, this temptation leads nowhere good, as can be illustrated by +the following, with all variables initially zero: + +<blockquote> +<pre> + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) +10 { +11 r1 = READ_ONCE(a); +12 synchronize_rcu(); +13 WRITE_ONCE(c, 1); +14 } +15 +16 void thread2(void) +17 { +18 rcu_read_lock(); +19 WRITE_ONCE(d, 1); +20 r2 = READ_ONCE(c); +21 rcu_read_unlock(); +22 } +23 +24 void thread3(void) +25 { +26 r3 = READ_ONCE(d); +27 synchronize_rcu(); +28 WRITE_ONCE(e, 1); +29 } +30 +31 void thread4(void) +32 { +33 rcu_read_lock(); +34 r4 = READ_ONCE(b); +35 r5 = READ_ONCE(e); +36 rcu_read_unlock(); +37 } +</pre> +</blockquote> + +<p> +In this case, the outcome: + +<blockquote> +<pre> +(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) +</pre> +</blockquote> + +is entirely possible, as illustrated below: + +<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> + +<p> +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire +grace period. +As a result, an RCU read-side critical section cannot partition a pair +of RCU grace periods. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + How long a sequence of grace periods, each separated by an RCU + read-side critical section, would be required to partition the RCU + read-side critical sections at the beginning and end of the chain? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + In theory, an infinite number. + In practice, an unknown number that is sensitive to both implementation + details and timing considerations. + Therefore, even in practice, RCU users must abide by the + theoretical rather than the practical answer. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h3><a name="Disabling Preemption Does Not Block Grace Periods"> +Disabling Preemption Does Not Block Grace Periods</a></h3> + +<p> +There was a time when disabling preemption on any given CPU would block +subsequent grace periods. +However, this was an accident of implementation and is not a requirement. +And in the current Linux-kernel implementation, disabling preemption +on a given CPU in fact does not block grace periods, as Oleg Nesterov +<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>. + +<p> +If you need a preempt-disable region to block grace periods, you need to add +<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example +as follows: + +<blockquote> +<pre> + 1 preempt_disable(); + 2 rcu_read_lock(); + 3 do_something(); + 4 rcu_read_unlock(); + 5 preempt_enable(); + 6 + 7 /* Spinlocks implicitly disable preemption. */ + 8 spin_lock(&mylock); + 9 rcu_read_lock(); +10 do_something(); +11 rcu_read_unlock(); +12 spin_unlock(&mylock); +</pre> +</blockquote> + +<p> +In theory, you could enter the RCU read-side critical section first, +but it is more efficient to keep the entire RCU read-side critical +section contained in the preempt-disable region as shown above. +Of course, RCU read-side critical sections that extend outside of +preempt-disable regions will work correctly, but such critical sections +can be preempted, which forces <tt>rcu_read_unlock()</tt> to do +more work. +And no, this is <i>not</i> an invitation to enclose all of your RCU +read-side critical sections within preempt-disable regions, because +doing so would degrade real-time response. + +<p> +This non-requirement appeared with preemptible RCU. +If you need a grace period that waits on non-preemptible code regions, use +<a href="#Sched Flavor">RCU-sched</a>. + +<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> + +<p> +These parallelism facts of life are by no means specific to RCU, but +the RCU implementation must abide by them. +They therefore bear repeating: + +<ol> +<li> Any CPU or task may be delayed at any time, + and any attempts to avoid these delays by disabling + preemption, interrupts, or whatever are completely futile. + This is most obvious in preemptible user-level + environments and in virtualized environments (where + a given guest OS's VCPUs can be preempted at any time by + the underlying hypervisor), but can also happen in bare-metal + environments due to ECC errors, NMIs, and other hardware + events. + Although a delay of more than about 20 seconds can result + in splats, the RCU implementation is obligated to use + algorithms that can tolerate extremely long delays, but where + “extremely long” is not long enough to allow + wrap-around when incrementing a 64-bit counter. +<li> Both the compiler and the CPU can reorder memory accesses. + Where it matters, RCU must use compiler directives and + memory-barrier instructions to preserve ordering. +<li> Conflicting writes to memory locations in any given cache line + will result in expensive cache misses. + Greater numbers of concurrent writes and more-frequent + concurrent writes will result in more dramatic slowdowns. + RCU is therefore obligated to use algorithms that have + sufficient locality to avoid significant performance and + scalability problems. +<li> As a rough rule of thumb, only one CPU's worth of processing + may be carried out under the protection of any given exclusive + lock. + RCU must therefore use scalable locking designs. +<li> Counters are finite, especially on 32-bit systems. + RCU's use of counters must therefore tolerate counter wrap, + or be designed such that counter wrap would take way more + time than a single system is likely to run. + An uptime of ten years is quite possible, a runtime + of a century much less so. + As an example of the latter, RCU's dyntick-idle nesting counter + allows 54 bits for interrupt nesting level (this counter + is 64 bits even on a 32-bit system). + Overflowing this counter requires 2<sup>54</sup> + half-interrupts on a given CPU without that CPU ever going idle. + If a half-interrupt happened every microsecond, it would take + 570 years of runtime to overflow this counter, which is currently + believed to be an acceptably long time. +<li> Linux systems can have thousands of CPUs running a single + Linux kernel in a single shared-memory environment. + RCU must therefore pay close attention to high-end scalability. +</ol> + +<p> +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. +The idea that Linux might scale to systems with thousands of CPUs would +have been met with some skepticism in the 1990s, but these requirements +would have otherwise have been unsurprising, even in the early 1990s. + +<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> + +<p> +These sections list quality-of-implementation requirements. +Although an RCU implementation that ignores these requirements could +still be used, it would likely be subject to limitations that would +make it inappropriate for industrial-strength production use. +Classes of quality-of-implementation requirements are as follows: + +<ol> +<li> <a href="#Specialization">Specialization</a> +<li> <a href="#Performance and Scalability">Performance and Scalability</a> +<li> <a href="#Composability">Composability</a> +<li> <a href="#Corner Cases">Corner Cases</a> +</ol> + +<p> +These classes is covered in the following sections. + +<h3><a name="Specialization">Specialization</a></h3> + +<p> +RCU is and always has been intended primarily for read-mostly situations, +which means that RCU's read-side primitives are optimized, often at the +expense of its update-side primitives. +Experience thus far is captured by the following list of situations: + +<ol> +<li> Read-mostly data, where stale and inconsistent data is not + a problem: RCU works great! +<li> Read-mostly data, where data must be consistent: + RCU works well. +<li> Read-write data, where data must be consistent: + RCU <i>might</i> work OK. + Or not. +<li> Write-mostly data, where data must be consistent: + RCU is very unlikely to be the right tool for the job, + with the following exceptions, where RCU can provide: + <ol type=a> + <li> Existence guarantees for update-friendly mechanisms. + <li> Wait-free read-side primitives for real-time use. + </ol> +</ol> + +<p> +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. +For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> +examples discussed earlier use RCU to protect readers and locking to +coordinate updaters. +However, the need extends much farther, requiring that a variety of +synchronization primitives be legal within RCU read-side critical sections, +including spinlocks, sequence locks, atomic operations, reference +counters, and memory barriers. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + What about sleeping locks? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + These are forbidden within Linux-kernel RCU read-side critical + sections because it is not legal to place a quiescent state + (in this case, voluntary context switch) within an RCU read-side + critical section. + However, sleeping locks may be used within userspace RCU read-side + critical sections, and also within Linux-kernel sleepable RCU + <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> + read-side critical sections. + In addition, the -rt patchset turns spinlocks into a + sleeping locks so that the corresponding critical sections + can be preempted, which also means that these sleeplockified + spinlocks (but not other sleeping locks!) may be acquire within + -rt-Linux-kernel RCU read-side critical sections. + </font> + + <p><font color="ffffff"> + Note that it <i>is</i> legal for a normal RCU read-side + critical section to conditionally acquire a sleeping locks + (as in <tt>mutex_trylock()</tt>), but only as long as it does + not loop indefinitely attempting to conditionally acquire that + sleeping locks. + The key point is that things like <tt>mutex_trylock()</tt> + either return with the mutex held, or return an error indication if + the mutex was not immediately available. + Either way, <tt>mutex_trylock()</tt> returns immediately without + sleeping. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, +with network routing being the poster child. +Internet routing algorithms take significant time to propagate +updates, so that by the time an update arrives at a given system, +that system has been sending network traffic the wrong way for +a considerable length of time. +Having a few threads continue to send traffic the wrong way for a +few more milliseconds is clearly not a problem: In the worst case, +TCP retransmissions will eventually get the data where it needs to go. +In general, when tracking the state of the universe outside of the +computer, some level of inconsistency must be tolerated due to +speed-of-light delays if nothing else. + +<p> +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. +But how long should they wait after the last heartbeat to decide that +the cat is in fact dead? +Waiting less than 400 milliseconds makes no sense because this would +mean that a relaxed cat would be considered to cycle between death +and life more than 100 times per minute. +Moreover, just as with human beings, a cat's heart might stop for +some period of time, so the exact wait period is a judgment call. +One of our pair of veterinarians might wait 30 seconds before pronouncing +the cat dead, while the other might insist on waiting a full minute. +The two veterinarians would then disagree on the state of the cat during +the final 30 seconds of the minute following the last heartbeat. + +<p> +Interestingly enough, this same situation applies to hardware. +When push comes to shove, how do we tell whether or not some +external server has failed? +We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. +Policy decisions can usually tolerate short +periods of inconsistency. +The policy was decided some time ago, and is only now being put into +effect, so a few milliseconds of delay is normally inconsequential. + +<p> +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore +ID to the corresponding in-kernel data structure is protected by RCU, +but it is absolutely forbidden to update a semaphore that has just been +removed. +In the Linux kernel, this need for consistency is accommodated by acquiring +spinlocks located in the in-kernel data structure from within +the RCU read-side critical section, and this is indicated by the +green box in the figure above. +Many other techniques may be used, and are in fact used within the +Linux kernel. + +<p> +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows +the right mix of synchronization tools to be used for a given job. + +<h3><a name="Performance and Scalability">Performance and Scalability</a></h3> + +<p> +Energy efficiency is a critical component of performance today, +and Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. +I cannot claim that this requirement was premeditated. +In fact, I learned of it during a telephone conversation in which I +was given “frank and open” feedback on the importance +of energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. +So much so that mere Linux-kernel-mailing-list posts are +insufficient to vent their ire. + +<p> +Memory consumption is not particularly important for in most +situations, and has become decreasingly +so as memory sizes have expanded and memory +costs have plummeted. +However, as I learned from Matt Mackall's +<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> +efforts, memory footprint is critically important on single-CPU systems with +non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus +<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> +was born. +Josh Triplett has since taken over the small-memory banner with his +<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> +project, which resulted in +<a href="#Sleepable RCU">SRCU</a> +becoming optional for those kernels not needing it. + +<p> +The remaining performance requirements are, for the most part, +unsurprising. +For example, in keeping with RCU's read-side specialization, +<tt>rcu_dereference()</tt> should have negligible overhead (for +example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> should have exactly zero overhead. + +<p> +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> should have minimal overhead. +In particular, they should not contain atomic read-modify-write +operations, memory-barrier instructions, preemption disabling, +interrupt disabling, or backwards branches. +However, in the case where the RCU read-side critical section was preempted, +<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. +This is why it is better to nest an RCU read-side critical section +within a preempt-disable region than vice versa, at least in cases +where that critical section is short enough to avoid unduly degrading +real-time latencies. + +<p> +The <tt>synchronize_rcu()</tt> grace-period-wait primitive is +optimized for throughput. +It may therefore incur several milliseconds of latency in addition to +the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +<tt>synchronize_rcu()</tt> are required to use batching optimizations +so that they can be satisfied by a single underlying grace-period-wait +operation. +For example, in the Linux kernel, it is not unusual for a single +grace-period-wait operation to serve more than +<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> +of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation +overhead down to nearly zero. +However, the grace-period optimization is also required to avoid +measurable degradation of real-time scheduling and interrupt latencies. + +<p> +In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> +latencies are unacceptable. +In these cases, <tt>synchronize_rcu_expedited()</tt> may be used +instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. +There are currently no special latency requirements for +<tt>synchronize_rcu_expedited()</tt> on large systems, but, +consistent with the empirical nature of the RCU specification, +that is subject to change. +However, there most definitely are scalability requirements: +A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 +CPUs should at least make reasonable forward progress. +In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> +is permitted to impose modest degradation of real-time latency +on non-idle online CPUs. +Here, “modest” means roughly the same latency +degradation as a scheduling-clock interrupt. + +<p> +There are a number of situations where even +<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period +latency is unacceptable. +In these situations, the asynchronous <tt>call_rcu()</tt> can be +used in place of <tt>synchronize_rcu()</tt> as follows: + +<blockquote> +<pre> + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); +10 +11 kfree(p); +12 } +13 +14 bool remove_gp_asynchronous(void) +15 { +16 struct foo *p; +17 +18 spin_lock(&gp_lock); +19 p = rcu_dereference(gp); +20 if (!p) { +21 spin_unlock(&gp_lock); +22 return false; +23 } +24 rcu_assign_pointer(gp, NULL); +25 call_rcu(&p->rh, remove_gp_cb); +26 spin_unlock(&gp_lock); +27 return true; +28 } +</pre> +</blockquote> + +<p> +A definition of <tt>struct foo</tt> is finally needed, and appears +on lines 1-5. +The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> +on line 25, and will be invoked after the end of a subsequent +grace period. +This gets the same effect as <tt>remove_gp_synchronous()</tt>, +but without forcing the updater to wait for a grace period to elapse. +The <tt>call_rcu()</tt> function may be used in a number of +situations where neither <tt>synchronize_rcu()</tt> nor +<tt>synchronize_rcu_expedited()</tt> would be legal, +including within preempt-disable code, <tt>local_bh_disable()</tt> code, +interrupt-disable code, and interrupt handlers. +However, even <tt>call_rcu()</tt> is illegal within NMI handlers +and from idle and offline CPUs. +The callback function (<tt>remove_gp_cb()</tt> in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, +either within a real softirq handler or under the protection +of <tt>local_bh_disable()</tt>. +In both the Linux kernel and in userspace, it is bad practice to +write an RCU callback function that takes too long. +Long-running operations should be relegated to separate threads or +(in the Linux kernel) workqueues. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why does line 19 use <tt>rcu_access_pointer()</tt>? + After all, <tt>call_rcu()</tt> on line 25 stores into the + structure, which would interact badly with concurrent insertions. + Doesn't this mean that <tt>rcu_dereference()</tt> is required? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes + any changes, including any insertions that <tt>rcu_dereference()</tt> + would protect against. + Therefore, any insertions will be delayed until after + <tt>->gp_lock</tt> + is released on line 25, which in turn means that + <tt>rcu_access_pointer()</tt> suffices. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +However, all that <tt>remove_gp_cb()</tt> is doing is +invoking <tt>kfree()</tt> on the data element. +This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, +which allows “fire and forget” operation as shown below: + +<blockquote> +<pre> + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; +10 +11 spin_lock(&gp_lock); +12 p = rcu_dereference(gp); +13 if (!p) { +14 spin_unlock(&gp_lock); +15 return false; +16 } +17 rcu_assign_pointer(gp, NULL); +18 kfree_rcu(p, rh); +19 spin_unlock(&gp_lock); +20 return true; +21 } +</pre> +</blockquote> + +<p> +Note that <tt>remove_gp_faf()</tt> simply invokes +<tt>kfree_rcu()</tt> and proceeds, without any need to pay any +further attention to the subsequent grace period and <tt>kfree()</tt>. +It is permissible to invoke <tt>kfree_rcu()</tt> from the same +environments as for <tt>call_rcu()</tt>. +Interestingly enough, DYNIX/ptx had the equivalents of +<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not +<tt>synchronize_rcu()</tt>. +This was due to the fact that RCU was not heavily used within DYNIX/ptx, +so the very few places that needed something like +<tt>synchronize_rcu()</tt> simply open-coded it. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Earlier it was claimed that <tt>call_rcu()</tt> and + <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked + by readers. + But how can that be correct, given that the invocation of the callback + and the freeing of the memory (respectively) must still wait for + a grace period to elapse? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + We could define things this way, but keep in mind that this sort of + definition would say that updates in garbage-collected languages + cannot complete until the next time the garbage collector runs, + which does not seem at all reasonable. + The key point is that in most cases, an updater using either + <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the + next update as soon as it has invoked <tt>call_rcu()</tt> or + <tt>kfree_rcu()</tt>, without having to wait for a subsequent + grace period. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks +that can be carried out in the meantime? +The polling-style <tt>get_state_synchronize_rcu()</tt> and +<tt>cond_synchronize_rcu()</tt> functions may be used for this +purpose, as shown below: + +<blockquote> +<pre> + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); +10 return false; +11 } +12 rcu_assign_pointer(gp, NULL); +13 spin_unlock(&gp_lock); +14 s = get_state_synchronize_rcu(); +15 do_something_while_waiting(); +16 cond_synchronize_rcu(s); +17 kfree(p); +18 return true; +19 } +</pre> +</blockquote> + +<p> +On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a +“cookie” from RCU, +then line 15 carries out other tasks, +and finally, line 16 returns immediately if a grace period has +elapsed in the meantime, but otherwise waits as required. +The need for <tt>get_state_synchronize_rcu</tt> and +<tt>cond_synchronize_rcu()</tt> has appeared quite recently, +so it is too early to tell whether they will stand the test of time. + +<p> +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +<h3><a name="Composability">Composability</a></h3> + +<p> +Composability has received much attention in recent years, perhaps in part +due to the collision of multicore hardware with object-oriented techniques +designed in single-threaded environments for single-threaded use. +And in theory, RCU read-side critical sections may be composed, and in +fact may be nested arbitrarily deeply. +In practice, as with all real-world implementations of composable +constructs, there are limitations. + +<p> +Implementations of RCU for which <tt>rcu_read_lock()</tt> +and <tt>rcu_read_unlock()</tt> generate no code, such as +Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be +nested arbitrarily deeply. +After all, there is no overhead. +Except that if all these instances of <tt>rcu_read_lock()</tt> +and <tt>rcu_read_unlock()</tt> are visible to the compiler, +compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. +If the nesting is not visible to the compiler, as is the case with +mutually recursive functions each in its own translation unit, +stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable +will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. +Nevertheless, this class of RCU implementations is one +of the most composable constructs in existence. + +<p> +RCU implementations that explicitly track nesting depth +are limited by the nesting-depth counter. +For example, the Linux kernel's preemptible RCU limits nesting to +<tt>INT_MAX</tt>. +This should suffice for almost all practical purposes. +That said, a consecutive pair of RCU read-side critical sections +between which there is an operation that waits for a grace period +cannot be enclosed in another RCU read-side critical section. +This is because it is not legal to wait for a grace period within +an RCU read-side critical section: To do so would result either +in deadlock or +in RCU implicitly splitting the enclosing RCU read-side critical +section, neither of which is conducive to a long-lived and prosperous +kernel. + +<p> +It is worth noting that RCU is not alone in limiting composability. +For example, many transactional-memory implementations prohibit +composing a pair of transactions separated by an irrevocable +operation (for example, a network receive operation). +For another example, lock-based critical sections can be composed +surprisingly freely, but only if deadlock is avoided. + +<p> +In short, although RCU read-side critical sections are highly composable, +care is required in some situations, just as is the case for any other +composable synchronization mechanism. + +<h3><a name="Corner Cases">Corner Cases</a></h3> + +<p> +A given RCU workload might have an endless and intense stream of +RCU read-side critical sections, perhaps even so intense that there +was never a point in time during which there was not at least one +RCU read-side critical section in flight. +RCU cannot allow this situation to block grace periods: As long as +all the RCU read-side critical sections are finite, grace periods +must also be finite. + +<p> +That said, preemptible RCU implementations could potentially result +in RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side +critical section. +This situation can arise only in heavily loaded systems, but systems using +real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. +That said, the exact requirements on RCU priority boosting will likely +evolve as more experience accumulates. + +<p> +Other workloads might have very high update rates. +Although one can argue that such workloads should instead use +something other than RCU, the fact remains that RCU must +handle such workloads gracefully. +This requirement is another factor driving batching of grace periods, +but it is also the driving force behind the checks for large numbers +of queued RCU callbacks in the <tt>call_rcu()</tt> code path. +Finally, high update rates should not delay RCU read-side critical +sections, although some small read-side delays can occur when using +<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use +of <tt>smp_call_function_single()</tt>. + +<p> +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of <tt>close(open(path))</tt> +in a tight loop +in the early 2000s suddenly provided a much deeper appreciation of the +high-update-rate corner case. +This test also motivated addition of some RCU code to react to high update +rates, for example, if a given CPU finds itself with more than 10,000 +RCU callbacks queued, it will cause RCU to take evasive action by +more aggressively starting grace periods and more aggressively forcing +completion of grace-period processing. +This evasive action causes the grace period to complete more quickly, +but at the cost of restricting RCU's batching optimizations, thus +increasing the CPU overhead incurred by that grace period. + +<h2><a name="Software-Engineering Requirements"> +Software-Engineering Requirements</a></h2> + +<p> +Between Murphy's Law and “To err is human”, it is necessary to +guard against mishaps and misuse: + +<ol> +<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> + everywhere that it is needed, so kernels built with + <tt>CONFIG_PROVE_RCU=y</tt> will splat if + <tt>rcu_dereference()</tt> is used outside of an + RCU read-side critical section. + Update-side code can use <tt>rcu_dereference_protected()</tt>, + which takes a + <a href="https://lwn.net/Articles/371986/">lockdep expression</a> + to indicate what is providing the protection. + If the indicated protection is not provided, a lockdep splat + is emitted. + + <p> + Code shared between readers and updaters can use + <tt>rcu_dereference_check()</tt>, which also takes a + lockdep expression, and emits a lockdep splat if neither + <tt>rcu_read_lock()</tt> nor the indicated protection + is in place. + In addition, <tt>rcu_dereference_raw()</tt> is used in those + (hopefully rare) cases where the required protection cannot + be easily described. + Finally, <tt>rcu_read_lock_held()</tt> is provided to + allow a function to verify that it has been invoked within + an RCU read-side critical section. + I was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +<li> A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. + The <tt>rcu_lockdep_assert()</tt> does this job, + asserting the expression in kernels having lockdep enabled + and doing nothing otherwise. +<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> + and <tt>rcu_dereference()</tt>, perhaps (incorrectly) + substituting a simple assignment. + To catch this sort of error, a given RCU-protected pointer may be + tagged with <tt>__rcu</tt>, after which sparse + will complain about simple-assignment accesses to that pointer. + Arnd Bergmann made me aware of this requirement, and also + supplied the needed + <a href="https://lwn.net/Articles/376011/">patch series</a>. +<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> + will splat if a data element is passed to <tt>call_rcu()</tt> + twice in a row, without a grace period in between. + (This error is similar to a double free.) + The corresponding <tt>rcu_head</tt> structures that are + dynamically allocated are automatically tracked, but + <tt>rcu_head</tt> structures allocated on the stack + must be initialized with <tt>init_rcu_head_on_stack()</tt> + and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. + Similarly, statically allocated non-stack <tt>rcu_head</tt> + structures must be initialized with <tt>init_rcu_head()</tt> + and cleaned up with <tt>destroy_rcu_head()</tt>. + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. +<li> An infinite loop in an RCU read-side critical section will + eventually trigger an RCU CPU stall warning splat, with + the duration of “eventually” being controlled by the + <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, + alternatively, by the + <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs + parameter. + However, RCU is not obligated to produce this splat + unless there is a grace period waiting on that particular + RCU read-side critical section. + <p> + Some extreme workloads might intentionally delay + RCU grace periods, and systems running those workloads can + be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> + to suppress the splats. + This kernel parameter may also be set via <tt>sysfs</tt>. + Furthermore, RCU CPU stall warnings are counter-productive + during sysrq dumps and during panics. + RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and + <tt>rcu_sysrq_end()</tt> API members to be called before + and after long sysrq dumps. + RCU also supplies the <tt>rcu_panic()</tt> notifier that is + automatically invoked at the beginning of a panic to suppress + further RCU CPU stall warnings. + + <p> + This requirement made itself known in the early 1990s, pretty + much the first time that it was necessary to debug a CPU stall. + That said, the initial implementation in DYNIX/ptx was quite + generic in comparison with that of Linux. +<li> Although it would be very good to detect pointers leaking out + of RCU read-side critical sections, there is currently no + good way of doing this. + One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to + some other synchronization mechanism, for example, reference + counting. +<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related + information is provided via event tracing. +<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and + <tt>rcu_dereference()</tt> to create typical linked + data structures can be surprisingly error-prone. + Therefore, RCU-protected + <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> + and, more recently, RCU-protected + <a href="https://lwn.net/Articles/612100/">hash tables</a> + are available. + Many other special-purpose RCU-protected data structures are + available in the Linux kernel and the userspace RCU library. +<li> Some linked structures are created at compile time, but still + require <tt>__rcu</tt> checking. + The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this + purpose. +<li> It is not necessary to use <tt>rcu_assign_pointer()</tt> + when creating linked structures that are to be published via + a single external pointer. + The <tt>RCU_INIT_POINTER()</tt> macro is provided for + this task and also for assigning <tt>NULL</tt> pointers + at runtime. +</ol> + +<p> +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found +in real-world RCU usage. + +<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> + +<p> +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. +Some of the relevant points of interest are as follows: + +<ol> +<li> <a href="#Configuration">Configuration</a>. +<li> <a href="#Firmware Interface">Firmware Interface</a>. +<li> <a href="#Early Boot">Early Boot</a>. +<li> <a href="#Interrupts and NMIs"> + Interrupts and non-maskable interrupts (NMIs)</a>. +<li> <a href="#Loadable Modules">Loadable Modules</a>. +<li> <a href="#Hotplug CPU">Hotplug CPU</a>. +<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. +<li> <a href="#Tracing and RCU">Tracing and RCU</a>. +<li> <a href="#Energy Efficiency">Energy Efficiency</a>. +<li> <a href="#Scheduling-Clock Interrupts and RCU"> + Scheduling-Clock Interrupts and RCU</a>. +<li> <a href="#Memory Efficiency">Memory Efficiency</a>. +<li> <a href="#Performance, Scalability, Response Time, and Reliability"> + Performance, Scalability, Response Time, and Reliability</a>. +</ol> + +<p> +This list is probably incomplete, but it does give a feel for the +most notable Linux-kernel complications. +Each of the following sections covers one of the above topics. + +<h3><a name="Configuration">Configuration</a></h3> + +<p> +RCU's goal is automatic configuration, so that almost nobody +needs to worry about RCU's <tt>Kconfig</tt> options. +And for almost all users, RCU does in fact work well +“out of the box.” + +<p> +However, there are specialized use cases that are handled by +kernel boot parameters and <tt>Kconfig</tt> options. +Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users +about new <tt>Kconfig</tt> options, which requires almost all of them +be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. + +<p> +This all should be quite obvious, but the fact remains that +Linus Torvalds recently had to +<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> +me of this requirement. + +<h3><a name="Firmware Interface">Firmware Interface</a></h3> + +<p> +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. +Or the translation is accurate, but the original message is bogus. + +<p> +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. +If RCU naively believed the firmware, as it used to do, +it would create too many per-CPU kthreads. +Although the resulting system will still run correctly, the extra +kthreads needlessly consume memory and can cause confusion +when they show up in <tt>ps</tt> listings. + +<p> +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. +The resulting “ghost CPUs” (which are never going to +come online) cause a number of +<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. + +<h3><a name="Early Boot">Early Boot</a></h3> + +<p> +The Linux kernel's boot sequence is an interesting process, +and RCU is used early, even before <tt>rcu_init()</tt> +is invoked. +In fact, a number of RCU's primitives can be used as soon as the +initial task's <tt>task_struct</tt> is available and the +boot CPU's per-CPU variables are set up. +The read-side primitives (<tt>rcu_read_lock()</tt>, +<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, +and <tt>rcu_access_pointer()</tt>) will operate normally very early on, +as will <tt>rcu_assign_pointer()</tt>. + +<p> +Although <tt>call_rcu()</tt> may be invoked at any +time during boot, callbacks are not guaranteed to be invoked until after +all of RCU's kthreads have been spawned, which occurs at +<tt>early_initcall()</tt> time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full initialization +cannot occur until after the scheduler has initialized itself to the +point where RCU can spawn and run its kthreads. +In theory, it would be possible to invoke callbacks earlier, +however, this is not a panacea because there would be severe restrictions +on what operations those callbacks could invoke. + +<p> +Perhaps surprisingly, <tt>synchronize_rcu()</tt>, +<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a> +(<a href="#Bottom-Half Flavor">discussed below</a>), +<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>, +<tt>synchronize_rcu_expedited()</tt>, +<tt>synchronize_rcu_bh_expedited()</tt>, and +<tt>synchronize_sched_expedited()</tt> +will all operate normally +during very early boot, the reason being that there is only one CPU +and preemption is disabled. +This means that the call <tt>synchronize_rcu()</tt> (or friends) +itself is a quiescent +state and thus a grace period, so the early-boot implementation can +be a no-op. + +<p> +However, once the scheduler has spawned its first kthread, this early +boot trick fails for <tt>synchronize_rcu()</tt> (as well as for +<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> +kernels. +The reason is that an RCU read-side critical section might be preempted, +which means that a subsequent <tt>synchronize_rcu()</tt> really does have +to wait for something, as opposed to simply returning immediately. +Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of +its kthreads are spawned, which doesn't happen until some time during +<tt>early_initcalls()</tt> time. +But this is no excuse: RCU is nevertheless required to correctly handle +synchronous grace periods during this time period. +Once all of its kthreads are up and running, RCU starts running +normally. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + How can RCU possibly handle grace periods before all of its + kthreads have been spawned??? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Very carefully! + </font> + + <p><font color="ffffff"> + During the “dead zone” between the time that the + scheduler spawns the first task and the time that all of RCU's + kthreads have been spawned, all synchronous grace periods are + handled by the expedited grace-period mechanism. + At runtime, this expedited mechanism relies on workqueues, but + during the dead zone the requesting task itself drives the + desired expedited grace period. + Because dead-zone execution takes place within task context, + everything works. + Once the dead zone ends, expedited grace periods go back to + using workqueues, as is required to avoid problems that would + otherwise occur when a user task received a POSIX signal while + driving an expedited grace period. + </font> + + <p><font color="ffffff"> + And yes, this does mean that it is unhelpful to send POSIX + signals to random tasks between the time that the scheduler + spawns its first kthread and the time that RCU's kthreads + have all been spawned. + If there ever turns out to be a good reason for sending POSIX + signals during that time, appropriate adjustments will be made. + (If it turns out that POSIX signals are sent during this time for + no good reason, other adjustments will be made, appropriate + or otherwise.) +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +I learned of these boot-time requirements as a result of a series of +system hangs. + +<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> + +<p> +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions +of code, as are invocations of <tt>call_rcu()</tt>. + +<p> +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead stealthily +transitioning back to process context. +This trick is sometimes used to invoke system calls from inside the kernel. +These “half-interrupts” mean that RCU has to be very careful +about how it counts interrupt nesting levels. +I learned of this requirement the hard way during a rewrite +of RCU's dyntick-idle code. + +<p> +The Linux kernel has non-maskable interrupts (NMIs), and +RCU read-side critical sections are legal within NMI handlers. +Thankfully, RCU update-side primitives, including +<tt>call_rcu()</tt>, are prohibited within NMI handlers. + +<p> +The name notwithstanding, some Linux-kernel architectures +can have nested NMIs, which RCU must handle correctly. +Andy Lutomirski +<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> +with this requirement; +he also kindly surprised me with +<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> +that meets this requirement. + +<h3><a name="Loadable Modules">Loadable Modules</a></h3> + +<p> +The Linux kernel has loadable modules, and these modules can +also be unloaded. +After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. +The module-unload functions must therefore cancel any +delayed calls to loadable-module functions, for example, +any outstanding <tt>mod_timer()</tt> must be dealt with +via <tt>del_timer_sync()</tt> or similar. + +<p> +Unfortunately, there is no way to cancel an RCU callback; +once you invoke <tt>call_rcu()</tt>, the callback function is +going to eventually be invoked, unless the system goes down first. +Because it is normally considered socially irresponsible to crash the system +in response to a module unload request, we need some other way +to deal with in-flight RCU callbacks. + +<p> +RCU therefore provides +<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, +which waits until all in-flight RCU callbacks have been invoked. +If a module uses <tt>call_rcu()</tt>, its exit function should therefore +prevent any future invocation of <tt>call_rcu()</tt>, then invoke +<tt>rcu_barrier()</tt>. +In theory, the underlying module-unload code could invoke +<tt>rcu_barrier()</tt> unconditionally, but in practice this would +incur unacceptable latencies. + +<p> +Nikita Danilov noted this requirement for an analogous filesystem-unmount +situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. +The need for <tt>rcu_barrier()</tt> for module unloading became +apparent later. + +<p> +<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, +repeat, <i>not</i>, obligated to wait for a grace period. +It is instead only required to wait for RCU callbacks that have +already been posted. +Therefore, if there are no RCU callbacks posted anywhere in the system, +<tt>rcu_barrier()</tt> is within its rights to return immediately. +Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not +necessarily need to wait for a grace period. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Wait a minute! + Each RCU callbacks must wait for a grace period to complete, + and <tt>rcu_barrier()</tt> must wait for each pre-existing + callback to be invoked. + Doesn't <tt>rcu_barrier()</tt> therefore need to wait for + a full grace period if there is even one callback posted anywhere + in the system? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Absolutely not!!! + </font> + + <p><font color="ffffff"> + Yes, each RCU callbacks must wait for a grace period to complete, + but it might well be partly (or even completely) finished waiting + by the time <tt>rcu_barrier()</tt> is invoked. + In that case, <tt>rcu_barrier()</tt> need only wait for the + remaining portion of the grace period to elapse. + So even if there are quite a few callbacks posted, + <tt>rcu_barrier()</tt> might well return quite quickly. + </font> + + <p><font color="ffffff"> + So if you need to wait for a grace period as well as for all + pre-existing callbacks, you will need to invoke both + <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. + If latency is a concern, you can always use workqueues + to invoke them concurrently. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h3><a name="Hotplug CPU">Hotplug CPU</a></h3> + +<p> +The Linux kernel supports CPU hotplug, which means that CPUs +can come and go. +It is of course illegal to use any RCU API member from an offline CPU, +with the exception of <a href="#Sleepable RCU">SRCU</a> read-side +critical sections. +This requirement was present from day one in DYNIX/ptx, but +on the other hand, the Linux kernel's CPU-hotplug implementation +is “interesting.” + +<p> +The Linux-kernel CPU-hotplug implementation has notifiers that +are used to allow the various kernel subsystems (including RCU) +to respond appropriately to a given CPU-hotplug operation. +Most RCU operations may be invoked from CPU-hotplug notifiers, +including even synchronous grace-period operations such as +<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. + +<p> +However, all-callback-wait operations such as +<tt>rcu_barrier()</tt> are also not supported, due to the +fact that there are phases of CPU-hotplug operations where +the outgoing CPU's callbacks will not be invoked until after +the CPU-hotplug operation ends, which could also result in deadlock. +Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations +during its execution, which results in another type of deadlock +when invoked from a CPU-hotplug notifier. + +<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> + +<p> +RCU depends on the scheduler, and the scheduler uses RCU to +protect some of its data structures. +This means the scheduler is forbidden from acquiring +the runqueue locks and the priority-inheritance locks +in the middle of an outermost RCU read-side critical section unless either +(1) it releases them before exiting that same +RCU read-side critical section, or +(2) interrupts are disabled across +that entire RCU read-side critical section. +This same prohibition also applies (recursively!) to any lock that is acquired +while holding any lock to which this prohibition applies. +Adhering to this rule prevents preemptible RCU from invoking +<tt>rcu_read_unlock_special()</tt> while either runqueue or +priority-inheritance locks are held, thus avoiding deadlock. + +<p> +Prior to v4.4, it was only necessary to disable preemption across +RCU read-side critical sections that acquired scheduler locks. +In v4.4, expedited grace periods started using IPIs, and these +IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath. +Therefore, this expedited-grace-period change required disabling of +interrupts, not just preemption. + +<p> +For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt> +implementation must be written carefully to avoid similar deadlocks. +In particular, <tt>rcu_read_unlock()</tt> must tolerate an +interrupt where the interrupt handler invokes both +<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. +This possibility requires <tt>rcu_read_unlock()</tt> to use +negative nesting levels to avoid destructive recursion via +interrupt handler's use of RCU. + +<p> +This pair of mutual scheduler-RCU requirements came as a +<a href="https://lwn.net/Articles/453002/">complete surprise</a>. + +<p> +As noted above, RCU makes use of kthreads, and it is necessary to +avoid excessive CPU-time accumulation by these kthreads. +This requirement was no surprise, but RCU's violation of it +when running context-switch-heavy workloads when built with +<tt>CONFIG_NO_HZ_FULL=y</tt> +<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. +RCU has made good progress towards meeting this requirement, even +for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, +but there is room for further improvement. + +<h3><a name="Tracing and RCU">Tracing and RCU</a></h3> + +<p> +It is possible to use tracing on RCU code, but tracing itself +uses RCU. +For this reason, <tt>rcu_dereference_raw_notrace()</tt> +is provided for use by tracing, which avoids the destructive +recursion that could otherwise ensue. +This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing +cannot be used. +The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +<h3><a name="Energy Efficiency">Energy Efficiency</a></h3> + +<p> +Interrupting idle CPUs is considered socially unacceptable, +especially by people with battery-powered embedded systems. +RCU therefore conserves energy by detecting which CPUs are +idle, including tracking CPUs that have been interrupted from idle. +This is a large part of the energy-efficiency requirement, +so I learned of this via an irate phone call. + +<p> +Because RCU avoids interrupting idle CPUs, it is illegal to +execute an RCU read-side critical section on an idle CPU. +(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat +if you try it.) +The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> +event tracing is provided to work around this restriction. +In addition, <tt>rcu_is_watching()</tt> may be used to +test whether or not it is currently legal to run RCU read-side +critical sections on this CPU. +I learned of the need for diagnostics on the one hand +and <tt>RCU_NONIDLE()</tt> on the other while inspecting +idle-loop code. +Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, +which is used quite heavily in the idle loop. +However, there are some restrictions on the code placed within +<tt>RCU_NONIDLE()</tt>: + +<ol> +<li> Blocking is prohibited. + In practice, this is not a serious restriction given that idle + tasks are prohibited from blocking to begin with. +<li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot + nest indefinitely deeply. + However, given that they can be nested on the order of a million + deep, even on 32-bit systems, this should not be a serious + restriction. + This nesting limit would probably be reached long after the + compiler OOMed or the stack overflowed. +<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence + out of that same <tt>RCU_NONIDLE()</tt>. + For example, the following is grossly illegal: + + <blockquote> + <pre> + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + </pre> + </blockquote> + + <p> + It is just as illegal to transfer control into the middle of + <tt>RCU_NONIDLE()</tt>'s argument. + Yes, in theory, you could transfer in as long as you also + transferred out, but in practice you could also expect to get sharply + worded review comments. +</ol> + +<p> +It is similarly socially unacceptable to interrupt an +<tt>nohz_full</tt> CPU running in userspace. +RCU must therefore track <tt>nohz_full</tt> userspace +execution. +RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent +any time idle and/or executing in userspace. + +<p> +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of +which was finally able to demonstrate +<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. +As noted earlier, +I learned of many of these requirements via angry phone calls: +Flaming me on the Linux-kernel mailing list was apparently not +sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +<h3><a name="Scheduling-Clock Interrupts and RCU"> +Scheduling-Clock Interrupts and RCU</a></h3> + +<p> +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. +Depending on kernel configuration, RCU handles these states differently: + +<table border=3> +<tr><th><tt>HZ</tt> Kconfig</th> + <th>In-Kernel</th> + <th>Usermode</th> + <th>Idle</th></tr> +<tr><th align="left"><tt>HZ_PERIODIC</tt></th> + <td>Can rely on scheduling-clock interrupt.</td> + <td>Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +<tr><th align="left"><tt>NO_HZ_IDLE</tt></th> + <td>Can rely on scheduling-clock interrupt.</td> + <td>Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +<tr><th align="left"><tt>NO_HZ_FULL</tt></th> + <td>Can only sometimes rely on scheduling-clock interrupt. + In other cases, it is necessary to bound kernel execution + times and/or use IPIs.</td> + <td>Can rely on RCU's dyntick-idle detection.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +</table> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the + scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> + and <tt>NO_HZ_IDLE</tt> do? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Because, as a performance optimization, <tt>NO_HZ_FULL</tt> + does not necessarily re-enable the scheduling-clock interrupt + on entry to each and every system call. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +However, RCU must be reliably informed as to whether any given +CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, +also whether that CPU is executing in usermode, as discussed +<a href="#Energy Efficiency">earlier</a>. +It also requires that the scheduling-clock interrupt be enabled when +RCU needs it to be: + +<ol> +<li> If a CPU is either idle or executing in usermode, and RCU believes + it is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, + very long (11-second) grace periods, with a pointless IPI waking + the CPU from time to time. +<li> If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. <b>DON'T DO THIS!!!</b> + + <br>This is one reason to test with lockdep, which will complain + about this sort of thing. +<li> If a CPU is in a portion of the kernel that is absolutely + positively no-joking guaranteed to never execute any RCU read-side + critical sections, and RCU believes this CPU to to be idle, + no problem. This sort of thing is used by some architectures + for light-weight exception handlers, which can then avoid the + overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> + at exception entry and exit, respectively. + Some go further and avoid the entireties of <tt>irq_enter()</tt> + and <tt>irq_exit()</tt>. + + <br>Just make very sure you are running some of your tests with + <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths + was in fact joking about not doing RCU read-side critical sections. +<li> If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, + and if the CPU goes idle (from an RCU perspective) every few + jiffies, no problem. It is usually OK for there to be the + occasional gap between idle periods of up to a second or so. + + <br>If the gap grows too long, you get RCU CPU stall warnings. +<li> If a CPU is either idle or executing in usermode, and RCU believes + it to be idle, of course no problem. +<li> If a CPU is executing in the kernel, the kernel code + path is passing through quiescent states at a reasonable + frequency (preferably about once per few jiffies, but the + occasional excursion to a second or so is usually OK) and the + scheduling-clock interrupt is enabled, of course no problem. + + <br>If the gap between a successive pair of quiescent states grows + too long, you get RCU CPU stall warnings. +</ol> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what if my driver has a hardware interrupt handler + that can run for many seconds? + I cannot invoke <tt>schedule()</tt> from an hardware + interrupt handler, after all! +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> + every so often. + But given that long-running interrupt handlers can cause + other problems, not least for response time, shouldn't you + work to keep your interrupt handler's runtime within reasonable + bounds? +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +But as long as RCU is properly informed of kernel state transitions between +in-kernel execution, usermode execution, and idle, and as long as the +scheduling-clock interrupt is enabled when RCU needs it to be, you +can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +<h3><a name="Memory Efficiency">Memory Efficiency</a></h3> + +<p> +Although small-memory non-realtime systems can simply use Tiny RCU, +code size is only one aspect of memory efficiency. +Another aspect is the size of the <tt>rcu_head</tt> structure +used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. +Although this structure contains nothing more than a pair of pointers, +it does appear in many RCU-protected data structures, including +some that are size critical. +The <tt>page</tt> structure is a case in point, as evidenced by +the many occurrences of the <tt>union</tt> keyword within that structure. + +<p> +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the <tt>rcu_head</tt> structures that +are waiting for a grace period to elapse. +It is also the reason why <tt>rcu_head</tt> structures do not contain +debug information, such as fields tracking the file and line of the +<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. +Although this information might appear in debug-only kernel builds at some +point, in the meantime, the <tt>->func</tt> field will often provide +the needed debug information. + +<p> +However, in some cases, the need for memory efficiency leads to even +more extreme measures. +Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field +shares storage with a great many other structures that are used at +various points in the corresponding page's lifetime. +In order to correctly resolve certain +<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, +the Linux kernel's memory-management subsystem needs a particular bit +to remain zero during all phases of grace-period processing, +and that bit happens to map to the bottom bit of the +<tt>rcu_head</tt> structure's <tt>->next</tt> field. +RCU makes this guarantee as long as <tt>call_rcu()</tt> +is used to post the callback, as opposed to <tt>kfree_rcu()</tt> +or some future “lazy” +variant of <tt>call_rcu()</tt> that might one day be created for +energy-efficiency purposes. + +<p> +That said, there are limits. +RCU requires that the <tt>rcu_head</tt> structure be aligned to a +two-byte boundary, and passing a misaligned <tt>rcu_head</tt> +structure to one of the <tt>call_rcu()</tt> family of functions +will result in a splat. +It is therefore necessary to exercise caution when packing +structures containing fields of type <tt>rcu_head</tt>. +Why not a four-byte or even eight-byte alignment requirement? +Because the m68k architecture provides only two-byte alignment, +and thus acts as alignment's least common denominator. + +<p> +The reason for reserving the bottom bit of pointers to +<tt>rcu_head</tt> structures is to leave the door open to +“lazy” callbacks whose invocations can safely be deferred. +Deferring invocation could potentially have energy-efficiency +benefits, but only if the rate of non-lazy callbacks decreases +significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open +in case it one day becomes useful. + +<h3><a name="Performance, Scalability, Response Time, and Reliability"> +Performance, Scalability, Response Time, and Reliability</a></h3> + +<p> +Expanding on the +<a href="#Performance and Scalability">earlier discussion</a>, +RCU is used heavily by hot code paths in performance-critical +portions of the Linux kernel's networking, security, virtualization, +and scheduling code paths. +RCU must therefore use efficient implementations, especially in its +read-side primitives. +To that end, it would be good if preemptible RCU's implementation +of <tt>rcu_read_lock()</tt> could be inlined, however, doing +this requires resolving <tt>#include</tt> issues with the +<tt>task_struct</tt> structure. + +<p> +The Linux kernel supports hardware configurations with up to +4096 CPUs, which means that RCU must be extremely scalable. +Algorithms that involve frequent acquisitions of global locks or +frequent atomic operations on global variables simply cannot be +tolerated within the RCU implementation. +RCU therefore makes heavy use of a combining tree based on the +<tt>rcu_node</tt> structure. +RCU is required to tolerate all CPUs continuously invoking any +combination of RCU's runtime primitives with minimal per-operation +overhead. +In fact, in many cases, increasing load must <i>decrease</i> the +per-operation overhead, witness the batching optimizations for +<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, +<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. +As a general rule, RCU must cheerfully accept whatever the +rest of the Linux kernel decides to throw at it. + +<p> +The Linux kernel is used for real-time workloads, especially +in conjunction with the +<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. +The real-time-latency response requirements are such that the +traditional approach of disabling preemption across RCU +read-side critical sections is inappropriate. +Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore +use an RCU implementation that allows RCU read-side critical +sections to be preempted. +This requirement made its presence known after users made it +clear that an earlier +<a href="https://lwn.net/Articles/107930/">real-time patch</a> +did not meet their needs, in conjunction with some +<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> +encountered by a very early version of the -rt patchset. + +<p> +In addition, RCU must make do with a sub-100-microsecond real-time latency +budget. +In fact, on smaller systems with the -rt patchset, the Linux kernel +provides sub-20-microsecond real-time latencies for the whole kernel, +including RCU. +RCU's scalability and latency must therefore be sufficient for +these sorts of configurations. +To my surprise, the sub-100-microsecond real-time latency budget +<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> +applies to even the largest systems [PDF]</a>, +up to and including systems with 4096 CPUs. +This real-time requirement motivated the grace-period kthread, which +also simplified handling of a number of race conditions. + +<p> +RCU must avoid degrading real-time response for CPU-bound threads, whether +executing in usermode (which is one use case for +<tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. +That said, CPU-bound loops in the kernel must execute +<tt>cond_resched()</tt> at least once per few tens of milliseconds +in order to avoid receiving an IPI from RCU. + +<p> +Finally, RCU's status as a synchronization primitive means that +any RCU failure can result in arbitrary memory corruption that can be +extremely difficult to debug. +This means that RCU must be extremely reliable, which in +practice also means that RCU must have an aggressive stress-test +suite. +This stress-test suite is called <tt>rcutorture</tt>. + +<p> +Although the need for <tt>rcutorture</tt> was no surprise, +the current immense popularity of the Linux kernel is posing +interesting—and perhaps unprecedented—validation +challenges. +To see this, keep in mind that there are well over one billion +instances of the Linux kernel running today, given Android +smartphones, Linux-powered televisions, and servers. +This number can be expected to increase sharply with the advent of +the celebrated Internet of Things. + +<p> +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. +This bug will be occurring about three times per <i>day</i> across +the installed base. +RCU could simply hide behind hardware error rates, given that no one +should really expect their smartphone to last for a million years. +However, anyone taking too much comfort from this thought should +consider the fact that in most jurisdictions, a successful multi-year +test of a given mechanism, which might include a Linux kernel, +suffices for a number of types of safety-critical certifications. +In fact, rumor has it that the Linux kernel is already being used +in production for safety-critical applications. +I don't know about you, but I would feel quite bad if a bug in RCU +killed someone. +Which might explain my recent focus on validation and verification. + +<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> + +<p> +One of the more surprising things about RCU is that there are now +no fewer than five <i>flavors</i>, or API families. +In addition, the primary flavor that has been the sole focus up to +this point has two different implementations, non-preemptible and +preemptible. +The other four flavors are listed below, with requirements for each +described in a separate section. + +<ol> +<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a> +<li> <a href="#Sched Flavor">Sched Flavor</a> +<li> <a href="#Sleepable RCU">Sleepable RCU</a> +<li> <a href="#Tasks RCU">Tasks RCU</a> +<li> <a href="#Waiting for Multiple Grace Periods"> + Waiting for Multiple Grace Periods</a> +</ol> + +<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3> + +<p> +The softirq-disable (AKA “bottom-half”, +hence the “_bh” abbreviations) +flavor of RCU, or <i>RCU-bh</i>, was developed by +Dipankar Sarma to provide a flavor of RCU that could withstand the +network-based denial-of-service attacks researched by Robert +Olsson. +These attacks placed so much networking load on the system +that some of the CPUs never exited softirq execution, +which in turn prevented those CPUs from ever executing a context switch, +which, in the RCU implementation of that time, prevented grace periods +from ever ending. +The result was an out-of-memory condition and a system hang. + +<p> +The solution was the creation of RCU-bh, which does +<tt>local_bh_disable()</tt> +across its read-side critical sections, and which uses the transition +from one type of softirq processing to another as a quiescent state +in addition to context switch, idle, user mode, and offline. +This means that RCU-bh grace periods can complete even when some of +the CPUs execute in softirq indefinitely, thus allowing algorithms +based on RCU-bh to withstand network-based denial-of-service attacks. + +<p> +Because +<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> +disable and re-enable softirq handlers, any attempt to start a softirq +handlers during the +RCU-bh read-side critical section will be deferred. +In this case, <tt>rcu_read_unlock_bh()</tt> +will invoke softirq processing, which can take considerable time. +One can of course argue that this softirq overhead should be associated +with the code following the RCU-bh read-side critical section rather +than <tt>rcu_read_unlock_bh()</tt>, but the fact +is that most profiling tools cannot be expected to make this sort +of fine distinction. +For example, suppose that a three-millisecond-long RCU-bh read-side +critical section executes during a time of heavy networking load. +There will very likely be an attempt to invoke at least one softirq +handler during that three milliseconds, but any such invocation will +be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. +This can of course make it appear at first glance as if +<tt>rcu_read_unlock_bh()</tt> was executing very slowly. + +<p> +The +<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> +includes +<tt>rcu_read_lock_bh()</tt>, +<tt>rcu_read_unlock_bh()</tt>, +<tt>rcu_dereference_bh()</tt>, +<tt>rcu_dereference_bh_check()</tt>, +<tt>synchronize_rcu_bh()</tt>, +<tt>synchronize_rcu_bh_expedited()</tt>, +<tt>call_rcu_bh()</tt>, +<tt>rcu_barrier_bh()</tt>, and +<tt>rcu_read_lock_bh_held()</tt>. + +<h3><a name="Sched Flavor">Sched Flavor</a></h3> + +<p> +Before preemptible RCU, waiting for an RCU grace period had the +side effect of also waiting for all pre-existing interrupt +and NMI handlers. +However, there are legitimate preemptible-RCU implementations that +do not have this property, given that any point in the code outside +of an RCU read-side critical section can be a quiescent state. +Therefore, <i>RCU-sched</i> was created, which follows “classic” +RCU in that an RCU-sched grace period waits for for pre-existing +interrupt and NMI handlers. +In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched +APIs have identical implementations, while kernels built with +<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. + +<p> +Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, +<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> +disable and re-enable preemption, respectively. +This means that if there was a preemption attempt during the +RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> +will enter the scheduler, with all the latency and overhead entailed. +Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look +as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. +However, the highest-priority task won't be preempted, so that task +will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. + +<p> +The +<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> +includes +<tt>rcu_read_lock_sched()</tt>, +<tt>rcu_read_unlock_sched()</tt>, +<tt>rcu_read_lock_sched_notrace()</tt>, +<tt>rcu_read_unlock_sched_notrace()</tt>, +<tt>rcu_dereference_sched()</tt>, +<tt>rcu_dereference_sched_check()</tt>, +<tt>synchronize_sched()</tt>, +<tt>synchronize_rcu_sched_expedited()</tt>, +<tt>call_rcu_sched()</tt>, +<tt>rcu_barrier_sched()</tt>, and +<tt>rcu_read_lock_sched_held()</tt>. +However, anything that disables preemption also marks an RCU-sched +read-side critical section, including +<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, +<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, +and so on. + +<h3><a name="Sleepable RCU">Sleepable RCU</a></h3> + +<p> +For well over a decade, someone saying “I need to block within +an RCU read-side critical section” was a reliable indication +that this someone did not understand RCU. +After all, if you are always blocking in an RCU read-side critical +section, you can probably afford to use a higher-overhead synchronization +mechanism. +However, that changed with the advent of the Linux kernel's notifiers, +whose RCU read-side critical +sections almost never sleep, but sometimes need to. +This resulted in the introduction of +<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, +or <i>SRCU</i>. + +<p> +SRCU allows different domains to be defined, with each such domain +defined by an instance of an <tt>srcu_struct</tt> structure. +A pointer to this structure must be passed in to each SRCU function, +for example, <tt>synchronize_srcu(&ss)</tt>, where +<tt>ss</tt> is the <tt>srcu_struct</tt> structure. +The key benefit of these domains is that a slow SRCU reader in one +domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code +must pass a “cookie” from <tt>srcu_read_lock()</tt> +to <tt>srcu_read_unlock()</tt>, for example, as follows: + +<blockquote> +<pre> + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); +</pre> +</blockquote> + +<p> +As noted above, it is legal to block within SRCU read-side critical sections, +however, with great power comes great responsibility. +If you block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can block waiting, either directly or indirectly, for that domain's +grace period to elapse. +For example, this results in a self-deadlock: + +<blockquote> +<pre> + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); +</pre> +</blockquote> + +<p> +However, if line 5 acquired a mutex that was held across +a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, +deadlock would still be possible. +Furthermore, if line 5 acquired a mutex that was held across +a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, +and if an <tt>ss1</tt>-domain SRCU read-side critical section +acquired another mutex that was held across as <tt>ss</tt>-domain +<tt>synchronize_srcu()</tt>, +deadlock would again be possible. +Such a deadlock cycle could extend across an arbitrarily large number +of different SRCU domains. +Again, with great power comes great responsibility. + +<p> +Unlike the other RCU flavors, SRCU read-side critical sections can +run on idle and even offline CPUs. +This ability requires that <tt>srcu_read_lock()</tt> and +<tt>srcu_read_unlock()</tt> contain memory barriers, which means +that SRCU readers will run a bit slower than would RCU readers. +It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> +API, which, in combination with <tt>srcu_read_unlock()</tt>, +guarantees a full memory barrier. + +<p> +Also unlike other RCU flavors, SRCU's callbacks-wait function +<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers, +though this is not necessarily a good idea. +The reason that this is possible is that SRCU is insensitive +to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> +need not exclude CPU-hotplug operations. + +<p> +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. +This means that in the current SRCU implementation, expediting a +future grace period has the side effect of expediting all prior +grace periods that have not yet completed. +(But please note that this is a property of the current implementation, +not necessarily of future implementations.) +In addition, if SRCU has been idle for longer than the interval +specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter +(25 microseconds by default), +and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, +that invocation will be automatically expedited. + +<p> +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating +a locking bottleneck present in prior kernel versions. +Although this will allow users to put much heavier stress on +<tt>call_srcu()</tt>, it is important to note that SRCU does not +yet take any special steps to deal with callback flooding. +So if you are posting (say) 10,000 SRCU callbacks per second per CPU, +you are probably totally OK, but if you intend to post (say) 1,000,000 +SRCU callbacks per second per CPU, please run some tests first. +SRCU just might need a few adjustment to deal with that sort of load. +Of course, your mileage may vary based on the speed of your CPUs and +the size of your memory. + +<p> +The +<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> +includes +<tt>srcu_read_lock()</tt>, +<tt>srcu_read_unlock()</tt>, +<tt>srcu_dereference()</tt>, +<tt>srcu_dereference_check()</tt>, +<tt>synchronize_srcu()</tt>, +<tt>synchronize_srcu_expedited()</tt>, +<tt>call_srcu()</tt>, +<tt>srcu_barrier()</tt>, and +<tt>srcu_read_lock_held()</tt>. +It also includes +<tt>DEFINE_SRCU()</tt>, +<tt>DEFINE_STATIC_SRCU()</tt>, and +<tt>init_srcu_struct()</tt> +APIs for defining and initializing <tt>srcu_struct</tt> structures. + +<h3><a name="Tasks RCU">Tasks RCU</a></h3> + +<p> +Some forms of tracing use “trampolines” to handle the +binary rewriting required to install different types of probes. +It would be good to be able to free old trampolines, which sounds +like a job for some form of RCU. +However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers +such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. +In addition, it does not work to have these markers in the trampoline +itself, because there would need to be instructions following +<tt>rcu_read_unlock()</tt>. +Although <tt>synchronize_rcu()</tt> would guarantee that execution +reached the <tt>rcu_read_unlock()</tt>, it would not be able to +guarantee that execution had completely left the trampoline. + +<p> +The solution, in the form of +<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, +is to have implicit +read-side critical sections that are delimited by voluntary context +switches, that is, calls to <tt>schedule()</tt>, +<tt>cond_resched()</tt>, and +<tt>synchronize_rcu_tasks()</tt>. +In addition, transitions to and from userspace execution also delimit +tasks-RCU read-side critical sections. + +<p> +The tasks-RCU API is quite compact, consisting only of +<tt>call_rcu_tasks()</tt>, +<tt>synchronize_rcu_tasks()</tt>, and +<tt>rcu_barrier_tasks()</tt>. + +<h3><a name="Waiting for Multiple Grace Periods"> +Waiting for Multiple Grace Periods</a></h3> + +<p> +Perhaps you have an RCU protected data structure that is accessed from +RCU read-side critical sections, from softirq handlers, and from +hardware interrupt handlers. +That is three flavors of RCU, the normal flavor, the bottom-half flavor, +and the sched flavor. +How to wait for a compound grace period? + +<p> +The best approach is usually to “just say no!” and +insert <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> +around each RCU read-side critical section, regardless of what +environment it happens to be in. +But suppose that some of the RCU read-side critical sections are +on extremely hot code paths, and that use of <tt>CONFIG_PREEMPT=n</tt> +is not a viable option, so that <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> are not free. +What then? + +<p> +You <i>could</i> wait on all three grace periods in succession, as follows: + +<blockquote> +<pre> + 1 synchronize_rcu(); + 2 synchronize_rcu_bh(); + 3 synchronize_sched(); +</pre> +</blockquote> + +<p> +This works, but triples the update-side latency penalty. +In cases where this is not acceptable, <tt>synchronize_rcu_mult()</tt> +may be used to wait on all three flavors of grace period concurrently: + +<blockquote> +<pre> + 1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched); +</pre> +</blockquote> + +<p> +But what if it is necessary to also wait on SRCU? +This can be done as follows: + +<blockquote> +<pre> + 1 static void call_my_srcu(struct rcu_head *head, + 2 void (*func)(struct rcu_head *head)) + 3 { + 4 call_srcu(&my_srcu, head, func); + 5 } + 6 + 7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu); +</pre> +</blockquote> + +<p> +If you needed to wait on multiple different flavors of SRCU +(but why???), you would need to create a wrapper function resembling +<tt>call_my_srcu()</tt> for each SRCU flavor. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what if I need to wait for multiple RCU flavors, but I also need + the grace periods to be expedited? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + If you are using expedited grace periods, there should be less penalty + for waiting on them in succession. + But if that is nevertheless a problem, you can use workqueues + or multiple kthreads to wait on the various expedited grace + periods concurrently. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +Again, it is usually better to adjust the RCU read-side critical sections +to use a single flavor of RCU, but when this is not feasible, you can use +<tt>synchronize_rcu_mult()</tt>. + +<h2><a name="Possible Future Changes">Possible Future Changes</a></h2> + +<p> +One of the tricks that RCU uses to attain update-side scalability is +to increase grace-period latency with increasing numbers of CPUs. +If this becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +<p> +Expedited grace periods scan the CPUs, so their latency and overhead +increases with increasing numbers of CPUs. +If this becomes a serious problem on large systems, it will be necessary +to do some redesign to avoid this scalability problem. + +<p> +RCU disables CPU hotplug in a few places, perhaps most notably in the +<tt>rcu_barrier()</tt> operations. +If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug +notifiers, it will be necessary to avoid disabling CPU hotplug. +This would introduce some complexity, so there had better be a <i>very</i> +good reason. + +<p> +The tradeoff between grace-period latency on the one hand and interruptions +of other CPUs on the other hand may need to be re-examined. +The desire is of course for zero grace-period latency as well as zero +interprocessor interrupts undertaken during an expedited grace period +operation. +While this ideal is unlikely to be achievable, it is quite possible that +further improvements can be made. + +<p> +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such +as sockets or cores. +Such spreading and alignment is currently believed to be unnecessary +because the hotpath read-side primitives do not access the combining +tree, nor does <tt>call_rcu()</tt> in the common case. +If you believe that your architecture needs such spreading and alignment, +then your architecture should also benefit from the +<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set +to the number of CPUs in a socket, NUMA node, or whatever. +If the number of CPUs is too large, use a fraction of the number of +CPUs. +If the number of CPUs is a large prime number, well, that certainly +is an “interesting” architectural choice! +More flexible arrangements might be considered, but only if +<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only +if the inadequacy has been demonstrated by a carefully run and +realistic system-level workload. + +<p> +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +<p> +There is an embarrassingly large number of flavors of RCU, and this +number has been increasing over time. +Perhaps it will be possible to combine some at some future date. + +<p> +RCU's various kthreads are reasonably recent additions. +It is quite likely that adjustments will be required to more gracefully +handle extreme loads. +It might also be necessary to be able to relate CPU utilization by +RCU's kthreads and softirq handlers to the code that instigated this +CPU utilization. +For example, RCU callback overhead might be charged back to the +originating <tt>call_rcu()</tt> instance, though probably not +in production kernels. + +<h2><a name="Summary">Summary</a></h2> + +<p> +This document has presented more than two decade's worth of RCU +requirements. +Given that the requirements keep changing, this will not be the last +word on this subject, but at least it serves to get an important +subset of the requirements set forth. + +<h2><a name="Acknowledgments">Acknowledgments</a></h2> + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, +Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and +Andy Lutomirski for their help in rendering +this article human readable, and to Michelle Rankin for her support +of this effort. +Other contributions are acknowledged in the Linux kernel's git archive. + +</body></html> |