summaryrefslogtreecommitdiffstats
path: root/Documentation/RCU/Design/Requirements/Requirements.html
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.html')
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.html3324
1 files changed, 3324 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
new file mode 100644
index 000000000..49690228b
--- /dev/null
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -0,0 +1,3324 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
+ "http://www.w3.org/TR/html4/loose.dtd">
+ <html>
+ <head><title>A Tour Through RCU's Requirements [LWN.net]</title>
+ <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
+
+<h1>A Tour Through RCU's Requirements</h1>
+
+<p>Copyright IBM Corporation, 2015</p>
+<p>Author: Paul E.&nbsp;McKenney</p>
+<p><i>The initial version of this document appeared in the
+<a href="https://lwn.net/">LWN</a> articles
+<a href="https://lwn.net/Articles/652156/">here</a>,
+<a href="https://lwn.net/Articles/652677/">here</a>, and
+<a href="https://lwn.net/Articles/653326/">here</a>.</i></p>
+
+<h2>Introduction</h2>
+
+<p>
+Read-copy update (RCU) is a synchronization mechanism that is often
+used as a replacement for reader-writer locking.
+RCU is unusual in that updaters do not block readers,
+which means that RCU's read-side primitives can be exceedingly fast
+and scalable.
+In addition, updaters can make useful forward progress concurrently
+with readers.
+However, all this concurrency between RCU readers and updaters does raise
+the question of exactly what RCU readers are doing, which in turn
+raises the question of exactly what RCU's requirements are.
+
+<p>
+This document therefore summarizes RCU's requirements, and can be thought
+of as an informal, high-level specification for RCU.
+It is important to understand that RCU's specification is primarily
+empirical in nature;
+in fact, I learned about many of these requirements the hard way.
+This situation might cause some consternation, however, not only
+has this learning process been a lot of fun, but it has also been
+a great privilege to work with so many people willing to apply
+technologies in interesting new ways.
+
+<p>
+All that aside, here are the categories of currently known RCU requirements:
+</p>
+
+<ol>
+<li> <a href="#Fundamental Requirements">
+ Fundamental Requirements</a>
+<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a>
+<li> <a href="#Parallelism Facts of Life">
+ Parallelism Facts of Life</a>
+<li> <a href="#Quality-of-Implementation Requirements">
+ Quality-of-Implementation Requirements</a>
+<li> <a href="#Linux Kernel Complications">
+ Linux Kernel Complications</a>
+<li> <a href="#Software-Engineering Requirements">
+ Software-Engineering Requirements</a>
+<li> <a href="#Other RCU Flavors">
+ Other RCU Flavors</a>
+<li> <a href="#Possible Future Changes">
+ Possible Future Changes</a>
+</ol>
+
+<p>
+This is followed by a <a href="#Summary">summary</a>,
+however, the answers to each quick quiz immediately follows the quiz.
+Select the big white space with your mouse to see the answer.
+
+<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2>
+
+<p>
+RCU's fundamental requirements are the closest thing RCU has to hard
+mathematical requirements.
+These are:
+
+<ol>
+<li> <a href="#Grace-Period Guarantee">
+ Grace-Period Guarantee</a>
+<li> <a href="#Publish-Subscribe Guarantee">
+ Publish-Subscribe Guarantee</a>
+<li> <a href="#Memory-Barrier Guarantees">
+ Memory-Barrier Guarantees</a>
+<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally">
+ RCU Primitives Guaranteed to Execute Unconditionally</a>
+<li> <a href="#Guaranteed Read-to-Write Upgrade">
+ Guaranteed Read-to-Write Upgrade</a>
+</ol>
+
+<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3>
+
+<p>
+RCU's grace-period guarantee is unusual in being premeditated:
+Jack Slingwine and I had this guarantee firmly in mind when we started
+work on RCU (then called &ldquo;rclock&rdquo;) in the early 1990s.
+That said, the past two decades of experience with RCU have produced
+a much more detailed understanding of this guarantee.
+
+<p>
+RCU's grace-period guarantee allows updaters to wait for the completion
+of all pre-existing RCU read-side critical sections.
+An RCU read-side critical section
+begins with the marker <tt>rcu_read_lock()</tt> and ends with
+the marker <tt>rcu_read_unlock()</tt>.
+These markers may be nested, and RCU treats a nested set as one
+big RCU read-side critical section.
+Production-quality implementations of <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> are extremely lightweight, and in
+fact have exactly zero overhead in Linux kernels built for production
+use with <tt>CONFIG_PREEMPT=n</tt>.
+
+<p>
+This guarantee allows ordering to be enforced with extremely low
+overhead to readers, for example:
+
+<blockquote>
+<pre>
+ 1 int x, y;
+ 2
+ 3 void thread0(void)
+ 4 {
+ 5 rcu_read_lock();
+ 6 r1 = READ_ONCE(x);
+ 7 r2 = READ_ONCE(y);
+ 8 rcu_read_unlock();
+ 9 }
+10
+11 void thread1(void)
+12 {
+13 WRITE_ONCE(x, 1);
+14 synchronize_rcu();
+15 WRITE_ONCE(y, 1);
+16 }
+</pre>
+</blockquote>
+
+<p>
+Because the <tt>synchronize_rcu()</tt> on line&nbsp;14 waits for
+all pre-existing readers, any instance of <tt>thread0()</tt> that
+loads a value of zero from <tt>x</tt> must complete before
+<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must
+also load a value of zero from <tt>y</tt>.
+Similarly, any instance of <tt>thread0()</tt> that loads a value of
+one from <tt>y</tt> must have started after the
+<tt>synchronize_rcu()</tt> started, and must therefore also load
+a value of one from <tt>x</tt>.
+Therefore, the outcome:
+<blockquote>
+<pre>
+(r1 == 0 &amp;&amp; r2 == 1)
+</pre>
+</blockquote>
+cannot happen.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Wait a minute!
+ You said that updaters can make useful forward progress concurrently
+ with readers, but pre-existing readers will block
+ <tt>synchronize_rcu()</tt>!!!
+ Just who are you trying to fool???
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ First, if updaters do not wish to be blocked by readers, they can use
+ <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will
+ be discussed later.
+ Second, even when using <tt>synchronize_rcu()</tt>, the other
+ update-side code does run concurrently with readers, whether
+ pre-existing or not.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+This scenario resembles one of the first uses of RCU in
+<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>,
+which managed a distributed lock manager's transition into
+a state suitable for handling recovery from node failure,
+more or less as follows:
+
+<blockquote>
+<pre>
+ 1 #define STATE_NORMAL 0
+ 2 #define STATE_WANT_RECOVERY 1
+ 3 #define STATE_RECOVERING 2
+ 4 #define STATE_WANT_NORMAL 3
+ 5
+ 6 int state = STATE_NORMAL;
+ 7
+ 8 void do_something_dlm(void)
+ 9 {
+10 int state_snap;
+11
+12 rcu_read_lock();
+13 state_snap = READ_ONCE(state);
+14 if (state_snap == STATE_NORMAL)
+15 do_something();
+16 else
+17 do_something_carefully();
+18 rcu_read_unlock();
+19 }
+20
+21 void start_recovery(void)
+22 {
+23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
+24 synchronize_rcu();
+25 WRITE_ONCE(state, STATE_RECOVERING);
+26 recovery();
+27 WRITE_ONCE(state, STATE_WANT_NORMAL);
+28 synchronize_rcu();
+29 WRITE_ONCE(state, STATE_NORMAL);
+30 }
+</pre>
+</blockquote>
+
+<p>
+The RCU read-side critical section in <tt>do_something_dlm()</tt>
+works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt>
+to guarantee that <tt>do_something()</tt> never runs concurrently
+with <tt>recovery()</tt>, but with little or no synchronization
+overhead in <tt>do_something_dlm()</tt>.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Why is the <tt>synchronize_rcu()</tt> on line&nbsp;28 needed?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Without that extra grace period, memory reordering could result in
+ <tt>do_something_dlm()</tt> executing <tt>do_something()</tt>
+ concurrently with the last bits of <tt>recovery()</tt>.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+In order to avoid fatal problems such as deadlocks,
+an RCU read-side critical section must not contain calls to
+<tt>synchronize_rcu()</tt>.
+Similarly, an RCU read-side critical section must not
+contain anything that waits, directly or indirectly, on completion of
+an invocation of <tt>synchronize_rcu()</tt>.
+
+<p>
+Although RCU's grace-period guarantee is useful in and of itself, with
+<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>,
+it would be good to be able to use RCU to coordinate read-side
+access to linked data structures.
+For this, the grace-period guarantee is not sufficient, as can
+be seen in function <tt>add_gp_buggy()</tt> below.
+We will look at the reader's code later, but in the meantime, just think of
+the reader as locklessly picking up the <tt>gp</tt> pointer,
+and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the
+<tt>-&gt;a</tt> and <tt>-&gt;b</tt> fields.
+
+<blockquote>
+<pre>
+ 1 bool add_gp_buggy(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&amp;gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&amp;gp_lock);
+ 9 return false;
+10 }
+11 p-&gt;a = a;
+12 p-&gt;b = a;
+13 gp = p; /* ORDERING BUG */
+14 spin_unlock(&amp;gp_lock);
+15 return true;
+16 }
+</pre>
+</blockquote>
+
+<p>
+The problem is that both the compiler and weakly ordered CPUs are within
+their rights to reorder this code as follows:
+
+<blockquote>
+<pre>
+ 1 bool add_gp_buggy_optimized(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&amp;gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&amp;gp_lock);
+ 9 return false;
+10 }
+<b>11 gp = p; /* ORDERING BUG */
+12 p-&gt;a = a;
+13 p-&gt;b = a;</b>
+14 spin_unlock(&amp;gp_lock);
+15 return true;
+16 }
+</pre>
+</blockquote>
+
+<p>
+If an RCU reader fetches <tt>gp</tt> just after
+<tt>add_gp_buggy_optimized</tt> executes line&nbsp;11,
+it will see garbage in the <tt>-&gt;a</tt> and <tt>-&gt;b</tt>
+fields.
+And this is but one of many ways in which compiler and hardware optimizations
+could cause trouble.
+Therefore, we clearly need some way to prevent the compiler and the CPU from
+reordering in this manner, which brings us to the publish-subscribe
+guarantee discussed in the next section.
+
+<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3>
+
+<p>
+RCU's publish-subscribe guarantee allows data to be inserted
+into a linked data structure without disrupting RCU readers.
+The updater uses <tt>rcu_assign_pointer()</tt> to insert the
+new data, and readers use <tt>rcu_dereference()</tt> to
+access data, whether new or old.
+The following shows an example of insertion:
+
+<blockquote>
+<pre>
+ 1 bool add_gp(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&amp;gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&amp;gp_lock);
+ 9 return false;
+10 }
+11 p-&gt;a = a;
+12 p-&gt;b = a;
+13 rcu_assign_pointer(gp, p);
+14 spin_unlock(&amp;gp_lock);
+15 return true;
+16 }
+</pre>
+</blockquote>
+
+<p>
+The <tt>rcu_assign_pointer()</tt> on line&nbsp;13 is conceptually
+equivalent to a simple assignment statement, but also guarantees
+that its assignment will
+happen after the two assignments in lines&nbsp;11 and&nbsp;12,
+similar to the C11 <tt>memory_order_release</tt> store operation.
+It also prevents any number of &ldquo;interesting&rdquo; compiler
+optimizations, for example, the use of <tt>gp</tt> as a scratch
+location immediately preceding the assignment.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But <tt>rcu_assign_pointer()</tt> does nothing to prevent the
+ two assignments to <tt>p-&gt;a</tt> and <tt>p-&gt;b</tt>
+ from being reordered.
+ Can't that also cause problems?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ No, it cannot.
+ The readers cannot see either of these two fields until
+ the assignment to <tt>gp</tt>, by which time both fields are
+ fully initialized.
+ So reordering the assignments
+ to <tt>p-&gt;a</tt> and <tt>p-&gt;b</tt> cannot possibly
+ cause any problems.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+It is tempting to assume that the reader need not do anything special
+to control its accesses to the RCU-protected data,
+as shown in <tt>do_something_gp_buggy()</tt> below:
+
+<blockquote>
+<pre>
+ 1 bool do_something_gp_buggy(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = gp; /* OPTIMIZATIONS GALORE!!! */
+ 5 if (p) {
+ 6 do_something(p-&gt;a, p-&gt;b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+10 rcu_read_unlock();
+11 return false;
+12 }
+</pre>
+</blockquote>
+
+<p>
+However, this temptation must be resisted because there are a
+surprisingly large number of ways that the compiler
+(to say nothing of
+<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>)
+can trip this code up.
+For but one example, if the compiler were short of registers, it
+might choose to refetch from <tt>gp</tt> rather than keeping
+a separate copy in <tt>p</tt> as follows:
+
+<blockquote>
+<pre>
+ 1 bool do_something_gp_buggy_optimized(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
+<b> 5 do_something(gp-&gt;a, gp-&gt;b);</b>
+ 6 rcu_read_unlock();
+ 7 return true;
+ 8 }
+ 9 rcu_read_unlock();
+10 return false;
+11 }
+</pre>
+</blockquote>
+
+<p>
+If this function ran concurrently with a series of updates that
+replaced the current structure with a new one,
+the fetches of <tt>gp-&gt;a</tt>
+and <tt>gp-&gt;b</tt> might well come from two different structures,
+which could cause serious confusion.
+To prevent this (and much else besides), <tt>do_something_gp()</tt> uses
+<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>:
+
+<blockquote>
+<pre>
+ 1 bool do_something_gp(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = rcu_dereference(gp);
+ 5 if (p) {
+ 6 do_something(p-&gt;a, p-&gt;b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+10 rcu_read_unlock();
+11 return false;
+12 }
+</pre>
+</blockquote>
+
+<p>
+The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha)
+memory barriers in the Linux kernel.
+Should a
+<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a>
+ever appear, then <tt>rcu_dereference()</tt> could be implemented
+as a <tt>memory_order_consume</tt> load.
+Regardless of the exact implementation, a pointer fetched by
+<tt>rcu_dereference()</tt> may not be used outside of the
+outermost RCU read-side critical section containing that
+<tt>rcu_dereference()</tt>, unless protection of
+the corresponding data element has been passed from RCU to some
+other synchronization mechanism, most commonly locking or
+<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>.
+
+<p>
+In short, updaters use <tt>rcu_assign_pointer()</tt> and readers
+use <tt>rcu_dereference()</tt>, and these two RCU API elements
+work together to ensure that readers have a consistent view of
+newly added data elements.
+
+<p>
+Of course, it is also necessary to remove elements from RCU-protected
+data structures, for example, using the following process:
+
+<ol>
+<li> Remove the data element from the enclosing structure.
+<li> Wait for all pre-existing RCU read-side critical sections
+ to complete (because only pre-existing readers can possibly have
+ a reference to the newly removed data element).
+<li> At this point, only the updater has a reference to the
+ newly removed data element, so it can safely reclaim
+ the data element, for example, by passing it to <tt>kfree()</tt>.
+</ol>
+
+This process is implemented by <tt>remove_gp_synchronous()</tt>:
+
+<blockquote>
+<pre>
+ 1 bool remove_gp_synchronous(void)
+ 2 {
+ 3 struct foo *p;
+ 4
+ 5 spin_lock(&amp;gp_lock);
+ 6 p = rcu_access_pointer(gp);
+ 7 if (!p) {
+ 8 spin_unlock(&amp;gp_lock);
+ 9 return false;
+10 }
+11 rcu_assign_pointer(gp, NULL);
+12 spin_unlock(&amp;gp_lock);
+13 synchronize_rcu();
+14 kfree(p);
+15 return true;
+16 }
+</pre>
+</blockquote>
+
+<p>
+This function is straightforward, with line&nbsp;13 waiting for a grace
+period before line&nbsp;14 frees the old data element.
+This waiting ensures that readers will reach line&nbsp;7 of
+<tt>do_something_gp()</tt> before the data element referenced by
+<tt>p</tt> is freed.
+The <tt>rcu_access_pointer()</tt> on line&nbsp;6 is similar to
+<tt>rcu_dereference()</tt>, except that:
+
+<ol>
+<li> The value returned by <tt>rcu_access_pointer()</tt>
+ cannot be dereferenced.
+ If you want to access the value pointed to as well as
+ the pointer itself, use <tt>rcu_dereference()</tt>
+ instead of <tt>rcu_access_pointer()</tt>.
+<li> The call to <tt>rcu_access_pointer()</tt> need not be
+ protected.
+ In contrast, <tt>rcu_dereference()</tt> must either be
+ within an RCU read-side critical section or in a code
+ segment where the pointer cannot change, for example, in
+ code protected by the corresponding update-side lock.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Without the <tt>rcu_dereference()</tt> or the
+ <tt>rcu_access_pointer()</tt>, what destructive optimizations
+ might the compiler make use of?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Let's start with what happens to <tt>do_something_gp()</tt>
+ if it fails to use <tt>rcu_dereference()</tt>.
+ It could reuse a value formerly fetched from this same pointer.
+ It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time
+ manner, resulting in <i>load tearing</i>, in turn resulting a bytewise
+ mash-up of two distinct pointer values.
+ It might even use value-speculation optimizations, where it makes
+ a wrong guess, but by the time it gets around to checking the
+ value, an update has changed the pointer to match the wrong guess.
+ Too bad about any dereferences that returned pre-initialization garbage
+ in the meantime!
+ </font>
+
+ <p><font color="ffffff">
+ For <tt>remove_gp_synchronous()</tt>, as long as all modifications
+ to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>,
+ the above optimizations are harmless.
+ However, <tt>sparse</tt> will complain if you
+ define <tt>gp</tt> with <tt>__rcu</tt> and then
+ access it without using
+ either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+In short, RCU's publish-subscribe guarantee is provided by the combination
+of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>.
+This guarantee allows data elements to be safely added to RCU-protected
+linked data structures without disrupting RCU readers.
+This guarantee can be used in combination with the grace-period
+guarantee to also allow data elements to be removed from RCU-protected
+linked data structures, again without disrupting RCU readers.
+
+<p>
+This guarantee was only partially premeditated.
+DYNIX/ptx used an explicit memory barrier for publication, but had nothing
+resembling <tt>rcu_dereference()</tt> for subscription, nor did it
+have anything resembling the <tt>smp_read_barrier_depends()</tt>
+that was later subsumed into <tt>rcu_dereference()</tt> and later
+still into <tt>READ_ONCE()</tt>.
+The need for these operations made itself known quite suddenly at a
+late-1990s meeting with the DEC Alpha architects, back in the days when
+DEC was still a free-standing company.
+It took the Alpha architects a good hour to convince me that any sort
+of barrier would ever be needed, and it then took me a good <i>two</i> hours
+to convince them that their documentation did not make this point clear.
+More recent work with the C and C++ standards committees have provided
+much education on tricks and traps from the compiler.
+In short, compilers were much less tricky in the early 1990s, but in
+2015, don't even think about omitting <tt>rcu_dereference()</tt>!
+
+<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3>
+
+<p>
+The previous section's simple linked-data-structure scenario clearly
+demonstrates the need for RCU's stringent memory-ordering guarantees on
+systems with more than one CPU:
+
+<ol>
+<li> Each CPU that has an RCU read-side critical section that
+ begins before <tt>synchronize_rcu()</tt> starts is
+ guaranteed to execute a full memory barrier between the time
+ that the RCU read-side critical section ends and the time that
+ <tt>synchronize_rcu()</tt> returns.
+ Without this guarantee, a pre-existing RCU read-side critical section
+ might hold a reference to the newly removed <tt>struct foo</tt>
+ after the <tt>kfree()</tt> on line&nbsp;14 of
+ <tt>remove_gp_synchronous()</tt>.
+<li> Each CPU that has an RCU read-side critical section that ends
+ after <tt>synchronize_rcu()</tt> returns is guaranteed
+ to execute a full memory barrier between the time that
+ <tt>synchronize_rcu()</tt> begins and the time that the RCU
+ read-side critical section begins.
+ Without this guarantee, a later RCU read-side critical section
+ running after the <tt>kfree()</tt> on line&nbsp;14 of
+ <tt>remove_gp_synchronous()</tt> might
+ later run <tt>do_something_gp()</tt> and find the
+ newly deleted <tt>struct foo</tt>.
+<li> If the task invoking <tt>synchronize_rcu()</tt> remains
+ on a given CPU, then that CPU is guaranteed to execute a full
+ memory barrier sometime during the execution of
+ <tt>synchronize_rcu()</tt>.
+ This guarantee ensures that the <tt>kfree()</tt> on
+ line&nbsp;14 of <tt>remove_gp_synchronous()</tt> really does
+ execute after the removal on line&nbsp;11.
+<li> If the task invoking <tt>synchronize_rcu()</tt> migrates
+ among a group of CPUs during that invocation, then each of the
+ CPUs in that group is guaranteed to execute a full memory barrier
+ sometime during the execution of <tt>synchronize_rcu()</tt>.
+ This guarantee also ensures that the <tt>kfree()</tt> on
+ line&nbsp;14 of <tt>remove_gp_synchronous()</tt> really does
+ execute after the removal on
+ line&nbsp;11, but also in the case where the thread executing the
+ <tt>synchronize_rcu()</tt> migrates in the meantime.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Given that multiple CPUs can start RCU read-side critical sections
+ at any time without any ordering whatsoever, how can RCU possibly
+ tell whether or not a given RCU read-side critical section starts
+ before a given instance of <tt>synchronize_rcu()</tt>?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ If RCU cannot tell whether or not a given
+ RCU read-side critical section starts before a
+ given instance of <tt>synchronize_rcu()</tt>,
+ then it must assume that the RCU read-side critical section
+ started first.
+ In other words, a given instance of <tt>synchronize_rcu()</tt>
+ can avoid waiting on a given RCU read-side critical section only
+ if it can prove that <tt>synchronize_rcu()</tt> started first.
+ </font>
+
+ <p><font color="ffffff">
+ A related question is &ldquo;When <tt>rcu_read_lock()</tt>
+ doesn't generate any code, why does it matter how it relates
+ to a grace period?&rdquo;
+ The answer is that it is not the relationship of
+ <tt>rcu_read_lock()</tt> itself that is important, but rather
+ the relationship of the code within the enclosed RCU read-side
+ critical section to the code preceding and following the
+ grace period.
+ If we take this viewpoint, then a given RCU read-side critical
+ section begins before a given grace period when some access
+ preceding the grace period observes the effect of some access
+ within the critical section, in which case none of the accesses
+ within the critical section may observe the effects of any
+ access following the grace period.
+ </font>
+
+ <p><font color="ffffff">
+ As of late 2016, mathematical models of RCU take this
+ viewpoint, for example, see slides&nbsp;62 and&nbsp;63
+ of the
+ <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a>
+ presentation.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ The first and second guarantees require unbelievably strict ordering!
+ Are all these memory barriers <i> really</i> required?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Yes, they really are required.
+ To see why the first guarantee is required, consider the following
+ sequence of events:
+ </font>
+
+ <ol>
+ <li> <font color="ffffff">
+ CPU 1: <tt>rcu_read_lock()</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 1: <tt>q = rcu_dereference(gp);
+ /* Very likely to return p. */</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 0: <tt>list_del_rcu(p);</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 0: <tt>synchronize_rcu()</tt> starts.
+ </font>
+ <li> <font color="ffffff">
+ CPU 1: <tt>do_something_with(q-&gt;a);
+ /* No smp_mb(), so might happen after kfree(). */</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 1: <tt>rcu_read_unlock()</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 0: <tt>synchronize_rcu()</tt> returns.
+ </font>
+ <li> <font color="ffffff">
+ CPU 0: <tt>kfree(p);</tt>
+ </font>
+ </ol>
+
+ <p><font color="ffffff">
+ Therefore, there absolutely must be a full memory barrier between the
+ end of the RCU read-side critical section and the end of the
+ grace period.
+ </font>
+
+ <p><font color="ffffff">
+ The sequence of events demonstrating the necessity of the second rule
+ is roughly similar:
+ </font>
+
+ <ol>
+ <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt>
+ </font>
+ <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts.
+ </font>
+ <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt>
+ </font>
+ <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp);
+ /* Might return p if no memory barrier. */</tt>
+ </font>
+ <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns.
+ </font>
+ <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt>
+ </font>
+ <li> <font color="ffffff">
+ CPU 1: <tt>do_something_with(q-&gt;a); /* Boom!!! */</tt>
+ </font>
+ <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt>
+ </font>
+ </ol>
+
+ <p><font color="ffffff">
+ And similarly, without a memory barrier between the beginning of the
+ grace period and the beginning of the RCU read-side critical section,
+ CPU&nbsp;1 might end up accessing the freelist.
+ </font>
+
+ <p><font color="ffffff">
+ The &ldquo;as if&rdquo; rule of course applies, so that any
+ implementation that acts as if the appropriate memory barriers
+ were in place is a correct implementation.
+ That said, it is much easier to fool yourself into believing
+ that you have adhered to the as-if rule than it is to actually
+ adhere to it!
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
+ generate absolutely no code in some kernel builds.
+ This means that the compiler might arbitrarily rearrange consecutive
+ RCU read-side critical sections.
+ Given such rearrangement, if a given RCU read-side critical section
+ is done, how can you be sure that all prior RCU read-side critical
+ sections are done?
+ Won't the compiler rearrangements make that impossible to determine?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
+ generate absolutely no code, RCU infers quiescent states only at
+ special locations, for example, within the scheduler.
+ Because calls to <tt>schedule()</tt> had better prevent calling-code
+ accesses to shared variables from being rearranged across the call to
+ <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side
+ critical section, it will necessarily detect the end of all prior
+ RCU read-side critical sections, no matter how aggressively the
+ compiler scrambles the code.
+ </font>
+
+ <p><font color="ffffff">
+ Again, this all assumes that the compiler cannot scramble code across
+ calls to the scheduler, out of interrupt handlers, into the idle loop,
+ into user-mode code, and so on.
+ But if your kernel build allows that sort of scrambling, you have broken
+ far more than just RCU!
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+Note that these memory-barrier requirements do not replace the fundamental
+RCU requirement that a grace period wait for all pre-existing readers.
+On the contrary, the memory barriers called out in this section must operate in
+such a way as to <i>enforce</i> this fundamental requirement.
+Of course, different implementations enforce this requirement in different
+ways, but enforce it they must.
+
+<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3>
+
+<p>
+The common-case RCU primitives are unconditional.
+They are invoked, they do their job, and they return, with no possibility
+of error, and no need to retry.
+This is a key RCU design philosophy.
+
+<p>
+However, this philosophy is pragmatic rather than pigheaded.
+If someone comes up with a good justification for a particular conditional
+RCU primitive, it might well be implemented and added.
+After all, this guarantee was reverse-engineered, not premeditated.
+The unconditional nature of the RCU primitives was initially an
+accident of implementation, and later experience with synchronization
+primitives with conditional primitives caused me to elevate this
+accident to a guarantee.
+Therefore, the justification for adding a conditional primitive to
+RCU would need to be based on detailed and compelling use cases.
+
+<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3>
+
+<p>
+As far as RCU is concerned, it is always possible to carry out an
+update within an RCU read-side critical section.
+For example, that RCU read-side critical section might search for
+a given data element, and then might acquire the update-side
+spinlock in order to update that element, all while remaining
+in that RCU read-side critical section.
+Of course, it is necessary to exit the RCU read-side critical section
+before invoking <tt>synchronize_rcu()</tt>, however, this
+inconvenience can be avoided through use of the
+<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members
+described later in this document.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But how does the upgrade-to-write operation exclude other readers?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ It doesn't, just like normal RCU updates, which also do not exclude
+ RCU readers.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+This guarantee allows lookup code to be shared between read-side
+and update-side code, and was premeditated, appearing in the earliest
+DYNIX/ptx RCU documentation.
+
+<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2>
+
+<p>
+RCU provides extremely lightweight readers, and its read-side guarantees,
+though quite useful, are correspondingly lightweight.
+It is therefore all too easy to assume that RCU is guaranteeing more
+than it really is.
+Of course, the list of things that RCU does not guarantee is infinitely
+long, however, the following sections list a few non-guarantees that
+have caused confusion.
+Except where otherwise noted, these non-guarantees were premeditated.
+
+<ol>
+<li> <a href="#Readers Impose Minimal Ordering">
+ Readers Impose Minimal Ordering</a>
+<li> <a href="#Readers Do Not Exclude Updaters">
+ Readers Do Not Exclude Updaters</a>
+<li> <a href="#Updaters Only Wait For Old Readers">
+ Updaters Only Wait For Old Readers</a>
+<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections">
+ Grace Periods Don't Partition Read-Side Critical Sections</a>
+<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods">
+ Read-Side Critical Sections Don't Partition Grace Periods</a>
+<li> <a href="#Disabling Preemption Does Not Block Grace Periods">
+ Disabling Preemption Does Not Block Grace Periods</a>
+</ol>
+
+<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
+
+<p>
+Reader-side markers such as <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees
+except through their interaction with the grace-period APIs such as
+<tt>synchronize_rcu()</tt>.
+To see this, consider the following pair of threads:
+
+<blockquote>
+<pre>
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(x, 1);
+ 5 rcu_read_unlock();
+ 6 rcu_read_lock();
+ 7 WRITE_ONCE(y, 1);
+ 8 rcu_read_unlock();
+ 9 }
+10
+11 void thread1(void)
+12 {
+13 rcu_read_lock();
+14 r1 = READ_ONCE(y);
+15 rcu_read_unlock();
+16 rcu_read_lock();
+17 r2 = READ_ONCE(x);
+18 rcu_read_unlock();
+19 }
+</pre>
+</blockquote>
+
+<p>
+After <tt>thread0()</tt> and <tt>thread1()</tt> execute
+concurrently, it is quite possible to have
+
+<blockquote>
+<pre>
+(r1 == 1 &amp;&amp; r2 == 0)
+</pre>
+</blockquote>
+
+(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>),
+which would not be possible if <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> had much in the way of ordering
+properties.
+But they do not, so the CPU is within its rights
+to do significant reordering.
+This is by design: Any significant ordering constraints would slow down
+these fast-path APIs.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Can't the compiler also reorder this code?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ No, the volatile casts in <tt>READ_ONCE()</tt> and
+ <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in
+ this particular case.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3>
+
+<p>
+Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt>
+exclude updates.
+All they do is to prevent grace periods from ending.
+The following example illustrates this:
+
+<blockquote>
+<pre>
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 r1 = READ_ONCE(y);
+ 5 if (r1) {
+ 6 do_something_with_nonzero_x();
+ 7 r2 = READ_ONCE(x);
+ 8 WARN_ON(!r2); /* BUG!!! */
+ 9 }
+10 rcu_read_unlock();
+11 }
+12
+13 void thread1(void)
+14 {
+15 spin_lock(&amp;my_lock);
+16 WRITE_ONCE(x, 1);
+17 WRITE_ONCE(y, 1);
+18 spin_unlock(&amp;my_lock);
+19 }
+</pre>
+</blockquote>
+
+<p>
+If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt>
+excluded the <tt>thread1()</tt> function's update,
+the <tt>WARN_ON()</tt> could never fire.
+But the fact is that <tt>rcu_read_lock()</tt> does not exclude
+much of anything aside from subsequent grace periods, of which
+<tt>thread1()</tt> has none, so the
+<tt>WARN_ON()</tt> can and does fire.
+
+<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3>
+
+<p>
+It might be tempting to assume that after <tt>synchronize_rcu()</tt>
+completes, there are no readers executing.
+This temptation must be avoided because
+new readers can start immediately after <tt>synchronize_rcu()</tt>
+starts, and <tt>synchronize_rcu()</tt> is under no
+obligation to wait for these new readers.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Suppose that synchronize_rcu() did wait until <i>all</i>
+ readers had completed instead of waiting only on
+ pre-existing readers.
+ For how long would the updater be able to rely on there
+ being no readers?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ For no time at all.
+ Even if <tt>synchronize_rcu()</tt> were to wait until
+ all readers had completed, a new reader might start immediately after
+ <tt>synchronize_rcu()</tt> completed.
+ Therefore, the code following
+ <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being
+ no readers.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections">
+Grace Periods Don't Partition Read-Side Critical Sections</a></h3>
+
+<p>
+It is tempting to assume that if any part of one RCU read-side critical
+section precedes a given grace period, and if any part of another RCU
+read-side critical section follows that same grace period, then all of
+the first RCU read-side critical section must precede all of the second.
+However, this just isn't the case: A single grace period does not
+partition the set of RCU read-side critical sections.
+An example of this situation can be illustrated as follows, where
+<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero:
+
+<blockquote>
+<pre>
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11 r1 = READ_ONCE(a);
+12 synchronize_rcu();
+13 WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18 rcu_read_lock();
+19 r2 = READ_ONCE(b);
+20 r3 = READ_ONCE(c);
+21 rcu_read_unlock();
+22 }
+</pre>
+</blockquote>
+
+<p>
+It turns out that the outcome:
+
+<blockquote>
+<pre>
+(r1 == 1 &amp;&amp; r2 == 0 &amp;&amp; r3 == 1)
+</pre>
+</blockquote>
+
+is entirely possible.
+The following figure show how this can happen, with each circled
+<tt>QS</tt> indicating the point at which RCU recorded a
+<i>quiescent state</i> for each thread, that is, a state in which
+RCU knows that the thread cannot be in the midst of an RCU read-side
+critical section that started before the current grace period:
+
+<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p>
+
+<p>
+If it is necessary to partition RCU read-side critical sections in this
+manner, it is necessary to use two grace periods, where the first
+grace period is known to end before the second grace period starts:
+
+<blockquote>
+<pre>
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11 r1 = READ_ONCE(a);
+12 synchronize_rcu();
+13 WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18 r2 = READ_ONCE(c);
+19 synchronize_rcu();
+20 WRITE_ONCE(d, 1);
+21 }
+22
+23 void thread3(void)
+24 {
+25 rcu_read_lock();
+26 r3 = READ_ONCE(b);
+27 r4 = READ_ONCE(d);
+28 rcu_read_unlock();
+29 }
+</pre>
+</blockquote>
+
+<p>
+Here, if <tt>(r1 == 1)</tt>, then
+<tt>thread0()</tt>'s write to <tt>b</tt> must happen
+before the end of <tt>thread1()</tt>'s grace period.
+If in addition <tt>(r4 == 1)</tt>, then
+<tt>thread3()</tt>'s read from <tt>b</tt> must happen
+after the beginning of <tt>thread2()</tt>'s grace period.
+If it is also the case that <tt>(r2 == 1)</tt>, then the
+end of <tt>thread1()</tt>'s grace period must precede the
+beginning of <tt>thread2()</tt>'s grace period.
+This mean that the two RCU read-side critical sections cannot overlap,
+guaranteeing that <tt>(r3 == 1)</tt>.
+As a result, the outcome:
+
+<blockquote>
+<pre>
+(r1 == 1 &amp;&amp; r2 == 1 &amp;&amp; r3 == 0 &amp;&amp; r4 == 1)
+</pre>
+</blockquote>
+
+cannot happen.
+
+<p>
+This non-requirement was also non-premeditated, but became apparent
+when studying RCU's interaction with memory ordering.
+
+<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods">
+Read-Side Critical Sections Don't Partition Grace Periods</a></h3>
+
+<p>
+It is also tempting to assume that if an RCU read-side critical section
+happens between a pair of grace periods, then those grace periods cannot
+overlap.
+However, this temptation leads nowhere good, as can be illustrated by
+the following, with all variables initially zero:
+
+<blockquote>
+<pre>
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11 r1 = READ_ONCE(a);
+12 synchronize_rcu();
+13 WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18 rcu_read_lock();
+19 WRITE_ONCE(d, 1);
+20 r2 = READ_ONCE(c);
+21 rcu_read_unlock();
+22 }
+23
+24 void thread3(void)
+25 {
+26 r3 = READ_ONCE(d);
+27 synchronize_rcu();
+28 WRITE_ONCE(e, 1);
+29 }
+30
+31 void thread4(void)
+32 {
+33 rcu_read_lock();
+34 r4 = READ_ONCE(b);
+35 r5 = READ_ONCE(e);
+36 rcu_read_unlock();
+37 }
+</pre>
+</blockquote>
+
+<p>
+In this case, the outcome:
+
+<blockquote>
+<pre>
+(r1 == 1 &amp;&amp; r2 == 1 &amp;&amp; r3 == 1 &amp;&amp; r4 == 0 &amp&amp; r5 == 1)
+</pre>
+</blockquote>
+
+is entirely possible, as illustrated below:
+
+<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p>
+
+<p>
+Again, an RCU read-side critical section can overlap almost all of a
+given grace period, just so long as it does not overlap the entire
+grace period.
+As a result, an RCU read-side critical section cannot partition a pair
+of RCU grace periods.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ How long a sequence of grace periods, each separated by an RCU
+ read-side critical section, would be required to partition the RCU
+ read-side critical sections at the beginning and end of the chain?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ In theory, an infinite number.
+ In practice, an unknown number that is sensitive to both implementation
+ details and timing considerations.
+ Therefore, even in practice, RCU users must abide by the
+ theoretical rather than the practical answer.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<h3><a name="Disabling Preemption Does Not Block Grace Periods">
+Disabling Preemption Does Not Block Grace Periods</a></h3>
+
+<p>
+There was a time when disabling preemption on any given CPU would block
+subsequent grace periods.
+However, this was an accident of implementation and is not a requirement.
+And in the current Linux-kernel implementation, disabling preemption
+on a given CPU in fact does not block grace periods, as Oleg Nesterov
+<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>.
+
+<p>
+If you need a preempt-disable region to block grace periods, you need to add
+<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example
+as follows:
+
+<blockquote>
+<pre>
+ 1 preempt_disable();
+ 2 rcu_read_lock();
+ 3 do_something();
+ 4 rcu_read_unlock();
+ 5 preempt_enable();
+ 6
+ 7 /* Spinlocks implicitly disable preemption. */
+ 8 spin_lock(&amp;mylock);
+ 9 rcu_read_lock();
+10 do_something();
+11 rcu_read_unlock();
+12 spin_unlock(&amp;mylock);
+</pre>
+</blockquote>
+
+<p>
+In theory, you could enter the RCU read-side critical section first,
+but it is more efficient to keep the entire RCU read-side critical
+section contained in the preempt-disable region as shown above.
+Of course, RCU read-side critical sections that extend outside of
+preempt-disable regions will work correctly, but such critical sections
+can be preempted, which forces <tt>rcu_read_unlock()</tt> to do
+more work.
+And no, this is <i>not</i> an invitation to enclose all of your RCU
+read-side critical sections within preempt-disable regions, because
+doing so would degrade real-time response.
+
+<p>
+This non-requirement appeared with preemptible RCU.
+If you need a grace period that waits on non-preemptible code regions, use
+<a href="#Sched Flavor">RCU-sched</a>.
+
+<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
+
+<p>
+These parallelism facts of life are by no means specific to RCU, but
+the RCU implementation must abide by them.
+They therefore bear repeating:
+
+<ol>
+<li> Any CPU or task may be delayed at any time,
+ and any attempts to avoid these delays by disabling
+ preemption, interrupts, or whatever are completely futile.
+ This is most obvious in preemptible user-level
+ environments and in virtualized environments (where
+ a given guest OS's VCPUs can be preempted at any time by
+ the underlying hypervisor), but can also happen in bare-metal
+ environments due to ECC errors, NMIs, and other hardware
+ events.
+ Although a delay of more than about 20 seconds can result
+ in splats, the RCU implementation is obligated to use
+ algorithms that can tolerate extremely long delays, but where
+ &ldquo;extremely long&rdquo; is not long enough to allow
+ wrap-around when incrementing a 64-bit counter.
+<li> Both the compiler and the CPU can reorder memory accesses.
+ Where it matters, RCU must use compiler directives and
+ memory-barrier instructions to preserve ordering.
+<li> Conflicting writes to memory locations in any given cache line
+ will result in expensive cache misses.
+ Greater numbers of concurrent writes and more-frequent
+ concurrent writes will result in more dramatic slowdowns.
+ RCU is therefore obligated to use algorithms that have
+ sufficient locality to avoid significant performance and
+ scalability problems.
+<li> As a rough rule of thumb, only one CPU's worth of processing
+ may be carried out under the protection of any given exclusive
+ lock.
+ RCU must therefore use scalable locking designs.
+<li> Counters are finite, especially on 32-bit systems.
+ RCU's use of counters must therefore tolerate counter wrap,
+ or be designed such that counter wrap would take way more
+ time than a single system is likely to run.
+ An uptime of ten years is quite possible, a runtime
+ of a century much less so.
+ As an example of the latter, RCU's dyntick-idle nesting counter
+ allows 54 bits for interrupt nesting level (this counter
+ is 64 bits even on a 32-bit system).
+ Overflowing this counter requires 2<sup>54</sup>
+ half-interrupts on a given CPU without that CPU ever going idle.
+ If a half-interrupt happened every microsecond, it would take
+ 570 years of runtime to overflow this counter, which is currently
+ believed to be an acceptably long time.
+<li> Linux systems can have thousands of CPUs running a single
+ Linux kernel in a single shared-memory environment.
+ RCU must therefore pay close attention to high-end scalability.
+</ol>
+
+<p>
+This last parallelism fact of life means that RCU must pay special
+attention to the preceding facts of life.
+The idea that Linux might scale to systems with thousands of CPUs would
+have been met with some skepticism in the 1990s, but these requirements
+would have otherwise have been unsurprising, even in the early 1990s.
+
+<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2>
+
+<p>
+These sections list quality-of-implementation requirements.
+Although an RCU implementation that ignores these requirements could
+still be used, it would likely be subject to limitations that would
+make it inappropriate for industrial-strength production use.
+Classes of quality-of-implementation requirements are as follows:
+
+<ol>
+<li> <a href="#Specialization">Specialization</a>
+<li> <a href="#Performance and Scalability">Performance and Scalability</a>
+<li> <a href="#Composability">Composability</a>
+<li> <a href="#Corner Cases">Corner Cases</a>
+</ol>
+
+<p>
+These classes is covered in the following sections.
+
+<h3><a name="Specialization">Specialization</a></h3>
+
+<p>
+RCU is and always has been intended primarily for read-mostly situations,
+which means that RCU's read-side primitives are optimized, often at the
+expense of its update-side primitives.
+Experience thus far is captured by the following list of situations:
+
+<ol>
+<li> Read-mostly data, where stale and inconsistent data is not
+ a problem: RCU works great!
+<li> Read-mostly data, where data must be consistent:
+ RCU works well.
+<li> Read-write data, where data must be consistent:
+ RCU <i>might</i> work OK.
+ Or not.
+<li> Write-mostly data, where data must be consistent:
+ RCU is very unlikely to be the right tool for the job,
+ with the following exceptions, where RCU can provide:
+ <ol type=a>
+ <li> Existence guarantees for update-friendly mechanisms.
+ <li> Wait-free read-side primitives for real-time use.
+ </ol>
+</ol>
+
+<p>
+This focus on read-mostly situations means that RCU must interoperate
+with other synchronization primitives.
+For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt>
+examples discussed earlier use RCU to protect readers and locking to
+coordinate updaters.
+However, the need extends much farther, requiring that a variety of
+synchronization primitives be legal within RCU read-side critical sections,
+including spinlocks, sequence locks, atomic operations, reference
+counters, and memory barriers.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ What about sleeping locks?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ These are forbidden within Linux-kernel RCU read-side critical
+ sections because it is not legal to place a quiescent state
+ (in this case, voluntary context switch) within an RCU read-side
+ critical section.
+ However, sleeping locks may be used within userspace RCU read-side
+ critical sections, and also within Linux-kernel sleepable RCU
+ <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a>
+ read-side critical sections.
+ In addition, the -rt patchset turns spinlocks into a
+ sleeping locks so that the corresponding critical sections
+ can be preempted, which also means that these sleeplockified
+ spinlocks (but not other sleeping locks!) may be acquire within
+ -rt-Linux-kernel RCU read-side critical sections.
+ </font>
+
+ <p><font color="ffffff">
+ Note that it <i>is</i> legal for a normal RCU read-side
+ critical section to conditionally acquire a sleeping locks
+ (as in <tt>mutex_trylock()</tt>), but only as long as it does
+ not loop indefinitely attempting to conditionally acquire that
+ sleeping locks.
+ The key point is that things like <tt>mutex_trylock()</tt>
+ either return with the mutex held, or return an error indication if
+ the mutex was not immediately available.
+ Either way, <tt>mutex_trylock()</tt> returns immediately without
+ sleeping.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+It often comes as a surprise that many algorithms do not require a
+consistent view of data, but many can function in that mode,
+with network routing being the poster child.
+Internet routing algorithms take significant time to propagate
+updates, so that by the time an update arrives at a given system,
+that system has been sending network traffic the wrong way for
+a considerable length of time.
+Having a few threads continue to send traffic the wrong way for a
+few more milliseconds is clearly not a problem: In the worst case,
+TCP retransmissions will eventually get the data where it needs to go.
+In general, when tracking the state of the universe outside of the
+computer, some level of inconsistency must be tolerated due to
+speed-of-light delays if nothing else.
+
+<p>
+Furthermore, uncertainty about external state is inherent in many cases.
+For example, a pair of veterinarians might use heartbeat to determine
+whether or not a given cat was alive.
+But how long should they wait after the last heartbeat to decide that
+the cat is in fact dead?
+Waiting less than 400 milliseconds makes no sense because this would
+mean that a relaxed cat would be considered to cycle between death
+and life more than 100 times per minute.
+Moreover, just as with human beings, a cat's heart might stop for
+some period of time, so the exact wait period is a judgment call.
+One of our pair of veterinarians might wait 30 seconds before pronouncing
+the cat dead, while the other might insist on waiting a full minute.
+The two veterinarians would then disagree on the state of the cat during
+the final 30 seconds of the minute following the last heartbeat.
+
+<p>
+Interestingly enough, this same situation applies to hardware.
+When push comes to shove, how do we tell whether or not some
+external server has failed?
+We send messages to it periodically, and declare it failed if we
+don't receive a response within a given period of time.
+Policy decisions can usually tolerate short
+periods of inconsistency.
+The policy was decided some time ago, and is only now being put into
+effect, so a few milliseconds of delay is normally inconsequential.
+
+<p>
+However, there are algorithms that absolutely must see consistent data.
+For example, the translation between a user-level SystemV semaphore
+ID to the corresponding in-kernel data structure is protected by RCU,
+but it is absolutely forbidden to update a semaphore that has just been
+removed.
+In the Linux kernel, this need for consistency is accommodated by acquiring
+spinlocks located in the in-kernel data structure from within
+the RCU read-side critical section, and this is indicated by the
+green box in the figure above.
+Many other techniques may be used, and are in fact used within the
+Linux kernel.
+
+<p>
+In short, RCU is not required to maintain consistency, and other
+mechanisms may be used in concert with RCU when consistency is required.
+RCU's specialization allows it to do its job extremely well, and its
+ability to interoperate with other synchronization mechanisms allows
+the right mix of synchronization tools to be used for a given job.
+
+<h3><a name="Performance and Scalability">Performance and Scalability</a></h3>
+
+<p>
+Energy efficiency is a critical component of performance today,
+and Linux-kernel RCU implementations must therefore avoid unnecessarily
+awakening idle CPUs.
+I cannot claim that this requirement was premeditated.
+In fact, I learned of it during a telephone conversation in which I
+was given &ldquo;frank and open&rdquo; feedback on the importance
+of energy efficiency in battery-powered systems and on specific
+energy-efficiency shortcomings of the Linux-kernel RCU implementation.
+In my experience, the battery-powered embedded community will consider
+any unnecessary wakeups to be extremely unfriendly acts.
+So much so that mere Linux-kernel-mailing-list posts are
+insufficient to vent their ire.
+
+<p>
+Memory consumption is not particularly important for in most
+situations, and has become decreasingly
+so as memory sizes have expanded and memory
+costs have plummeted.
+However, as I learned from Matt Mackall's
+<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a>
+efforts, memory footprint is critically important on single-CPU systems with
+non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus
+<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a>
+was born.
+Josh Triplett has since taken over the small-memory banner with his
+<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a>
+project, which resulted in
+<a href="#Sleepable RCU">SRCU</a>
+becoming optional for those kernels not needing it.
+
+<p>
+The remaining performance requirements are, for the most part,
+unsurprising.
+For example, in keeping with RCU's read-side specialization,
+<tt>rcu_dereference()</tt> should have negligible overhead (for
+example, suppression of a few minor compiler optimizations).
+Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> should have exactly zero overhead.
+
+<p>
+In preemptible environments, in the case where the RCU read-side
+critical section was not preempted (as will be the case for the
+highest-priority real-time process), <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> should have minimal overhead.
+In particular, they should not contain atomic read-modify-write
+operations, memory-barrier instructions, preemption disabling,
+interrupt disabling, or backwards branches.
+However, in the case where the RCU read-side critical section was preempted,
+<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts.
+This is why it is better to nest an RCU read-side critical section
+within a preempt-disable region than vice versa, at least in cases
+where that critical section is short enough to avoid unduly degrading
+real-time latencies.
+
+<p>
+The <tt>synchronize_rcu()</tt> grace-period-wait primitive is
+optimized for throughput.
+It may therefore incur several milliseconds of latency in addition to
+the duration of the longest RCU read-side critical section.
+On the other hand, multiple concurrent invocations of
+<tt>synchronize_rcu()</tt> are required to use batching optimizations
+so that they can be satisfied by a single underlying grace-period-wait
+operation.
+For example, in the Linux kernel, it is not unusual for a single
+grace-period-wait operation to serve more than
+<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a>
+of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation
+overhead down to nearly zero.
+However, the grace-period optimization is also required to avoid
+measurable degradation of real-time scheduling and interrupt latencies.
+
+<p>
+In some cases, the multi-millisecond <tt>synchronize_rcu()</tt>
+latencies are unacceptable.
+In these cases, <tt>synchronize_rcu_expedited()</tt> may be used
+instead, reducing the grace-period latency down to a few tens of
+microseconds on small systems, at least in cases where the RCU read-side
+critical sections are short.
+There are currently no special latency requirements for
+<tt>synchronize_rcu_expedited()</tt> on large systems, but,
+consistent with the empirical nature of the RCU specification,
+that is subject to change.
+However, there most definitely are scalability requirements:
+A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096
+CPUs should at least make reasonable forward progress.
+In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
+is permitted to impose modest degradation of real-time latency
+on non-idle online CPUs.
+Here, &ldquo;modest&rdquo; means roughly the same latency
+degradation as a scheduling-clock interrupt.
+
+<p>
+There are a number of situations where even
+<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period
+latency is unacceptable.
+In these situations, the asynchronous <tt>call_rcu()</tt> can be
+used in place of <tt>synchronize_rcu()</tt> as follows:
+
+<blockquote>
+<pre>
+ 1 struct foo {
+ 2 int a;
+ 3 int b;
+ 4 struct rcu_head rh;
+ 5 };
+ 6
+ 7 static void remove_gp_cb(struct rcu_head *rhp)
+ 8 {
+ 9 struct foo *p = container_of(rhp, struct foo, rh);
+10
+11 kfree(p);
+12 }
+13
+14 bool remove_gp_asynchronous(void)
+15 {
+16 struct foo *p;
+17
+18 spin_lock(&amp;gp_lock);
+19 p = rcu_dereference(gp);
+20 if (!p) {
+21 spin_unlock(&amp;gp_lock);
+22 return false;
+23 }
+24 rcu_assign_pointer(gp, NULL);
+25 call_rcu(&amp;p-&gt;rh, remove_gp_cb);
+26 spin_unlock(&amp;gp_lock);
+27 return true;
+28 }
+</pre>
+</blockquote>
+
+<p>
+A definition of <tt>struct foo</tt> is finally needed, and appears
+on lines&nbsp;1-5.
+The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt>
+on line&nbsp;25, and will be invoked after the end of a subsequent
+grace period.
+This gets the same effect as <tt>remove_gp_synchronous()</tt>,
+but without forcing the updater to wait for a grace period to elapse.
+The <tt>call_rcu()</tt> function may be used in a number of
+situations where neither <tt>synchronize_rcu()</tt> nor
+<tt>synchronize_rcu_expedited()</tt> would be legal,
+including within preempt-disable code, <tt>local_bh_disable()</tt> code,
+interrupt-disable code, and interrupt handlers.
+However, even <tt>call_rcu()</tt> is illegal within NMI handlers
+and from idle and offline CPUs.
+The callback function (<tt>remove_gp_cb()</tt> in this case) will be
+executed within softirq (software interrupt) environment within the
+Linux kernel,
+either within a real softirq handler or under the protection
+of <tt>local_bh_disable()</tt>.
+In both the Linux kernel and in userspace, it is bad practice to
+write an RCU callback function that takes too long.
+Long-running operations should be relegated to separate threads or
+(in the Linux kernel) workqueues.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Why does line&nbsp;19 use <tt>rcu_access_pointer()</tt>?
+ After all, <tt>call_rcu()</tt> on line&nbsp;25 stores into the
+ structure, which would interact badly with concurrent insertions.
+ Doesn't this mean that <tt>rcu_dereference()</tt> is required?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Presumably the <tt>-&gt;gp_lock</tt> acquired on line&nbsp;18 excludes
+ any changes, including any insertions that <tt>rcu_dereference()</tt>
+ would protect against.
+ Therefore, any insertions will be delayed until after
+ <tt>-&gt;gp_lock</tt>
+ is released on line&nbsp;25, which in turn means that
+ <tt>rcu_access_pointer()</tt> suffices.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, all that <tt>remove_gp_cb()</tt> is doing is
+invoking <tt>kfree()</tt> on the data element.
+This is a common idiom, and is supported by <tt>kfree_rcu()</tt>,
+which allows &ldquo;fire and forget&rdquo; operation as shown below:
+
+<blockquote>
+<pre>
+ 1 struct foo {
+ 2 int a;
+ 3 int b;
+ 4 struct rcu_head rh;
+ 5 };
+ 6
+ 7 bool remove_gp_faf(void)
+ 8 {
+ 9 struct foo *p;
+10
+11 spin_lock(&amp;gp_lock);
+12 p = rcu_dereference(gp);
+13 if (!p) {
+14 spin_unlock(&amp;gp_lock);
+15 return false;
+16 }
+17 rcu_assign_pointer(gp, NULL);
+18 kfree_rcu(p, rh);
+19 spin_unlock(&amp;gp_lock);
+20 return true;
+21 }
+</pre>
+</blockquote>
+
+<p>
+Note that <tt>remove_gp_faf()</tt> simply invokes
+<tt>kfree_rcu()</tt> and proceeds, without any need to pay any
+further attention to the subsequent grace period and <tt>kfree()</tt>.
+It is permissible to invoke <tt>kfree_rcu()</tt> from the same
+environments as for <tt>call_rcu()</tt>.
+Interestingly enough, DYNIX/ptx had the equivalents of
+<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not
+<tt>synchronize_rcu()</tt>.
+This was due to the fact that RCU was not heavily used within DYNIX/ptx,
+so the very few places that needed something like
+<tt>synchronize_rcu()</tt> simply open-coded it.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Earlier it was claimed that <tt>call_rcu()</tt> and
+ <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked
+ by readers.
+ But how can that be correct, given that the invocation of the callback
+ and the freeing of the memory (respectively) must still wait for
+ a grace period to elapse?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ We could define things this way, but keep in mind that this sort of
+ definition would say that updates in garbage-collected languages
+ cannot complete until the next time the garbage collector runs,
+ which does not seem at all reasonable.
+ The key point is that in most cases, an updater using either
+ <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the
+ next update as soon as it has invoked <tt>call_rcu()</tt> or
+ <tt>kfree_rcu()</tt>, without having to wait for a subsequent
+ grace period.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But what if the updater must wait for the completion of code to be
+executed after the end of the grace period, but has other tasks
+that can be carried out in the meantime?
+The polling-style <tt>get_state_synchronize_rcu()</tt> and
+<tt>cond_synchronize_rcu()</tt> functions may be used for this
+purpose, as shown below:
+
+<blockquote>
+<pre>
+ 1 bool remove_gp_poll(void)
+ 2 {
+ 3 struct foo *p;
+ 4 unsigned long s;
+ 5
+ 6 spin_lock(&amp;gp_lock);
+ 7 p = rcu_access_pointer(gp);
+ 8 if (!p) {
+ 9 spin_unlock(&amp;gp_lock);
+10 return false;
+11 }
+12 rcu_assign_pointer(gp, NULL);
+13 spin_unlock(&amp;gp_lock);
+14 s = get_state_synchronize_rcu();
+15 do_something_while_waiting();
+16 cond_synchronize_rcu(s);
+17 kfree(p);
+18 return true;
+19 }
+</pre>
+</blockquote>
+
+<p>
+On line&nbsp;14, <tt>get_state_synchronize_rcu()</tt> obtains a
+&ldquo;cookie&rdquo; from RCU,
+then line&nbsp;15 carries out other tasks,
+and finally, line&nbsp;16 returns immediately if a grace period has
+elapsed in the meantime, but otherwise waits as required.
+The need for <tt>get_state_synchronize_rcu</tt> and
+<tt>cond_synchronize_rcu()</tt> has appeared quite recently,
+so it is too early to tell whether they will stand the test of time.
+
+<p>
+RCU thus provides a range of tools to allow updaters to strike the
+required tradeoff between latency, flexibility and CPU overhead.
+
+<h3><a name="Composability">Composability</a></h3>
+
+<p>
+Composability has received much attention in recent years, perhaps in part
+due to the collision of multicore hardware with object-oriented techniques
+designed in single-threaded environments for single-threaded use.
+And in theory, RCU read-side critical sections may be composed, and in
+fact may be nested arbitrarily deeply.
+In practice, as with all real-world implementations of composable
+constructs, there are limitations.
+
+<p>
+Implementations of RCU for which <tt>rcu_read_lock()</tt>
+and <tt>rcu_read_unlock()</tt> generate no code, such as
+Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be
+nested arbitrarily deeply.
+After all, there is no overhead.
+Except that if all these instances of <tt>rcu_read_lock()</tt>
+and <tt>rcu_read_unlock()</tt> are visible to the compiler,
+compilation will eventually fail due to exhausting memory,
+mass storage, or user patience, whichever comes first.
+If the nesting is not visible to the compiler, as is the case with
+mutually recursive functions each in its own translation unit,
+stack overflow will result.
+If the nesting takes the form of loops, perhaps in the guise of tail
+recursion, either the control variable
+will overflow or (in the Linux kernel) you will get an RCU CPU stall warning.
+Nevertheless, this class of RCU implementations is one
+of the most composable constructs in existence.
+
+<p>
+RCU implementations that explicitly track nesting depth
+are limited by the nesting-depth counter.
+For example, the Linux kernel's preemptible RCU limits nesting to
+<tt>INT_MAX</tt>.
+This should suffice for almost all practical purposes.
+That said, a consecutive pair of RCU read-side critical sections
+between which there is an operation that waits for a grace period
+cannot be enclosed in another RCU read-side critical section.
+This is because it is not legal to wait for a grace period within
+an RCU read-side critical section: To do so would result either
+in deadlock or
+in RCU implicitly splitting the enclosing RCU read-side critical
+section, neither of which is conducive to a long-lived and prosperous
+kernel.
+
+<p>
+It is worth noting that RCU is not alone in limiting composability.
+For example, many transactional-memory implementations prohibit
+composing a pair of transactions separated by an irrevocable
+operation (for example, a network receive operation).
+For another example, lock-based critical sections can be composed
+surprisingly freely, but only if deadlock is avoided.
+
+<p>
+In short, although RCU read-side critical sections are highly composable,
+care is required in some situations, just as is the case for any other
+composable synchronization mechanism.
+
+<h3><a name="Corner Cases">Corner Cases</a></h3>
+
+<p>
+A given RCU workload might have an endless and intense stream of
+RCU read-side critical sections, perhaps even so intense that there
+was never a point in time during which there was not at least one
+RCU read-side critical section in flight.
+RCU cannot allow this situation to block grace periods: As long as
+all the RCU read-side critical sections are finite, grace periods
+must also be finite.
+
+<p>
+That said, preemptible RCU implementations could potentially result
+in RCU read-side critical sections being preempted for long durations,
+which has the effect of creating a long-duration RCU read-side
+critical section.
+This situation can arise only in heavily loaded systems, but systems using
+real-time priorities are of course more vulnerable.
+Therefore, RCU priority boosting is provided to help deal with this
+case.
+That said, the exact requirements on RCU priority boosting will likely
+evolve as more experience accumulates.
+
+<p>
+Other workloads might have very high update rates.
+Although one can argue that such workloads should instead use
+something other than RCU, the fact remains that RCU must
+handle such workloads gracefully.
+This requirement is another factor driving batching of grace periods,
+but it is also the driving force behind the checks for large numbers
+of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
+Finally, high update rates should not delay RCU read-side critical
+sections, although some small read-side delays can occur when using
+<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
+of <tt>smp_call_function_single()</tt>.
+
+<p>
+Although all three of these corner cases were understood in the early
+1990s, a simple user-level test consisting of <tt>close(open(path))</tt>
+in a tight loop
+in the early 2000s suddenly provided a much deeper appreciation of the
+high-update-rate corner case.
+This test also motivated addition of some RCU code to react to high update
+rates, for example, if a given CPU finds itself with more than 10,000
+RCU callbacks queued, it will cause RCU to take evasive action by
+more aggressively starting grace periods and more aggressively forcing
+completion of grace-period processing.
+This evasive action causes the grace period to complete more quickly,
+but at the cost of restricting RCU's batching optimizations, thus
+increasing the CPU overhead incurred by that grace period.
+
+<h2><a name="Software-Engineering Requirements">
+Software-Engineering Requirements</a></h2>
+
+<p>
+Between Murphy's Law and &ldquo;To err is human&rdquo;, it is necessary to
+guard against mishaps and misuse:
+
+<ol>
+<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt>
+ everywhere that it is needed, so kernels built with
+ <tt>CONFIG_PROVE_RCU=y</tt> will splat if
+ <tt>rcu_dereference()</tt> is used outside of an
+ RCU read-side critical section.
+ Update-side code can use <tt>rcu_dereference_protected()</tt>,
+ which takes a
+ <a href="https://lwn.net/Articles/371986/">lockdep expression</a>
+ to indicate what is providing the protection.
+ If the indicated protection is not provided, a lockdep splat
+ is emitted.
+
+ <p>
+ Code shared between readers and updaters can use
+ <tt>rcu_dereference_check()</tt>, which also takes a
+ lockdep expression, and emits a lockdep splat if neither
+ <tt>rcu_read_lock()</tt> nor the indicated protection
+ is in place.
+ In addition, <tt>rcu_dereference_raw()</tt> is used in those
+ (hopefully rare) cases where the required protection cannot
+ be easily described.
+ Finally, <tt>rcu_read_lock_held()</tt> is provided to
+ allow a function to verify that it has been invoked within
+ an RCU read-side critical section.
+ I was made aware of this set of requirements shortly after Thomas
+ Gleixner audited a number of RCU uses.
+<li> A given function might wish to check for RCU-related preconditions
+ upon entry, before using any other RCU API.
+ The <tt>rcu_lockdep_assert()</tt> does this job,
+ asserting the expression in kernels having lockdep enabled
+ and doing nothing otherwise.
+<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt>
+ and <tt>rcu_dereference()</tt>, perhaps (incorrectly)
+ substituting a simple assignment.
+ To catch this sort of error, a given RCU-protected pointer may be
+ tagged with <tt>__rcu</tt>, after which sparse
+ will complain about simple-assignment accesses to that pointer.
+ Arnd Bergmann made me aware of this requirement, and also
+ supplied the needed
+ <a href="https://lwn.net/Articles/376011/">patch series</a>.
+<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt>
+ will splat if a data element is passed to <tt>call_rcu()</tt>
+ twice in a row, without a grace period in between.
+ (This error is similar to a double free.)
+ The corresponding <tt>rcu_head</tt> structures that are
+ dynamically allocated are automatically tracked, but
+ <tt>rcu_head</tt> structures allocated on the stack
+ must be initialized with <tt>init_rcu_head_on_stack()</tt>
+ and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>.
+ Similarly, statically allocated non-stack <tt>rcu_head</tt>
+ structures must be initialized with <tt>init_rcu_head()</tt>
+ and cleaned up with <tt>destroy_rcu_head()</tt>.
+ Mathieu Desnoyers made me aware of this requirement, and also
+ supplied the needed
+ <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>.
+<li> An infinite loop in an RCU read-side critical section will
+ eventually trigger an RCU CPU stall warning splat, with
+ the duration of &ldquo;eventually&rdquo; being controlled by the
+ <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or,
+ alternatively, by the
+ <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs
+ parameter.
+ However, RCU is not obligated to produce this splat
+ unless there is a grace period waiting on that particular
+ RCU read-side critical section.
+ <p>
+ Some extreme workloads might intentionally delay
+ RCU grace periods, and systems running those workloads can
+ be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt>
+ to suppress the splats.
+ This kernel parameter may also be set via <tt>sysfs</tt>.
+ Furthermore, RCU CPU stall warnings are counter-productive
+ during sysrq dumps and during panics.
+ RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and
+ <tt>rcu_sysrq_end()</tt> API members to be called before
+ and after long sysrq dumps.
+ RCU also supplies the <tt>rcu_panic()</tt> notifier that is
+ automatically invoked at the beginning of a panic to suppress
+ further RCU CPU stall warnings.
+
+ <p>
+ This requirement made itself known in the early 1990s, pretty
+ much the first time that it was necessary to debug a CPU stall.
+ That said, the initial implementation in DYNIX/ptx was quite
+ generic in comparison with that of Linux.
+<li> Although it would be very good to detect pointers leaking out
+ of RCU read-side critical sections, there is currently no
+ good way of doing this.
+ One complication is the need to distinguish between pointers
+ leaking and pointers that have been handed off from RCU to
+ some other synchronization mechanism, for example, reference
+ counting.
+<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related
+ information is provided via event tracing.
+<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and
+ <tt>rcu_dereference()</tt> to create typical linked
+ data structures can be surprisingly error-prone.
+ Therefore, RCU-protected
+ <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a>
+ and, more recently, RCU-protected
+ <a href="https://lwn.net/Articles/612100/">hash tables</a>
+ are available.
+ Many other special-purpose RCU-protected data structures are
+ available in the Linux kernel and the userspace RCU library.
+<li> Some linked structures are created at compile time, but still
+ require <tt>__rcu</tt> checking.
+ The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this
+ purpose.
+<li> It is not necessary to use <tt>rcu_assign_pointer()</tt>
+ when creating linked structures that are to be published via
+ a single external pointer.
+ The <tt>RCU_INIT_POINTER()</tt> macro is provided for
+ this task and also for assigning <tt>NULL</tt> pointers
+ at runtime.
+</ol>
+
+<p>
+This not a hard-and-fast list: RCU's diagnostic capabilities will
+continue to be guided by the number and type of usage bugs found
+in real-world RCU usage.
+
+<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2>
+
+<p>
+The Linux kernel provides an interesting environment for all kinds of
+software, including RCU.
+Some of the relevant points of interest are as follows:
+
+<ol>
+<li> <a href="#Configuration">Configuration</a>.
+<li> <a href="#Firmware Interface">Firmware Interface</a>.
+<li> <a href="#Early Boot">Early Boot</a>.
+<li> <a href="#Interrupts and NMIs">
+ Interrupts and non-maskable interrupts (NMIs)</a>.
+<li> <a href="#Loadable Modules">Loadable Modules</a>.
+<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
+<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
+<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
+<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
+<li> <a href="#Scheduling-Clock Interrupts and RCU">
+ Scheduling-Clock Interrupts and RCU</a>.
+<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
+<li> <a href="#Performance, Scalability, Response Time, and Reliability">
+ Performance, Scalability, Response Time, and Reliability</a>.
+</ol>
+
+<p>
+This list is probably incomplete, but it does give a feel for the
+most notable Linux-kernel complications.
+Each of the following sections covers one of the above topics.
+
+<h3><a name="Configuration">Configuration</a></h3>
+
+<p>
+RCU's goal is automatic configuration, so that almost nobody
+needs to worry about RCU's <tt>Kconfig</tt> options.
+And for almost all users, RCU does in fact work well
+&ldquo;out of the box.&rdquo;
+
+<p>
+However, there are specialized use cases that are handled by
+kernel boot parameters and <tt>Kconfig</tt> options.
+Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users
+about new <tt>Kconfig</tt> options, which requires almost all of them
+be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option.
+
+<p>
+This all should be quite obvious, but the fact remains that
+Linus Torvalds recently had to
+<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a>
+me of this requirement.
+
+<h3><a name="Firmware Interface">Firmware Interface</a></h3>
+
+<p>
+In many cases, kernel obtains information about the system from the
+firmware, and sometimes things are lost in translation.
+Or the translation is accurate, but the original message is bogus.
+
+<p>
+For example, some systems' firmware overreports the number of CPUs,
+sometimes by a large factor.
+If RCU naively believed the firmware, as it used to do,
+it would create too many per-CPU kthreads.
+Although the resulting system will still run correctly, the extra
+kthreads needlessly consume memory and can cause confusion
+when they show up in <tt>ps</tt> listings.
+
+<p>
+RCU must therefore wait for a given CPU to actually come online before
+it can allow itself to believe that the CPU actually exists.
+The resulting &ldquo;ghost CPUs&rdquo; (which are never going to
+come online) cause a number of
+<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>.
+
+<h3><a name="Early Boot">Early Boot</a></h3>
+
+<p>
+The Linux kernel's boot sequence is an interesting process,
+and RCU is used early, even before <tt>rcu_init()</tt>
+is invoked.
+In fact, a number of RCU's primitives can be used as soon as the
+initial task's <tt>task_struct</tt> is available and the
+boot CPU's per-CPU variables are set up.
+The read-side primitives (<tt>rcu_read_lock()</tt>,
+<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>,
+and <tt>rcu_access_pointer()</tt>) will operate normally very early on,
+as will <tt>rcu_assign_pointer()</tt>.
+
+<p>
+Although <tt>call_rcu()</tt> may be invoked at any
+time during boot, callbacks are not guaranteed to be invoked until after
+all of RCU's kthreads have been spawned, which occurs at
+<tt>early_initcall()</tt> time.
+This delay in callback invocation is due to the fact that RCU does not
+invoke callbacks until it is fully initialized, and this full initialization
+cannot occur until after the scheduler has initialized itself to the
+point where RCU can spawn and run its kthreads.
+In theory, it would be possible to invoke callbacks earlier,
+however, this is not a panacea because there would be severe restrictions
+on what operations those callbacks could invoke.
+
+<p>
+Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
+<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
+(<a href="#Bottom-Half Flavor">discussed below</a>),
+<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>,
+<tt>synchronize_rcu_expedited()</tt>,
+<tt>synchronize_rcu_bh_expedited()</tt>, and
+<tt>synchronize_sched_expedited()</tt>
+will all operate normally
+during very early boot, the reason being that there is only one CPU
+and preemption is disabled.
+This means that the call <tt>synchronize_rcu()</tt> (or friends)
+itself is a quiescent
+state and thus a grace period, so the early-boot implementation can
+be a no-op.
+
+<p>
+However, once the scheduler has spawned its first kthread, this early
+boot trick fails for <tt>synchronize_rcu()</tt> (as well as for
+<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt>
+kernels.
+The reason is that an RCU read-side critical section might be preempted,
+which means that a subsequent <tt>synchronize_rcu()</tt> really does have
+to wait for something, as opposed to simply returning immediately.
+Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of
+its kthreads are spawned, which doesn't happen until some time during
+<tt>early_initcalls()</tt> time.
+But this is no excuse: RCU is nevertheless required to correctly handle
+synchronous grace periods during this time period.
+Once all of its kthreads are up and running, RCU starts running
+normally.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ How can RCU possibly handle grace periods before all of its
+ kthreads have been spawned???
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Very carefully!
+ </font>
+
+ <p><font color="ffffff">
+ During the &ldquo;dead zone&rdquo; between the time that the
+ scheduler spawns the first task and the time that all of RCU's
+ kthreads have been spawned, all synchronous grace periods are
+ handled by the expedited grace-period mechanism.
+ At runtime, this expedited mechanism relies on workqueues, but
+ during the dead zone the requesting task itself drives the
+ desired expedited grace period.
+ Because dead-zone execution takes place within task context,
+ everything works.
+ Once the dead zone ends, expedited grace periods go back to
+ using workqueues, as is required to avoid problems that would
+ otherwise occur when a user task received a POSIX signal while
+ driving an expedited grace period.
+ </font>
+
+ <p><font color="ffffff">
+ And yes, this does mean that it is unhelpful to send POSIX
+ signals to random tasks between the time that the scheduler
+ spawns its first kthread and the time that RCU's kthreads
+ have all been spawned.
+ If there ever turns out to be a good reason for sending POSIX
+ signals during that time, appropriate adjustments will be made.
+ (If it turns out that POSIX signals are sent during this time for
+ no good reason, other adjustments will be made, appropriate
+ or otherwise.)
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+I learned of these boot-time requirements as a result of a series of
+system hangs.
+
+<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3>
+
+<p>
+The Linux kernel has interrupts, and RCU read-side critical sections are
+legal within interrupt handlers and within interrupt-disabled regions
+of code, as are invocations of <tt>call_rcu()</tt>.
+
+<p>
+Some Linux-kernel architectures can enter an interrupt handler from
+non-idle process context, and then just never leave it, instead stealthily
+transitioning back to process context.
+This trick is sometimes used to invoke system calls from inside the kernel.
+These &ldquo;half-interrupts&rdquo; mean that RCU has to be very careful
+about how it counts interrupt nesting levels.
+I learned of this requirement the hard way during a rewrite
+of RCU's dyntick-idle code.
+
+<p>
+The Linux kernel has non-maskable interrupts (NMIs), and
+RCU read-side critical sections are legal within NMI handlers.
+Thankfully, RCU update-side primitives, including
+<tt>call_rcu()</tt>, are prohibited within NMI handlers.
+
+<p>
+The name notwithstanding, some Linux-kernel architectures
+can have nested NMIs, which RCU must handle correctly.
+Andy Lutomirski
+<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
+with this requirement;
+he also kindly surprised me with
+<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
+that meets this requirement.
+
+<h3><a name="Loadable Modules">Loadable Modules</a></h3>
+
+<p>
+The Linux kernel has loadable modules, and these modules can
+also be unloaded.
+After a given module has been unloaded, any attempt to call
+one of its functions results in a segmentation fault.
+The module-unload functions must therefore cancel any
+delayed calls to loadable-module functions, for example,
+any outstanding <tt>mod_timer()</tt> must be dealt with
+via <tt>del_timer_sync()</tt> or similar.
+
+<p>
+Unfortunately, there is no way to cancel an RCU callback;
+once you invoke <tt>call_rcu()</tt>, the callback function is
+going to eventually be invoked, unless the system goes down first.
+Because it is normally considered socially irresponsible to crash the system
+in response to a module unload request, we need some other way
+to deal with in-flight RCU callbacks.
+
+<p>
+RCU therefore provides
+<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>,
+which waits until all in-flight RCU callbacks have been invoked.
+If a module uses <tt>call_rcu()</tt>, its exit function should therefore
+prevent any future invocation of <tt>call_rcu()</tt>, then invoke
+<tt>rcu_barrier()</tt>.
+In theory, the underlying module-unload code could invoke
+<tt>rcu_barrier()</tt> unconditionally, but in practice this would
+incur unacceptable latencies.
+
+<p>
+Nikita Danilov noted this requirement for an analogous filesystem-unmount
+situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
+The need for <tt>rcu_barrier()</tt> for module unloading became
+apparent later.
+
+<p>
+<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not,
+repeat, <i>not</i>, obligated to wait for a grace period.
+It is instead only required to wait for RCU callbacks that have
+already been posted.
+Therefore, if there are no RCU callbacks posted anywhere in the system,
+<tt>rcu_barrier()</tt> is within its rights to return immediately.
+Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not
+necessarily need to wait for a grace period.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Wait a minute!
+ Each RCU callbacks must wait for a grace period to complete,
+ and <tt>rcu_barrier()</tt> must wait for each pre-existing
+ callback to be invoked.
+ Doesn't <tt>rcu_barrier()</tt> therefore need to wait for
+ a full grace period if there is even one callback posted anywhere
+ in the system?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Absolutely not!!!
+ </font>
+
+ <p><font color="ffffff">
+ Yes, each RCU callbacks must wait for a grace period to complete,
+ but it might well be partly (or even completely) finished waiting
+ by the time <tt>rcu_barrier()</tt> is invoked.
+ In that case, <tt>rcu_barrier()</tt> need only wait for the
+ remaining portion of the grace period to elapse.
+ So even if there are quite a few callbacks posted,
+ <tt>rcu_barrier()</tt> might well return quite quickly.
+ </font>
+
+ <p><font color="ffffff">
+ So if you need to wait for a grace period as well as for all
+ pre-existing callbacks, you will need to invoke both
+ <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>.
+ If latency is a concern, you can always use workqueues
+ to invoke them concurrently.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
+
+<p>
+The Linux kernel supports CPU hotplug, which means that CPUs
+can come and go.
+It is of course illegal to use any RCU API member from an offline CPU,
+with the exception of <a href="#Sleepable RCU">SRCU</a> read-side
+critical sections.
+This requirement was present from day one in DYNIX/ptx, but
+on the other hand, the Linux kernel's CPU-hotplug implementation
+is &ldquo;interesting.&rdquo;
+
+<p>
+The Linux-kernel CPU-hotplug implementation has notifiers that
+are used to allow the various kernel subsystems (including RCU)
+to respond appropriately to a given CPU-hotplug operation.
+Most RCU operations may be invoked from CPU-hotplug notifiers,
+including even synchronous grace-period operations such as
+<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>.
+
+<p>
+However, all-callback-wait operations such as
+<tt>rcu_barrier()</tt> are also not supported, due to the
+fact that there are phases of CPU-hotplug operations where
+the outgoing CPU's callbacks will not be invoked until after
+the CPU-hotplug operation ends, which could also result in deadlock.
+Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations
+during its execution, which results in another type of deadlock
+when invoked from a CPU-hotplug notifier.
+
+<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
+
+<p>
+RCU depends on the scheduler, and the scheduler uses RCU to
+protect some of its data structures.
+This means the scheduler is forbidden from acquiring
+the runqueue locks and the priority-inheritance locks
+in the middle of an outermost RCU read-side critical section unless either
+(1)&nbsp;it releases them before exiting that same
+RCU read-side critical section, or
+(2)&nbsp;interrupts are disabled across
+that entire RCU read-side critical section.
+This same prohibition also applies (recursively!) to any lock that is acquired
+while holding any lock to which this prohibition applies.
+Adhering to this rule prevents preemptible RCU from invoking
+<tt>rcu_read_unlock_special()</tt> while either runqueue or
+priority-inheritance locks are held, thus avoiding deadlock.
+
+<p>
+Prior to v4.4, it was only necessary to disable preemption across
+RCU read-side critical sections that acquired scheduler locks.
+In v4.4, expedited grace periods started using IPIs, and these
+IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath.
+Therefore, this expedited-grace-period change required disabling of
+interrupts, not just preemption.
+
+<p>
+For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt>
+implementation must be written carefully to avoid similar deadlocks.
+In particular, <tt>rcu_read_unlock()</tt> must tolerate an
+interrupt where the interrupt handler invokes both
+<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
+This possibility requires <tt>rcu_read_unlock()</tt> to use
+negative nesting levels to avoid destructive recursion via
+interrupt handler's use of RCU.
+
+<p>
+This pair of mutual scheduler-RCU requirements came as a
+<a href="https://lwn.net/Articles/453002/">complete surprise</a>.
+
+<p>
+As noted above, RCU makes use of kthreads, and it is necessary to
+avoid excessive CPU-time accumulation by these kthreads.
+This requirement was no surprise, but RCU's violation of it
+when running context-switch-heavy workloads when built with
+<tt>CONFIG_NO_HZ_FULL=y</tt>
+<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
+RCU has made good progress towards meeting this requirement, even
+for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
+but there is room for further improvement.
+
+<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
+
+<p>
+It is possible to use tracing on RCU code, but tracing itself
+uses RCU.
+For this reason, <tt>rcu_dereference_raw_notrace()</tt>
+is provided for use by tracing, which avoids the destructive
+recursion that could otherwise ensue.
+This API is also used by virtualization in some architectures,
+where RCU readers execute in environments in which tracing
+cannot be used.
+The tracing folks both located the requirement and provided the
+needed fix, so this surprise requirement was relatively painless.
+
+<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>
+
+<p>
+Interrupting idle CPUs is considered socially unacceptable,
+especially by people with battery-powered embedded systems.
+RCU therefore conserves energy by detecting which CPUs are
+idle, including tracking CPUs that have been interrupted from idle.
+This is a large part of the energy-efficiency requirement,
+so I learned of this via an irate phone call.
+
+<p>
+Because RCU avoids interrupting idle CPUs, it is illegal to
+execute an RCU read-side critical section on an idle CPU.
+(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat
+if you try it.)
+The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt>
+event tracing is provided to work around this restriction.
+In addition, <tt>rcu_is_watching()</tt> may be used to
+test whether or not it is currently legal to run RCU read-side
+critical sections on this CPU.
+I learned of the need for diagnostics on the one hand
+and <tt>RCU_NONIDLE()</tt> on the other while inspecting
+idle-loop code.
+Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
+which is used quite heavily in the idle loop.
+However, there are some restrictions on the code placed within
+<tt>RCU_NONIDLE()</tt>:
+
+<ol>
+<li> Blocking is prohibited.
+ In practice, this is not a serious restriction given that idle
+ tasks are prohibited from blocking to begin with.
+<li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot
+ nest indefinitely deeply.
+ However, given that they can be nested on the order of a million
+ deep, even on 32-bit systems, this should not be a serious
+ restriction.
+ This nesting limit would probably be reached long after the
+ compiler OOMed or the stack overflowed.
+<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence
+ out of that same <tt>RCU_NONIDLE()</tt>.
+ For example, the following is grossly illegal:
+
+ <blockquote>
+ <pre>
+ 1 RCU_NONIDLE({
+ 2 do_something();
+ 3 goto bad_idea; /* BUG!!! */
+ 4 do_something_else();});
+ 5 bad_idea:
+ </pre>
+ </blockquote>
+
+ <p>
+ It is just as illegal to transfer control into the middle of
+ <tt>RCU_NONIDLE()</tt>'s argument.
+ Yes, in theory, you could transfer in as long as you also
+ transferred out, but in practice you could also expect to get sharply
+ worded review comments.
+</ol>
+
+<p>
+It is similarly socially unacceptable to interrupt an
+<tt>nohz_full</tt> CPU running in userspace.
+RCU must therefore track <tt>nohz_full</tt> userspace
+execution.
+RCU must therefore be able to sample state at two points in
+time, and be able to determine whether or not some other CPU spent
+any time idle and/or executing in userspace.
+
+<p>
+These energy-efficiency requirements have proven quite difficult to
+understand and to meet, for example, there have been more than five
+clean-sheet rewrites of RCU's energy-efficiency code, the last of
+which was finally able to demonstrate
+<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>.
+As noted earlier,
+I learned of many of these requirements via angry phone calls:
+Flaming me on the Linux-kernel mailing list was apparently not
+sufficient to fully vent their ire at RCU's energy-efficiency bugs!
+
+<h3><a name="Scheduling-Clock Interrupts and RCU">
+Scheduling-Clock Interrupts and RCU</a></h3>
+
+<p>
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on kernel configuration, RCU handles these states differently:
+
+<table border=3>
+<tr><th><tt>HZ</tt> Kconfig</th>
+ <th>In-Kernel</th>
+ <th>Usermode</th>
+ <th>Idle</th></tr>
+<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
+ <td>Can only sometimes rely on scheduling-clock interrupt.
+ In other cases, it is necessary to bound kernel execution
+ times and/or use IPIs.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
+ scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
+ and <tt>NO_HZ_IDLE</tt> do?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
+ does not necessarily re-enable the scheduling-clock interrupt
+ on entry to each and every system call.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, RCU must be reliably informed as to whether any given
+CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
+also whether that CPU is executing in usermode, as discussed
+<a href="#Energy Efficiency">earlier</a>.
+It also requires that the scheduling-clock interrupt be enabled when
+RCU needs it to be:
+
+<ol>
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it is non-idle, the scheduling-clock tick had better be running.
+ Otherwise, you will get RCU CPU stall warnings. Or at best,
+ very long (11-second) grace periods, with a pointless IPI waking
+ the CPU from time to time.
+<li> If a CPU is in a portion of the kernel that executes RCU read-side
+ critical sections, and RCU believes this CPU to be idle, you will get
+ random memory corruption. <b>DON'T DO THIS!!!</b>
+
+ <br>This is one reason to test with lockdep, which will complain
+ about this sort of thing.
+<li> If a CPU is in a portion of the kernel that is absolutely
+ positively no-joking guaranteed to never execute any RCU read-side
+ critical sections, and RCU believes this CPU to to be idle,
+ no problem. This sort of thing is used by some architectures
+ for light-weight exception handlers, which can then avoid the
+ overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
+ at exception entry and exit, respectively.
+ Some go further and avoid the entireties of <tt>irq_enter()</tt>
+ and <tt>irq_exit()</tt>.
+
+ <br>Just make very sure you are running some of your tests with
+ <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
+ was in fact joking about not doing RCU read-side critical sections.
+<li> If a CPU is executing in the kernel with the scheduling-clock
+ interrupt disabled and RCU believes this CPU to be non-idle,
+ and if the CPU goes idle (from an RCU perspective) every few
+ jiffies, no problem. It is usually OK for there to be the
+ occasional gap between idle periods of up to a second or so.
+
+ <br>If the gap grows too long, you get RCU CPU stall warnings.
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it to be idle, of course no problem.
+<li> If a CPU is executing in the kernel, the kernel code
+ path is passing through quiescent states at a reasonable
+ frequency (preferably about once per few jiffies, but the
+ occasional excursion to a second or so is usually OK) and the
+ scheduling-clock interrupt is enabled, of course no problem.
+
+ <br>If the gap between a successive pair of quiescent states grows
+ too long, you get RCU CPU stall warnings.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But what if my driver has a hardware interrupt handler
+ that can run for many seconds?
+ I cannot invoke <tt>schedule()</tt> from an hardware
+ interrupt handler, after all!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
+ every so often.
+ But given that long-running interrupt handlers can cause
+ other problems, not least for response time, shouldn't you
+ work to keep your interrupt handler's runtime within reasonable
+ bounds?
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But as long as RCU is properly informed of kernel state transitions between
+in-kernel execution, usermode execution, and idle, and as long as the
+scheduling-clock interrupt is enabled when RCU needs it to be, you
+can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
+<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
+
+<p>
+Although small-memory non-realtime systems can simply use Tiny RCU,
+code size is only one aspect of memory efficiency.
+Another aspect is the size of the <tt>rcu_head</tt> structure
+used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>.
+Although this structure contains nothing more than a pair of pointers,
+it does appear in many RCU-protected data structures, including
+some that are size critical.
+The <tt>page</tt> structure is a case in point, as evidenced by
+the many occurrences of the <tt>union</tt> keyword within that structure.
+
+<p>
+This need for memory efficiency is one reason that RCU uses hand-crafted
+singly linked lists to track the <tt>rcu_head</tt> structures that
+are waiting for a grace period to elapse.
+It is also the reason why <tt>rcu_head</tt> structures do not contain
+debug information, such as fields tracking the file and line of the
+<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them.
+Although this information might appear in debug-only kernel builds at some
+point, in the meantime, the <tt>-&gt;func</tt> field will often provide
+the needed debug information.
+
+<p>
+However, in some cases, the need for memory efficiency leads to even
+more extreme measures.
+Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field
+shares storage with a great many other structures that are used at
+various points in the corresponding page's lifetime.
+In order to correctly resolve certain
+<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>,
+the Linux kernel's memory-management subsystem needs a particular bit
+to remain zero during all phases of grace-period processing,
+and that bit happens to map to the bottom bit of the
+<tt>rcu_head</tt> structure's <tt>-&gt;next</tt> field.
+RCU makes this guarantee as long as <tt>call_rcu()</tt>
+is used to post the callback, as opposed to <tt>kfree_rcu()</tt>
+or some future &ldquo;lazy&rdquo;
+variant of <tt>call_rcu()</tt> that might one day be created for
+energy-efficiency purposes.
+
+<p>
+That said, there are limits.
+RCU requires that the <tt>rcu_head</tt> structure be aligned to a
+two-byte boundary, and passing a misaligned <tt>rcu_head</tt>
+structure to one of the <tt>call_rcu()</tt> family of functions
+will result in a splat.
+It is therefore necessary to exercise caution when packing
+structures containing fields of type <tt>rcu_head</tt>.
+Why not a four-byte or even eight-byte alignment requirement?
+Because the m68k architecture provides only two-byte alignment,
+and thus acts as alignment's least common denominator.
+
+<p>
+The reason for reserving the bottom bit of pointers to
+<tt>rcu_head</tt> structures is to leave the door open to
+&ldquo;lazy&rdquo; callbacks whose invocations can safely be deferred.
+Deferring invocation could potentially have energy-efficiency
+benefits, but only if the rate of non-lazy callbacks decreases
+significantly for some important workload.
+In the meantime, reserving the bottom bit keeps this option open
+in case it one day becomes useful.
+
+<h3><a name="Performance, Scalability, Response Time, and Reliability">
+Performance, Scalability, Response Time, and Reliability</a></h3>
+
+<p>
+Expanding on the
+<a href="#Performance and Scalability">earlier discussion</a>,
+RCU is used heavily by hot code paths in performance-critical
+portions of the Linux kernel's networking, security, virtualization,
+and scheduling code paths.
+RCU must therefore use efficient implementations, especially in its
+read-side primitives.
+To that end, it would be good if preemptible RCU's implementation
+of <tt>rcu_read_lock()</tt> could be inlined, however, doing
+this requires resolving <tt>#include</tt> issues with the
+<tt>task_struct</tt> structure.
+
+<p>
+The Linux kernel supports hardware configurations with up to
+4096 CPUs, which means that RCU must be extremely scalable.
+Algorithms that involve frequent acquisitions of global locks or
+frequent atomic operations on global variables simply cannot be
+tolerated within the RCU implementation.
+RCU therefore makes heavy use of a combining tree based on the
+<tt>rcu_node</tt> structure.
+RCU is required to tolerate all CPUs continuously invoking any
+combination of RCU's runtime primitives with minimal per-operation
+overhead.
+In fact, in many cases, increasing load must <i>decrease</i> the
+per-operation overhead, witness the batching optimizations for
+<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>,
+<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>.
+As a general rule, RCU must cheerfully accept whatever the
+rest of the Linux kernel decides to throw at it.
+
+<p>
+The Linux kernel is used for real-time workloads, especially
+in conjunction with the
+<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>.
+The real-time-latency response requirements are such that the
+traditional approach of disabling preemption across RCU
+read-side critical sections is inappropriate.
+Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore
+use an RCU implementation that allows RCU read-side critical
+sections to be preempted.
+This requirement made its presence known after users made it
+clear that an earlier
+<a href="https://lwn.net/Articles/107930/">real-time patch</a>
+did not meet their needs, in conjunction with some
+<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a>
+encountered by a very early version of the -rt patchset.
+
+<p>
+In addition, RCU must make do with a sub-100-microsecond real-time latency
+budget.
+In fact, on smaller systems with the -rt patchset, the Linux kernel
+provides sub-20-microsecond real-time latencies for the whole kernel,
+including RCU.
+RCU's scalability and latency must therefore be sufficient for
+these sorts of configurations.
+To my surprise, the sub-100-microsecond real-time latency budget
+<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf">
+applies to even the largest systems [PDF]</a>,
+up to and including systems with 4096 CPUs.
+This real-time requirement motivated the grace-period kthread, which
+also simplified handling of a number of race conditions.
+
+<p>
+RCU must avoid degrading real-time response for CPU-bound threads, whether
+executing in usermode (which is one use case for
+<tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel.
+That said, CPU-bound loops in the kernel must execute
+<tt>cond_resched()</tt> at least once per few tens of milliseconds
+in order to avoid receiving an IPI from RCU.
+
+<p>
+Finally, RCU's status as a synchronization primitive means that
+any RCU failure can result in arbitrary memory corruption that can be
+extremely difficult to debug.
+This means that RCU must be extremely reliable, which in
+practice also means that RCU must have an aggressive stress-test
+suite.
+This stress-test suite is called <tt>rcutorture</tt>.
+
+<p>
+Although the need for <tt>rcutorture</tt> was no surprise,
+the current immense popularity of the Linux kernel is posing
+interesting&mdash;and perhaps unprecedented&mdash;validation
+challenges.
+To see this, keep in mind that there are well over one billion
+instances of the Linux kernel running today, given Android
+smartphones, Linux-powered televisions, and servers.
+This number can be expected to increase sharply with the advent of
+the celebrated Internet of Things.
+
+<p>
+Suppose that RCU contains a race condition that manifests on average
+once per million years of runtime.
+This bug will be occurring about three times per <i>day</i> across
+the installed base.
+RCU could simply hide behind hardware error rates, given that no one
+should really expect their smartphone to last for a million years.
+However, anyone taking too much comfort from this thought should
+consider the fact that in most jurisdictions, a successful multi-year
+test of a given mechanism, which might include a Linux kernel,
+suffices for a number of types of safety-critical certifications.
+In fact, rumor has it that the Linux kernel is already being used
+in production for safety-critical applications.
+I don't know about you, but I would feel quite bad if a bug in RCU
+killed someone.
+Which might explain my recent focus on validation and verification.
+
+<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2>
+
+<p>
+One of the more surprising things about RCU is that there are now
+no fewer than five <i>flavors</i>, or API families.
+In addition, the primary flavor that has been the sole focus up to
+this point has two different implementations, non-preemptible and
+preemptible.
+The other four flavors are listed below, with requirements for each
+described in a separate section.
+
+<ol>
+<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a>
+<li> <a href="#Sched Flavor">Sched Flavor</a>
+<li> <a href="#Sleepable RCU">Sleepable RCU</a>
+<li> <a href="#Tasks RCU">Tasks RCU</a>
+<li> <a href="#Waiting for Multiple Grace Periods">
+ Waiting for Multiple Grace Periods</a>
+</ol>
+
+<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3>
+
+<p>
+The softirq-disable (AKA &ldquo;bottom-half&rdquo;,
+hence the &ldquo;_bh&rdquo; abbreviations)
+flavor of RCU, or <i>RCU-bh</i>, was developed by
+Dipankar Sarma to provide a flavor of RCU that could withstand the
+network-based denial-of-service attacks researched by Robert
+Olsson.
+These attacks placed so much networking load on the system
+that some of the CPUs never exited softirq execution,
+which in turn prevented those CPUs from ever executing a context switch,
+which, in the RCU implementation of that time, prevented grace periods
+from ever ending.
+The result was an out-of-memory condition and a system hang.
+
+<p>
+The solution was the creation of RCU-bh, which does
+<tt>local_bh_disable()</tt>
+across its read-side critical sections, and which uses the transition
+from one type of softirq processing to another as a quiescent state
+in addition to context switch, idle, user mode, and offline.
+This means that RCU-bh grace periods can complete even when some of
+the CPUs execute in softirq indefinitely, thus allowing algorithms
+based on RCU-bh to withstand network-based denial-of-service attacks.
+
+<p>
+Because
+<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt>
+disable and re-enable softirq handlers, any attempt to start a softirq
+handlers during the
+RCU-bh read-side critical section will be deferred.
+In this case, <tt>rcu_read_unlock_bh()</tt>
+will invoke softirq processing, which can take considerable time.
+One can of course argue that this softirq overhead should be associated
+with the code following the RCU-bh read-side critical section rather
+than <tt>rcu_read_unlock_bh()</tt>, but the fact
+is that most profiling tools cannot be expected to make this sort
+of fine distinction.
+For example, suppose that a three-millisecond-long RCU-bh read-side
+critical section executes during a time of heavy networking load.
+There will very likely be an attempt to invoke at least one softirq
+handler during that three milliseconds, but any such invocation will
+be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>.
+This can of course make it appear at first glance as if
+<tt>rcu_read_unlock_bh()</tt> was executing very slowly.
+
+<p>
+The
+<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a>
+includes
+<tt>rcu_read_lock_bh()</tt>,
+<tt>rcu_read_unlock_bh()</tt>,
+<tt>rcu_dereference_bh()</tt>,
+<tt>rcu_dereference_bh_check()</tt>,
+<tt>synchronize_rcu_bh()</tt>,
+<tt>synchronize_rcu_bh_expedited()</tt>,
+<tt>call_rcu_bh()</tt>,
+<tt>rcu_barrier_bh()</tt>, and
+<tt>rcu_read_lock_bh_held()</tt>.
+
+<h3><a name="Sched Flavor">Sched Flavor</a></h3>
+
+<p>
+Before preemptible RCU, waiting for an RCU grace period had the
+side effect of also waiting for all pre-existing interrupt
+and NMI handlers.
+However, there are legitimate preemptible-RCU implementations that
+do not have this property, given that any point in the code outside
+of an RCU read-side critical section can be a quiescent state.
+Therefore, <i>RCU-sched</i> was created, which follows &ldquo;classic&rdquo;
+RCU in that an RCU-sched grace period waits for for pre-existing
+interrupt and NMI handlers.
+In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched
+APIs have identical implementations, while kernels built with
+<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each.
+
+<p>
+Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels,
+<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
+disable and re-enable preemption, respectively.
+This means that if there was a preemption attempt during the
+RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt>
+will enter the scheduler, with all the latency and overhead entailed.
+Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look
+as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly.
+However, the highest-priority task won't be preempted, so that task
+will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations.
+
+<p>
+The
+<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a>
+includes
+<tt>rcu_read_lock_sched()</tt>,
+<tt>rcu_read_unlock_sched()</tt>,
+<tt>rcu_read_lock_sched_notrace()</tt>,
+<tt>rcu_read_unlock_sched_notrace()</tt>,
+<tt>rcu_dereference_sched()</tt>,
+<tt>rcu_dereference_sched_check()</tt>,
+<tt>synchronize_sched()</tt>,
+<tt>synchronize_rcu_sched_expedited()</tt>,
+<tt>call_rcu_sched()</tt>,
+<tt>rcu_barrier_sched()</tt>, and
+<tt>rcu_read_lock_sched_held()</tt>.
+However, anything that disables preemption also marks an RCU-sched
+read-side critical section, including
+<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>,
+<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>,
+and so on.
+
+<h3><a name="Sleepable RCU">Sleepable RCU</a></h3>
+
+<p>
+For well over a decade, someone saying &ldquo;I need to block within
+an RCU read-side critical section&rdquo; was a reliable indication
+that this someone did not understand RCU.
+After all, if you are always blocking in an RCU read-side critical
+section, you can probably afford to use a higher-overhead synchronization
+mechanism.
+However, that changed with the advent of the Linux kernel's notifiers,
+whose RCU read-side critical
+sections almost never sleep, but sometimes need to.
+This resulted in the introduction of
+<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>,
+or <i>SRCU</i>.
+
+<p>
+SRCU allows different domains to be defined, with each such domain
+defined by an instance of an <tt>srcu_struct</tt> structure.
+A pointer to this structure must be passed in to each SRCU function,
+for example, <tt>synchronize_srcu(&amp;ss)</tt>, where
+<tt>ss</tt> is the <tt>srcu_struct</tt> structure.
+The key benefit of these domains is that a slow SRCU reader in one
+domain does not delay an SRCU grace period in some other domain.
+That said, one consequence of these domains is that read-side code
+must pass a &ldquo;cookie&rdquo; from <tt>srcu_read_lock()</tt>
+to <tt>srcu_read_unlock()</tt>, for example, as follows:
+
+<blockquote>
+<pre>
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&amp;ss);
+ 4 do_something();
+ 5 srcu_read_unlock(&amp;ss, idx);
+</pre>
+</blockquote>
+
+<p>
+As noted above, it is legal to block within SRCU read-side critical sections,
+however, with great power comes great responsibility.
+If you block forever in one of a given domain's SRCU read-side critical
+sections, then that domain's grace periods will also be blocked forever.
+Of course, one good way to block forever is to deadlock, which can
+happen if any operation in a given domain's SRCU read-side critical
+section can block waiting, either directly or indirectly, for that domain's
+grace period to elapse.
+For example, this results in a self-deadlock:
+
+<blockquote>
+<pre>
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&amp;ss);
+ 4 do_something();
+ 5 synchronize_srcu(&amp;ss);
+ 6 srcu_read_unlock(&amp;ss, idx);
+</pre>
+</blockquote>
+
+<p>
+However, if line&nbsp;5 acquired a mutex that was held across
+a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>,
+deadlock would still be possible.
+Furthermore, if line&nbsp;5 acquired a mutex that was held across
+a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>,
+and if an <tt>ss1</tt>-domain SRCU read-side critical section
+acquired another mutex that was held across as <tt>ss</tt>-domain
+<tt>synchronize_srcu()</tt>,
+deadlock would again be possible.
+Such a deadlock cycle could extend across an arbitrarily large number
+of different SRCU domains.
+Again, with great power comes great responsibility.
+
+<p>
+Unlike the other RCU flavors, SRCU read-side critical sections can
+run on idle and even offline CPUs.
+This ability requires that <tt>srcu_read_lock()</tt> and
+<tt>srcu_read_unlock()</tt> contain memory barriers, which means
+that SRCU readers will run a bit slower than would RCU readers.
+It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
+API, which, in combination with <tt>srcu_read_unlock()</tt>,
+guarantees a full memory barrier.
+
+<p>
+Also unlike other RCU flavors, SRCU's callbacks-wait function
+<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers,
+though this is not necessarily a good idea.
+The reason that this is possible is that SRCU is insensitive
+to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt>
+need not exclude CPU-hotplug operations.
+
+<p>
+SRCU also differs from other RCU flavors in that SRCU's expedited and
+non-expedited grace periods are implemented by the same mechanism.
+This means that in the current SRCU implementation, expediting a
+future grace period has the side effect of expediting all prior
+grace periods that have not yet completed.
+(But please note that this is a property of the current implementation,
+not necessarily of future implementations.)
+In addition, if SRCU has been idle for longer than the interval
+specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter
+(25&nbsp;microseconds by default),
+and if a <tt>synchronize_srcu()</tt> invocation ends this idle period,
+that invocation will be automatically expedited.
+
+<p>
+As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating
+a locking bottleneck present in prior kernel versions.
+Although this will allow users to put much heavier stress on
+<tt>call_srcu()</tt>, it is important to note that SRCU does not
+yet take any special steps to deal with callback flooding.
+So if you are posting (say) 10,000 SRCU callbacks per second per CPU,
+you are probably totally OK, but if you intend to post (say) 1,000,000
+SRCU callbacks per second per CPU, please run some tests first.
+SRCU just might need a few adjustment to deal with that sort of load.
+Of course, your mileage may vary based on the speed of your CPUs and
+the size of your memory.
+
+<p>
+The
+<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
+includes
+<tt>srcu_read_lock()</tt>,
+<tt>srcu_read_unlock()</tt>,
+<tt>srcu_dereference()</tt>,
+<tt>srcu_dereference_check()</tt>,
+<tt>synchronize_srcu()</tt>,
+<tt>synchronize_srcu_expedited()</tt>,
+<tt>call_srcu()</tt>,
+<tt>srcu_barrier()</tt>, and
+<tt>srcu_read_lock_held()</tt>.
+It also includes
+<tt>DEFINE_SRCU()</tt>,
+<tt>DEFINE_STATIC_SRCU()</tt>, and
+<tt>init_srcu_struct()</tt>
+APIs for defining and initializing <tt>srcu_struct</tt> structures.
+
+<h3><a name="Tasks RCU">Tasks RCU</a></h3>
+
+<p>
+Some forms of tracing use &ldquo;trampolines&rdquo; to handle the
+binary rewriting required to install different types of probes.
+It would be good to be able to free old trampolines, which sounds
+like a job for some form of RCU.
+However, because it is necessary to be able to install a trace
+anywhere in the code, it is not possible to use read-side markers
+such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
+In addition, it does not work to have these markers in the trampoline
+itself, because there would need to be instructions following
+<tt>rcu_read_unlock()</tt>.
+Although <tt>synchronize_rcu()</tt> would guarantee that execution
+reached the <tt>rcu_read_unlock()</tt>, it would not be able to
+guarantee that execution had completely left the trampoline.
+
+<p>
+The solution, in the form of
+<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>,
+is to have implicit
+read-side critical sections that are delimited by voluntary context
+switches, that is, calls to <tt>schedule()</tt>,
+<tt>cond_resched()</tt>, and
+<tt>synchronize_rcu_tasks()</tt>.
+In addition, transitions to and from userspace execution also delimit
+tasks-RCU read-side critical sections.
+
+<p>
+The tasks-RCU API is quite compact, consisting only of
+<tt>call_rcu_tasks()</tt>,
+<tt>synchronize_rcu_tasks()</tt>, and
+<tt>rcu_barrier_tasks()</tt>.
+
+<h3><a name="Waiting for Multiple Grace Periods">
+Waiting for Multiple Grace Periods</a></h3>
+
+<p>
+Perhaps you have an RCU protected data structure that is accessed from
+RCU read-side critical sections, from softirq handlers, and from
+hardware interrupt handlers.
+That is three flavors of RCU, the normal flavor, the bottom-half flavor,
+and the sched flavor.
+How to wait for a compound grace period?
+
+<p>
+The best approach is usually to &ldquo;just say no!&rdquo; and
+insert <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
+around each RCU read-side critical section, regardless of what
+environment it happens to be in.
+But suppose that some of the RCU read-side critical sections are
+on extremely hot code paths, and that use of <tt>CONFIG_PREEMPT=n</tt>
+is not a viable option, so that <tt>rcu_read_lock()</tt> and
+<tt>rcu_read_unlock()</tt> are not free.
+What then?
+
+<p>
+You <i>could</i> wait on all three grace periods in succession, as follows:
+
+<blockquote>
+<pre>
+ 1 synchronize_rcu();
+ 2 synchronize_rcu_bh();
+ 3 synchronize_sched();
+</pre>
+</blockquote>
+
+<p>
+This works, but triples the update-side latency penalty.
+In cases where this is not acceptable, <tt>synchronize_rcu_mult()</tt>
+may be used to wait on all three flavors of grace period concurrently:
+
+<blockquote>
+<pre>
+ 1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);
+</pre>
+</blockquote>
+
+<p>
+But what if it is necessary to also wait on SRCU?
+This can be done as follows:
+
+<blockquote>
+<pre>
+ 1 static void call_my_srcu(struct rcu_head *head,
+ 2 void (*func)(struct rcu_head *head))
+ 3 {
+ 4 call_srcu(&amp;my_srcu, head, func);
+ 5 }
+ 6
+ 7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu);
+</pre>
+</blockquote>
+
+<p>
+If you needed to wait on multiple different flavors of SRCU
+(but why???), you would need to create a wrapper function resembling
+<tt>call_my_srcu()</tt> for each SRCU flavor.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But what if I need to wait for multiple RCU flavors, but I also need
+ the grace periods to be expedited?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ If you are using expedited grace periods, there should be less penalty
+ for waiting on them in succession.
+ But if that is nevertheless a problem, you can use workqueues
+ or multiple kthreads to wait on the various expedited grace
+ periods concurrently.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+Again, it is usually better to adjust the RCU read-side critical sections
+to use a single flavor of RCU, but when this is not feasible, you can use
+<tt>synchronize_rcu_mult()</tt>.
+
+<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>
+
+<p>
+One of the tricks that RCU uses to attain update-side scalability is
+to increase grace-period latency with increasing numbers of CPUs.
+If this becomes a serious problem, it will be necessary to rework the
+grace-period state machine so as to avoid the need for the additional
+latency.
+
+<p>
+Expedited grace periods scan the CPUs, so their latency and overhead
+increases with increasing numbers of CPUs.
+If this becomes a serious problem on large systems, it will be necessary
+to do some redesign to avoid this scalability problem.
+
+<p>
+RCU disables CPU hotplug in a few places, perhaps most notably in the
+<tt>rcu_barrier()</tt> operations.
+If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug
+notifiers, it will be necessary to avoid disabling CPU hotplug.
+This would introduce some complexity, so there had better be a <i>very</i>
+good reason.
+
+<p>
+The tradeoff between grace-period latency on the one hand and interruptions
+of other CPUs on the other hand may need to be re-examined.
+The desire is of course for zero grace-period latency as well as zero
+interprocessor interrupts undertaken during an expedited grace period
+operation.
+While this ideal is unlikely to be achievable, it is quite possible that
+further improvements can be made.
+
+<p>
+The multiprocessor implementations of RCU use a combining tree that
+groups CPUs so as to reduce lock contention and increase cache locality.
+However, this combining tree does not spread its memory across NUMA
+nodes nor does it align the CPU groups with hardware features such
+as sockets or cores.
+Such spreading and alignment is currently believed to be unnecessary
+because the hotpath read-side primitives do not access the combining
+tree, nor does <tt>call_rcu()</tt> in the common case.
+If you believe that your architecture needs such spreading and alignment,
+then your architecture should also benefit from the
+<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set
+to the number of CPUs in a socket, NUMA node, or whatever.
+If the number of CPUs is too large, use a fraction of the number of
+CPUs.
+If the number of CPUs is a large prime number, well, that certainly
+is an &ldquo;interesting&rdquo; architectural choice!
+More flexible arrangements might be considered, but only if
+<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only
+if the inadequacy has been demonstrated by a carefully run and
+realistic system-level workload.
+
+<p>
+Please note that arrangements that require RCU to remap CPU numbers will
+require extremely good demonstration of need and full exploration of
+alternatives.
+
+<p>
+There is an embarrassingly large number of flavors of RCU, and this
+number has been increasing over time.
+Perhaps it will be possible to combine some at some future date.
+
+<p>
+RCU's various kthreads are reasonably recent additions.
+It is quite likely that adjustments will be required to more gracefully
+handle extreme loads.
+It might also be necessary to be able to relate CPU utilization by
+RCU's kthreads and softirq handlers to the code that instigated this
+CPU utilization.
+For example, RCU callback overhead might be charged back to the
+originating <tt>call_rcu()</tt> instance, though probably not
+in production kernels.
+
+<h2><a name="Summary">Summary</a></h2>
+
+<p>
+This document has presented more than two decade's worth of RCU
+requirements.
+Given that the requirements keep changing, this will not be the last
+word on this subject, but at least it serves to get an important
+subset of the requirements set forth.
+
+<h2><a name="Acknowledgments">Acknowledgments</a></h2>
+
+I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar,
+Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and
+Andy Lutomirski for their help in rendering
+this article human readable, and to Michelle Rankin for her support
+of this effort.
+Other contributions are acknowledged in the Linux kernel's git archive.
+
+</body></html>