summaryrefslogtreecommitdiffstats
path: root/Documentation/RCU/Design/Requirements/Requirements.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.rst')
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.rst2680
1 files changed, 2680 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
new file mode 100644
index 000000000..1ae79a10a
--- /dev/null
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -0,0 +1,2680 @@
+=================================
+A Tour Through RCU's Requirements
+=================================
+
+Copyright IBM Corporation, 2015
+
+Author: Paul E. McKenney
+
+The initial version of this document appeared in the
+`LWN <https://lwn.net/>`_ on those articles:
+`part 1 <https://lwn.net/Articles/652156/>`_,
+`part 2 <https://lwn.net/Articles/652677/>`_, and
+`part 3 <https://lwn.net/Articles/653326/>`_.
+
+Introduction
+------------
+
+Read-copy update (RCU) is a synchronization mechanism that is often used
+as a replacement for reader-writer locking. RCU is unusual in that
+updaters do not block readers, which means that RCU's read-side
+primitives can be exceedingly fast and scalable. In addition, updaters
+can make useful forward progress concurrently with readers. However, all
+this concurrency between RCU readers and updaters does raise the
+question of exactly what RCU readers are doing, which in turn raises the
+question of exactly what RCU's requirements are.
+
+This document therefore summarizes RCU's requirements, and can be
+thought of as an informal, high-level specification for RCU. It is
+important to understand that RCU's specification is primarily empirical
+in nature; in fact, I learned about many of these requirements the hard
+way. This situation might cause some consternation, however, not only
+has this learning process been a lot of fun, but it has also been a
+great privilege to work with so many people willing to apply
+technologies in interesting new ways.
+
+All that aside, here are the categories of currently known RCU
+requirements:
+
+#. `Fundamental Requirements`_
+#. `Fundamental Non-Requirements`_
+#. `Parallelism Facts of Life`_
+#. `Quality-of-Implementation Requirements`_
+#. `Linux Kernel Complications`_
+#. `Software-Engineering Requirements`_
+#. `Other RCU Flavors`_
+#. `Possible Future Changes`_
+
+This is followed by a `summary <#Summary>`__, however, the answers to
+each quick quiz immediately follows the quiz. Select the big white space
+with your mouse to see the answer.
+
+Fundamental Requirements
+------------------------
+
+RCU's fundamental requirements are the closest thing RCU has to hard
+mathematical requirements. These are:
+
+#. `Grace-Period Guarantee`_
+#. `Publish/Subscribe Guarantee`_
+#. `Memory-Barrier Guarantees`_
+#. `RCU Primitives Guaranteed to Execute Unconditionally`_
+#. `Guaranteed Read-to-Write Upgrade`_
+
+Grace-Period Guarantee
+~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's grace-period guarantee is unusual in being premeditated: Jack
+Slingwine and I had this guarantee firmly in mind when we started work
+on RCU (then called “rclock”) in the early 1990s. That said, the past
+two decades of experience with RCU have produced a much more detailed
+understanding of this guarantee.
+
+RCU's grace-period guarantee allows updaters to wait for the completion
+of all pre-existing RCU read-side critical sections. An RCU read-side
+critical section begins with the marker ``rcu_read_lock()`` and ends
+with the marker ``rcu_read_unlock()``. These markers may be nested, and
+RCU treats a nested set as one big RCU read-side critical section.
+Production-quality implementations of ``rcu_read_lock()`` and
+``rcu_read_unlock()`` are extremely lightweight, and in fact have
+exactly zero overhead in Linux kernels built for production use with
+``CONFIG_PREEMPT=n``.
+
+This guarantee allows ordering to be enforced with extremely low
+overhead to readers, for example:
+
+ ::
+
+ 1 int x, y;
+ 2
+ 3 void thread0(void)
+ 4 {
+ 5 rcu_read_lock();
+ 6 r1 = READ_ONCE(x);
+ 7 r2 = READ_ONCE(y);
+ 8 rcu_read_unlock();
+ 9 }
+ 10
+ 11 void thread1(void)
+ 12 {
+ 13 WRITE_ONCE(x, 1);
+ 14 synchronize_rcu();
+ 15 WRITE_ONCE(y, 1);
+ 16 }
+
+Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
+readers, any instance of ``thread0()`` that loads a value of zero from
+``x`` must complete before ``thread1()`` stores to ``y``, so that
+instance must also load a value of zero from ``y``. Similarly, any
+instance of ``thread0()`` that loads a value of one from ``y`` must have
+started after the ``synchronize_rcu()`` started, and must therefore also
+load a value of one from ``x``. Therefore, the outcome:
+
+ ::
+
+ (r1 == 0 && r2 == 1)
+
+cannot happen.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Wait a minute! You said that updaters can make useful forward |
+| progress concurrently with readers, but pre-existing readers will |
+| block ``synchronize_rcu()``!!! |
+| Just who are you trying to fool??? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| First, if updaters do not wish to be blocked by readers, they can use |
+| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. |
+| Second, even when using ``synchronize_rcu()``, the other update-side |
+| code does run concurrently with readers, whether pre-existing or not. |
++-----------------------------------------------------------------------+
+
+This scenario resembles one of the first uses of RCU in
+`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
+distributed lock manager's transition into a state suitable for handling
+recovery from node failure, more or less as follows:
+
+ ::
+
+ 1 #define STATE_NORMAL 0
+ 2 #define STATE_WANT_RECOVERY 1
+ 3 #define STATE_RECOVERING 2
+ 4 #define STATE_WANT_NORMAL 3
+ 5
+ 6 int state = STATE_NORMAL;
+ 7
+ 8 void do_something_dlm(void)
+ 9 {
+ 10 int state_snap;
+ 11
+ 12 rcu_read_lock();
+ 13 state_snap = READ_ONCE(state);
+ 14 if (state_snap == STATE_NORMAL)
+ 15 do_something();
+ 16 else
+ 17 do_something_carefully();
+ 18 rcu_read_unlock();
+ 19 }
+ 20
+ 21 void start_recovery(void)
+ 22 {
+ 23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
+ 24 synchronize_rcu();
+ 25 WRITE_ONCE(state, STATE_RECOVERING);
+ 26 recovery();
+ 27 WRITE_ONCE(state, STATE_WANT_NORMAL);
+ 28 synchronize_rcu();
+ 29 WRITE_ONCE(state, STATE_NORMAL);
+ 30 }
+
+The RCU read-side critical section in ``do_something_dlm()`` works with
+the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
+``do_something()`` never runs concurrently with ``recovery()``, but with
+little or no synchronization overhead in ``do_something_dlm()``.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Why is the ``synchronize_rcu()`` on line 28 needed? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Without that extra grace period, memory reordering could result in |
+| ``do_something_dlm()`` executing ``do_something()`` concurrently with |
+| the last bits of ``recovery()``. |
++-----------------------------------------------------------------------+
+
+In order to avoid fatal problems such as deadlocks, an RCU read-side
+critical section must not contain calls to ``synchronize_rcu()``.
+Similarly, an RCU read-side critical section must not contain anything
+that waits, directly or indirectly, on completion of an invocation of
+``synchronize_rcu()``.
+
+Although RCU's grace-period guarantee is useful in and of itself, with
+`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
+be good to be able to use RCU to coordinate read-side access to linked
+data structures. For this, the grace-period guarantee is not sufficient,
+as can be seen in function ``add_gp_buggy()`` below. We will look at the
+reader's code later, but in the meantime, just think of the reader as
+locklessly picking up the ``gp`` pointer, and, if the value loaded is
+non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
+
+ ::
+
+ 1 bool add_gp_buggy(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 p->a = a;
+ 12 p->b = a;
+ 13 gp = p; /* ORDERING BUG */
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+The problem is that both the compiler and weakly ordered CPUs are within
+their rights to reorder this code as follows:
+
+ ::
+
+ 1 bool add_gp_buggy_optimized(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 gp = p; /* ORDERING BUG */
+ 12 p->a = a;
+ 13 p->b = a;
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
+executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
+And this is but one of many ways in which compiler and hardware
+optimizations could cause trouble. Therefore, we clearly need some way
+to prevent the compiler and the CPU from reordering in this manner,
+which brings us to the publish-subscribe guarantee discussed in the next
+section.
+
+Publish/Subscribe Guarantee
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's publish-subscribe guarantee allows data to be inserted into a
+linked data structure without disrupting RCU readers. The updater uses
+``rcu_assign_pointer()`` to insert the new data, and readers use
+``rcu_dereference()`` to access data, whether new or old. The following
+shows an example of insertion:
+
+ ::
+
+ 1 bool add_gp(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 p->a = a;
+ 12 p->b = a;
+ 13 rcu_assign_pointer(gp, p);
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
+simple assignment statement, but also guarantees that its assignment
+will happen after the two assignments in lines 11 and 12, similar to the
+C11 ``memory_order_release`` store operation. It also prevents any
+number of “interesting” compiler optimizations, for example, the use of
+``gp`` as a scratch location immediately preceding the assignment.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But ``rcu_assign_pointer()`` does nothing to prevent the two |
+| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
+| also cause problems? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| No, it cannot. The readers cannot see either of these two fields |
+| until the assignment to ``gp``, by which time both fields are fully |
+| initialized. So reordering the assignments to ``p->a`` and ``p->b`` |
+| cannot possibly cause any problems. |
++-----------------------------------------------------------------------+
+
+It is tempting to assume that the reader need not do anything special to
+control its accesses to the RCU-protected data, as shown in
+``do_something_gp_buggy()`` below:
+
+ ::
+
+ 1 bool do_something_gp_buggy(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = gp; /* OPTIMIZATIONS GALORE!!! */
+ 5 if (p) {
+ 6 do_something(p->a, p->b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+ 10 rcu_read_unlock();
+ 11 return false;
+ 12 }
+
+However, this temptation must be resisted because there are a
+surprisingly large number of ways that the compiler (to say nothing of
+`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
+can trip this code up. For but one example, if the compiler were short
+of registers, it might choose to refetch from ``gp`` rather than keeping
+a separate copy in ``p`` as follows:
+
+ ::
+
+ 1 bool do_something_gp_buggy_optimized(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
+ 5 do_something(gp->a, gp->b);
+ 6 rcu_read_unlock();
+ 7 return true;
+ 8 }
+ 9 rcu_read_unlock();
+ 10 return false;
+ 11 }
+
+If this function ran concurrently with a series of updates that replaced
+the current structure with a new one, the fetches of ``gp->a`` and
+``gp->b`` might well come from two different structures, which could
+cause serious confusion. To prevent this (and much else besides),
+``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
+
+ ::
+
+ 1 bool do_something_gp(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = rcu_dereference(gp);
+ 5 if (p) {
+ 6 do_something(p->a, p->b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+ 10 rcu_read_unlock();
+ 11 return false;
+ 12 }
+
+The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
+barriers in the Linux kernel. Should a `high-quality implementation of
+C11 ``memory_order_consume``
+[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
+ever appear, then ``rcu_dereference()`` could be implemented as a
+``memory_order_consume`` load. Regardless of the exact implementation, a
+pointer fetched by ``rcu_dereference()`` may not be used outside of the
+outermost RCU read-side critical section containing that
+``rcu_dereference()``, unless protection of the corresponding data
+element has been passed from RCU to some other synchronization
+mechanism, most commonly locking or `reference
+counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
+
+In short, updaters use ``rcu_assign_pointer()`` and readers use
+``rcu_dereference()``, and these two RCU API elements work together to
+ensure that readers have a consistent view of newly added data elements.
+
+Of course, it is also necessary to remove elements from RCU-protected
+data structures, for example, using the following process:
+
+#. Remove the data element from the enclosing structure.
+#. Wait for all pre-existing RCU read-side critical sections to complete
+ (because only pre-existing readers can possibly have a reference to
+ the newly removed data element).
+#. At this point, only the updater has a reference to the newly removed
+ data element, so it can safely reclaim the data element, for example,
+ by passing it to ``kfree()``.
+
+This process is implemented by ``remove_gp_synchronous()``:
+
+ ::
+
+ 1 bool remove_gp_synchronous(void)
+ 2 {
+ 3 struct foo *p;
+ 4
+ 5 spin_lock(&gp_lock);
+ 6 p = rcu_access_pointer(gp);
+ 7 if (!p) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 rcu_assign_pointer(gp, NULL);
+ 12 spin_unlock(&gp_lock);
+ 13 synchronize_rcu();
+ 14 kfree(p);
+ 15 return true;
+ 16 }
+
+This function is straightforward, with line 13 waiting for a grace
+period before line 14 frees the old data element. This waiting ensures
+that readers will reach line 7 of ``do_something_gp()`` before the data
+element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
+line 6 is similar to ``rcu_dereference()``, except that:
+
+#. The value returned by ``rcu_access_pointer()`` cannot be
+ dereferenced. If you want to access the value pointed to as well as
+ the pointer itself, use ``rcu_dereference()`` instead of
+ ``rcu_access_pointer()``.
+#. The call to ``rcu_access_pointer()`` need not be protected. In
+ contrast, ``rcu_dereference()`` must either be within an RCU
+ read-side critical section or in a code segment where the pointer
+ cannot change, for example, in code protected by the corresponding
+ update-side lock.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, |
+| what destructive optimizations might the compiler make use of? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Let's start with what happens to ``do_something_gp()`` if it fails to |
+| use ``rcu_dereference()``. It could reuse a value formerly fetched |
+| from this same pointer. It could also fetch the pointer from ``gp`` |
+| in a byte-at-a-time manner, resulting in *load tearing*, in turn |
+| resulting a bytewise mash-up of two distinct pointer values. It might |
+| even use value-speculation optimizations, where it makes a wrong |
+| guess, but by the time it gets around to checking the value, an |
+| update has changed the pointer to match the wrong guess. Too bad |
+| about any dereferences that returned pre-initialization garbage in |
+| the meantime! |
+| For ``remove_gp_synchronous()``, as long as all modifications to |
+| ``gp`` are carried out while holding ``gp_lock``, the above |
+| optimizations are harmless. However, ``sparse`` will complain if you |
+| define ``gp`` with ``__rcu`` and then access it without using either |
+| ``rcu_access_pointer()`` or ``rcu_dereference()``. |
++-----------------------------------------------------------------------+
+
+In short, RCU's publish-subscribe guarantee is provided by the
+combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
+guarantee allows data elements to be safely added to RCU-protected
+linked data structures without disrupting RCU readers. This guarantee
+can be used in combination with the grace-period guarantee to also allow
+data elements to be removed from RCU-protected linked data structures,
+again without disrupting RCU readers.
+
+This guarantee was only partially premeditated. DYNIX/ptx used an
+explicit memory barrier for publication, but had nothing resembling
+``rcu_dereference()`` for subscription, nor did it have anything
+resembling the dependency-ordering barrier that was later subsumed
+into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
+need for these operations made itself known quite suddenly at a
+late-1990s meeting with the DEC Alpha architects, back in the days when
+DEC was still a free-standing company. It took the Alpha architects a
+good hour to convince me that any sort of barrier would ever be needed,
+and it then took me a good *two* hours to convince them that their
+documentation did not make this point clear. More recent work with the C
+and C++ standards committees have provided much education on tricks and
+traps from the compiler. In short, compilers were much less tricky in
+the early 1990s, but in 2015, don't even think about omitting
+``rcu_dereference()``!
+
+Memory-Barrier Guarantees
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The previous section's simple linked-data-structure scenario clearly
+demonstrates the need for RCU's stringent memory-ordering guarantees on
+systems with more than one CPU:
+
+#. Each CPU that has an RCU read-side critical section that begins
+ before ``synchronize_rcu()`` starts is guaranteed to execute a full
+ memory barrier between the time that the RCU read-side critical
+ section ends and the time that ``synchronize_rcu()`` returns. Without
+ this guarantee, a pre-existing RCU read-side critical section might
+ hold a reference to the newly removed ``struct foo`` after the
+ ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
+#. Each CPU that has an RCU read-side critical section that ends after
+ ``synchronize_rcu()`` returns is guaranteed to execute a full memory
+ barrier between the time that ``synchronize_rcu()`` begins and the
+ time that the RCU read-side critical section begins. Without this
+ guarantee, a later RCU read-side critical section running after the
+ ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
+ ``do_something_gp()`` and find the newly deleted ``struct foo``.
+#. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
+ then that CPU is guaranteed to execute a full memory barrier sometime
+ during the execution of ``synchronize_rcu()``. This guarantee ensures
+ that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
+ does execute after the removal on line 11.
+#. If the task invoking ``synchronize_rcu()`` migrates among a group of
+ CPUs during that invocation, then each of the CPUs in that group is
+ guaranteed to execute a full memory barrier sometime during the
+ execution of ``synchronize_rcu()``. This guarantee also ensures that
+ the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
+ execute after the removal on line 11, but also in the case where the
+ thread executing the ``synchronize_rcu()`` migrates in the meantime.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Given that multiple CPUs can start RCU read-side critical sections at |
+| any time without any ordering whatsoever, how can RCU possibly tell |
+| whether or not a given RCU read-side critical section starts before a |
+| given instance of ``synchronize_rcu()``? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| If RCU cannot tell whether or not a given RCU read-side critical |
+| section starts before a given instance of ``synchronize_rcu()``, then |
+| it must assume that the RCU read-side critical section started first. |
+| In other words, a given instance of ``synchronize_rcu()`` can avoid |
+| waiting on a given RCU read-side critical section only if it can |
+| prove that ``synchronize_rcu()`` started first. |
+| A related question is “When ``rcu_read_lock()`` doesn't generate any |
+| code, why does it matter how it relates to a grace period?” The |
+| answer is that it is not the relationship of ``rcu_read_lock()`` |
+| itself that is important, but rather the relationship of the code |
+| within the enclosed RCU read-side critical section to the code |
+| preceding and following the grace period. If we take this viewpoint, |
+| then a given RCU read-side critical section begins before a given |
+| grace period when some access preceding the grace period observes the |
+| effect of some access within the critical section, in which case none |
+| of the accesses within the critical section may observe the effects |
+| of any access following the grace period. |
+| |
+| As of late 2016, mathematical models of RCU take this viewpoint, for |
+| example, see slides 62 and 63 of the `2016 LinuxCon |
+| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
+| 6.10.04c.LCE.pdf>`__ |
+| presentation. |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| The first and second guarantees require unbelievably strict ordering! |
+| Are all these memory barriers *really* required? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Yes, they really are required. To see why the first guarantee is |
+| required, consider the following sequence of events: |
+| |
+| #. CPU 1: ``rcu_read_lock()`` |
+| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
+| #. CPU 0: ``list_del_rcu(p);`` |
+| #. CPU 0: ``synchronize_rcu()`` starts. |
+| #. CPU 1: ``do_something_with(q->a);`` |
+| ``/* No smp_mb(), so might happen after kfree(). */`` |
+| #. CPU 1: ``rcu_read_unlock()`` |
+| #. CPU 0: ``synchronize_rcu()`` returns. |
+| #. CPU 0: ``kfree(p);`` |
+| |
+| Therefore, there absolutely must be a full memory barrier between the |
+| end of the RCU read-side critical section and the end of the grace |
+| period. |
+| |
+| The sequence of events demonstrating the necessity of the second rule |
+| is roughly similar: |
+| |
+| #. CPU 0: ``list_del_rcu(p);`` |
+| #. CPU 0: ``synchronize_rcu()`` starts. |
+| #. CPU 1: ``rcu_read_lock()`` |
+| #. CPU 1: ``q = rcu_dereference(gp);`` |
+| ``/* Might return p if no memory barrier. */`` |
+| #. CPU 0: ``synchronize_rcu()`` returns. |
+| #. CPU 0: ``kfree(p);`` |
+| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` |
+| #. CPU 1: ``rcu_read_unlock()`` |
+| |
+| And similarly, without a memory barrier between the beginning of the |
+| grace period and the beginning of the RCU read-side critical section, |
+| CPU 1 might end up accessing the freelist. |
+| |
+| The “as if” rule of course applies, so that any implementation that |
+| acts as if the appropriate memory barriers were in place is a correct |
+| implementation. That said, it is much easier to fool yourself into |
+| believing that you have adhered to the as-if rule than it is to |
+| actually adhere to it! |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code in some kernel builds. This means that the |
+| compiler might arbitrarily rearrange consecutive RCU read-side |
+| critical sections. Given such rearrangement, if a given RCU read-side |
+| critical section is done, how can you be sure that all prior RCU |
+| read-side critical sections are done? Won't the compiler |
+| rearrangements make that impossible to determine? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code, RCU infers quiescent states only at special |
+| locations, for example, within the scheduler. Because calls to |
+| ``schedule()`` had better prevent calling-code accesses to shared |
+| variables from being rearranged across the call to ``schedule()``, if |
+| RCU detects the end of a given RCU read-side critical section, it |
+| will necessarily detect the end of all prior RCU read-side critical |
+| sections, no matter how aggressively the compiler scrambles the code. |
+| Again, this all assumes that the compiler cannot scramble code across |
+| calls to the scheduler, out of interrupt handlers, into the idle |
+| loop, into user-mode code, and so on. But if your kernel build allows |
+| that sort of scrambling, you have broken far more than just RCU! |
++-----------------------------------------------------------------------+
+
+Note that these memory-barrier requirements do not replace the
+fundamental RCU requirement that a grace period wait for all
+pre-existing readers. On the contrary, the memory barriers called out in
+this section must operate in such a way as to *enforce* this fundamental
+requirement. Of course, different implementations enforce this
+requirement in different ways, but enforce it they must.
+
+RCU Primitives Guaranteed to Execute Unconditionally
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The common-case RCU primitives are unconditional. They are invoked, they
+do their job, and they return, with no possibility of error, and no need
+to retry. This is a key RCU design philosophy.
+
+However, this philosophy is pragmatic rather than pigheaded. If someone
+comes up with a good justification for a particular conditional RCU
+primitive, it might well be implemented and added. After all, this
+guarantee was reverse-engineered, not premeditated. The unconditional
+nature of the RCU primitives was initially an accident of
+implementation, and later experience with synchronization primitives
+with conditional primitives caused me to elevate this accident to a
+guarantee. Therefore, the justification for adding a conditional
+primitive to RCU would need to be based on detailed and compelling use
+cases.
+
+Guaranteed Read-to-Write Upgrade
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As far as RCU is concerned, it is always possible to carry out an update
+within an RCU read-side critical section. For example, that RCU
+read-side critical section might search for a given data element, and
+then might acquire the update-side spinlock in order to update that
+element, all while remaining in that RCU read-side critical section. Of
+course, it is necessary to exit the RCU read-side critical section
+before invoking ``synchronize_rcu()``, however, this inconvenience can
+be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
+members described later in this document.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But how does the upgrade-to-write operation exclude other readers? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| It doesn't, just like normal RCU updates, which also do not exclude |
+| RCU readers. |
++-----------------------------------------------------------------------+
+
+This guarantee allows lookup code to be shared between read-side and
+update-side code, and was premeditated, appearing in the earliest
+DYNIX/ptx RCU documentation.
+
+Fundamental Non-Requirements
+----------------------------
+
+RCU provides extremely lightweight readers, and its read-side
+guarantees, though quite useful, are correspondingly lightweight. It is
+therefore all too easy to assume that RCU is guaranteeing more than it
+really is. Of course, the list of things that RCU does not guarantee is
+infinitely long, however, the following sections list a few
+non-guarantees that have caused confusion. Except where otherwise noted,
+these non-guarantees were premeditated.
+
+#. `Readers Impose Minimal Ordering`_
+#. `Readers Do Not Exclude Updaters`_
+#. `Updaters Only Wait For Old Readers`_
+#. `Grace Periods Don't Partition Read-Side Critical Sections`_
+#. `Read-Side Critical Sections Don't Partition Grace Periods`_
+
+Readers Impose Minimal Ordering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reader-side markers such as ``rcu_read_lock()`` and
+``rcu_read_unlock()`` provide absolutely no ordering guarantees except
+through their interaction with the grace-period APIs such as
+``synchronize_rcu()``. To see this, consider the following pair of
+threads:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(x, 1);
+ 5 rcu_read_unlock();
+ 6 rcu_read_lock();
+ 7 WRITE_ONCE(y, 1);
+ 8 rcu_read_unlock();
+ 9 }
+ 10
+ 11 void thread1(void)
+ 12 {
+ 13 rcu_read_lock();
+ 14 r1 = READ_ONCE(y);
+ 15 rcu_read_unlock();
+ 16 rcu_read_lock();
+ 17 r2 = READ_ONCE(x);
+ 18 rcu_read_unlock();
+ 19 }
+
+After ``thread0()`` and ``thread1()`` execute concurrently, it is quite
+possible to have
+
+ ::
+
+ (r1 == 1 && r2 == 0)
+
+(that is, ``y`` appears to have been assigned before ``x``), which would
+not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
+much in the way of ordering properties. But they do not, so the CPU is
+within its rights to do significant reordering. This is by design: Any
+significant ordering constraints would slow down these fast-path APIs.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Can't the compiler also reorder this code? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` |
+| prevent the compiler from reordering in this particular case. |
++-----------------------------------------------------------------------+
+
+Readers Do Not Exclude Updaters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
+All they do is to prevent grace periods from ending. The following
+example illustrates this:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 r1 = READ_ONCE(y);
+ 5 if (r1) {
+ 6 do_something_with_nonzero_x();
+ 7 r2 = READ_ONCE(x);
+ 8 WARN_ON(!r2); /* BUG!!! */
+ 9 }
+ 10 rcu_read_unlock();
+ 11 }
+ 12
+ 13 void thread1(void)
+ 14 {
+ 15 spin_lock(&my_lock);
+ 16 WRITE_ONCE(x, 1);
+ 17 WRITE_ONCE(y, 1);
+ 18 spin_unlock(&my_lock);
+ 19 }
+
+If the ``thread0()`` function's ``rcu_read_lock()`` excluded the
+``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
+the fact is that ``rcu_read_lock()`` does not exclude much of anything
+aside from subsequent grace periods, of which ``thread1()`` has none, so
+the ``WARN_ON()`` can and does fire.
+
+Updaters Only Wait For Old Readers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It might be tempting to assume that after ``synchronize_rcu()``
+completes, there are no readers executing. This temptation must be
+avoided because new readers can start immediately after
+``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
+obligation to wait for these new readers.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Suppose that synchronize_rcu() did wait until *all* readers had |
+| completed instead of waiting only on pre-existing readers. For how |
+| long would the updater be able to rely on there being no readers? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| For no time at all. Even if ``synchronize_rcu()`` were to wait until |
+| all readers had completed, a new reader might start immediately after |
+| ``synchronize_rcu()`` completed. Therefore, the code following |
+| ``synchronize_rcu()`` can *never* rely on there being no readers. |
++-----------------------------------------------------------------------+
+
+Grace Periods Don't Partition Read-Side Critical Sections
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is tempting to assume that if any part of one RCU read-side critical
+section precedes a given grace period, and if any part of another RCU
+read-side critical section follows that same grace period, then all of
+the first RCU read-side critical section must precede all of the second.
+However, this just isn't the case: A single grace period does not
+partition the set of RCU read-side critical sections. An example of this
+situation can be illustrated as follows, where ``x``, ``y``, and ``z``
+are initially all zero:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+ 10 {
+ 11 r1 = READ_ONCE(a);
+ 12 synchronize_rcu();
+ 13 WRITE_ONCE(c, 1);
+ 14 }
+ 15
+ 16 void thread2(void)
+ 17 {
+ 18 rcu_read_lock();
+ 19 r2 = READ_ONCE(b);
+ 20 r3 = READ_ONCE(c);
+ 21 rcu_read_unlock();
+ 22 }
+
+It turns out that the outcome:
+
+ ::
+
+ (r1 == 1 && r2 == 0 && r3 == 1)
+
+is entirely possible. The following figure show how this can happen,
+with each circled ``QS`` indicating the point at which RCU recorded a
+*quiescent state* for each thread, that is, a state in which RCU knows
+that the thread cannot be in the midst of an RCU read-side critical
+section that started before the current grace period:
+
+.. kernel-figure:: GPpartitionReaders1.svg
+
+If it is necessary to partition RCU read-side critical sections in this
+manner, it is necessary to use two grace periods, where the first grace
+period is known to end before the second grace period starts:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+ 10 {
+ 11 r1 = READ_ONCE(a);
+ 12 synchronize_rcu();
+ 13 WRITE_ONCE(c, 1);
+ 14 }
+ 15
+ 16 void thread2(void)
+ 17 {
+ 18 r2 = READ_ONCE(c);
+ 19 synchronize_rcu();
+ 20 WRITE_ONCE(d, 1);
+ 21 }
+ 22
+ 23 void thread3(void)
+ 24 {
+ 25 rcu_read_lock();
+ 26 r3 = READ_ONCE(b);
+ 27 r4 = READ_ONCE(d);
+ 28 rcu_read_unlock();
+ 29 }
+
+Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen
+before the end of ``thread1()``'s grace period. If in addition
+``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after
+the beginning of ``thread2()``'s grace period. If it is also the case
+that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must
+precede the beginning of ``thread2()``'s grace period. This mean that
+the two RCU read-side critical sections cannot overlap, guaranteeing
+that ``(r3 == 1)``. As a result, the outcome:
+
+ ::
+
+ (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
+
+cannot happen.
+
+This non-requirement was also non-premeditated, but became apparent when
+studying RCU's interaction with memory ordering.
+
+Read-Side Critical Sections Don't Partition Grace Periods
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is also tempting to assume that if an RCU read-side critical section
+happens between a pair of grace periods, then those grace periods cannot
+overlap. However, this temptation leads nowhere good, as can be
+illustrated by the following, with all variables initially zero:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+ 10 {
+ 11 r1 = READ_ONCE(a);
+ 12 synchronize_rcu();
+ 13 WRITE_ONCE(c, 1);
+ 14 }
+ 15
+ 16 void thread2(void)
+ 17 {
+ 18 rcu_read_lock();
+ 19 WRITE_ONCE(d, 1);
+ 20 r2 = READ_ONCE(c);
+ 21 rcu_read_unlock();
+ 22 }
+ 23
+ 24 void thread3(void)
+ 25 {
+ 26 r3 = READ_ONCE(d);
+ 27 synchronize_rcu();
+ 28 WRITE_ONCE(e, 1);
+ 29 }
+ 30
+ 31 void thread4(void)
+ 32 {
+ 33 rcu_read_lock();
+ 34 r4 = READ_ONCE(b);
+ 35 r5 = READ_ONCE(e);
+ 36 rcu_read_unlock();
+ 37 }
+
+In this case, the outcome:
+
+ ::
+
+ (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
+
+is entirely possible, as illustrated below:
+
+.. kernel-figure:: ReadersPartitionGP1.svg
+
+Again, an RCU read-side critical section can overlap almost all of a
+given grace period, just so long as it does not overlap the entire grace
+period. As a result, an RCU read-side critical section cannot partition
+a pair of RCU grace periods.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| How long a sequence of grace periods, each separated by an RCU |
+| read-side critical section, would be required to partition the RCU |
+| read-side critical sections at the beginning and end of the chain? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| In theory, an infinite number. In practice, an unknown number that is |
+| sensitive to both implementation details and timing considerations. |
+| Therefore, even in practice, RCU users must abide by the theoretical |
+| rather than the practical answer. |
++-----------------------------------------------------------------------+
+
+Parallelism Facts of Life
+-------------------------
+
+These parallelism facts of life are by no means specific to RCU, but the
+RCU implementation must abide by them. They therefore bear repeating:
+
+#. Any CPU or task may be delayed at any time, and any attempts to avoid
+ these delays by disabling preemption, interrupts, or whatever are
+ completely futile. This is most obvious in preemptible user-level
+ environments and in virtualized environments (where a given guest
+ OS's VCPUs can be preempted at any time by the underlying
+ hypervisor), but can also happen in bare-metal environments due to
+ ECC errors, NMIs, and other hardware events. Although a delay of more
+ than about 20 seconds can result in splats, the RCU implementation is
+ obligated to use algorithms that can tolerate extremely long delays,
+ but where “extremely long” is not long enough to allow wrap-around
+ when incrementing a 64-bit counter.
+#. Both the compiler and the CPU can reorder memory accesses. Where it
+ matters, RCU must use compiler directives and memory-barrier
+ instructions to preserve ordering.
+#. Conflicting writes to memory locations in any given cache line will
+ result in expensive cache misses. Greater numbers of concurrent
+ writes and more-frequent concurrent writes will result in more
+ dramatic slowdowns. RCU is therefore obligated to use algorithms that
+ have sufficient locality to avoid significant performance and
+ scalability problems.
+#. As a rough rule of thumb, only one CPU's worth of processing may be
+ carried out under the protection of any given exclusive lock. RCU
+ must therefore use scalable locking designs.
+#. Counters are finite, especially on 32-bit systems. RCU's use of
+ counters must therefore tolerate counter wrap, or be designed such
+ that counter wrap would take way more time than a single system is
+ likely to run. An uptime of ten years is quite possible, a runtime of
+ a century much less so. As an example of the latter, RCU's
+ dyntick-idle nesting counter allows 54 bits for interrupt nesting
+ level (this counter is 64 bits even on a 32-bit system). Overflowing
+ this counter requires 2\ :sup:`54` half-interrupts on a given CPU
+ without that CPU ever going idle. If a half-interrupt happened every
+ microsecond, it would take 570 years of runtime to overflow this
+ counter, which is currently believed to be an acceptably long time.
+#. Linux systems can have thousands of CPUs running a single Linux
+ kernel in a single shared-memory environment. RCU must therefore pay
+ close attention to high-end scalability.
+
+This last parallelism fact of life means that RCU must pay special
+attention to the preceding facts of life. The idea that Linux might
+scale to systems with thousands of CPUs would have been met with some
+skepticism in the 1990s, but these requirements would have otherwise
+have been unsurprising, even in the early 1990s.
+
+Quality-of-Implementation Requirements
+--------------------------------------
+
+These sections list quality-of-implementation requirements. Although an
+RCU implementation that ignores these requirements could still be used,
+it would likely be subject to limitations that would make it
+inappropriate for industrial-strength production use. Classes of
+quality-of-implementation requirements are as follows:
+
+#. `Specialization`_
+#. `Performance and Scalability`_
+#. `Forward Progress`_
+#. `Composability`_
+#. `Corner Cases`_
+
+These classes is covered in the following sections.
+
+Specialization
+~~~~~~~~~~~~~~
+
+RCU is and always has been intended primarily for read-mostly
+situations, which means that RCU's read-side primitives are optimized,
+often at the expense of its update-side primitives. Experience thus far
+is captured by the following list of situations:
+
+#. Read-mostly data, where stale and inconsistent data is not a problem:
+ RCU works great!
+#. Read-mostly data, where data must be consistent: RCU works well.
+#. Read-write data, where data must be consistent: RCU *might* work OK.
+ Or not.
+#. Write-mostly data, where data must be consistent: RCU is very
+ unlikely to be the right tool for the job, with the following
+ exceptions, where RCU can provide:
+
+ a. Existence guarantees for update-friendly mechanisms.
+ b. Wait-free read-side primitives for real-time use.
+
+This focus on read-mostly situations means that RCU must interoperate
+with other synchronization primitives. For example, the ``add_gp()`` and
+``remove_gp_synchronous()`` examples discussed earlier use RCU to
+protect readers and locking to coordinate updaters. However, the need
+extends much farther, requiring that a variety of synchronization
+primitives be legal within RCU read-side critical sections, including
+spinlocks, sequence locks, atomic operations, reference counters, and
+memory barriers.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| What about sleeping locks? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| These are forbidden within Linux-kernel RCU read-side critical |
+| sections because it is not legal to place a quiescent state (in this |
+| case, voluntary context switch) within an RCU read-side critical |
+| section. However, sleeping locks may be used within userspace RCU |
+| read-side critical sections, and also within Linux-kernel sleepable |
+| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In |
+| addition, the -rt patchset turns spinlocks into a sleeping locks so |
+| that the corresponding critical sections can be preempted, which also |
+| means that these sleeplockified spinlocks (but not other sleeping |
+| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
+| sections. |
+| Note that it *is* legal for a normal RCU read-side critical section |
+| to conditionally acquire a sleeping locks (as in |
+| ``mutex_trylock()``), but only as long as it does not loop |
+| indefinitely attempting to conditionally acquire that sleeping locks. |
+| The key point is that things like ``mutex_trylock()`` either return |
+| with the mutex held, or return an error indication if the mutex was |
+| not immediately available. Either way, ``mutex_trylock()`` returns |
+| immediately without sleeping. |
++-----------------------------------------------------------------------+
+
+It often comes as a surprise that many algorithms do not require a
+consistent view of data, but many can function in that mode, with
+network routing being the poster child. Internet routing algorithms take
+significant time to propagate updates, so that by the time an update
+arrives at a given system, that system has been sending network traffic
+the wrong way for a considerable length of time. Having a few threads
+continue to send traffic the wrong way for a few more milliseconds is
+clearly not a problem: In the worst case, TCP retransmissions will
+eventually get the data where it needs to go. In general, when tracking
+the state of the universe outside of the computer, some level of
+inconsistency must be tolerated due to speed-of-light delays if nothing
+else.
+
+Furthermore, uncertainty about external state is inherent in many cases.
+For example, a pair of veterinarians might use heartbeat to determine
+whether or not a given cat was alive. But how long should they wait
+after the last heartbeat to decide that the cat is in fact dead? Waiting
+less than 400 milliseconds makes no sense because this would mean that a
+relaxed cat would be considered to cycle between death and life more
+than 100 times per minute. Moreover, just as with human beings, a cat's
+heart might stop for some period of time, so the exact wait period is a
+judgment call. One of our pair of veterinarians might wait 30 seconds
+before pronouncing the cat dead, while the other might insist on waiting
+a full minute. The two veterinarians would then disagree on the state of
+the cat during the final 30 seconds of the minute following the last
+heartbeat.
+
+Interestingly enough, this same situation applies to hardware. When push
+comes to shove, how do we tell whether or not some external server has
+failed? We send messages to it periodically, and declare it failed if we
+don't receive a response within a given period of time. Policy decisions
+can usually tolerate short periods of inconsistency. The policy was
+decided some time ago, and is only now being put into effect, so a few
+milliseconds of delay is normally inconsequential.
+
+However, there are algorithms that absolutely must see consistent data.
+For example, the translation between a user-level SystemV semaphore ID
+to the corresponding in-kernel data structure is protected by RCU, but
+it is absolutely forbidden to update a semaphore that has just been
+removed. In the Linux kernel, this need for consistency is accommodated
+by acquiring spinlocks located in the in-kernel data structure from
+within the RCU read-side critical section, and this is indicated by the
+green box in the figure above. Many other techniques may be used, and
+are in fact used within the Linux kernel.
+
+In short, RCU is not required to maintain consistency, and other
+mechanisms may be used in concert with RCU when consistency is required.
+RCU's specialization allows it to do its job extremely well, and its
+ability to interoperate with other synchronization mechanisms allows the
+right mix of synchronization tools to be used for a given job.
+
+Performance and Scalability
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Energy efficiency is a critical component of performance today, and
+Linux-kernel RCU implementations must therefore avoid unnecessarily
+awakening idle CPUs. I cannot claim that this requirement was
+premeditated. In fact, I learned of it during a telephone conversation
+in which I was given “frank and open” feedback on the importance of
+energy efficiency in battery-powered systems and on specific
+energy-efficiency shortcomings of the Linux-kernel RCU implementation.
+In my experience, the battery-powered embedded community will consider
+any unnecessary wakeups to be extremely unfriendly acts. So much so that
+mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
+
+Memory consumption is not particularly important for in most situations,
+and has become decreasingly so as memory sizes have expanded and memory
+costs have plummeted. However, as I learned from Matt Mackall's
+`bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
+footprint is critically important on single-CPU systems with
+non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
+RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__
+was born. Josh Triplett has since taken over the small-memory banner
+with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__
+project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional
+for those kernels not needing it.
+
+The remaining performance requirements are, for the most part,
+unsurprising. For example, in keeping with RCU's read-side
+specialization, ``rcu_dereference()`` should have negligible overhead
+(for example, suppression of a few minor compiler optimizations).
+Similarly, in non-preemptible environments, ``rcu_read_lock()`` and
+``rcu_read_unlock()`` should have exactly zero overhead.
+
+In preemptible environments, in the case where the RCU read-side
+critical section was not preempted (as will be the case for the
+highest-priority real-time process), ``rcu_read_lock()`` and
+``rcu_read_unlock()`` should have minimal overhead. In particular, they
+should not contain atomic read-modify-write operations, memory-barrier
+instructions, preemption disabling, interrupt disabling, or backwards
+branches. However, in the case where the RCU read-side critical section
+was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable
+interrupts. This is why it is better to nest an RCU read-side critical
+section within a preempt-disable region than vice versa, at least in
+cases where that critical section is short enough to avoid unduly
+degrading real-time latencies.
+
+The ``synchronize_rcu()`` grace-period-wait primitive is optimized for
+throughput. It may therefore incur several milliseconds of latency in
+addition to the duration of the longest RCU read-side critical section.
+On the other hand, multiple concurrent invocations of
+``synchronize_rcu()`` are required to use batching optimizations so that
+they can be satisfied by a single underlying grace-period-wait
+operation. For example, in the Linux kernel, it is not unusual for a
+single grace-period-wait operation to serve more than `1,000 separate
+invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__
+of ``synchronize_rcu()``, thus amortizing the per-invocation overhead
+down to nearly zero. However, the grace-period optimization is also
+required to avoid measurable degradation of real-time scheduling and
+interrupt latencies.
+
+In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
+unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be
+used instead, reducing the grace-period latency down to a few tens of
+microseconds on small systems, at least in cases where the RCU read-side
+critical sections are short. There are currently no special latency
+requirements for ``synchronize_rcu_expedited()`` on large systems, but,
+consistent with the empirical nature of the RCU specification, that is
+subject to change. However, there most definitely are scalability
+requirements: A storm of ``synchronize_rcu_expedited()`` invocations on
+4096 CPUs should at least make reasonable forward progress. In return
+for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted
+to impose modest degradation of real-time latency on non-idle online
+CPUs. Here, “modest” means roughly the same latency degradation as a
+scheduling-clock interrupt.
+
+There are a number of situations where even
+``synchronize_rcu_expedited()``'s reduced grace-period latency is
+unacceptable. In these situations, the asynchronous ``call_rcu()`` can
+be used in place of ``synchronize_rcu()`` as follows:
+
+ ::
+
+ 1 struct foo {
+ 2 int a;
+ 3 int b;
+ 4 struct rcu_head rh;
+ 5 };
+ 6
+ 7 static void remove_gp_cb(struct rcu_head *rhp)
+ 8 {
+ 9 struct foo *p = container_of(rhp, struct foo, rh);
+ 10
+ 11 kfree(p);
+ 12 }
+ 13
+ 14 bool remove_gp_asynchronous(void)
+ 15 {
+ 16 struct foo *p;
+ 17
+ 18 spin_lock(&gp_lock);
+ 19 p = rcu_access_pointer(gp);
+ 20 if (!p) {
+ 21 spin_unlock(&gp_lock);
+ 22 return false;
+ 23 }
+ 24 rcu_assign_pointer(gp, NULL);
+ 25 call_rcu(&p->rh, remove_gp_cb);
+ 26 spin_unlock(&gp_lock);
+ 27 return true;
+ 28 }
+
+A definition of ``struct foo`` is finally needed, and appears on
+lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
+on line 25, and will be invoked after the end of a subsequent grace
+period. This gets the same effect as ``remove_gp_synchronous()``, but
+without forcing the updater to wait for a grace period to elapse. The
+``call_rcu()`` function may be used in a number of situations where
+neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would
+be legal, including within preempt-disable code, ``local_bh_disable()``
+code, interrupt-disable code, and interrupt handlers. However, even
+``call_rcu()`` is illegal within NMI handlers and from idle and offline
+CPUs. The callback function (``remove_gp_cb()`` in this case) will be
+executed within softirq (software interrupt) environment within the
+Linux kernel, either within a real softirq handler or under the
+protection of ``local_bh_disable()``. In both the Linux kernel and in
+userspace, it is bad practice to write an RCU callback function that
+takes too long. Long-running operations should be relegated to separate
+threads or (in the Linux kernel) workqueues.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Why does line 19 use ``rcu_access_pointer()``? After all, |
+| ``call_rcu()`` on line 25 stores into the structure, which would |
+| interact badly with concurrent insertions. Doesn't this mean that |
+| ``rcu_dereference()`` is required? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Presumably the ``->gp_lock`` acquired on line 18 excludes any |
+| changes, including any insertions that ``rcu_dereference()`` would |
+| protect against. Therefore, any insertions will be delayed until |
+| after ``->gp_lock`` is released on line 25, which in turn means that |
+| ``rcu_access_pointer()`` suffices. |
++-----------------------------------------------------------------------+
+
+However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on
+the data element. This is a common idiom, and is supported by
+``kfree_rcu()``, which allows “fire and forget” operation as shown
+below:
+
+ ::
+
+ 1 struct foo {
+ 2 int a;
+ 3 int b;
+ 4 struct rcu_head rh;
+ 5 };
+ 6
+ 7 bool remove_gp_faf(void)
+ 8 {
+ 9 struct foo *p;
+ 10
+ 11 spin_lock(&gp_lock);
+ 12 p = rcu_dereference(gp);
+ 13 if (!p) {
+ 14 spin_unlock(&gp_lock);
+ 15 return false;
+ 16 }
+ 17 rcu_assign_pointer(gp, NULL);
+ 18 kfree_rcu(p, rh);
+ 19 spin_unlock(&gp_lock);
+ 20 return true;
+ 21 }
+
+Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and
+proceeds, without any need to pay any further attention to the
+subsequent grace period and ``kfree()``. It is permissible to invoke
+``kfree_rcu()`` from the same environments as for ``call_rcu()``.
+Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()``
+and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the
+fact that RCU was not heavily used within DYNIX/ptx, so the very few
+places that needed something like ``synchronize_rcu()`` simply
+open-coded it.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` |
+| allowed updaters to avoid being blocked by readers. But how can that |
+| be correct, given that the invocation of the callback and the freeing |
+| of the memory (respectively) must still wait for a grace period to |
+| elapse? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| We could define things this way, but keep in mind that this sort of |
+| definition would say that updates in garbage-collected languages |
+| cannot complete until the next time the garbage collector runs, which |
+| does not seem at all reasonable. The key point is that in most cases, |
+| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed |
+| to the next update as soon as it has invoked ``call_rcu()`` or |
+| ``kfree_rcu()``, without having to wait for a subsequent grace |
+| period. |
++-----------------------------------------------------------------------+
+
+But what if the updater must wait for the completion of code to be
+executed after the end of the grace period, but has other tasks that can
+be carried out in the meantime? The polling-style
+``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions
+may be used for this purpose, as shown below:
+
+ ::
+
+ 1 bool remove_gp_poll(void)
+ 2 {
+ 3 struct foo *p;
+ 4 unsigned long s;
+ 5
+ 6 spin_lock(&gp_lock);
+ 7 p = rcu_access_pointer(gp);
+ 8 if (!p) {
+ 9 spin_unlock(&gp_lock);
+ 10 return false;
+ 11 }
+ 12 rcu_assign_pointer(gp, NULL);
+ 13 spin_unlock(&gp_lock);
+ 14 s = get_state_synchronize_rcu();
+ 15 do_something_while_waiting();
+ 16 cond_synchronize_rcu(s);
+ 17 kfree(p);
+ 18 return true;
+ 19 }
+
+On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU,
+then line 15 carries out other tasks, and finally, line 16 returns
+immediately if a grace period has elapsed in the meantime, but otherwise
+waits as required. The need for ``get_state_synchronize_rcu`` and
+``cond_synchronize_rcu()`` has appeared quite recently, so it is too
+early to tell whether they will stand the test of time.
+
+RCU thus provides a range of tools to allow updaters to strike the
+required tradeoff between latency, flexibility and CPU overhead.
+
+Forward Progress
+~~~~~~~~~~~~~~~~
+
+In theory, delaying grace-period completion and callback invocation is
+harmless. In practice, not only are memory sizes finite but also
+callbacks sometimes do wakeups, and sufficiently deferred wakeups can be
+difficult to distinguish from system hangs. Therefore, RCU must provide
+a number of mechanisms to promote forward progress.
+
+These mechanisms are not foolproof, nor can they be. For one simple
+example, an infinite loop in an RCU read-side critical section must by
+definition prevent later grace periods from ever completing. For a more
+involved example, consider a 64-CPU system built with
+``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
+CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even
+if these tight loops also contain calls to ``cond_resched()`` (thus
+allowing grace periods to complete), CPU 0 simply will not be able to
+invoke callbacks as fast as the other 63 CPUs can register them, at
+least not until the system runs out of memory. In both of these
+examples, the Spiderman principle applies: With great power comes great
+responsibility. However, short of this level of abuse, RCU is required
+to ensure timely completion of grace periods and timely invocation of
+callbacks.
+
+RCU takes the following steps to encourage timely completion of grace
+periods:
+
+#. If a grace period fails to complete within 100 milliseconds, RCU
+ causes future invocations of ``cond_resched()`` on the holdout CPUs
+ to provide an RCU quiescent state. RCU also causes those CPUs'
+ ``need_resched()`` invocations to return ``true``, but only after the
+ corresponding CPU's next scheduling-clock.
+#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run
+ indefinitely in the kernel without scheduling-clock interrupts, which
+ defeats the above ``need_resched()`` strategem. RCU will therefore
+ invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out
+ after 109 milliseconds.
+#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that
+ has been preempted within an RCU read-side critical section is
+ holding out for more than 500 milliseconds, RCU will resort to
+ priority boosting.
+#. If a CPU is still holding out 10 seconds into the grace period, RCU
+ will invoke ``resched_cpu()`` on it regardless of its ``nohz_full``
+ state.
+
+The above values are defaults for systems running with ``HZ=1000``. They
+will vary as the value of ``HZ`` varies, and can also be changed using
+the relevant Kconfig options and kernel boot parameters. RCU currently
+does not do much sanity checking of these parameters, so please use
+caution when changing them. Note that these forward-progress measures
+are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
+RCU <#Tasks%20RCU>`__.
+
+RCU takes the following steps in ``call_rcu()`` to encourage timely
+invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
+10,000 callbacks, or has 10,000 more callbacks than it had the last time
+encouragement was provided:
+
+#. Starts a grace period, if one is not already in progress.
+#. Forces immediate checking for quiescent states, rather than waiting
+ for three milliseconds to have elapsed since the beginning of the
+ grace period.
+#. Immediately tags the CPU's callbacks with their grace period
+ completion numbers, rather than waiting for the ``RCU_SOFTIRQ``
+ handler to get around to it.
+#. Lifts callback-execution batch limits, which speeds up callback
+ invocation at the expense of degrading realtime response.
+
+Again, these are default values when running at ``HZ=1000``, and can be
+overridden. Again, these forward-progress measures are provided only for
+RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks
+RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
+progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
+because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke
+``call_rcu()`` relatively infrequently. If workloads emerge that need
+both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then
+additional forward-progress work will be required.
+
+Composability
+~~~~~~~~~~~~~
+
+Composability has received much attention in recent years, perhaps in
+part due to the collision of multicore hardware with object-oriented
+techniques designed in single-threaded environments for single-threaded
+use. And in theory, RCU read-side critical sections may be composed, and
+in fact may be nested arbitrarily deeply. In practice, as with all
+real-world implementations of composable constructs, there are
+limitations.
+
+Implementations of RCU for which ``rcu_read_lock()`` and
+``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
+``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there
+is no overhead. Except that if all these instances of
+``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the
+compiler, compilation will eventually fail due to exhausting memory,
+mass storage, or user patience, whichever comes first. If the nesting is
+not visible to the compiler, as is the case with mutually recursive
+functions each in its own translation unit, stack overflow will result.
+If the nesting takes the form of loops, perhaps in the guise of tail
+recursion, either the control variable will overflow or (in the Linux
+kernel) you will get an RCU CPU stall warning. Nevertheless, this class
+of RCU implementations is one of the most composable constructs in
+existence.
+
+RCU implementations that explicitly track nesting depth are limited by
+the nesting-depth counter. For example, the Linux kernel's preemptible
+RCU limits nesting to ``INT_MAX``. This should suffice for almost all
+practical purposes. That said, a consecutive pair of RCU read-side
+critical sections between which there is an operation that waits for a
+grace period cannot be enclosed in another RCU read-side critical
+section. This is because it is not legal to wait for a grace period
+within an RCU read-side critical section: To do so would result either
+in deadlock or in RCU implicitly splitting the enclosing RCU read-side
+critical section, neither of which is conducive to a long-lived and
+prosperous kernel.
+
+It is worth noting that RCU is not alone in limiting composability. For
+example, many transactional-memory implementations prohibit composing a
+pair of transactions separated by an irrevocable operation (for example,
+a network receive operation). For another example, lock-based critical
+sections can be composed surprisingly freely, but only if deadlock is
+avoided.
+
+In short, although RCU read-side critical sections are highly
+composable, care is required in some situations, just as is the case for
+any other composable synchronization mechanism.
+
+Corner Cases
+~~~~~~~~~~~~
+
+A given RCU workload might have an endless and intense stream of RCU
+read-side critical sections, perhaps even so intense that there was
+never a point in time during which there was not at least one RCU
+read-side critical section in flight. RCU cannot allow this situation to
+block grace periods: As long as all the RCU read-side critical sections
+are finite, grace periods must also be finite.
+
+That said, preemptible RCU implementations could potentially result in
+RCU read-side critical sections being preempted for long durations,
+which has the effect of creating a long-duration RCU read-side critical
+section. This situation can arise only in heavily loaded systems, but
+systems using real-time priorities are of course more vulnerable.
+Therefore, RCU priority boosting is provided to help deal with this
+case. That said, the exact requirements on RCU priority boosting will
+likely evolve as more experience accumulates.
+
+Other workloads might have very high update rates. Although one can
+argue that such workloads should instead use something other than RCU,
+the fact remains that RCU must handle such workloads gracefully. This
+requirement is another factor driving batching of grace periods, but it
+is also the driving force behind the checks for large numbers of queued
+RCU callbacks in the ``call_rcu()`` code path. Finally, high update
+rates should not delay RCU read-side critical sections, although some
+small read-side delays can occur when using
+``synchronize_rcu_expedited()``, courtesy of this function's use of
+``smp_call_function_single()``.
+
+Although all three of these corner cases were understood in the early
+1990s, a simple user-level test consisting of ``close(open(path))`` in a
+tight loop in the early 2000s suddenly provided a much deeper
+appreciation of the high-update-rate corner case. This test also
+motivated addition of some RCU code to react to high update rates, for
+example, if a given CPU finds itself with more than 10,000 RCU callbacks
+queued, it will cause RCU to take evasive action by more aggressively
+starting grace periods and more aggressively forcing completion of
+grace-period processing. This evasive action causes the grace period to
+complete more quickly, but at the cost of restricting RCU's batching
+optimizations, thus increasing the CPU overhead incurred by that grace
+period.
+
+Software-Engineering Requirements
+---------------------------------
+
+Between Murphy's Law and “To err is human”, it is necessary to guard
+against mishaps and misuse:
+
+#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere
+ that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will
+ splat if ``rcu_dereference()`` is used outside of an RCU read-side
+ critical section. Update-side code can use
+ ``rcu_dereference_protected()``, which takes a `lockdep
+ expression <https://lwn.net/Articles/371986/>`__ to indicate what is
+ providing the protection. If the indicated protection is not
+ provided, a lockdep splat is emitted.
+ Code shared between readers and updaters can use
+ ``rcu_dereference_check()``, which also takes a lockdep expression,
+ and emits a lockdep splat if neither ``rcu_read_lock()`` nor the
+ indicated protection is in place. In addition,
+ ``rcu_dereference_raw()`` is used in those (hopefully rare) cases
+ where the required protection cannot be easily described. Finally,
+ ``rcu_read_lock_held()`` is provided to allow a function to verify
+ that it has been invoked within an RCU read-side critical section. I
+ was made aware of this set of requirements shortly after Thomas
+ Gleixner audited a number of RCU uses.
+#. A given function might wish to check for RCU-related preconditions
+ upon entry, before using any other RCU API. The
+ ``rcu_lockdep_assert()`` does this job, asserting the expression in
+ kernels having lockdep enabled and doing nothing otherwise.
+#. It is also easy to forget to use ``rcu_assign_pointer()`` and
+ ``rcu_dereference()``, perhaps (incorrectly) substituting a simple
+ assignment. To catch this sort of error, a given RCU-protected
+ pointer may be tagged with ``__rcu``, after which sparse will
+ complain about simple-assignment accesses to that pointer. Arnd
+ Bergmann made me aware of this requirement, and also supplied the
+ needed `patch series <https://lwn.net/Articles/376011/>`__.
+#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if
+ a data element is passed to ``call_rcu()`` twice in a row, without a
+ grace period in between. (This error is similar to a double free.)
+ The corresponding ``rcu_head`` structures that are dynamically
+ allocated are automatically tracked, but ``rcu_head`` structures
+ allocated on the stack must be initialized with
+ ``init_rcu_head_on_stack()`` and cleaned up with
+ ``destroy_rcu_head_on_stack()``. Similarly, statically allocated
+ non-stack ``rcu_head`` structures must be initialized with
+ ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``.
+ Mathieu Desnoyers made me aware of this requirement, and also
+ supplied the needed
+ `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__.
+#. An infinite loop in an RCU read-side critical section will eventually
+ trigger an RCU CPU stall warning splat, with the duration of
+ “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT``
+ ``Kconfig`` option, or, alternatively, by the
+ ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU
+ is not obligated to produce this splat unless there is a grace period
+ waiting on that particular RCU read-side critical section.
+
+ Some extreme workloads might intentionally delay RCU grace periods,
+ and systems running those workloads can be booted with
+ ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This
+ kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU
+ stall warnings are counter-productive during sysrq dumps and during
+ panics. RCU therefore supplies the ``rcu_sysrq_start()`` and
+ ``rcu_sysrq_end()`` API members to be called before and after long
+ sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is
+ automatically invoked at the beginning of a panic to suppress further
+ RCU CPU stall warnings.
+
+ This requirement made itself known in the early 1990s, pretty much
+ the first time that it was necessary to debug a CPU stall. That said,
+ the initial implementation in DYNIX/ptx was quite generic in
+ comparison with that of Linux.
+
+#. Although it would be very good to detect pointers leaking out of RCU
+ read-side critical sections, there is currently no good way of doing
+ this. One complication is the need to distinguish between pointers
+ leaking and pointers that have been handed off from RCU to some other
+ synchronization mechanism, for example, reference counting.
+#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
+ is provided via event tracing.
+#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
+ to create typical linked data structures can be surprisingly
+ error-prone. Therefore, RCU-protected `linked
+ lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and,
+ more recently, RCU-protected `hash
+ tables <https://lwn.net/Articles/612100/>`__ are available. Many
+ other special-purpose RCU-protected data structures are available in
+ the Linux kernel and the userspace RCU library.
+#. Some linked structures are created at compile time, but still require
+ ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves
+ this purpose.
+#. It is not necessary to use ``rcu_assign_pointer()`` when creating
+ linked structures that are to be published via a single external
+ pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task
+ and also for assigning ``NULL`` pointers at runtime.
+
+This not a hard-and-fast list: RCU's diagnostic capabilities will
+continue to be guided by the number and type of usage bugs found in
+real-world RCU usage.
+
+Linux Kernel Complications
+--------------------------
+
+The Linux kernel provides an interesting environment for all kinds of
+software, including RCU. Some of the relevant points of interest are as
+follows:
+
+#. `Configuration`_
+#. `Firmware Interface`_
+#. `Early Boot`_
+#. `Interrupts and NMIs`_
+#. `Loadable Modules`_
+#. `Hotplug CPU`_
+#. `Scheduler and RCU`_
+#. `Tracing and RCU`_
+#. `Accesses to User Memory and RCU`_
+#. `Energy Efficiency`_
+#. `Scheduling-Clock Interrupts and RCU`_
+#. `Memory Efficiency`_
+#. `Performance, Scalability, Response Time, and Reliability`_
+
+This list is probably incomplete, but it does give a feel for the most
+notable Linux-kernel complications. Each of the following sections
+covers one of the above topics.
+
+Configuration
+~~~~~~~~~~~~~
+
+RCU's goal is automatic configuration, so that almost nobody needs to
+worry about RCU's ``Kconfig`` options. And for almost all users, RCU
+does in fact work well “out of the box.”
+
+However, there are specialized use cases that are handled by kernel boot
+parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig``
+system will explicitly ask users about new ``Kconfig`` options, which
+requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT``
+``Kconfig`` option.
+
+This all should be quite obvious, but the fact remains that Linus
+Torvalds recently had to
+`remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__
+me of this requirement.
+
+Firmware Interface
+~~~~~~~~~~~~~~~~~~
+
+In many cases, kernel obtains information about the system from the
+firmware, and sometimes things are lost in translation. Or the
+translation is accurate, but the original message is bogus.
+
+For example, some systems' firmware overreports the number of CPUs,
+sometimes by a large factor. If RCU naively believed the firmware, as it
+used to do, it would create too many per-CPU kthreads. Although the
+resulting system will still run correctly, the extra kthreads needlessly
+consume memory and can cause confusion when they show up in ``ps``
+listings.
+
+RCU must therefore wait for a given CPU to actually come online before
+it can allow itself to believe that the CPU actually exists. The
+resulting “ghost CPUs” (which are never going to come online) cause a
+number of `interesting
+complications <https://paulmck.livejournal.com/37494.html>`__.
+
+Early Boot
+~~~~~~~~~~
+
+The Linux kernel's boot sequence is an interesting process, and RCU is
+used early, even before ``rcu_init()`` is invoked. In fact, a number of
+RCU's primitives can be used as soon as the initial task's
+``task_struct`` is available and the boot CPU's per-CPU variables are
+set up. The read-side primitives (``rcu_read_lock()``,
+``rcu_read_unlock()``, ``rcu_dereference()``, and
+``rcu_access_pointer()``) will operate normally very early on, as will
+``rcu_assign_pointer()``.
+
+Although ``call_rcu()`` may be invoked at any time during boot,
+callbacks are not guaranteed to be invoked until after all of RCU's
+kthreads have been spawned, which occurs at ``early_initcall()`` time.
+This delay in callback invocation is due to the fact that RCU does not
+invoke callbacks until it is fully initialized, and this full
+initialization cannot occur until after the scheduler has initialized
+itself to the point where RCU can spawn and run its kthreads. In theory,
+it would be possible to invoke callbacks earlier, however, this is not a
+panacea because there would be severe restrictions on what operations
+those callbacks could invoke.
+
+Perhaps surprisingly, ``synchronize_rcu()`` and
+``synchronize_rcu_expedited()``, will operate normally during very early
+boot, the reason being that there is only one CPU and preemption is
+disabled. This means that the call ``synchronize_rcu()`` (or friends)
+itself is a quiescent state and thus a grace period, so the early-boot
+implementation can be a no-op.
+
+However, once the scheduler has spawned its first kthread, this early
+boot trick fails for ``synchronize_rcu()`` (as well as for
+``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The
+reason is that an RCU read-side critical section might be preempted,
+which means that a subsequent ``synchronize_rcu()`` really does have to
+wait for something, as opposed to simply returning immediately.
+Unfortunately, ``synchronize_rcu()`` can't do this until all of its
+kthreads are spawned, which doesn't happen until some time during
+``early_initcalls()`` time. But this is no excuse: RCU is nevertheless
+required to correctly handle synchronous grace periods during this time
+period. Once all of its kthreads are up and running, RCU starts running
+normally.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| How can RCU possibly handle grace periods before all of its kthreads |
+| have been spawned??? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Very carefully! |
+| During the “dead zone” between the time that the scheduler spawns the |
+| first task and the time that all of RCU's kthreads have been spawned, |
+| all synchronous grace periods are handled by the expedited |
+| grace-period mechanism. At runtime, this expedited mechanism relies |
+| on workqueues, but during the dead zone the requesting task itself |
+| drives the desired expedited grace period. Because dead-zone |
+| execution takes place within task context, everything works. Once the |
+| dead zone ends, expedited grace periods go back to using workqueues, |
+| as is required to avoid problems that would otherwise occur when a |
+| user task received a POSIX signal while driving an expedited grace |
+| period. |
+| |
+| And yes, this does mean that it is unhelpful to send POSIX signals to |
+| random tasks between the time that the scheduler spawns its first |
+| kthread and the time that RCU's kthreads have all been spawned. If |
+| there ever turns out to be a good reason for sending POSIX signals |
+| during that time, appropriate adjustments will be made. (If it turns |
+| out that POSIX signals are sent during this time for no good reason, |
+| other adjustments will be made, appropriate or otherwise.) |
++-----------------------------------------------------------------------+
+
+I learned of these boot-time requirements as a result of a series of
+system hangs.
+
+Interrupts and NMIs
+~~~~~~~~~~~~~~~~~~~
+
+The Linux kernel has interrupts, and RCU read-side critical sections are
+legal within interrupt handlers and within interrupt-disabled regions of
+code, as are invocations of ``call_rcu()``.
+
+Some Linux-kernel architectures can enter an interrupt handler from
+non-idle process context, and then just never leave it, instead
+stealthily transitioning back to process context. This trick is
+sometimes used to invoke system calls from inside the kernel. These
+“half-interrupts” mean that RCU has to be very careful about how it
+counts interrupt nesting levels. I learned of this requirement the hard
+way during a rewrite of RCU's dyntick-idle code.
+
+The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
+critical sections are legal within NMI handlers. Thankfully, RCU
+update-side primitives, including ``call_rcu()``, are prohibited within
+NMI handlers.
+
+The name notwithstanding, some Linux-kernel architectures can have
+nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised
+me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
+with this requirement; he also kindly surprised me with `an
+algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__
+that meets this requirement.
+
+Furthermore, NMI handlers can be interrupted by what appear to RCU to be
+normal interrupts. One way that this can happen is for code that
+directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called
+from an NMI handler. This astonishing fact of life prompted the current
+code structure, which has ``rcu_irq_enter()`` invoking
+``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``.
+And yes, I also learned of this requirement the hard way.
+
+Loadable Modules
+~~~~~~~~~~~~~~~~
+
+The Linux kernel has loadable modules, and these modules can also be
+unloaded. After a given module has been unloaded, any attempt to call
+one of its functions results in a segmentation fault. The module-unload
+functions must therefore cancel any delayed calls to loadable-module
+functions, for example, any outstanding ``mod_timer()`` must be dealt
+with via ``del_timer_sync()`` or similar.
+
+Unfortunately, there is no way to cancel an RCU callback; once you
+invoke ``call_rcu()``, the callback function is eventually going to be
+invoked, unless the system goes down first. Because it is normally
+considered socially irresponsible to crash the system in response to a
+module unload request, we need some other way to deal with in-flight RCU
+callbacks.
+
+RCU therefore provides ``rcu_barrier()``, which waits until all
+in-flight RCU callbacks have been invoked. If a module uses
+``call_rcu()``, its exit function should therefore prevent any future
+invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory,
+the underlying module-unload code could invoke ``rcu_barrier()``
+unconditionally, but in practice this would incur unacceptable
+latencies.
+
+Nikita Danilov noted this requirement for an analogous
+filesystem-unmount situation, and Dipankar Sarma incorporated
+``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module
+unloading became apparent later.
+
+.. important::
+
+ The ``rcu_barrier()`` function is not, repeat,
+ *not*, obligated to wait for a grace period. It is instead only required
+ to wait for RCU callbacks that have already been posted. Therefore, if
+ there are no RCU callbacks posted anywhere in the system,
+ ``rcu_barrier()`` is within its rights to return immediately. Even if
+ there are callbacks posted, ``rcu_barrier()`` does not necessarily need
+ to wait for a grace period.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Wait a minute! Each RCU callbacks must wait for a grace period to |
+| complete, and ``rcu_barrier()`` must wait for each pre-existing |
+| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to |
+| wait for a full grace period if there is even one callback posted |
+| anywhere in the system? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Absolutely not!!! |
+| Yes, each RCU callbacks must wait for a grace period to complete, but |
+| it might well be partly (or even completely) finished waiting by the |
+| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` |
+| need only wait for the remaining portion of the grace period to |
+| elapse. So even if there are quite a few callbacks posted, |
+| ``rcu_barrier()`` might well return quite quickly. |
+| |
+| So if you need to wait for a grace period as well as for all |
+| pre-existing callbacks, you will need to invoke both |
+| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, |
+| you can always use workqueues to invoke them concurrently. |
++-----------------------------------------------------------------------+
+
+Hotplug CPU
+~~~~~~~~~~~
+
+The Linux kernel supports CPU hotplug, which means that CPUs can come
+and go. It is of course illegal to use any RCU API member from an
+offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
+critical sections. This requirement was present from day one in
+DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
+implementation is “interesting.”
+
+The Linux-kernel CPU-hotplug implementation has notifiers that are used
+to allow the various kernel subsystems (including RCU) to respond
+appropriately to a given CPU-hotplug operation. Most RCU operations may
+be invoked from CPU-hotplug notifiers, including even synchronous
+grace-period operations such as ``synchronize_rcu()`` and
+``synchronize_rcu_expedited()``.
+
+However, all-callback-wait operations such as ``rcu_barrier()`` are also
+not supported, due to the fact that there are phases of CPU-hotplug
+operations where the outgoing CPU's callbacks will not be invoked until
+after the CPU-hotplug operation ends, which could also result in
+deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
+during its execution, which results in another type of deadlock when
+invoked from a CPU-hotplug notifier.
+
+Scheduler and RCU
+~~~~~~~~~~~~~~~~~
+
+RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
+accumulation by these kthreads. This requirement was no surprise, but
+RCU's violation of it when running context-switch-heavy workloads when
+built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
+[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
+RCU has made good progress towards meeting this requirement, even for
+context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
+room for further improvement.
+
+There is no longer any prohibition against holding any of
+scheduler's runqueue or priority-inheritance spinlocks across an
+``rcu_read_unlock()``, even if interrupts and preemption were enabled
+somewhere within the corresponding RCU read-side critical section.
+Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
+with preemption enabled, acquire one of the scheduler locks, and hold
+that lock across the matching ``rcu_read_unlock()``.
+
+Similarly, the RCU flavor consolidation has removed the need for negative
+nesting. The fact that interrupt-disabled regions of code act as RCU
+read-side critical sections implicitly avoids earlier issues that used
+to result in destructive recursion via interrupt handler's use of RCU.
+
+Tracing and RCU
+~~~~~~~~~~~~~~~
+
+It is possible to use tracing on RCU code, but tracing itself uses RCU.
+For this reason, ``rcu_dereference_raw_check()`` is provided for use
+by tracing, which avoids the destructive recursion that could otherwise
+ensue. This API is also used by virtualization in some architectures,
+where RCU readers execute in environments in which tracing cannot be
+used. The tracing folks both located the requirement and provided the
+needed fix, so this surprise requirement was relatively painless.
+
+Accesses to User Memory and RCU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The kernel needs to access user-space memory, for example, to access data
+referenced by system-call parameters. The ``get_user()`` macro does this job.
+
+However, user-space memory might well be paged out, which means that
+``get_user()`` might well page-fault and thus block while waiting for the
+resulting I/O to complete. It would be a very bad thing for the compiler to
+reorder a ``get_user()`` invocation into an RCU read-side critical section.
+
+For example, suppose that the source code looked like this:
+
+ ::
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 v = p->value;
+ 4 rcu_read_unlock();
+ 5 get_user(user_v, user_p);
+ 6 do_something_with(v, user_v);
+
+The compiler must not be permitted to transform this source code into
+the following:
+
+ ::
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
+ 4 v = p->value;
+ 5 rcu_read_unlock();
+ 6 do_something_with(v, user_v);
+
+If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel
+build, and if ``get_user()`` did page fault, the result would be a quiescent
+state in the middle of an RCU read-side critical section. This misplaced
+quiescent state could result in line 4 being a use-after-free access,
+which could be bad for your kernel's actuarial statistics. Similar examples
+can be constructed with the call to ``get_user()`` preceding the
+``rcu_read_lock()``.
+
+Unfortunately, ``get_user()`` doesn't have any particular ordering properties,
+and in some architectures the underlying ``asm`` isn't even marked
+``volatile``. And even if it was marked ``volatile``, the above access to
+``p->value`` is not volatile, so the compiler would not have any reason to keep
+those two accesses in order.
+
+Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
+``rcu_read_unlock()`` must act as compiler barriers, at least for outermost
+instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set
+of RCU read-side critical sections.
+
+Energy Efficiency
+~~~~~~~~~~~~~~~~~
+
+Interrupting idle CPUs is considered socially unacceptable, especially
+by people with battery-powered embedded systems. RCU therefore conserves
+energy by detecting which CPUs are idle, including tracking CPUs that
+have been interrupted from idle. This is a large part of the
+energy-efficiency requirement, so I learned of this via an irate phone
+call.
+
+Because RCU avoids interrupting idle CPUs, it is illegal to execute an
+RCU read-side critical section on an idle CPU. (Kernels built with
+``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()``
+macro and ``_rcuidle`` event tracing is provided to work around this
+restriction. In addition, ``rcu_is_watching()`` may be used to test
+whether or not it is currently legal to run RCU read-side critical
+sections on this CPU. I learned of the need for diagnostics on the one
+hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
+Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite
+heavily in the idle loop. However, there are some restrictions on the
+code placed within ``RCU_NONIDLE()``:
+
+#. Blocking is prohibited. In practice, this is not a serious
+ restriction given that idle tasks are prohibited from blocking to
+ begin with.
+#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest
+ indefinitely deeply. However, given that they can be nested on the
+ order of a million deep, even on 32-bit systems, this should not be a
+ serious restriction. This nesting limit would probably be reached
+ long after the compiler OOMed or the stack overflowed.
+#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that
+ same ``RCU_NONIDLE()``. For example, the following is grossly
+ illegal:
+
+ ::
+
+ 1 RCU_NONIDLE({
+ 2 do_something();
+ 3 goto bad_idea; /* BUG!!! */
+ 4 do_something_else();});
+ 5 bad_idea:
+
+
+ It is just as illegal to transfer control into the middle of
+ ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in
+ as long as you also transferred out, but in practice you could also
+ expect to get sharply worded review comments.
+
+It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU
+running in userspace. RCU must therefore track ``nohz_full`` userspace
+execution. RCU must therefore be able to sample state at two points in
+time, and be able to determine whether or not some other CPU spent any
+time idle and/or executing in userspace.
+
+These energy-efficiency requirements have proven quite difficult to
+understand and to meet, for example, there have been more than five
+clean-sheet rewrites of RCU's energy-efficiency code, the last of which
+was finally able to demonstrate `real energy savings running on real
+hardware
+[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__.
+As noted earlier, I learned of many of these requirements via angry
+phone calls: Flaming me on the Linux-kernel mailing list was apparently
+not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
+
+Scheduling-Clock Interrupts and RCU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop. Depending on kernel configuration, RCU
+handles these states differently:
+
++-----------------+------------------+------------------+-----------------+
+| ``HZ`` Kconfig | In-Kernel | Usermode | Idle |
++=================+==================+==================+=================+
+| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on |
+| | scheduling-clock | scheduling-clock | RCU's |
+| | interrupt. | interrupt and | dyntick-idle |
+| | | its detection | detection. |
+| | | of interrupt | |
+| | | from usermode. | |
++-----------------+------------------+------------------+-----------------+
+| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on |
+| | scheduling-clock | scheduling-clock | RCU's |
+| | interrupt. | interrupt and | dyntick-idle |
+| | | its detection | detection. |
+| | | of interrupt | |
+| | | from usermode. | |
++-----------------+------------------+------------------+-----------------+
+| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on |
+| | sometimes rely | RCU's | RCU's |
+| | on | dyntick-idle | dyntick-idle |
+| | scheduling-clock | detection. | detection. |
+| | interrupt. In | | |
+| | other cases, it | | |
+| | is necessary to | | |
+| | bound kernel | | |
+| | execution times | | |
+| | and/or use | | |
+| | IPIs. | | |
++-----------------+------------------+------------------+-----------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the |
+| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and |
+| ``NO_HZ_IDLE`` do? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Because, as a performance optimization, ``NO_HZ_FULL`` does not |
+| necessarily re-enable the scheduling-clock interrupt on entry to each |
+| and every system call. |
++-----------------------------------------------------------------------+
+
+However, RCU must be reliably informed as to whether any given CPU is
+currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that
+CPU is executing in usermode, as discussed
+`earlier <#Energy%20Efficiency>`__. It also requires that the
+scheduling-clock interrupt be enabled when RCU needs it to be:
+
+#. If a CPU is either idle or executing in usermode, and RCU believes it
+ is non-idle, the scheduling-clock tick had better be running.
+ Otherwise, you will get RCU CPU stall warnings. Or at best, very long
+ (11-second) grace periods, with a pointless IPI waking the CPU from
+ time to time.
+#. If a CPU is in a portion of the kernel that executes RCU read-side
+ critical sections, and RCU believes this CPU to be idle, you will get
+ random memory corruption. **DON'T DO THIS!!!**
+ This is one reason to test with lockdep, which will complain about
+ this sort of thing.
+#. If a CPU is in a portion of the kernel that is absolutely positively
+ no-joking guaranteed to never execute any RCU read-side critical
+ sections, and RCU believes this CPU to be idle, no problem. This
+ sort of thing is used by some architectures for light-weight
+ exception handlers, which can then avoid the overhead of
+ ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and
+ exit, respectively. Some go further and avoid the entireties of
+ ``irq_enter()`` and ``irq_exit()``.
+ Just make very sure you are running some of your tests with
+ ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in
+ fact joking about not doing RCU read-side critical sections.
+#. If a CPU is executing in the kernel with the scheduling-clock
+ interrupt disabled and RCU believes this CPU to be non-idle, and if
+ the CPU goes idle (from an RCU perspective) every few jiffies, no
+ problem. It is usually OK for there to be the occasional gap between
+ idle periods of up to a second or so.
+ If the gap grows too long, you get RCU CPU stall warnings.
+#. If a CPU is either idle or executing in usermode, and RCU believes it
+ to be idle, of course no problem.
+#. If a CPU is executing in the kernel, the kernel code path is passing
+ through quiescent states at a reasonable frequency (preferably about
+ once per few jiffies, but the occasional excursion to a second or so
+ is usually OK) and the scheduling-clock interrupt is enabled, of
+ course no problem.
+ If the gap between a successive pair of quiescent states grows too
+ long, you get RCU CPU stall warnings.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But what if my driver has a hardware interrupt handler that can run |
+| for many seconds? I cannot invoke ``schedule()`` from an hardware |
+| interrupt handler, after all! |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so |
+| often. But given that long-running interrupt handlers can cause other |
+| problems, not least for response time, shouldn't you work to keep |
+| your interrupt handler's runtime within reasonable bounds? |
++-----------------------------------------------------------------------+
+
+But as long as RCU is properly informed of kernel state transitions
+between in-kernel execution, usermode execution, and idle, and as long
+as the scheduling-clock interrupt is enabled when RCU needs it to be,
+you can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
+Memory Efficiency
+~~~~~~~~~~~~~~~~~
+
+Although small-memory non-realtime systems can simply use Tiny RCU, code
+size is only one aspect of memory efficiency. Another aspect is the size
+of the ``rcu_head`` structure used by ``call_rcu()`` and
+``kfree_rcu()``. Although this structure contains nothing more than a
+pair of pointers, it does appear in many RCU-protected data structures,
+including some that are size critical. The ``page`` structure is a case
+in point, as evidenced by the many occurrences of the ``union`` keyword
+within that structure.
+
+This need for memory efficiency is one reason that RCU uses hand-crafted
+singly linked lists to track the ``rcu_head`` structures that are
+waiting for a grace period to elapse. It is also the reason why
+``rcu_head`` structures do not contain debug information, such as fields
+tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that
+posted them. Although this information might appear in debug-only kernel
+builds at some point, in the meantime, the ``->func`` field will often
+provide the needed debug information.
+
+However, in some cases, the need for memory efficiency leads to even
+more extreme measures. Returning to the ``page`` structure, the
+``rcu_head`` field shares storage with a great many other structures
+that are used at various points in the corresponding page's lifetime. In
+order to correctly resolve certain `race
+conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__,
+the Linux kernel's memory-management subsystem needs a particular bit to
+remain zero during all phases of grace-period processing, and that bit
+happens to map to the bottom bit of the ``rcu_head`` structure's
+``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
+used to post the callback, as opposed to ``kfree_rcu()`` or some future
+“lazy” variant of ``call_rcu()`` that might one day be created for
+energy-efficiency purposes.
+
+That said, there are limits. RCU requires that the ``rcu_head``
+structure be aligned to a two-byte boundary, and passing a misaligned
+``rcu_head`` structure to one of the ``call_rcu()`` family of functions
+will result in a splat. It is therefore necessary to exercise caution
+when packing structures containing fields of type ``rcu_head``. Why not
+a four-byte or even eight-byte alignment requirement? Because the m68k
+architecture provides only two-byte alignment, and thus acts as
+alignment's least common denominator.
+
+The reason for reserving the bottom bit of pointers to ``rcu_head``
+structures is to leave the door open to “lazy” callbacks whose
+invocations can safely be deferred. Deferring invocation could
+potentially have energy-efficiency benefits, but only if the rate of
+non-lazy callbacks decreases significantly for some important workload.
+In the meantime, reserving the bottom bit keeps this option open in case
+it one day becomes useful.
+
+Performance, Scalability, Response Time, and Reliability
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Expanding on the `earlier
+discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by
+hot code paths in performance-critical portions of the Linux kernel's
+networking, security, virtualization, and scheduling code paths. RCU
+must therefore use efficient implementations, especially in its
+read-side primitives. To that end, it would be good if preemptible RCU's
+implementation of ``rcu_read_lock()`` could be inlined, however, doing
+this requires resolving ``#include`` issues with the ``task_struct``
+structure.
+
+The Linux kernel supports hardware configurations with up to 4096 CPUs,
+which means that RCU must be extremely scalable. Algorithms that involve
+frequent acquisitions of global locks or frequent atomic operations on
+global variables simply cannot be tolerated within the RCU
+implementation. RCU therefore makes heavy use of a combining tree based
+on the ``rcu_node`` structure. RCU is required to tolerate all CPUs
+continuously invoking any combination of RCU's runtime primitives with
+minimal per-operation overhead. In fact, in many cases, increasing load
+must *decrease* the per-operation overhead, witness the batching
+optimizations for ``synchronize_rcu()``, ``call_rcu()``,
+``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general
+rule, RCU must cheerfully accept whatever the rest of the Linux kernel
+decides to throw at it.
+
+The Linux kernel is used for real-time workloads, especially in
+conjunction with the `-rt
+patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The
+real-time-latency response requirements are such that the traditional
+approach of disabling preemption across RCU read-side critical sections
+is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use
+an RCU implementation that allows RCU read-side critical sections to be
+preempted. This requirement made its presence known after users made it
+clear that an earlier `real-time
+patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in
+conjunction with some `RCU
+issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__
+encountered by a very early version of the -rt patchset.
+
+In addition, RCU must make do with a sub-100-microsecond real-time
+latency budget. In fact, on smaller systems with the -rt patchset, the
+Linux kernel provides sub-20-microsecond real-time latencies for the
+whole kernel, including RCU. RCU's scalability and latency must
+therefore be sufficient for these sorts of configurations. To my
+surprise, the sub-100-microsecond real-time latency budget `applies to
+even the largest systems
+[PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__,
+up to and including systems with 4096 CPUs. This real-time requirement
+motivated the grace-period kthread, which also simplified handling of a
+number of race conditions.
+
+RCU must avoid degrading real-time response for CPU-bound threads,
+whether executing in usermode (which is one use case for
+``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
+the kernel must execute ``cond_resched()`` at least once per few tens of
+milliseconds in order to avoid receiving an IPI from RCU.
+
+Finally, RCU's status as a synchronization primitive means that any RCU
+failure can result in arbitrary memory corruption that can be extremely
+difficult to debug. This means that RCU must be extremely reliable,
+which in practice also means that RCU must have an aggressive
+stress-test suite. This stress-test suite is called ``rcutorture``.
+
+Although the need for ``rcutorture`` was no surprise, the current
+immense popularity of the Linux kernel is posing interesting—and perhaps
+unprecedented—validation challenges. To see this, keep in mind that
+there are well over one billion instances of the Linux kernel running
+today, given Android smartphones, Linux-powered televisions, and
+servers. This number can be expected to increase sharply with the advent
+of the celebrated Internet of Things.
+
+Suppose that RCU contains a race condition that manifests on average
+once per million years of runtime. This bug will be occurring about
+three times per *day* across the installed base. RCU could simply hide
+behind hardware error rates, given that no one should really expect
+their smartphone to last for a million years. However, anyone taking too
+much comfort from this thought should consider the fact that in most
+jurisdictions, a successful multi-year test of a given mechanism, which
+might include a Linux kernel, suffices for a number of types of
+safety-critical certifications. In fact, rumor has it that the Linux
+kernel is already being used in production for safety-critical
+applications. I don't know about you, but I would feel quite bad if a
+bug in RCU killed someone. Which might explain my recent focus on
+validation and verification.
+
+Other RCU Flavors
+-----------------
+
+One of the more surprising things about RCU is that there are now no
+fewer than five *flavors*, or API families. In addition, the primary
+flavor that has been the sole focus up to this point has two different
+implementations, non-preemptible and preemptible. The other four flavors
+are listed below, with requirements for each described in a separate
+section.
+
+#. `Bottom-Half Flavor (Historical)`_
+#. `Sched Flavor (Historical)`_
+#. `Sleepable RCU`_
+#. `Tasks RCU`_
+
+Bottom-Half Flavor (Historical)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The RCU-bh flavor of RCU has since been expressed in terms of the other
+RCU flavors as part of a consolidation of the three flavors into a
+single flavor. The read-side API remains, and continues to disable
+softirq and to be accounted for by lockdep. Much of the material in this
+section is therefore strictly historical in nature.
+
+The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
+flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
+flavor of RCU that could withstand the network-based denial-of-service
+attacks researched by Robert Olsson. These attacks placed so much
+networking load on the system that some of the CPUs never exited softirq
+execution, which in turn prevented those CPUs from ever executing a
+context switch, which, in the RCU implementation of that time, prevented
+grace periods from ever ending. The result was an out-of-memory
+condition and a system hang.
+
+The solution was the creation of RCU-bh, which does
+``local_bh_disable()`` across its read-side critical sections, and which
+uses the transition from one type of softirq processing to another as a
+quiescent state in addition to context switch, idle, user mode, and
+offline. This means that RCU-bh grace periods can complete even when
+some of the CPUs execute in softirq indefinitely, thus allowing
+algorithms based on RCU-bh to withstand network-based denial-of-service
+attacks.
+
+Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and
+re-enable softirq handlers, any attempt to start a softirq handlers
+during the RCU-bh read-side critical section will be deferred. In this
+case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can
+take considerable time. One can of course argue that this softirq
+overhead should be associated with the code following the RCU-bh
+read-side critical section rather than ``rcu_read_unlock_bh()``, but the
+fact is that most profiling tools cannot be expected to make this sort
+of fine distinction. For example, suppose that a three-millisecond-long
+RCU-bh read-side critical section executes during a time of heavy
+networking load. There will very likely be an attempt to invoke at least
+one softirq handler during that three milliseconds, but any such
+invocation will be delayed until the time of the
+``rcu_read_unlock_bh()``. This can of course make it appear at first
+glance as if ``rcu_read_unlock_bh()`` was executing very slowly.
+
+The `RCU-bh
+API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
+includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``,
+``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``,
+``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``,
+``call_rcu_bh()``, ``rcu_barrier_bh()``, and
+``rcu_read_lock_bh_held()``. However, the update-side APIs are now
+simple wrappers for other RCU flavors, namely RCU-sched in
+CONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
+
+Sched Flavor (Historical)
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The RCU-sched flavor of RCU has since been expressed in terms of the
+other RCU flavors as part of a consolidation of the three flavors into a
+single flavor. The read-side API remains, and continues to disable
+preemption and to be accounted for by lockdep. Much of the material in
+this section is therefore strictly historical in nature.
+
+Before preemptible RCU, waiting for an RCU grace period had the side
+effect of also waiting for all pre-existing interrupt and NMI handlers.
+However, there are legitimate preemptible-RCU implementations that do
+not have this property, given that any point in the code outside of an
+RCU read-side critical section can be a quiescent state. Therefore,
+*RCU-sched* was created, which follows “classic” RCU in that an
+RCU-sched grace period waits for pre-existing interrupt and NMI
+handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and
+RCU-sched APIs have identical implementations, while kernels built with
+``CONFIG_PREEMPT=y`` provide a separate implementation for each.
+
+Note well that in ``CONFIG_PREEMPT=y`` kernels,
+``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and
+re-enable preemption, respectively. This means that if there was a
+preemption attempt during the RCU-sched read-side critical section,
+``rcu_read_unlock_sched()`` will enter the scheduler, with all the
+latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``,
+this can make it look as if ``rcu_read_unlock_sched()`` was executing
+very slowly. However, the highest-priority task won't be preempted, so
+that task will enjoy low-overhead ``rcu_read_unlock_sched()``
+invocations.
+
+The `RCU-sched
+API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
+includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``,
+``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``,
+``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``,
+``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``,
+``call_rcu_sched()``, ``rcu_barrier_sched()``, and
+``rcu_read_lock_sched_held()``. However, anything that disables
+preemption also marks an RCU-sched read-side critical section, including
+``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and
+``local_irq_restore()``, and so on.
+
+Sleepable RCU
+~~~~~~~~~~~~~
+
+For well over a decade, someone saying “I need to block within an RCU
+read-side critical section” was a reliable indication that this someone
+did not understand RCU. After all, if you are always blocking in an RCU
+read-side critical section, you can probably afford to use a
+higher-overhead synchronization mechanism. However, that changed with
+the advent of the Linux kernel's notifiers, whose RCU read-side critical
+sections almost never sleep, but sometimes need to. This resulted in the
+introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or
+*SRCU*.
+
+SRCU allows different domains to be defined, with each such domain
+defined by an instance of an ``srcu_struct`` structure. A pointer to
+this structure must be passed in to each SRCU function, for example,
+``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct``
+structure. The key benefit of these domains is that a slow SRCU reader
+in one domain does not delay an SRCU grace period in some other domain.
+That said, one consequence of these domains is that read-side code must
+pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for
+example, as follows:
+
+ ::
+
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&ss);
+ 4 do_something();
+ 5 srcu_read_unlock(&ss, idx);
+
+As noted above, it is legal to block within SRCU read-side critical
+sections, however, with great power comes great responsibility. If you
+block forever in one of a given domain's SRCU read-side critical
+sections, then that domain's grace periods will also be blocked forever.
+Of course, one good way to block forever is to deadlock, which can
+happen if any operation in a given domain's SRCU read-side critical
+section can wait, either directly or indirectly, for that domain's grace
+period to elapse. For example, this results in a self-deadlock:
+
+ ::
+
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&ss);
+ 4 do_something();
+ 5 synchronize_srcu(&ss);
+ 6 srcu_read_unlock(&ss, idx);
+
+However, if line 5 acquired a mutex that was held across a
+``synchronize_srcu()`` for domain ``ss``, deadlock would still be
+possible. Furthermore, if line 5 acquired a mutex that was held across a
+``synchronize_srcu()`` for some other domain ``ss1``, and if an
+``ss1``-domain SRCU read-side critical section acquired another mutex
+that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
+would again be possible. Such a deadlock cycle could extend across an
+arbitrarily large number of different SRCU domains. Again, with great
+power comes great responsibility.
+
+Unlike the other RCU flavors, SRCU read-side critical sections can run
+on idle and even offline CPUs. This ability requires that
+``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers,
+which means that SRCU readers will run a bit slower than would RCU
+readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API,
+which, in combination with ``srcu_read_unlock()``, guarantees a full
+memory barrier.
+
+Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be
+invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
+periods make use of timers and the possibility of timers being
+temporarily “stranded” on the outgoing CPU. This stranding of timers
+means that timers posted to the outgoing CPU will not fire until late in
+the CPU-hotplug process. The problem is that if a notifier is waiting on
+an SRCU grace period, that grace period is waiting on a timer, and that
+timer is stranded on the outgoing CPU, then the notifier will never be
+awakened, in other words, deadlock has occurred. This same situation of
+course also prohibits ``srcu_barrier()`` from being invoked from
+CPU-hotplug notifiers.
+
+SRCU also differs from other RCU flavors in that SRCU's expedited and
+non-expedited grace periods are implemented by the same mechanism. This
+means that in the current SRCU implementation, expediting a future grace
+period has the side effect of expediting all prior grace periods that
+have not yet completed. (But please note that this is a property of the
+current implementation, not necessarily of future implementations.) In
+addition, if SRCU has been idle for longer than the interval specified
+by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds
+by default), and if a ``synchronize_srcu()`` invocation ends this idle
+period, that invocation will be automatically expedited.
+
+As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
+locking bottleneck present in prior kernel versions. Although this will
+allow users to put much heavier stress on ``call_srcu()``, it is
+important to note that SRCU does not yet take any special steps to deal
+with callback flooding. So if you are posting (say) 10,000 SRCU
+callbacks per second per CPU, you are probably totally OK, but if you
+intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please
+run some tests first. SRCU just might need a few adjustment to deal with
+that sort of load. Of course, your mileage may vary based on the speed
+of your CPUs and the size of your memory.
+
+The `SRCU
+API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
+includes ``srcu_read_lock()``, ``srcu_read_unlock()``,
+``srcu_dereference()``, ``srcu_dereference_check()``,
+``synchronize_srcu()``, ``synchronize_srcu_expedited()``,
+``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It
+also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and
+``init_srcu_struct()`` APIs for defining and initializing
+``srcu_struct`` structures.
+
+Tasks RCU
+~~~~~~~~~
+
+Some forms of tracing use “trampolines” to handle the binary rewriting
+required to install different types of probes. It would be good to be
+able to free old trampolines, which sounds like a job for some form of
+RCU. However, because it is necessary to be able to install a trace
+anywhere in the code, it is not possible to use read-side markers such
+as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does
+not work to have these markers in the trampoline itself, because there
+would need to be instructions following ``rcu_read_unlock()``. Although
+``synchronize_rcu()`` would guarantee that execution reached the
+``rcu_read_unlock()``, it would not be able to guarantee that execution
+had completely left the trampoline. Worse yet, in some situations
+the trampoline's protection must extend a few instructions *prior* to
+execution reaching the trampoline. For example, these few instructions
+might calculate the address of the trampoline, so that entering the
+trampoline would be pre-ordained a surprisingly long time before execution
+actually reached the trampoline itself.
+
+The solution, in the form of `Tasks
+RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
+critical sections that are delimited by voluntary context switches, that
+is, calls to ``schedule()``, ``cond_resched()``, and
+``synchronize_rcu_tasks()``. In addition, transitions to and from
+userspace execution also delimit tasks-RCU read-side critical sections.
+
+The tasks-RCU API is quite compact, consisting only of
+``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and
+``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines
+cannot be preempted, so these APIs map to ``call_rcu()``,
+``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In
+``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these
+three APIs are therefore implemented by separate functions that check
+for voluntary context switches.
+
+Possible Future Changes
+-----------------------
+
+One of the tricks that RCU uses to attain update-side scalability is to
+increase grace-period latency with increasing numbers of CPUs. If this
+becomes a serious problem, it will be necessary to rework the
+grace-period state machine so as to avoid the need for the additional
+latency.
+
+RCU disables CPU hotplug in a few places, perhaps most notably in the
+``rcu_barrier()`` operations. If there is a strong reason to use
+``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
+avoid disabling CPU hotplug. This would introduce some complexity, so
+there had better be a *very* good reason.
+
+The tradeoff between grace-period latency on the one hand and
+interruptions of other CPUs on the other hand may need to be
+re-examined. The desire is of course for zero grace-period latency as
+well as zero interprocessor interrupts undertaken during an expedited
+grace period operation. While this ideal is unlikely to be achievable,
+it is quite possible that further improvements can be made.
+
+The multiprocessor implementations of RCU use a combining tree that
+groups CPUs so as to reduce lock contention and increase cache locality.
+However, this combining tree does not spread its memory across NUMA
+nodes nor does it align the CPU groups with hardware features such as
+sockets or cores. Such spreading and alignment is currently believed to
+be unnecessary because the hotpath read-side primitives do not access
+the combining tree, nor does ``call_rcu()`` in the common case. If you
+believe that your architecture needs such spreading and alignment, then
+your architecture should also benefit from the
+``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the
+number of CPUs in a socket, NUMA node, or whatever. If the number of
+CPUs is too large, use a fraction of the number of CPUs. If the number
+of CPUs is a large prime number, well, that certainly is an
+“interesting” architectural choice! More flexible arrangements might be
+considered, but only if ``rcutree.rcu_fanout_leaf`` has proven
+inadequate, and only if the inadequacy has been demonstrated by a
+carefully run and realistic system-level workload.
+
+Please note that arrangements that require RCU to remap CPU numbers will
+require extremely good demonstration of need and full exploration of
+alternatives.
+
+RCU's various kthreads are reasonably recent additions. It is quite
+likely that adjustments will be required to more gracefully handle
+extreme loads. It might also be necessary to be able to relate CPU
+utilization by RCU's kthreads and softirq handlers to the code that
+instigated this CPU utilization. For example, RCU callback overhead
+might be charged back to the originating ``call_rcu()`` instance, though
+probably not in production kernels.
+
+Additional work may be required to provide reasonable forward-progress
+guarantees under heavy load for grace periods and for callback
+invocation.
+
+Summary
+-------
+
+This document has presented more than two decade's worth of RCU
+requirements. Given that the requirements keep changing, this will not
+be the last word on this subject, but at least it serves to get an
+important subset of the requirements set forth.
+
+Acknowledgments
+---------------
+
+I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg
+Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy
+Lutomirski for their help in rendering this article human readable, and
+to Michelle Rankin for her support of this effort. Other contributions
+are acknowledged in the Linux kernel's git archive.