From 76cb841cb886eef6b3bee341a2266c76578724ad Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Mon, 6 May 2024 03:02:30 +0200 Subject: Adding upstream version 4.19.249. Signed-off-by: Daniel Baumann --- .../Design/Data-Structures/BigTreeClassicRCU.svg | 474 ++ .../Design/Data-Structures/BigTreeClassicRCUBH.svg | 499 ++ .../Data-Structures/BigTreeClassicRCUBHdyntick.svg | 695 +++ .../Data-Structures/BigTreePreemptRCUBHdyntick.svg | 741 +++ .../BigTreePreemptRCUBHdyntickCB.svg | 858 ++++ .../Design/Data-Structures/Data-Structures.html | 1462 ++++++ .../Design/Data-Structures/HugeTreeClassicRCU.svg | 939 ++++ .../RCU/Design/Data-Structures/TreeLevel.svg | 828 ++++ .../RCU/Design/Data-Structures/TreeMapping.svg | 305 ++ .../Design/Data-Structures/TreeMappingLevel.svg | 380 ++ .../RCU/Design/Data-Structures/blkd_task.svg | 843 ++++ .../RCU/Design/Data-Structures/nxtlist.svg | 386 ++ .../Design/Expedited-Grace-Periods/ExpRCUFlow.svg | 830 ++++ .../Expedited-Grace-Periods/ExpSchedFlow.svg | 826 ++++ .../Expedited-Grace-Periods.html | 667 +++ .../RCU/Design/Expedited-Grace-Periods/Funnel0.svg | 275 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel1.svg | 275 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel2.svg | 287 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel3.svg | 323 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel4.svg | 323 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel5.svg | 335 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel6.svg | 335 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel7.svg | 347 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel8.svg | 311 ++ .../Design/Memory-Ordering/Tree-RCU-Diagram.html | 9 + .../Memory-Ordering/Tree-RCU-Memory-Ordering.html | 705 +++ .../TreeRCU-callback-invocation.svg | 486 ++ .../Memory-Ordering/TreeRCU-callback-registry.svg | 655 +++ .../RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Design/Memory-Ordering/TreeRCU-gp-cleanup.svg | 1133 +++++ .../RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 +++++ .../Design/Memory-Ordering/TreeRCU-gp-init-1.svg | 658 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-3.svg | 636 +++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5144 ++++++++++++++++++++ .../RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 +++++ .../RCU/Design/Memory-Ordering/rcu_node-lock.svg | 229 + .../Design/Requirements/GPpartitionReaders1.svg | 374 ++ .../Design/Requirements/ReadersPartitionGP1.svg | 639 +++ .../RCU/Design/Requirements/Requirements.html | 3324 +++++++++++++ 41 files changed, 32071 insertions(+) create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg create mode 100644 Documentation/RCU/Design/Data-Structures/Data-Structures.html create mode 100644 Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMapping.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/blkd_task.svg create mode 100644 Documentation/RCU/Design/Data-Structures/nxtlist.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg create mode 100644 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg create mode 100644 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg create mode 100644 Documentation/RCU/Design/Requirements/Requirements.html (limited to 'Documentation/RCU/Design') diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg new file mode 100644 index 000000000..727e270b1 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg @@ -0,0 +1,474 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_data + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + CPU 1007 + + struct + + rcu_data + + CPU 1023 + + struct rcu_state + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_node + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg new file mode 100644 index 000000000..9bbb1944f --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBH.svg @@ -0,0 +1,499 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + rcu_sched + + + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg new file mode 100644 index 000000000..21ba78234 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCUBHdyntick.svg @@ -0,0 +1,695 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_sched + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg new file mode 100644 index 000000000..15adcac03 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntick.svg @@ -0,0 +1,741 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_preempt + + rcu_sched + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg new file mode 100644 index 000000000..bbc380147 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg @@ -0,0 +1,858 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_head + + struct + + rcu_head + + struct + + rcu_head + + rcu_sched + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_preempt + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html new file mode 100644 index 000000000..f5120a00f --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -0,0 +1,1462 @@ + + + A Tour Through TREE_RCU's Data Structures [LWN.net] + + +

December 18, 2016

+

This article was contributed by Paul E. McKenney

+ +

Introduction

+ +This document describes RCU's major data structures and their relationship +to each other. + +
    +
  1. + Data-Structure Relationships +
  2. + The rcu_state Structure +
  3. + The rcu_node Structure +
  4. + The rcu_segcblist Structure +
  5. + The rcu_data Structure +
  6. + The rcu_dynticks Structure +
  7. + The rcu_head Structure +
  8. + RCU-Specific Fields in the task_struct Structure +
  9. + Accessor Functions +
+ +

Data-Structure Relationships

+ +

RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +

BigTreeClassicRCU.svg + +

This diagram shows an enclosing rcu_state structure +containing a tree of rcu_node structures. +Each leaf node of the rcu_node tree has up to 16 +rcu_data structures associated with it, so that there +are NR_CPUS number of rcu_data structures, +one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the +common case where nr_cpu_ids is much less than +NR_CPUs. +For example, a number of Linux distributions set NR_CPUs=4096, +which results in a three-level rcu_node tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an rcu_node tree with only a single node. + +

The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU rcu_data structures, +and other events are recorded by the leaf-level rcu_node +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root rcu_node +structure. +A grace period can be completed at the root once every CPU +(or, in the case of CONFIG_PREEMPT_RCU, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +

As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + + + + + + + + +
 
Quick Quiz:
+ Why isn't the fanout at the leaves also 64? +
Answer:
+ Because there are more types of events that affect the leaf-level + rcu_node structures than further up the tree. + Therefore, if the leaf rcu_node structures have fanout of + 64, the contention on these structures' ->structures + becomes excessive. + Experimentation on a wide variety of systems has shown that a fanout + of 16 works well for the leaves of the rcu_node tree. + + +

Of course, further experience with + systems having hundreds or thousands of CPUs may demonstrate + that the fanout for the non-leaf rcu_node structures + must also be reduced. + Such reduction can be easily carried out when and if it proves + necessary. + In the meantime, if you are using such a system and running into + contention problems on the non-leaf rcu_node structures, + you may use the CONFIG_RCU_FANOUT kernel configuration + parameter to reduce the non-leaf fanout as needed. + + +

Kernels built for systems with + strong NUMA characteristics might also need to adjust + CONFIG_RCU_FANOUT so that the domains of the + rcu_node structures align with hardware boundaries. + However, there has thus far been no need for this. +

 
+ +

If your system has more than 1,024 CPUs (or more than 512 CPUs on +a 32-bit system), then RCU will automatically add more levels to the +tree. +For example, if you are crazy enough to build a 64-bit system with 65,536 +CPUs, RCU would configure the rcu_node tree as follows: + +

HugeTreeClassicRCU.svg + +

RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. +On the other hand, you can set CONFIG_RCU_FANOUT to be +as small as 2 if you wish, which would permit only 16 CPUs, which +is useful for testing. + +

This multi-level combining tree allows us to get most of the +performance and scalability +benefits of partitioning, even though RCU grace-period detection is +inherently a global operation. +The trick here is that only the last CPU to report a quiescent state +into a given rcu_node structure need advance to the rcu_node +structure at the next level up the tree. +This means that at the leaf-level rcu_node structure, only +one access out of sixteen will progress up the tree. +For the internal rcu_node structures, the situation is even +more extreme: Only one access out of sixty-four will progress up +the tree. +Because the vast majority of the CPUs do not progress up the tree, +the lock contention remains roughly constant up the tree. +No matter how many CPUs there are in the system, at most 64 quiescent-state +reports per grace period will progress all the way to the root +rcu_node structure, thus ensuring that the lock contention +on that root rcu_node structure remains acceptably low. + +

In effect, the combining tree acts like a big shock absorber, +keeping lock contention under control at all tree levels regardless +of the level of loading on the system. + +

The Linux kernel actually supports multiple flavors of RCU +running concurrently, so RCU builds separate data structures for each +flavor. +For example, for CONFIG_TREE_RCU=y kernels, RCU provides +rcu_sched and rcu_bh, as shown below: + +

BigTreeClassicRCUBH.svg + +

Energy efficiency is increasingly important, and for that +reason the Linux kernel provides CONFIG_NO_HZ_IDLE, which +turns off the scheduling-clock interrupts on idle CPUs, which in +turn allows those CPUs to attain deeper sleep states and to consume +less energy. +CPUs whose scheduling-clock interrupts have been turned off are +said to be in dyntick-idle mode. +RCU must handle dyntick-idle CPUs specially +because RCU would otherwise wake up each CPU on every grace period, +which would defeat the whole purpose of CONFIG_NO_HZ_IDLE. +RCU uses the rcu_dynticks structure to track +which CPUs are in dyntick idle mode, as shown below: + +

BigTreeClassicRCUBHdyntick.svg + +

However, if a CPU is in dyntick-idle mode, it is in that mode +for all flavors of RCU. +Therefore, a single rcu_dynticks structure is allocated per +CPU, and all of a given CPU's rcu_data structures share +that rcu_dynticks, as shown in the figure. + +

Kernels built with CONFIG_PREEMPT_RCU support +rcu_preempt in addition to rcu_sched and rcu_bh, as shown below: + +

BigTreePreemptRCUBHdyntick.svg + +

RCU updaters wait for normal grace periods by registering +RCU callbacks, either directly via call_rcu() and +friends (namely call_rcu_bh() and call_rcu_sched()), +there being a separate interface per flavor of RCU) +or indirectly via synchronize_rcu() and friends. +RCU callbacks are represented by rcu_head structures, +which are queued on rcu_data structures while they are +waiting for a grace period to elapse, as shown in the following figure: + +

BigTreePreemptRCUBHdyntickCB.svg + +

This figure shows how TREE_RCU's and +PREEMPT_RCU's major data structures are related. +Lesser data structures will be introduced with the algorithms that +make use of them. + +

Note that each of the data structures in the above figure has +its own synchronization: + +

    +
  1. Each rcu_state structures has a lock and a mutex, + and some fields are protected by the corresponding root + rcu_node structure's lock. +
  2. Each rcu_node structure has a spinlock. +
  3. The fields in rcu_data are private to the corresponding + CPU, although a few can be read and written by other CPUs. +
  4. Similarly, the fields in rcu_dynticks are private + to the corresponding CPU, although a few can be read by + other CPUs. +
+ +

It is important to note that different data structures can have +very different ideas about the state of RCU at any given time. +For but one example, awareness of the start or end of a given RCU +grace period propagates slowly through the data structures. +This slow propagation is absolutely necessary for RCU to have good +read-side performance. +If this balkanized implementation seems foreign to you, one useful +trick is to consider each instance of these data structures to be +a different person, each having the usual slightly different +view of reality. + +

The general role of each of these data structures is as +follows: + +

    +
  1. rcu_state: + This structure forms the interconnection between the + rcu_node and rcu_data structures, + tracks grace periods, serves as short-term repository + for callbacks orphaned by CPU-hotplug events, + maintains rcu_barrier() state, + tracks expedited grace-period state, + and maintains state used to force quiescent states when + grace periods extend too long, +
  2. rcu_node: This structure forms the combining + tree that propagates quiescent-state + information from the leaves to the root, and also propagates + grace-period information from the root to the leaves. + It provides local copies of the grace-period state in order + to allow this information to be accessed in a synchronized + manner without suffering the scalability limitations that + would otherwise be imposed by global locking. + In CONFIG_PREEMPT_RCU kernels, it manages the lists + of tasks that have blocked while in their current + RCU read-side critical section. + In CONFIG_PREEMPT_RCU with + CONFIG_RCU_BOOST, it manages the + per-rcu_node priority-boosting + kernel threads (kthreads) and state. + Finally, it records CPU-hotplug state in order to determine + which CPUs should be ignored during a given grace period. +
  3. rcu_data: This per-CPU structure is the + focus of quiescent-state detection and RCU callback queuing. + It also tracks its relationship to the corresponding leaf + rcu_node structure to allow more-efficient + propagation of quiescent states up the rcu_node + combining tree. + Like the rcu_node structure, it provides a local + copy of the grace-period information to allow for-free + synchronized + access to this information from the corresponding CPU. + Finally, this structure records past dyntick-idle state + for the corresponding CPU and also tracks statistics. +
  4. rcu_dynticks: + This per-CPU structure tracks the current dyntick-idle + state for the corresponding CPU. + Unlike the other three structures, the rcu_dynticks + structure is not replicated per RCU flavor. +
  5. rcu_head: + This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. + The rcu_head structure is normally embedded + within the RCU-protected data structure. +
+ +

If all you wanted from this article was a general notion of how +RCU's data structures are related, you are done. +Otherwise, each of the following sections give more details on +the rcu_state, rcu_node, rcu_data, +and rcu_dynticks data structures. + +

+The rcu_state Structure

+ +

The rcu_state structure is the base structure that +represents a flavor of RCU. +This structure forms the interconnection between the +rcu_node and rcu_data structures, +tracks grace periods, contains the lock used to +synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when +grace periods extend too long, + +

A few of the rcu_state structure's fields are discussed, +singly and in groups, in the following sections. +The more specialized fields are covered in the discussion of their +use. + +

Relationship to rcu_node and rcu_data Structures
+ +This portion of the rcu_state structure is declared +as follows: + +
+  1   struct rcu_node node[NUM_RCU_NODES];
+  2   struct rcu_node *level[NUM_RCU_LVLS + 1];
+  3   struct rcu_data __percpu *rda;
+
+ + + + + + + + +
 
Quick Quiz:
+ Wait a minute! + You said that the rcu_node structures formed a tree, + but they are declared as a flat array! + What gives? +
Answer:
+ The tree is laid out in the array. + The first node In the array is the head, the next set of nodes in the + array are children of the head node, and so on until the last set of + nodes in the array are the leaves. + + +

See the following diagrams to see how + this works. +

 
+ +

The rcu_node tree is embedded into the +->node[] array as shown in the following figure: + +

TreeMapping.svg + +

One interesting consequence of this mapping is that a +breadth-first traversal of the tree is implemented as a simple +linear scan of the array, which is in fact what the +rcu_for_each_node_breadth_first() macro does. +This macro is used at the beginning and ends of grace periods. + +

Each entry of the ->level array references +the first rcu_node structure on the corresponding level +of the tree, for example, as shown below: + +

TreeMappingLevel.svg + +

The zeroth element of the array references the root +rcu_node structure, the first element references the +first child of the root rcu_node, and finally the second +element references the first leaf rcu_node structure. + +

For whatever it is worth, if you draw the tree to be tree-shaped +rather than array-shaped, it is easy to draw a planar representation: + +

TreeLevel.svg + +

Finally, the ->rda field references a per-CPU +pointer to the corresponding CPU's rcu_data structure. + +

All of these fields are constant once initialization is complete, +and therefore need no protection. + +

Grace-Period Tracking
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gp_seq;
+
+ +

RCU grace periods are numbered, and +the ->gp_seq field contains the current grace-period +sequence number. +The bottom two bits are the state of the current grace period, +which can be zero for not yet started or one for in progress. +In other words, if the bottom two bits of ->gp_seq are +zero, the corresponding flavor of RCU is idle. +Any other value in the bottom two bits indicates that something is broken. +This field is protected by the root rcu_node structure's +->lock field. + +

There are ->gp_seq fields +in the rcu_node and rcu_data structures +as well. +The fields in the rcu_state structure represent the +most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a distributed +fashion. +The values flow from rcu_state to rcu_node +(down the tree from the root to the leaves) to rcu_data. + +

Miscellaneous
+ +

This portion of the rcu_state structure is declared +as follows: + +

+  1   unsigned long gp_max;
+  2   char abbr;
+  3   char *name;
+
+ +

The ->gp_max field tracks the duration of the longest +grace period in jiffies. +It is protected by the root rcu_node's ->lock. + +

The ->name field points to the name of the RCU flavor +(for example, “rcu_sched”), and is constant. +The ->abbr field contains a one-character abbreviation, +for example, “s” for RCU-sched. + +

+The rcu_node Structure

+ +

The rcu_node structures form the combining +tree that propagates quiescent-state +information from the leaves to the root and also that propagates +grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order +to allow this information to be accessed in a synchronized +manner without suffering the scalability limitations that +would otherwise be imposed by global locking. +In CONFIG_PREEMPT_RCU kernels, they manage the lists +of tasks that have blocked while in their current +RCU read-side critical section. +In CONFIG_PREEMPT_RCU with +CONFIG_RCU_BOOST, they manage the +per-rcu_node priority-boosting +kernel threads (kthreads) and state. +Finally, they record CPU-hotplug state in order to determine +which CPUs should be ignored during a given grace period. + +

The rcu_node structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Combining Tree
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   struct rcu_node *parent;
+  2   u8 level;
+  3   u8 grpnum;
+  4   unsigned long grpmask;
+  5   int grplo;
+  6   int grphi;
+
+ +

The ->parent pointer references the rcu_node +one level up in the tree, and is NULL for the root +rcu_node. +The RCU implementation makes heavy use of this field to push quiescent +states up the tree. +The ->level field gives the level in the tree, with +the root being at level zero, its children at level one, and so on. +The ->grpnum field gives this node's position within +the children of its parent, so this number can range between 0 and 31 +on 32-bit systems and between 0 and 63 on 64-bit systems. +The ->level and ->grpnum fields are +used only during initialization and for tracing. +The ->grpmask field is the bitmask counterpart of +->grpnum, and therefore always has exactly one bit set. +This mask is used to clear the bit corresponding to this rcu_node +structure in its parent's bitmasks, which are described later. +Finally, the ->grplo and ->grphi fields +contain the lowest and highest numbered CPU served by this +rcu_node structure, respectively. + +

All of these fields are constant, and thus do not require any +synchronization. + +

Synchronization
+ +

This field of the rcu_node structure is declared +as follows: + +

+  1   raw_spinlock_t lock;
+
+ +

This field is used to protect the remaining fields in this structure, +unless otherwise stated. +That said, all of the fields in this structure can be accessed without +locking for tracing purposes. +Yes, this can result in confusing traces, but better some tracing confusion +than to be heisenbugged out of existence. + +

Grace-Period Tracking
+ +

This portion of the rcu_node structure is declared +as follows: + +

+  1   unsigned long gp_seq;
+  2   unsigned long gp_seq_needed;
+
+ +

The rcu_node structures' ->gp_seq fields are +the counterparts of the field of the same name in the rcu_state +structure. +They each may lag up to one step behind their rcu_state +counterpart. +If the bottom two bits of a given rcu_node structure's +->gp_seq field is zero, then this rcu_node +structure believes that RCU is idle. +

The >gp_seq field of each rcu_node +structure is updated at the beginning and the end +of each grace period. + +

The ->gp_seq_needed fields record the +furthest-in-the-future grace period request seen by the corresponding +rcu_node structure. The request is considered fulfilled when +the value of the ->gp_seq field equals or exceeds that of +the ->gp_seq_needed field. + + + + + + + + +
 
Quick Quiz:
+ Suppose that this rcu_node structure doesn't see + a request for a very long time. + Won't wrapping of the ->gp_seq field cause + problems? +
Answer:
+ No, because if the ->gp_seq_needed field lags behind the + ->gp_seq field, the ->gp_seq_needed field + will be updated at the end of the grace period. + Modulo-arithmetic comparisons therefore will always get the + correct answer, even with wrapping. +
 
+ +

Quiescent-State Tracking
+ +

These fields manage the propagation of quiescent states up the +combining tree. + +

This portion of the rcu_node structure has fields +as follows: + +

+  1   unsigned long qsmask;
+  2   unsigned long expmask;
+  3   unsigned long qsmaskinit;
+  4   unsigned long expmaskinit;
+
+ +

The ->qsmask field tracks which of this +rcu_node structure's children still need to report +quiescent states for the current normal grace period. +Such children will have a value of 1 in their corresponding bit. +Note that the leaf rcu_node structures should be +thought of as having rcu_data structures as their +children. +Similarly, the ->expmask field tracks which +of this rcu_node structure's children still need to report +quiescent states for the current expedited grace period. +An expedited grace period has +the same conceptual properties as a normal grace period, but the +expedited implementation accepts extreme CPU overhead to obtain +much lower grace-period latency, for example, consuming a few +tens of microseconds worth of CPU time to reduce grace-period +duration from milliseconds to tens of microseconds. +The ->qsmaskinit field tracks which of this +rcu_node structure's children cover for at least +one online CPU. +This mask is used to initialize ->qsmask, +and ->expmaskinit is used to initialize +->expmask and the beginning of the +normal and expedited grace periods, respectively. + + + + + + + + +
 
Quick Quiz:
+ Why are these bitmasks protected by locking? + Come on, haven't you heard of atomic instructions??? +
Answer:
+ Lockless grace-period computation! Such a tantalizing possibility! + + +

But consider the following sequence of events: + + +

    +
  1. CPU 0 has been in dyntick-idle + mode for quite some time. + When it wakes up, it notices that the current RCU + grace period needs it to report in, so it sets a + flag where the scheduling clock interrupt will find it. +

    +

  2. Meanwhile, CPU 1 is running + force_quiescent_state(), + and notices that CPU 0 has been in dyntick idle mode, + which qualifies as an extended quiescent state. +

    +

  3. CPU 0's scheduling clock + interrupt fires in the + middle of an RCU read-side critical section, and notices + that the RCU core needs something, so commences RCU softirq + processing. + +

    +

  4. CPU 0's softirq handler + executes and is just about ready + to report its quiescent state up the rcu_node + tree. +

    +

  5. But CPU 1 beats it to the punch, + completing the current + grace period and starting a new one. +

    +

  6. CPU 0 now reports its quiescent + state for the wrong + grace period. + That grace period might now end before the RCU read-side + critical section. + If that happens, disaster will ensue. + +
+ +

So the locking is absolutely required in + order to coordinate clearing of the bits with updating of the + grace-period sequence number in ->gp_seq. +

 
+ +

Blocked-Task Management
+ +

PREEMPT_RCU allows tasks to be preempted in the +midst of their RCU read-side critical sections, and these tasks +must be tracked explicitly. +The details of exactly why and how they are tracked will be covered +in a separate article on RCU read-side processing. +For now, it is enough to know that the rcu_node +structure tracks them. + +

+  1   struct list_head blkd_tasks;
+  2   struct list_head *gp_tasks;
+  3   struct list_head *exp_tasks;
+  4   bool wait_blkd_tasks;
+
+ +

The ->blkd_tasks field is a list header for +the list of blocked and preempted tasks. +As tasks undergo context switches within RCU read-side critical +sections, their task_struct structures are enqueued +(via the task_struct's ->rcu_node_entry +field) onto the head of the ->blkd_tasks list for the +leaf rcu_node structure corresponding to the CPU +on which the outgoing context switch executed. +As these tasks later exit their RCU read-side critical sections, +they remove themselves from the list. +This list is therefore in reverse time order, so that if one of the tasks +is blocking the current grace period, all subsequent tasks must +also be blocking that same grace period. +Therefore, a single pointer into this list suffices to track +all tasks blocking a given grace period. +That pointer is stored in ->gp_tasks for normal +grace periods and in ->exp_tasks for expedited +grace periods. +These last two fields are NULL if either there is +no grace period in flight or if there are no blocked tasks +preventing that grace period from completing. +If either of these two pointers is referencing a task that +removes itself from the ->blkd_tasks list, +then that task must advance the pointer to the next task on +the list, or set the pointer to NULL if there +are no subsequent tasks on the list. + +

For example, suppose that tasks T1, T2, and T3 are +all hard-affinitied to the largest-numbered CPU in the system. +Then if task T1 blocked in an RCU read-side +critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, +then a normal grace period started, and finally task 3 blocked +in an RCU read-side critical section, then the state of the +last leaf rcu_node structure's blocked-task list +would be as shown below: + +

blkd_task.svg + +

Task T1 is blocking both grace periods, task T2 is +blocking only the normal grace period, and task T3 is blocking +neither grace period. +Note that these tasks will not remove themselves from this list +immediately upon resuming execution. +They will instead remain on the list until they execute the outermost +rcu_read_unlock() that ends their RCU read-side critical +section. + +

+The ->wait_blkd_tasks field indicates whether or not +the current grace period is waiting on a blocked task. + +

Sizing the rcu_node Array
+ +

The rcu_node array is sized via a series of +C-preprocessor expressions as follows: + +

+ 1 #ifdef CONFIG_RCU_FANOUT
+ 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
+ 3 #else
+ 4 # ifdef CONFIG_64BIT
+ 5 # define RCU_FANOUT 64
+ 6 # else
+ 7 # define RCU_FANOUT 32
+ 8 # endif
+ 9 #endif
+10
+11 #ifdef CONFIG_RCU_FANOUT_LEAF
+12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+13 #else
+14 # ifdef CONFIG_64BIT
+15 # define RCU_FANOUT_LEAF 64
+16 # else
+17 # define RCU_FANOUT_LEAF 32
+18 # endif
+19 #endif
+20
+21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
+22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
+23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
+24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
+25
+26 #if NR_CPUS <= RCU_FANOUT_1
+27 #  define RCU_NUM_LVLS        1
+28 #  define NUM_RCU_LVL_0        1
+29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
+30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
+31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
+32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
+33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
+34 #elif NR_CPUS <= RCU_FANOUT_2
+35 #  define RCU_NUM_LVLS        2
+36 #  define NUM_RCU_LVL_0        1
+37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
+41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
+42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
+43 #elif NR_CPUS <= RCU_FANOUT_3
+44 #  define RCU_NUM_LVLS        3
+45 #  define NUM_RCU_LVL_0        1
+46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
+49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
+52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
+53 #elif NR_CPUS <= RCU_FANOUT_4
+54 #  define RCU_NUM_LVLS        4
+55 #  define NUM_RCU_LVL_0        1
+56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
+57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
+58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
+59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
+61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
+62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
+63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
+64 #else
+65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
+66 #endif
+
+ +

The maximum number of levels in the rcu_node structure +is currently limited to four, as specified by lines 21-24 +and the structure of the subsequent “if” statement. +For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which +should be sufficient for the next few years at least. +For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which +should see us through the next decade or so. +This four-level tree also allows kernels built with +CONFIG_RCU_FANOUT=8 to support up to 4096 CPUs, +which might be useful in very large systems having eight CPUs per +socket (but please note that no one has yet shown any measurable +performance degradation due to misaligned socket and rcu_node +boundaries). +In addition, building kernels with a full four levels of rcu_node +tree permits better testing of RCU's combining-tree code. + +

The RCU_FANOUT symbol controls how many children +are permitted at each non-leaf level of the rcu_node tree. +If the CONFIG_RCU_FANOUT Kconfig option is not specified, +it is set based on the word size of the system, which is also +the Kconfig default. + +

The RCU_FANOUT_LEAF symbol controls how many CPUs are +handled by each leaf rcu_node structure. +Experience has shown that allowing a given leaf rcu_node +structure to handle 64 CPUs, as permitted by the number of bits in +the ->qsmask field on a 64-bit system, results in +excessive contention for the leaf rcu_node structures' +->lock fields. +The number of CPUs per leaf rcu_node structure is therefore +limited to 16 given the default value of CONFIG_RCU_FANOUT_LEAF. +If CONFIG_RCU_FANOUT_LEAF is unspecified, the value +selected is based on the word size of the system, just as for +CONFIG_RCU_FANOUT. +Lines 11-19 perform this computation. + +

Lines 21-24 compute the maximum number of CPUs supported by +a single-level (which contains a single rcu_node structure), +two-level, three-level, and four-level rcu_node tree, +respectively, given the fanout specified by RCU_FANOUT +and RCU_FANOUT_LEAF. +These numbers of CPUs are retained in the +RCU_FANOUT_1, +RCU_FANOUT_2, +RCU_FANOUT_3, and +RCU_FANOUT_4 +C-preprocessor variables, respectively. + +

These variables are used to control the C-preprocessor #if +statement spanning lines 26-66 that computes the number of +rcu_node structures required for each level of the tree, +as well as the number of levels required. +The number of levels is placed in the NUM_RCU_LVLS +C-preprocessor variable by lines 27, 35, 44, and 54. +The number of rcu_node structures for the topmost level +of the tree is always exactly one, and this value is unconditionally +placed into NUM_RCU_LVL_0 by lines 28, 36, 45, and 55. +The rest of the levels (if any) of the rcu_node tree +are computed by dividing the maximum number of CPUs by the +fanout supported by the number of levels from the current level down, +rounding up. This computation is performed by lines 37, +46-47, and 56-58. +Lines 31-33, 40-42, 50-52, and 62-63 create initializers +for lockdep lock-class names. +Finally, lines 64-66 produce an error if the maximum number of +CPUs is too large for the specified fanout. + +

+The rcu_segcblist Structure

+ +The rcu_segcblist structure maintains a segmented list of +callbacks as follows: + +
+ 1 #define RCU_DONE_TAIL        0
+ 2 #define RCU_WAIT_TAIL        1
+ 3 #define RCU_NEXT_READY_TAIL  2
+ 4 #define RCU_NEXT_TAIL        3
+ 5 #define RCU_CBLIST_NSEGS     4
+ 6
+ 7 struct rcu_segcblist {
+ 8   struct rcu_head *head;
+ 9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
+10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
+11   long len;
+12   long len_lazy;
+13 };
+
+ +

+The segments are as follows: + +

    +
  1. RCU_DONE_TAIL: Callbacks whose grace periods have elapsed. + These callbacks are ready to be invoked. +
  2. RCU_WAIT_TAIL: Callbacks that are waiting for the + current grace period. + Note that different CPUs can have different ideas about which + grace period is current, hence the ->gp_seq field. +
  3. RCU_NEXT_READY_TAIL: Callbacks waiting for the next + grace period to start. +
  4. RCU_NEXT_TAIL: Callbacks that have not yet been + associated with a grace period. +
+ +

+The ->head pointer references the first callback or +is NULL if the list contains no callbacks (which is +not the same as being empty). +Each element of the ->tails[] array references the +->next pointer of the last callback in the corresponding +segment of the list, or the list's ->head pointer if +that segment and all previous segments are empty. +If the corresponding segment is empty but some previous segment is +not empty, then the array element is identical to its predecessor. +Older callbacks are closer to the head of the list, and new callbacks +are added at the tail. +This relationship between the ->head pointer, the +->tails[] array, and the callbacks is shown in this +diagram: + +

nxtlist.svg + +

In this figure, the ->head pointer references the +first +RCU callback in the list. +The ->tails[RCU_DONE_TAIL] array element references +the ->head pointer itself, indicating that none +of the callbacks is ready to invoke. +The ->tails[RCU_WAIT_TAIL] array element references callback +CB 2's ->next pointer, which indicates that +CB 1 and CB 2 are both waiting on the current grace period, +give or take possible disagreements about exactly which grace period +is the current one. +The ->tails[RCU_NEXT_READY_TAIL] array element +references the same RCU callback that ->tails[RCU_WAIT_TAIL] +does, which indicates that there are no callbacks waiting on the next +RCU grace period. +The ->tails[RCU_NEXT_TAIL] array element references +CB 4's ->next pointer, indicating that all the +remaining RCU callbacks have not yet been assigned to an RCU grace +period. +Note that the ->tails[RCU_NEXT_TAIL] array element +always references the last RCU callback's ->next pointer +unless the callback list is empty, in which case it references +the ->head pointer. + +

+There is one additional important special case for the +->tails[RCU_NEXT_TAIL] array element: It can be NULL +when this list is disabled. +Lists are disabled when the corresponding CPU is offline or when +the corresponding CPU's callbacks are offloaded to a kthread, +both of which are described elsewhere. + +

CPUs advance their callbacks from the +RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the +RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments +as grace periods advance. + +

The ->gp_seq[] array records grace-period +numbers corresponding to the list segments. +This is what allows different CPUs to have different ideas as to +which is the current grace period while still avoiding premature +invocation of their callbacks. +In particular, this allows CPUs that go idle for extended periods +to determine which of their callbacks are ready to be invoked after +reawakening. + +

The ->len counter contains the number of +callbacks in ->head, and the +->len_lazy contains the number of those callbacks that +are known to only free memory, and whose invocation can therefore +be safely deferred. + +

Important note: It is the ->len field that +determines whether or not there are callbacks associated with +this rcu_segcblist structure, not the ->head +pointer. +The reason for this is that all the ready-to-invoke callbacks +(that is, those in the RCU_DONE_TAIL segment) are extracted +all at once at callback-invocation time. +If callback invocation must be postponed, for example, because a +high-priority process just woke up on this CPU, then the remaining +callbacks are placed back on the RCU_DONE_TAIL segment. +Either way, the ->len and ->len_lazy counts +are adjusted after the corresponding callbacks have been invoked, and so +again it is the ->len count that accurately reflects whether +or not there are callbacks associated with this rcu_segcblist +structure. +Of course, off-CPU sampling of the ->len count requires +the use of appropriate synchronization, for example, memory barriers. +This synchronization can be a bit subtle, particularly in the case +of rcu_barrier(). + +

+The rcu_data Structure

+ +

The rcu_data maintains the per-CPU state for the +corresponding flavor of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +This structure is the +focus of quiescent-state detection and RCU callback queuing. +It also tracks its relationship to the corresponding leaf +rcu_node structure to allow more-efficient +propagation of quiescent states up the rcu_node +combining tree. +Like the rcu_node structure, it provides a local +copy of the grace-period information to allow for-free +synchronized +access to this information from the corresponding CPU. +Finally, this structure records past dyntick-idle state +for the corresponding CPU and also tracks statistics. + +

The rcu_data structure's fields are discussed, +singly and in groups, in the following sections. + +

Connection to Other Data Structures
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int cpu;
+  2   struct rcu_state *rsp;
+  3   struct rcu_node *mynode;
+  4   struct rcu_dynticks *dynticks;
+  5   unsigned long grpmask;
+  6   bool beenonline;
+
+ +

The ->cpu field contains the number of the +corresponding CPU, the ->rsp pointer references +the corresponding rcu_state structure (and is most frequently +used to locate the name of the corresponding flavor of RCU for tracing), +and the ->mynode field references the corresponding +rcu_node structure. +The ->mynode is used to propagate quiescent states +up the combining tree. +

The ->dynticks pointer references the +rcu_dynticks structure corresponding to this +CPU. +Recall that a single per-CPU instance of the rcu_dynticks +structure is shared among all flavors of RCU. +These first four fields are constant and therefore require not +synchronization. + +

The ->grpmask field indicates the bit in +the ->mynode->qsmask corresponding to this +rcu_data structure, and is also used when propagating +quiescent states. +The ->beenonline flag is set whenever the corresponding +CPU comes online, which means that the debugfs tracing need not dump +out any rcu_data structure for which this flag is not set. + +

Quiescent-State and Grace-Period Tracking
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   unsigned long gp_seq;
+  2   unsigned long gp_seq_needed;
+  3   bool cpu_no_qs;
+  4   bool core_needs_qs;
+  5   bool gpwrap;
+  6   unsigned long rcu_qs_ctr_snap;
+
+ +

The ->gp_seq and ->gp_seq_needed +fields are the counterparts of the fields of the same name +in the rcu_state and rcu_node structures. +They may each lag up to one behind their rcu_node +counterparts, but in CONFIG_NO_HZ_IDLE and +CONFIG_NO_HZ_FULL kernels can lag +arbitrarily far behind for CPUs in dyntick-idle mode (but these counters +will catch up upon exit from dyntick-idle mode). +If the lower two bits of a given rcu_data structure's +->gp_seq are zero, then this rcu_data +structure believes that RCU is idle. + + + + + + + + +
 
Quick Quiz:
+ All this replication of the grace period numbers can only cause + massive confusion. + Why not just keep a global sequence number and be done with it??? +
Answer:
+ Because if there was only a single global sequence + numbers, there would need to be a single global lock to allow + safely accessing and updating it. + And if we are not going to have a single global lock, we need + to carefully manage the numbers on a per-node basis. + Recall from the answer to a previous Quick Quiz that the consequences + of applying a previously sampled quiescent state to the wrong + grace period are quite severe. +
 
+ +

The ->cpu_no_qs flag indicates that the +CPU has not yet passed through a quiescent state, +while the ->core_needs_qs flag indicates that the +RCU core needs a quiescent state from the corresponding CPU. +The ->gpwrap field indicates that the corresponding +CPU has remained idle for so long that the +gp_seq counter is in danger of overflow, which +will cause the CPU to disregard the values of its counters on +its next exit from idle. +Finally, the rcu_qs_ctr_snap field is used to detect +cases where a given operation has resulted in a quiescent state +for all flavors of RCU, for example, cond_resched() +when RCU has indicated a need for quiescent states. + +

RCU Callback Handling
+ +

In the absence of CPU-hotplug events, RCU callbacks are invoked by +the same CPU that registered them. +This is strictly a cache-locality optimization: callbacks can and +do get invoked on CPUs other than the one that registered them. +After all, if the CPU that registered a given callback has gone +offline before the callback can be invoked, there really is no other +choice. + +

This portion of the rcu_data structure is declared +as follows: + +

+ 1 struct rcu_segcblist cblist;
+ 2 long qlen_last_fqs_check;
+ 3 unsigned long n_cbs_invoked;
+ 4 unsigned long n_nocbs_invoked;
+ 5 unsigned long n_cbs_orphaned;
+ 6 unsigned long n_cbs_adopted;
+ 7 unsigned long n_force_qs_snap;
+ 8 long blimit;
+
+ +

The ->cblist structure is the segmented callback list +described earlier. +The CPU advances the callbacks in its rcu_data structure +whenever it notices that another RCU grace period has completed. +The CPU detects the completion of an RCU grace period by noticing +that the value of its rcu_data structure's +->gp_seq field differs from that of its leaf +rcu_node structure. +Recall that each rcu_node structure's +->gp_seq field is updated at the beginnings and ends of each +grace period. + +

+The ->qlen_last_fqs_check and +->n_force_qs_snap coordinate the forcing of quiescent +states from call_rcu() and friends when callback +lists grow excessively long. + +

The ->n_cbs_invoked, +->n_cbs_orphaned, and ->n_cbs_adopted +fields count the number of callbacks invoked, +sent to other CPUs when this CPU goes offline, +and received from other CPUs when those other CPUs go offline. +The ->n_nocbs_invoked is used when the CPU's callbacks +are offloaded to a kthread. + +

+Finally, the ->blimit counter is the maximum number of +RCU callbacks that may be invoked at a given time. + +

Dyntick-Idle Handling
+ +

This portion of the rcu_data structure is declared +as follows: + +

+  1   int dynticks_snap;
+  2   unsigned long dynticks_fqs;
+
+ +The ->dynticks_snap field is used to take a snapshot +of the corresponding CPU's dyntick-idle state when forcing +quiescent states, and is therefore accessed from other CPUs. +Finally, the ->dynticks_fqs field is used to +count the number of times this CPU is determined to be in +dyntick-idle state, and is used for tracing and debugging purposes. + +

+The rcu_dynticks Structure

+ +

The rcu_dynticks maintains the per-CPU dyntick-idle state +for the corresponding CPU. +Unlike the other structures, rcu_dynticks is not +replicated over the different flavors of RCU. +The fields in this structure may be accessed only from the corresponding +CPU (and from tracing) unless otherwise stated. +Its fields are as follows: + +

+  1   long dynticks_nesting;
+  2   long dynticks_nmi_nesting;
+  3   atomic_t dynticks;
+  4   bool rcu_need_heavy_qs;
+  5   unsigned long rcu_qs_ctr;
+  6   bool rcu_urgent_qs;
+
+ +

The ->dynticks_nesting field counts the +nesting depth of process execution, so that in normal circumstances +this counter has value zero or one. +NMIs, irqs, and tracers are counted by the ->dynticks_nmi_nesting +field. +Because NMIs cannot be masked, changes to this variable have to be +undertaken carefully using an algorithm provided by Andy Lutomirski. +The initial transition from idle adds one, and nested transitions +add two, so that a nesting level of five is represented by a +->dynticks_nmi_nesting value of nine. +This counter can therefore be thought of as counting the number +of reasons why this CPU cannot be permitted to enter dyntick-idle +mode, aside from process-level transitions. + +

However, it turns out that when running in non-idle kernel context, +the Linux kernel is fully capable of entering interrupt handlers that +never exit and perhaps also vice versa. +Therefore, whenever the ->dynticks_nesting field is +incremented up from zero, the ->dynticks_nmi_nesting field +is set to a large positive number, and whenever the +->dynticks_nesting field is decremented down to zero, +the the ->dynticks_nmi_nesting field is set to zero. +Assuming that the number of misnested interrupts is not sufficient +to overflow the counter, this approach corrects the +->dynticks_nmi_nesting field every time the corresponding +CPU enters the idle loop from process context. + +

The ->dynticks field counts the corresponding +CPU's transitions to and from dyntick-idle mode, so that this counter +has an even value when the CPU is in dyntick-idle mode and an odd +value otherwise. + +

The ->rcu_need_heavy_qs field is used +to record the fact that the RCU core code would really like to +see a quiescent state from the corresponding CPU, so much so that +it is willing to call for heavy-weight dyntick-counter operations. +This flag is checked by RCU's context-switch and cond_resched() +code, which provide a momentary idle sojourn in response. + +

The ->rcu_qs_ctr field is used to record +quiescent states from cond_resched(). +Because cond_resched() can execute quite frequently, this +must be quite lightweight, as in a non-atomic increment of this +per-CPU field. + +

Finally, the ->rcu_urgent_qs field is used to record +the fact that the RCU core code would really like to see a quiescent +state from the corresponding CPU, with the various other fields indicating +just how badly RCU wants this quiescent state. +This flag is checked by RCU's context-switch and cond_resched() +code, which, if nothing else, non-atomically increment ->rcu_qs_ctr +in response. + + + + + + + + +
 
Quick Quiz:
+ Why not simply combine the ->dynticks_nesting + and ->dynticks_nmi_nesting counters into a + single counter that just counts the number of reasons that + the corresponding CPU is non-idle? +
Answer:
+ Because this would fail in the presence of interrupts whose + handlers never return and of handlers that manage to return + from a made-up interrupt. +
 
+ +

Additional fields are present for some special-purpose +builds, and are discussed separately. + +

+The rcu_head Structure

+ +

Each rcu_head structure represents an RCU callback. +These structures are normally embedded within RCU-protected data +structures whose algorithms use asynchronous grace periods. +In contrast, when using algorithms that block waiting for RCU grace periods, +RCU users need not provide rcu_head structures. + +

The rcu_head structure has fields as follows: + +

+  1   struct rcu_head *next;
+  2   void (*func)(struct rcu_head *head);
+
+ +

The ->next field is used +to link the rcu_head structures together in the +lists within the rcu_data structures. +The ->func field is a pointer to the function +to be called when the callback is ready to be invoked, and +this function is passed a pointer to the rcu_head +structure. +However, kfree_rcu() uses the ->func +field to record the offset of the rcu_head +structure within the enclosing RCU-protected data structure. + +

Both of these fields are used internally by RCU. +From the viewpoint of RCU users, this structure is an +opaque “cookie”. + + + + + + + + +
 
Quick Quiz:
+ Given that the callback function ->func + is passed a pointer to the rcu_head structure, + how is that function supposed to find the beginning of the + enclosing RCU-protected data structure? +
Answer:
+ In actual practice, there is a separate callback function per + type of RCU-protected data structure. + The callback function can therefore use the container_of() + macro in the Linux kernel (or other pointer-manipulation facilities + in other software environments) to find the beginning of the + enclosing structure. +
 
+ +

+RCU-Specific Fields in the task_struct Structure

+ +

The CONFIG_PREEMPT_RCU implementation uses some +additional fields in the task_struct structure: + +

+ 1 #ifdef CONFIG_PREEMPT_RCU
+ 2   int rcu_read_lock_nesting;
+ 3   union rcu_special rcu_read_unlock_special;
+ 4   struct list_head rcu_node_entry;
+ 5   struct rcu_node *rcu_blocked_node;
+ 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
+ 7 #ifdef CONFIG_TASKS_RCU
+ 8   unsigned long rcu_tasks_nvcsw;
+ 9   bool rcu_tasks_holdout;
+10   struct list_head rcu_tasks_holdout_list;
+11   int rcu_tasks_idle_cpu;
+12 #endif /* #ifdef CONFIG_TASKS_RCU */
+
+ +

The ->rcu_read_lock_nesting field records the +nesting level for RCU read-side critical sections, and +the ->rcu_read_unlock_special field is a bitmask +that records special conditions that require rcu_read_unlock() +to do additional work. +The ->rcu_node_entry field is used to form lists of +tasks that have blocked within preemptible-RCU read-side critical +sections and the ->rcu_blocked_node field references +the rcu_node structure whose list this task is a member of, +or NULL if it is not blocked within a preemptible-RCU +read-side critical section. + +

The ->rcu_tasks_nvcsw field tracks the number of +voluntary context switches that this task had undergone at the +beginning of the current tasks-RCU grace period, +->rcu_tasks_holdout is set if the current tasks-RCU +grace period is waiting on this task, ->rcu_tasks_holdout_list +is a list element enqueuing this task on the holdout list, +and ->rcu_tasks_idle_cpu tracks which CPU this +idle task is running, but only if the task is currently running, +that is, if the CPU is currently idle. + +

+Accessor Functions

+ +

The following listing shows the +rcu_get_root(), rcu_for_each_node_breadth_first, +rcu_for_each_nonleaf_node_breadth_first(), and +rcu_for_each_leaf_node() function and macros: + +

+  1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
+  2 {
+  3   return &rsp->node[0];
+  4 }
+  5
+  6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
+  7   for ((rnp) = &(rsp)->node[0]; \
+  8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+  9
+ 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
+ 11   for ((rnp) = &(rsp)->node[0]; \
+ 12        (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
+ 13
+ 14 #define rcu_for_each_leaf_node(rsp, rnp) \
+ 15   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
+ 16        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+
+ +

The rcu_get_root() simply returns a pointer to the +first element of the specified rcu_state structure's +->node[] array, which is the root rcu_node +structure. + +

As noted earlier, the rcu_for_each_node_breadth_first() +macro takes advantage of the layout of the rcu_node +structures in the rcu_state structure's +->node[] array, performing a breadth-first traversal by +simply traversing the array in order. +The rcu_for_each_nonleaf_node_breadth_first() macro operates +similarly, but traverses only the first part of the array, thus excluding +the leaf rcu_node structures. +Finally, the rcu_for_each_leaf_node() macro traverses only +the last part of the array, thus traversing only the leaf +rcu_node structures. + + + + + + + + +
 
Quick Quiz:
+ What do rcu_for_each_nonleaf_node_breadth_first() and + rcu_for_each_leaf_node() do if the rcu_node tree + contains only a single node? +
Answer:
+ In the single-node case, + rcu_for_each_nonleaf_node_breadth_first() is a no-op + and rcu_for_each_leaf_node() traverses the single node. +
 
+ +

+Summary

+ +So each flavor of RCU is represented by an rcu_state structure, +which contains a combining tree of rcu_node and +rcu_data structures. +Finally, in CONFIG_NO_HZ_IDLE kernels, each CPU's dyntick-idle +state is tracked by an rcu_dynticks structure. + +If you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +

+Acknowledgments

+ +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn +for helping me get this document into a more human-readable state. + +

+Legal Statement

+ +

This work represents the view of the author and does not necessarily +represent the view of IBM. + +

Linux is a registered trademark of Linus Torvalds. + +

Other company, product, and service names may be trademarks or +service marks of others. + + diff --git a/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg new file mode 100644 index 000000000..2bf12b468 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg @@ -0,0 +1,939 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + struct + + rcu_data + + CPU 21823 + + CPU 21839 + + rcu_data + + struct + + struct + + rcu_data + + CPU 43679 + + CPU 43695 + + rcu_data + + struct + + struct + + rcu_data + + CPU 65519 + + CPU 65535 + + rcu_data + + struct + + struct rcu_state + + struct + + rcu_node + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg new file mode 100644 index 000000000..7a7eb3bac --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg @@ -0,0 +1,828 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + ->level[0] + + ->level[1] + + ->level[2] + + struct + + rcu_node + + CPU 15 + + CPU 0 + + CPU 65535 + + CPU 65519 + + CPU 43695 + + CPU 43679 + + CPU 21839 + + CPU 21823 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMapping.svg b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg new file mode 100644 index 000000000..729cfa9e6 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg @@ -0,0 +1,305 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg new file mode 100644 index 000000000..5b416a4b8 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg @@ -0,0 +1,380 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->level[0] + + ->level[1] + + ->level[2] + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/blkd_task.svg b/Documentation/RCU/Design/Data-Structures/blkd_task.svg new file mode 100644 index 000000000..00e810bb8 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/blkd_task.svg @@ -0,0 +1,843 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_bh + + struct + + rcu_node + + struct + + rcu_node + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + struct + + rcu_dynticks + + rcu_sched + + T3 + + T2 + + T1 + + + + + + + + + + + + + rcu_node + + struct + + blkd_tasks + + gp_tasks + + exp_tasks + + diff --git a/Documentation/RCU/Design/Data-Structures/nxtlist.svg b/Documentation/RCU/Design/Data-Structures/nxtlist.svg new file mode 100644 index 000000000..0223e79c3 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/nxtlist.svg @@ -0,0 +1,386 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->head + + ->tails[RCU_DONE_TAIL] + + ->tails[RCU_WAIT_TAIL] + + ->tails[RCU_NEXT_READY_TAIL] + + ->tails[RCU_NEXT_TAIL] + + CB 1 + + next + + CB 3 + + next + + CB 4 + + next + + CB 2 + + next + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg new file mode 100644 index 000000000..7c6c90bd0 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg @@ -0,0 +1,830 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + In RCUread-sidecriticalsection? + + N + + + + IPI Handler + + + + N + + + Report CPUQuiescentState + + + Y + + + rcu_read_unlock() + + + + + + Context Switch + + + + + Report CPUand TaskQuiescent States + + + + Enqueue Task + + + + + + Report CPUQuiescentState + + + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg new file mode 100644 index 000000000..e4233ac93 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg @@ -0,0 +1,826 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + CPU idle? + + N + + + + IPI Handler + + + + Y + + + Report CPUQuiescentState + + + N + + + Context Switch + + + + + + CPU Offline + + + + + Report CPUQuiescentState + + + + + Report CPUQuiescentState + + + + + + reched_cpu() + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html new file mode 100644 index 000000000..7394f034b --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -0,0 +1,667 @@ + + + A Tour Through TREE_RCU's Expedited Grace Periods + + +

Introduction

+ +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +

+There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched), +but only two flavors of expedited grace periods because the RCU-bh +expedited grace period maps onto the RCU-sched expedited grace period. +Each of the remaining two implementations is covered in its own section. + +

    +
  1. + Expedited Grace Period Design +
  2. + RCU-preempt Expedited Grace Periods +
  3. + RCU-sched Expedited Grace Periods +
  4. + Expedited Grace Period and CPU Hotplug +
  5. + Expedited Grace Period Refinements +
+ +

+Expedited Grace Period Design

+ +

+The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to try_stop_cpus() has been +replaced with a set of calls to smp_call_function_single(), +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +

+The details of the smp_call_function_single() handler's +operation depend on the RCU flavor, as described in the following +sections. + +

+RCU-preempt Expedited Grace Periods

+ +

+The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +

ExpRCUFlow.svg + +

+The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +

+If a given CPU is offline or idle, synchronize_rcu_expedited() +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +smp_call_function_single() to send the CPU an IPI, which +is handled by sync_rcu_exp_handler(). + +

+However, because this is preemptible RCU, sync_rcu_exp_handler() +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost rcu_read_unlock() +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +

+Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost rcu_read_unlock(). +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by rcu_preempt_ctxt_queue(), +which is called from rcu_preempt_note_context_switch(), which +in turn is called from rcu_note_context_switch(), which in +turn is called from the scheduler. + + + + + + + + +
 
Quick Quiz:
+ Why not just have the expedited grace period check the + state of all the CPUs? + After all, that would avoid all those real-time-unfriendly IPIs. +
Answer:
+ Because we want the RCU read-side critical sections to run fast, + which means no memory barriers. + Therefore, it is not possible to safely check the state from some + other CPU. + And even if it was possible to safely check the state, it would + still be necessary to IPI the CPU to safely interact with the + upcoming rcu_read_unlock() invocation, which means that + the remote state testing would not help the worst-case + latency that real-time applications care about. + +

One way to prevent your real-time + application from getting hit with these IPIs is to + build your kernel with CONFIG_NO_HZ_FULL=y. + RCU would then perceive the CPU running your application + as being idle, and it would be able to safely detect that + state without needing to IPI the CPU. +

 
+ +

+Please note that this is just the overall flow: +Additional complications can arise due to races with CPUs going idle +or offline, among other things. + +

+RCU-sched Expedited Grace Periods

+ +

+The overall flow of the handling of a given CPU by an RCU-sched +expedited grace period is shown in the following diagram: + +

ExpSchedFlow.svg + +

+As with RCU-preempt's synchronize_rcu_expedited(), +synchronize_sched_expedited() ignores offline and +idle CPUs, again because they are in remotely detectable +quiescent states. +However, the synchronize_rcu_expedited() handler +is sync_sched_exp_handler(), and because the +rcu_read_lock_sched() and rcu_read_unlock_sched() +leave no trace of their invocation, in general it is not possible to tell +whether or not the current CPU is in an RCU read-side critical section. +The best that sync_sched_exp_handler() can do is to check +for idle, on the off-chance that the CPU went idle while the IPI +was in flight. +If the CPU is idle, then tt>sync_sched_exp_handler() reports +the quiescent state. + +

+Otherwise, the handler invokes resched_cpu(), which forces +a future context switch. +At the time of the context switch, the CPU reports the quiescent state. +Should the CPU go offline first, it will report the quiescent state +at that time. + +

+Expedited Grace Period and CPU Hotplug

+ +

+The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. +In addition, attempting to IPI offline CPUs will result in splats, but +failing to IPI online CPUs can result in too-short grace periods. +Neither option is acceptable in production kernels. + +

+The interaction between expedited grace periods and CPU hotplug operations +is carried out at several levels: + +

    +
  1. The number of CPUs that have ever been online is tracked + by the rcu_state structure's ->ncpus + field. + The rcu_state structure's ->ncpus_snap + field tracks the number of CPUs that have ever been online + at the beginning of an RCU expedited grace period. + Note that this number never decreases, at least in the absence + of a time machine. +
  2. The identities of the CPUs that have ever been online is + tracked by the rcu_node structure's + ->expmaskinitnext field. + The rcu_node structure's ->expmaskinit + field tracks the identities of the CPUs that were online + at least once at the beginning of the most recent RCU + expedited grace period. + The rcu_state structure's ->ncpus and + ->ncpus_snap fields are used to detect when + new CPUs have come online for the first time, that is, + when the rcu_node structure's ->expmaskinitnext + field has changed since the beginning of the last RCU + expedited grace period, which triggers an update of each + rcu_node structure's ->expmaskinit + field from its ->expmaskinitnext field. +
  3. Each rcu_node structure's ->expmaskinit + field is used to initialize that structure's + ->expmask at the beginning of each RCU + expedited grace period. + This means that only those CPUs that have been online at least + once will be considered for a given grace period. +
  4. Any CPU that goes offline will clear its bit in its leaf + rcu_node structure's ->qsmaskinitnext + field, so any CPU with that bit clear can safely be ignored. + However, it is possible for a CPU coming online or going offline + to have this bit set for some time while cpu_online + returns false. +
  5. For each non-idle CPU that RCU believes is currently online, the grace + period invokes smp_call_function_single(). + If this succeeds, the CPU was fully online. + Failure indicates that the CPU is in the process of coming online + or going offline, in which case it is necessary to wait for a + short time period and try again. + The purpose of this wait (or series of waits, as the case may be) + is to permit a concurrent CPU-hotplug operation to complete. +
  6. In the case of RCU-sched, one of the last acts of an outgoing CPU + is to invoke rcu_report_dead(), which + reports a quiescent state for that CPU. + However, this is likely paranoia-induced redundancy. +
+ + + + + + + + +
 
Quick Quiz:
+ Why all the dancing around with multiple counters and masks + tracking CPUs that were once online? + Why not just have a single set of masks tracking the currently + online CPUs and be done with it? +
Answer:
+ Maintaining single set of masks tracking the online CPUs sounds + easier, at least until you try working out all the race conditions + between grace-period initialization and CPU-hotplug operations. + For example, suppose initialization is progressing down the + tree while a CPU-offline operation is progressing up the tree. + This situation can result in bits set at the top of the tree + that have no counterparts at the bottom of the tree. + Those bits will never be cleared, which will result in + grace-period hangs. + In short, that way lies madness, to say nothing of a great many + bugs, hangs, and deadlocks. + +

+ In contrast, the current multi-mask multi-counter scheme ensures + that grace-period initialization will always see consistent masks + up and down the tree, which brings significant simplifications + over the single-mask method. + +

+ This is an instance of + + deferring work in order to avoid synchronization. + Lazily recording CPU-hotplug events at the beginning of the next + grace period greatly simplifies maintenance of the CPU-tracking + bitmasks in the rcu_node tree. +

 
+ +

+Expedited Grace Period Refinements

+ +
    +
  1. Idle-CPU checks. +
  2. + Batching via sequence counter. +
  3. + Funnel locking and wait/wakeup. +
  4. Use of Workqueues. +
  5. Stall warnings. +
  6. Mid-boot operation. +
+ +

Idle-CPU Checks

+ +

+Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU +(both checks are carried out by sync_rcu_exp_select_cpus()). +If the CPU is idle at any time between those two times, the CPU will +not be IPIed. +Instead, the task pushing the grace period forward will include the +idle CPUs in the mask passed to rcu_report_exp_cpu_mult(). + +

+For RCU-sched, there is an additional check for idle in the IPI +handler, sync_sched_exp_handler(). +If the IPI has interrupted the idle loop, then +sync_sched_exp_handler() invokes rcu_report_exp_rdp() +to report the corresponding quiescent state. + +

+For RCU-preempt, there is no specific check for idle in the +IPI handler (sync_rcu_exp_handler()), but because +RCU read-side critical sections are not permitted within the +idle loop, if sync_rcu_exp_handler() sees that the CPU is within +RCU read-side critical section, the CPU cannot possibly be idle. +Otherwise, sync_rcu_exp_handler() invokes +rcu_report_exp_rdp() to report the corresponding quiescent +state, regardless of whether or not that quiescent state was due to +the CPU being idle. + +

+In summary, RCU expedited grace periods check for idle when building +the bitmask of CPUs that must be IPIed, just before sending each IPI, +and (either explicitly or implicitly) within the IPI handler. + +

+Batching via Sequence Counter

+ +

+If each grace-period request was carried out separately, expedited +grace periods would have abysmal scalability and +problematic high-load characteristics. +Because each grace-period operation can serve an unlimited number of +updates, it is important to batch requests, so that a single +expedited grace-period operation will cover all requests in the +corresponding batch. + +

+This batching is controlled by a sequence counter named +->expedited_sequence in the rcu_state structure. +This counter has an odd value when there is an expedited grace period +in progress and an even value otherwise, so that dividing the counter +value by two gives the number of completed grace periods. +During any given update request, the counter must transition from +even to odd and then back to even, thus indicating that a grace +period has elapsed. +Therefore, if the initial value of the counter is s, +the updater must wait until the counter reaches at least the +value (s+3)&~0x1. +This counter is managed by the following access functions: + +

    +
  1. rcu_exp_gp_seq_start(), which marks the start of + an expedited grace period. +
  2. rcu_exp_gp_seq_end(), which marks the end of an + expedited grace period. +
  3. rcu_exp_gp_seq_snap(), which obtains a snapshot of + the counter. +
  4. rcu_exp_gp_seq_done(), which returns true + if a full expedited grace period has elapsed since the + corresponding call to rcu_exp_gp_seq_snap(). +
+ +

+Again, only one request in a given batch need actually carry out +a grace-period operation, which means there must be an efficient +way to identify which of many concurrent reqeusts will initiate +the grace period, and that there be an efficient way for the +remaining requests to wait for that grace period to complete. +However, that is the topic of the next section. + +

+Funnel Locking and Wait/Wakeup

+ +

+The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the rcu_node combining +tree, as implemented by the exp_funnel_lock() function. +The first updater corresponding to a given grace period arriving +at a given rcu_node structure records its desired grace-period +sequence number in the ->exp_seq_rq field and moves up +to the next level in the tree. +Otherwise, if the ->exp_seq_rq field already contains +the sequence number for the desired grace period or some later one, +the updater blocks on one of four wait queues in the +->exp_wq[] array, using the second-from-bottom +and third-from bottom bits as an index. +An ->exp_lock field in the rcu_node structure +synchronizes access to these fields. + +

+An empty rcu_node tree is shown in the following diagram, +with the white cells representing the ->exp_seq_rq field +and the red cells representing the elements of the +->exp_wq[] array. + +

Funnel0.svg + +

+The next diagram shows the situation after the arrival of Task A +and Task B at the leftmost and rightmost leaf rcu_node +structures, respectively. +The current value of the rcu_state structure's +->expedited_sequence field is zero, so adding three and +clearing the bottom bit results in the value two, which both tasks +record in the ->exp_seq_rq field of their respective +rcu_node structures: + +

Funnel1.svg + +

+Each of Tasks A and B will move up to the root +rcu_node structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +

Funnel2.svg + +

+Task A now advances to initiate a new grace period, while Task B +moves up to the root rcu_node structure, and, seeing that +its desired sequence number is already recorded, blocks on +->exp_wq[1]. + + + + + + + + +
 
Quick Quiz:
+ Why ->exp_wq[1]? + Given that the value of these tasks' desired sequence number is + two, so shouldn't they instead block on ->exp_wq[2]? +
Answer:
+ No. + +

+ Recall that the bottom bit of the desired sequence number indicates + whether or not a grace period is currently in progress. + It is therefore necessary to shift the sequence number right one + bit position to obtain the number of the grace period. + This results in ->exp_wq[1]. +

 
+ +

+If Tasks C and D also arrive at this point, they will compute the +same desired grace-period sequence number, and see that both leaf +rcu_node structures already have that value recorded. +They will therefore block on their respective rcu_node +structures' ->exp_wq[1] fields, as shown below: + +

Funnel3.svg + +

+Task A now acquires the rcu_state structure's +->exp_mutex and initiates the grace period, which +increments ->expedited_sequence. +Therefore, if Tasks E and F arrive, they will compute +a desired sequence number of 4 and will record this value as +shown below: + +

Funnel4.svg + +

+Tasks E and F will propagate up the rcu_node +combining tree, with Task F blocking on the root rcu_node +structure and Task E wait for Task A to finish so that +it can start the next grace period. +The resulting state is as shown below: + +

Funnel5.svg + +

+Once the grace period completes, Task A +starts waking up the tasks waiting for this grace period to complete, +increments the ->expedited_sequence, +acquires the ->exp_wake_mutex and then releases the +->exp_mutex. +This results in the following state: + +

Funnel6.svg + +

+Task E can then acquire ->exp_mutex and increment +->expedited_sequence to the value three. +If new tasks G and H arrive and moves up the combining tree at the +same time, the state will be as follows: + +

Funnel7.svg + +

+Note that three of the root rcu_node structure's +waitqueues are now occupied. +However, at some point, Task A will wake up the +tasks blocked on the ->exp_wq waitqueues, resulting +in the following state: + +

Funnel8.svg + +

+Execution will continue with Tasks E and H completing +their grace periods and carrying out their wakeups. + + + + + + + + +
 
Quick Quiz:
+ What happens if Task A takes so long to do its wakeups + that Task E's grace period completes? +
Answer:
+ Then Task E will block on the ->exp_wake_mutex, + which will also prevent it from releasing ->exp_mutex, + which in turn will prevent the next grace period from starting. + This last is important in preventing overflow of the + ->exp_wq[] array. +
 
+ +

Use of Workqueues

+ +

+In earlier implementations, the task requesting the expedited +grace period also drove it to completion. +This straightforward approach had the disadvantage of needing to +account for POSIX signals sent to user tasks, +so more recent implemementations use the Linux kernel's +workqueues. + +

+The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock +does a schedule_work() (from _synchronize_rcu_expedited() +so that a workqueue kthread does the actual grace-period processing. +Because workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. + +In addition, this approach allows wakeups for the previous expedited +grace period to be overlapped with processing for the next expedited +grace period. +Because there are only four sets of waitqueues, it is necessary to +ensure that the previous grace period's wakeups complete before the +next grace period's wakeups start. +This is handled by having the ->exp_mutex +guard expedited grace-period processing and the +->exp_wake_mutex guard wakeups. +The key point is that the ->exp_mutex is not released +until the first wakeup is complete, which means that the +->exp_wake_mutex has already been acquired at that point. +This approach ensures that the previous grace period's wakeups can +be carried out while the current grace period is in process, but +that these wakeups will complete before the next grace period starts. +This means that only three waitqueues are required, guaranteeing that +the four that are provided are sufficient. + +

Stall Warnings

+ +

+Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check +for stalls just as normal grace periods do. + + + + + + + + +
 
Quick Quiz:
+ But why not just let the normal grace-period machinery + detect the stalls, given that a given reader must block + both normal and expedited grace periods? +
Answer:
+ Because it is quite possible that at a given time there + is no normal grace period in progress, in which case the + normal grace period cannot emit a stall warning. +
 
+ +The synchronize_sched_expedited_wait() function loops waiting +for the expedited grace period to end, but with a timeout set to the +current RCU CPU stall-warning time. +If this time is exceeded, any CPUs or rcu_node structures +blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +

Mid-boot operation

+ +

+The use of workqueues has the advantage that the expedited +grace-period code need not worry about POSIX signals. +Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until +they are initialized, which does not happen until some time after +the scheduler spawns the first task. +Given that there are parts of the kernel that really do want to +execute grace periods during this mid-boot “dead zone”, +expedited grace periods must do something else during thie time. + +

+What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case +before the use of workqueues. +However, the requesting task is only required to drive the grace period +during the mid-boot dead zone. +Before mid-boot, a synchronous grace period is a no-op. +Some time after mid-boot, workqueues are used. + +

+Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. +This is handled by causing non-expedited grace periods to take the +expedited code path during mid-boot. + +

+The current code assumes that there are no POSIX signals during +the mid-boot dead zone. +However, if an overwhelming need for POSIX signals somehow arises, +appropriate adjustments can be made to the expedited stall-warning code. +One such adjustment would reinstate the pre-workqueue stall-warning +checks, but only during the mid-boot dead zone. + +

+With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. + +

+Summary

+ +

+Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. +A funnel lock is used to efficiently identify the one task out of +a concurrent group that will request the grace period. +All members of the group will block on waitqueues provided in +the rcu_node structure. +The actual grace-period processing is carried out by a workqueue. + +

+CPU-hotplug operations are noted lazily in order to prevent the need +for tight synchronization between expedited grace periods and +CPU-hotplug operations. +The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, +at least in the common case. +RCU-preempt and RCU-sched use different IPI handlers and different +code to respond to the state changes carried out by those handlers, +but otherwise use common code. + +

+Quiescent states are tracked using the rcu_node tree, +and once all necessary quiescent states have been reported, +all tasks waiting on this expedited grace period are awakened. +A pair of mutexes are used to allow one grace period's wakeups +to proceed concurrently with the next grace period's processing. + +

+This combination of mechanisms allows expedited grace periods to +run reasonably efficiently. +However, for non-time-critical tasks, normal grace periods should be +used instead because their longer duration permits much higher +degrees of batching, and thus much lower per-request overheads. + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg new file mode 100644 index 000000000..98af66557 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :0 + + + + + + + + :0 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg new file mode 100644 index 000000000..e0184a37a --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + A:2 + + + + + + + + B:2 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg new file mode 100644 index 000000000..1bc3fed54 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg @@ -0,0 +1,287 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :2 + + + + + + + + B:2 + + A:2 + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg new file mode 100644 index 000000000..6d8a1bffb --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 GP: A + + + + + + + + + + + + :2 + C + + + + + + + + :2 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg new file mode 100644 index 000000000..44018fd63 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A + + + + + + + + + + + + E:4 + C + + + + + + + + F:4 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg new file mode 100644 index 000000000..e5eef5045 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A,E + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg new file mode 100644 index 000000000..fbd2c1892 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 2 GP: E Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg new file mode 100644 index 000000000..502e159ed --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg @@ -0,0 +1,347 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :6 + D + + :6 + B + F + G + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg new file mode 100644 index 000000000..677401551 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg @@ -0,0 +1,311 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H + + + + + + + + + + + + :4 + + + + + + + + :6 + + :6 + F + G + + diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html new file mode 100644 index 000000000..e5b42a798 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html @@ -0,0 +1,9 @@ + + + A Diagram of TREE_RCU's Grace-Period Memory Ordering + + +

TreeRCU-gp.svg + + diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html new file mode 100644 index 000000000..a346ce011 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -0,0 +1,705 @@ + + + A Tour Through TREE_RCU's Grace-Period Memory Ordering + + +

August 8, 2017

+

This article was contributed by Paul E. McKenney

+ +

Introduction

+ +

This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +

    +
  1. + What Is Tree RCU's Grace Period Memory Ordering Guarantee? +
  2. + Tree RCU Grace Period Memory Ordering Building Blocks +
  3. + Tree RCU Grace Period Memory Ordering Components +
  4. Putting It All Together +
+ +

+What Is Tree RCU's Grace Period Memory Ordering Guarantee?

+ +

RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +

This guarantee is particularly pervasive for synchronize_sched(), +for which RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +synchronize_sched() as smp_mb() on steroids. + +

RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +

The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +

+Tree RCU Grace Period Memory Ordering Building Blocks

+ +

The workhorse for RCU's grace-period memory ordering is the +critical section for the rcu_node structure's +->lock. +These critical sections use helper functions for lock acquisition, including +raw_spin_lock_rcu_node(), +raw_spin_lock_irq_rcu_node(), and +raw_spin_lock_irqsave_rcu_node(). +Their lock-release counterparts are +raw_spin_unlock_rcu_node(), +raw_spin_unlock_irq_rcu_node(), and +raw_spin_unlock_irqrestore_rcu_node(), +respectively. +For completeness, a +raw_spin_trylock_rcu_node() +is also provided. +The key point is that the lock-acquisition functions, including +raw_spin_trylock_rcu_node(), all invoke +smp_mb__after_unlock_lock() immediately after successful +acquisition of the lock. + +

Therefore, for any given rcu_node struction, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different rcu_node structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +

The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions: + +

+ 1 int x, y, z;
+ 2
+ 3 void task0(void)
+ 4 {
+ 5   raw_spin_lock_rcu_node(rnp);
+ 6   WRITE_ONCE(x, 1);
+ 7   r1 = READ_ONCE(y);
+ 8   raw_spin_unlock_rcu_node(rnp);
+ 9 }
+10
+11 void task1(void)
+12 {
+13   raw_spin_lock_rcu_node(rnp);
+14   WRITE_ONCE(y, 1);
+15   r2 = READ_ONCE(z);
+16   raw_spin_unlock_rcu_node(rnp);
+17 }
+18
+19 void task2(void)
+20 {
+21   WRITE_ONCE(z, 1);
+22   smp_mb();
+23   r3 = READ_ONCE(x);
+24 }
+25
+26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
+
+ +

The WARN_ON() is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the smp_mb__after_unlock_lock() provided by the +acquisition functions, this WARN_ON() could trigger, for example +on PowerPC. +The smp_mb__after_unlock_lock() invocations prevent this +WARN_ON() from triggering. + +

This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +atomic_add_return() read-modify-write atomic operation that +is invoked within rcu_dynticks_eqs_enter() at idle-entry +time and within rcu_dynticks_eqs_exit() at idle-exit time. +The grace-period kthread invokes rcu_dynticks_snap() and +rcu_dynticks_in_eqs_since() (both of which invoke +an atomic_add_return() of zero) to detect idle CPUs. + + + + + + + + +
 
Quick Quiz:
+ But what about CPUs that remain offline for the entire + grace period? +
Answer:
+ Such CPUs will be offline at the beginning of the grace period, + so the grace period won't expect quiescent states from them. + Races between grace-period start and CPU-hotplug operations + are mediated by the CPU's leaf rcu_node structure's + ->lock as described above. +
 
+ +

The approach must be extended to handle one final case, that +of waking a task blocked in synchronize_rcu(). +This task might be affinitied to a CPU that is not yet aware that +the grace period has ended, and thus might not yet be subject to +the grace period's memory ordering. +Therefore, there is an smp_mb() after the return from +wait_for_completion() in the synchronize_rcu() +code path. + + + + + + + + +
 
Quick Quiz:
+ What? Where??? + I don't see any smp_mb() after the return from + wait_for_completion()!!! +
Answer:
+ That would be because I spotted the need for that + smp_mb() during the creation of this documentation, + and it is therefore unlikely to hit mainline before v4.14. + Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and + Jonathan Cameron for asking questions that sensitized me + to the rather elaborate sequence of events that demonstrate + the need for this memory barrier. +
 
+ +

Tree RCU's grace--period memory-ordering guarantees rely most +heavily on the rcu_node structure's ->lock +field, so much so that it is necessary to abbreviate this pattern +in the diagrams in the next section. +For example, consider the rcu_prepare_for_idle() function +shown below, which is one of several functions that enforce ordering +of newly arrived RCU callbacks against future grace periods: + +

+ 1 static void rcu_prepare_for_idle(void)
+ 2 {
+ 3   bool needwake;
+ 4   struct rcu_data *rdp;
+ 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ 6   struct rcu_node *rnp;
+ 7   struct rcu_state *rsp;
+ 8   int tne;
+ 9
+10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
+11       rcu_is_nocb_cpu(smp_processor_id()))
+12     return;
+13   tne = READ_ONCE(tick_nohz_active);
+14   if (tne != rdtp->tick_nohz_enabled_snap) {
+15     if (rcu_cpu_has_callbacks(NULL))
+16       invoke_rcu_core();
+17     rdtp->tick_nohz_enabled_snap = tne;
+18     return;
+19   }
+20   if (!tne)
+21     return;
+22   if (rdtp->all_lazy &&
+23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
+24     rdtp->all_lazy = false;
+25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
+26     invoke_rcu_core();
+27     return;
+28   }
+29   if (rdtp->last_accelerate == jiffies)
+30     return;
+31   rdtp->last_accelerate = jiffies;
+32   for_each_rcu_flavor(rsp) {
+33     rdp = this_cpu_ptr(rsp->rda);
+34     if (rcu_segcblist_pend_cbs(&rdp->cblist))
+35       continue;
+36     rnp = rdp->mynode;
+37     raw_spin_lock_rcu_node(rnp);
+38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+39     raw_spin_unlock_rcu_node(rnp);
+40     if (needwake)
+41       rcu_gp_kthread_wake(rsp);
+42   }
+43 }
+
+ +

But the only part of rcu_prepare_for_idle() that really +matters for this discussion are lines 37–39. +We will therefore abbreviate this function as follows: + +

rcu_node-lock.svg + +

The box represents the rcu_node structure's ->lock +critical section, with the double line on top representing the additional +smp_mb__after_unlock_lock(). + +

+Tree RCU Grace Period Memory Ordering Components

+ +

Tree RCU's grace-period memory-ordering guarantee is provided by +a number of RCU components: + +

    +
  1. Callback Registry +
  2. Grace-Period Initialization +
  3. + Self-Reported Quiescent States +
  4. Dynamic Tick Interface +
  5. CPU-Hotplug Interface +
  6. Forcing Quiescent States +
  7. Grace-Period Cleanup +
  8. Callback Invocation +
+ +

Each of the following section looks at the corresponding component +in detail. + +

Callback Registry

+ +

If RCU's grace-period guarantee is to mean anything at all, any +access that happens before a given invocation of call_rcu() +must also happen before the corresponding grace period. +The implementation of this portion of RCU's grace period guarantee +is shown in the following figure: + +

TreeRCU-callback-registry.svg + +

Because call_rcu() normally acts only on CPU-local state, +it provides no ordering guarantees, either for itself or for +phase one of the update (which again will usually be removal of +an element from an RCU-protected data structure). +It simply enqueues the rcu_head structure on a per-CPU list, +which cannot become associated with a grace period until a later +call to rcu_accelerate_cbs(), as shown in the diagram above. + +

One set of code paths shown on the left invokes +rcu_accelerate_cbs() via +note_gp_changes(), either directly from call_rcu() (if +the current CPU is inundated with queued rcu_head structures) +or more likely from an RCU_SOFTIRQ handler. +Another code path in the middle is taken only in kernels built with +CONFIG_RCU_FAST_NO_HZ=y, which invokes +rcu_accelerate_cbs() via rcu_prepare_for_idle(). +The final code path on the right is taken only in kernels built with +CONFIG_HOTPLUG_CPU=y, which invokes +rcu_accelerate_cbs() via +rcu_advance_cbs(), rcu_migrate_callbacks, +rcutree_migrate_callbacks(), and takedown_cpu(), +which in turn is invoked on a surviving CPU after the outgoing +CPU has been completely offlined. + +

There are a few other code paths within grace-period processing +that opportunistically invoke rcu_accelerate_cbs(). +However, either way, all of the CPU's recently queued rcu_head +structures are associated with a future grace-period number under +the protection of the CPU's lead rcu_node structure's +->lock. +In all cases, there is full ordering against any prior critical section +for that same rcu_node structure's ->lock, and +also full ordering against any of the current task's or CPU's prior critical +sections for any rcu_node structure's ->lock. + +

The next section will show how this ordering ensures that any +accesses prior to the call_rcu() (particularly including phase +one of the update) +happen before the start of the corresponding grace period. + + + + + + + + +
 
Quick Quiz:
+ But what about synchronize_rcu()? +
Answer:
+ The synchronize_rcu() passes call_rcu() + to wait_rcu_gp(), which invokes it. + So either way, it eventually comes down to call_rcu(). +
 
+ +

Grace-Period Initialization

+ +

Grace-period initialization is carried out by +the grace-period kernel thread, which makes several passes over the +rcu_node tree within the rcu_gp_init() function. +This means that showing the full flow of ordering through the +grace-period computation will require duplicating this tree. +If you find this confusing, please note that the state of the +rcu_node changes over time, just like Heraclitus's river. +However, to keep the rcu_node river tractable, the +grace-period kernel thread's traversals are presented in multiple +parts, starting in this section with the various phases of +grace-period initialization. + +

The first ordering-related grace-period initialization action is to +advance the rcu_state structure's ->gp_seq +grace-period-number counter, as shown below: + +

TreeRCU-gp-init-1.svg + +

The actual increment is carried out using smp_store_release(), +which helps reject false-positive RCU CPU stall detection. +Note that only the root rcu_node structure is touched. + +

The first pass through the rcu_node tree updates bitmasks +based on CPUs having come online or gone offline since the start of +the previous grace period. +In the common case where the number of online CPUs for this rcu_node +structure has not transitioned to or from zero, +this pass will scan only the leaf rcu_node structures. +However, if the number of online CPUs for a given leaf rcu_node +structure has transitioned from zero, +rcu_init_new_rnp() will be invoked for the first incoming CPU. +Similarly, if the number of online CPUs for a given leaf rcu_node +structure has transitioned to zero, +rcu_cleanup_dead_rnp() will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +rcu_node structure onlines its first CPU and if the next +rcu_node structure has no online CPUs +(or, alternatively if the leftmost rcu_node structure offlines +its last CPU and if the next rcu_node structure has no online CPUs). + +

TreeRCU-gp-init-1.svg + +

The final rcu_gp_init() pass through the rcu_node +tree traverses breadth-first, setting each rcu_node structure's +->gp_seq field to the newly advanced value from the +rcu_state structure, as shown in the following diagram. + +

TreeRCU-gp-init-1.svg + +

This change will also cause each CPU's next call to +__note_gp_changes() +to notice that a new grace period has started, as described in the next +section. +But because the grace-period kthread started the grace period at the +root (with the advancing of the rcu_state structure's +->gp_seq field) before setting each leaf rcu_node +structure's ->gp_seq field, each CPU's observation of +the start of the grace period will happen after the actual start +of the grace period. + + + + + + + + +
 
Quick Quiz:
+ But what about the CPU that started the grace period? + Why wouldn't it see the start of the grace period right when + it started that grace period? +
Answer:
+ In some deep philosophical and overly anthromorphized + sense, yes, the CPU starting the grace period is immediately + aware of having done so. + However, if we instead assume that RCU is not self-aware, + then even the CPU starting the grace period does not really + become aware of the start of this grace period until its + first call to __note_gp_changes(). + On the other hand, this CPU potentially gets early notification + because it invokes __note_gp_changes() during its + last rcu_gp_init() pass through its leaf + rcu_node structure. +
 
+ +

+Self-Reported Quiescent States

+ +

When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. +Online non-idle CPUs report their own quiescent states, as shown +in the following diagram: + +

TreeRCU-qs.svg + +

This is for the last CPU to report a quiescent state, which signals +the end of the grace period. +Earlier quiescent states would push up the rcu_node tree +only until they encountered an rcu_node structure that +is waiting for additional quiescent states. +However, ordering is nevertheless preserved because some later quiescent +state will acquire that rcu_node structure's ->lock. + +

Any number of events can lead up to a CPU invoking +note_gp_changes (or alternatively, directly invoking +__note_gp_changes()), at which point that CPU will notice +the start of a new grace period while holding its leaf +rcu_node lock. +Therefore, all execution shown in this diagram happens after the +start of the grace period. +In addition, this CPU will consider any RCU read-side critical +section that started before the invocation of __note_gp_changes() +to have started before the grace period, and thus a critical +section that the grace period must wait on. + + + + + + + + +
 
Quick Quiz:
+ But a RCU read-side critical section might have started + after the beginning of the grace period + (the advancing of ->gp_seq from earlier), so why should + the grace period wait on such a critical section? +
Answer:
+ It is indeed not necessary for the grace period to wait on such + a critical section. + However, it is permissible to wait on it. + And it is furthermore important to wait on it, as this + lazy approach is far more scalable than a “big bang” + all-at-once grace-period start could possibly be. +
 
+ +

If the CPU does a context switch, a quiescent state will be +noted by rcu_node_context_switch() on the left. +On the other hand, if the CPU takes a scheduler-clock interrupt +while executing in usermode, a quiescent state will be noted by +rcu_check_callbacks() on the right. +Either way, the passage through a quiescent state will be noted +in a per-CPU variable. + +

The next time an RCU_SOFTIRQ handler executes on +this CPU (for example, after the next scheduler-clock +interrupt), __rcu_process_callbacks() will invoke +rcu_check_quiescent_state(), which will notice the +recorded quiescent state, and invoke +rcu_report_qs_rdp(). +If rcu_report_qs_rdp() verifies that the quiescent state +really does apply to the current grace period, it invokes +rcu_report_rnp() which traverses up the rcu_node +tree as shown at the bottom of the diagram, clearing bits from +each rcu_node structure's ->qsmask field, +and propagating up the tree when the result is zero. + +

Note that traversal passes upwards out of a given rcu_node +structure only if the current CPU is reporting the last quiescent +state for the subtree headed by that rcu_node structure. +A key point is that if a CPU's traversal stops at a given rcu_node +structure, then there will be a later traversal by another CPU +(or perhaps the same one) that proceeds upwards +from that point, and the rcu_node ->lock +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. +Applying this line of thought repeatedly shows that all CPUs' +quiescent states happen before the last CPU traverses through +the root rcu_node structure, the “last CPU” +being the one that clears the last bit in the root rcu_node +structure's ->qsmask field. + +

Dynamic Tick Interface

+ +

Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. +CPUs are therefore required to notify RCU when entering or leaving idle +state, which they do via fully ordered value-returning atomic operations +on a per-CPU variable. +The ordering effects are as shown below: + +

TreeRCU-dyntick.svg + +

The RCU grace-period kernel thread samples the per-CPU idleness +variable while holding the corresponding CPU's leaf rcu_node +structure's ->lock. +This means that any RCU read-side critical sections that precede the +idle period (the oval near the top of the diagram above) will happen +before the end of the current grace period. +Similarly, the beginning of the current grace period will happen before +any RCU read-side critical sections that follow the +idle period (the oval near the bottom of the diagram above). + +

Plumbing this into the full grace-period execution is described +below. + +

CPU-Hotplug Interface

+ +

RCU is also forbidden from disturbing offline CPUs, which might well +be powered off and removed from the system completely. +CPUs are therefore required to notify RCU of their comings and goings +as part of the corresponding CPU hotplug operations. +The ordering effects are shown below: + +

TreeRCU-hotplug.svg + +

Because CPU hotplug operations are much less frequent than idle transitions, +they are heavier weight, and thus acquire the CPU's leaf rcu_node +structure's ->lock and update this structure's +->qsmaskinitnext. +The RCU grace-period kernel thread samples this mask to detect CPUs +having gone offline since the beginning of this grace period. + +

Plumbing this into the full grace-period execution is described +below. + +

Forcing Quiescent States

+ +

As noted above, idle and offline CPUs cannot report their own +quiescent states, and therefore the grace-period kernel thread +must do the reporting on their behalf. +This process is called “forcing quiescent states”, it is +repeated every few jiffies, and its ordering effects are shown below: + +

TreeRCU-gp-fqs.svg + +

Each pass of quiescent state forcing is guaranteed to traverse the +leaf rcu_node structures, and if there are no new quiescent +states due to recently idled and/or offlined CPUs, then only the +leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left +or a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. +As with self-reported quiescent states, the upwards driving stops +once it reaches an rcu_node structure that has quiescent +states outstanding from other CPUs. + + + + + + + + +
 
Quick Quiz:
+ The leftmost drive to root stopped before it reached + the root rcu_node structure, which means that + there are still CPUs subordinate to that structure on + which the current grace period is waiting. + Given that, how is it possible that the rightmost drive + to root ended the grace period? +
Answer:
+ Good analysis! + It is in fact impossible in the absence of bugs in RCU. + But this diagram is complex enough as it is, so simplicity + overrode accuracy. + You can think of it as poetic license, or you can think of + it as misdirection that is resolved in the + stitched-together diagram. +
 
+ +

Grace-Period Cleanup

+ +

Grace-period cleanup first scans the rcu_node tree +breadth-first advancing all the ->gp_seq fields, then it +advances the rcu_state structure's ->gp_seq field. +The ordering effects are shown below: + +

TreeRCU-gp-cleanup.svg + +

As indicated by the oval at the bottom of the diagram, once +grace-period cleanup is complete, the next grace period can begin. + + + + + + + + +
 
Quick Quiz:
+ But when precisely does the grace period end? +
Answer:
+ There is no useful single point at which the grace period + can be said to end. + The earliest reasonable candidate is as soon as the last + CPU has reported its quiescent state, but it may be some + milliseconds before RCU becomes aware of this. + The latest reasonable candidate is once the rcu_state + structure's ->gp_seq field has been updated, + but it is quite possible that some CPUs have already completed + phase two of their updates by that time. + In short, if you are going to work with RCU, you need to + learn to embrace uncertainty. +
 
+ + +

Callback Invocation

+ +

Once a given CPU's leaf rcu_node structure's +->gp_seq field has been updated, that CPU can begin +invoking its RCU callbacks that were waiting for this grace period +to end. +These callbacks are identified by rcu_advance_cbs(), +which is usually invoked by __note_gp_changes(). +As shown in the diagram below, this invocation can be triggered by +the scheduling-clock interrupt (rcu_check_callbacks() on +the left) or by idle entry (rcu_cleanup_after_idle() on +the right, but only for kernels build with +CONFIG_RCU_FAST_NO_HZ=y). +Either way, RCU_SOFTIRQ is raised, which results in +rcu_do_batch() invoking the callbacks, which in turn +allows those callbacks to carry out (either directly or indirectly +via wakeup) the needed phase-two processing for each update. + +

TreeRCU-callback-invocation.svg + +

Please note that callback invocation can also be prompted by any +number of corner-case code paths, for example, when a CPU notes that +it has excessive numbers of callbacks queued. +In all cases, the CPU acquires its leaf rcu_node structure's +->lock before invoking callbacks, which preserves the +required ordering against the newly completed grace period. + +

However, if the callback function communicates to other CPUs, +for example, doing a wakeup, then it is that function's responsibility +to maintain ordering. +For example, if the callback function wakes up a task that runs on +some other CPU, proper ordering must in place in both the callback +function and the task being awakened. +To see why this is important, consider the top half of the +grace-period cleanup diagram. +The callback might be running on a CPU corresponding to the leftmost +leaf rcu_node structure, and awaken a task that is to run on +a CPU corresponding to the rightmost leaf rcu_node structure, +and the grace-period kernel thread might not yet have reached the +rightmost leaf. +In this case, the grace period's memory ordering might not yet have +reached that CPU, so again the callback function and the awakened +task must supply proper ordering. + +

Putting It All Together

+ +

A stitched-together diagram is +here. + +

+Legal Statement

+ +

This work represents the view of the author and does not necessarily +represent the view of IBM. + +

Linux is a registered trademark of Linus Torvalds. + +

Other company, product, and service names may be trademarks or +service marks of others. + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg new file mode 100644 index 000000000..832408313 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -0,0 +1,486 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_check_callbacks() + + rcu_cleanup_after_idle() + + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg new file mode 100644 index 000000000..7ac6f9269 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg @@ -0,0 +1,655 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg new file mode 100644 index 000000000..423df00c4 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg @@ -0,0 +1,700 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + Leaf + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg new file mode 100644 index 000000000..bf84fbab2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg @@ -0,0 +1,1133 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + Root + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Root + rcu_seq_end(&rsp->gp_seq) + + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + rcu_seq_end(&rnp->gp_seq) + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg new file mode 100644 index 000000000..7ddc094d7 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg @@ -0,0 +1,1309 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg new file mode 100644 index 000000000..8c2075508 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg @@ -0,0 +1,658 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_start(rsp->gp_seq) + + + + + Root + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg new file mode 100644 index 000000000..4d956a732 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg @@ -0,0 +1,656 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg new file mode 100644 index 000000000..d24d7d555 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg @@ -0,0 +1,636 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->gp_seq = rsp->gp_seq + + + + + Root + + + rcu_gp_init() + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + ->gp_seq = rsp->gp_seq + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg new file mode 100644 index 000000000..acd73c7ad --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -0,0 +1,5144 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + + + + + + Root + rcu_seq_start(rsp->gp_seq) + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + + Grace-period + kernel thread + awakened + + + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + + + + + + + Root + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_node_context_switch() + + + + rcu_check_callbacks() + + + + rcu_process_callbacks() + rcu_check_quiescent_state()) + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + + Grace-period + kernel thread + awakened + + + + + + + + Root + rcu_seq_end(&rnp->gp_seq) + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + Leaf + + + + + + Root + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + + + + rcu_check_callbacks() + + rcu_cleanup_after_idle() + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rsp->gp_seq) + ->gp_seq = rsp->gp_seq + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg new file mode 100644 index 000000000..2c9310ba2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg @@ -0,0 +1,775 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + Leaf + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg new file mode 100644 index 000000000..149bec2a4 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -0,0 +1,1095 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_node_context_switch() + + + + rcu_check_callbacks() + + + + rcu_process_callbacks() + rcu_check_quiescent_state()) + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg new file mode 100644 index 000000000..94c96c595 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + + + rcu_prepare_for_idle() + + diff --git a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg new file mode 100644 index 000000000..4b4014fda --- /dev/null +++ b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg @@ -0,0 +1,374 @@ + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + synchronize_rcu() + + + + + + + WRITE_ONCE(a, 1); + WRITE_ONCE(b, 1); + r1 = READ_ONCE(a); + WRITE_ONCE(c, 1); + r2 = READ_ONCE(b); + r3 = READ_ONCE(c); + thread0() + thread1() + thread2() + + + + rcu_read_lock(); + rcu_read_lock(); + rcu_read_unlock(); + rcu_read_unlock(); + + QS + + QS + + + QS + + diff --git a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg new file mode 100644 index 000000000..48cd1623d --- /dev/null +++ b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg @@ -0,0 +1,639 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + synchronize_rcu() + + + + + + + WRITE_ONCE(a, 1); + WRITE_ONCE(b, 1); + r1 = READ_ONCE(a); + WRITE_ONCE(c, 1); + WRITE_ONCE(d, 1); + r2 = READ_ONCE(c); + thread0() + thread1() + thread2() + + + + rcu_read_lock(); + rcu_read_lock(); + rcu_read_unlock(); + rcu_read_unlock(); + + QS + + QS + + + QS + + + + synchronize_rcu() + + + + + + + r3 = READ_ONCE(d); + WRITE_ONCE(e, 1); + + QS + r4 = READ_ONCE(b); + r5 = READ_ONCE(e); + rcu_read_lock(); + rcu_read_unlock(); + QS + + QS + + QS + + thread3() + thread4() + + diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html new file mode 100644 index 000000000..49690228b --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -0,0 +1,3324 @@ + + + A Tour Through RCU's Requirements [LWN.net] + + +

A Tour Through RCU's Requirements

+ +

Copyright IBM Corporation, 2015

+

Author: Paul E. McKenney

+

The initial version of this document appeared in the +LWN articles +here, +here, and +here.

+ +

Introduction

+ +

+Read-copy update (RCU) is a synchronization mechanism that is often +used as a replacement for reader-writer locking. +RCU is unusual in that updaters do not block readers, +which means that RCU's read-side primitives can be exceedingly fast +and scalable. +In addition, updaters can make useful forward progress concurrently +with readers. +However, all this concurrency between RCU readers and updaters does raise +the question of exactly what RCU readers are doing, which in turn +raises the question of exactly what RCU's requirements are. + +

+This document therefore summarizes RCU's requirements, and can be thought +of as an informal, high-level specification for RCU. +It is important to understand that RCU's specification is primarily +empirical in nature; +in fact, I learned about many of these requirements the hard way. +This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been +a great privilege to work with so many people willing to apply +technologies in interesting new ways. + +

+All that aside, here are the categories of currently known RCU requirements: +

+ +
    +
  1. + Fundamental Requirements +
  2. Fundamental Non-Requirements +
  3. + Parallelism Facts of Life +
  4. + Quality-of-Implementation Requirements +
  5. + Linux Kernel Complications +
  6. + Software-Engineering Requirements +
  7. + Other RCU Flavors +
  8. + Possible Future Changes +
+ +

+This is followed by a summary, +however, the answers to each quick quiz immediately follows the quiz. +Select the big white space with your mouse to see the answer. + +

Fundamental Requirements

+ +

+RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. +These are: + +

    +
  1. + Grace-Period Guarantee +
  2. + Publish-Subscribe Guarantee +
  3. + Memory-Barrier Guarantees +
  4. + RCU Primitives Guaranteed to Execute Unconditionally +
  5. + Guaranteed Read-to-Write Upgrade +
+ +

Grace-Period Guarantee

+ +

+RCU's grace-period guarantee is unusual in being premeditated: +Jack Slingwine and I had this guarantee firmly in mind when we started +work on RCU (then called “rclock”) in the early 1990s. +That said, the past two decades of experience with RCU have produced +a much more detailed understanding of this guarantee. + +

+RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. +An RCU read-side critical section +begins with the marker rcu_read_lock() and ends with +the marker rcu_read_unlock(). +These markers may be nested, and RCU treats a nested set as one +big RCU read-side critical section. +Production-quality implementations of rcu_read_lock() and +rcu_read_unlock() are extremely lightweight, and in +fact have exactly zero overhead in Linux kernels built for production +use with CONFIG_PREEMPT=n. + +

+This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + +

+
+ 1 int x, y;
+ 2
+ 3 void thread0(void)
+ 4 {
+ 5   rcu_read_lock();
+ 6   r1 = READ_ONCE(x);
+ 7   r2 = READ_ONCE(y);
+ 8   rcu_read_unlock();
+ 9 }
+10
+11 void thread1(void)
+12 {
+13   WRITE_ONCE(x, 1);
+14   synchronize_rcu();
+15   WRITE_ONCE(y, 1);
+16 }
+
+
+ +

+Because the synchronize_rcu() on line 14 waits for +all pre-existing readers, any instance of thread0() that +loads a value of zero from x must complete before +thread1() stores to y, so that instance must +also load a value of zero from y. +Similarly, any instance of thread0() that loads a value of +one from y must have started after the +synchronize_rcu() started, and must therefore also load +a value of one from x. +Therefore, the outcome: +

+
+(r1 == 0 && r2 == 1)
+
+
+cannot happen. + + + + + + + + +
 
Quick Quiz:
+ Wait a minute! + You said that updaters can make useful forward progress concurrently + with readers, but pre-existing readers will block + synchronize_rcu()!!! + Just who are you trying to fool??? +
Answer:
+ First, if updaters do not wish to be blocked by readers, they can use + call_rcu() or kfree_rcu(), which will + be discussed later. + Second, even when using synchronize_rcu(), the other + update-side code does run concurrently with readers, whether + pre-existing or not. +
 
+ +

+This scenario resembles one of the first uses of RCU in +DYNIX/ptx, +which managed a distributed lock manager's transition into +a state suitable for handling recovery from node failure, +more or less as follows: + +

+
+ 1 #define STATE_NORMAL        0
+ 2 #define STATE_WANT_RECOVERY 1
+ 3 #define STATE_RECOVERING    2
+ 4 #define STATE_WANT_NORMAL   3
+ 5
+ 6 int state = STATE_NORMAL;
+ 7
+ 8 void do_something_dlm(void)
+ 9 {
+10   int state_snap;
+11
+12   rcu_read_lock();
+13   state_snap = READ_ONCE(state);
+14   if (state_snap == STATE_NORMAL)
+15     do_something();
+16   else
+17     do_something_carefully();
+18   rcu_read_unlock();
+19 }
+20
+21 void start_recovery(void)
+22 {
+23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
+24   synchronize_rcu();
+25   WRITE_ONCE(state, STATE_RECOVERING);
+26   recovery();
+27   WRITE_ONCE(state, STATE_WANT_NORMAL);
+28   synchronize_rcu();
+29   WRITE_ONCE(state, STATE_NORMAL);
+30 }
+
+
+ +

+The RCU read-side critical section in do_something_dlm() +works with the synchronize_rcu() in start_recovery() +to guarantee that do_something() never runs concurrently +with recovery(), but with little or no synchronization +overhead in do_something_dlm(). + + + + + + + + +
 
Quick Quiz:
+ Why is the synchronize_rcu() on line 28 needed? +
Answer:
+ Without that extra grace period, memory reordering could result in + do_something_dlm() executing do_something() + concurrently with the last bits of recovery(). +
 
+ +

+In order to avoid fatal problems such as deadlocks, +an RCU read-side critical section must not contain calls to +synchronize_rcu(). +Similarly, an RCU read-side critical section must not +contain anything that waits, directly or indirectly, on completion of +an invocation of synchronize_rcu(). + +

+Although RCU's grace-period guarantee is useful in and of itself, with +quite a few use cases, +it would be good to be able to use RCU to coordinate read-side +access to linked data structures. +For this, the grace-period guarantee is not sufficient, as can +be seen in function add_gp_buggy() below. +We will look at the reader's code later, but in the meantime, just think of +the reader as locklessly picking up the gp pointer, +and, if the value loaded is non-NULL, locklessly accessing the +->a and ->b fields. + +

+
+ 1 bool add_gp_buggy(int a, int b)
+ 2 {
+ 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4   if (!p)
+ 5     return -ENOMEM;
+ 6   spin_lock(&gp_lock);
+ 7   if (rcu_access_pointer(gp)) {
+ 8     spin_unlock(&gp_lock);
+ 9     return false;
+10   }
+11   p->a = a;
+12   p->b = a;
+13   gp = p; /* ORDERING BUG */
+14   spin_unlock(&gp_lock);
+15   return true;
+16 }
+
+
+ +

+The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + +

+
+ 1 bool add_gp_buggy_optimized(int a, int b)
+ 2 {
+ 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4   if (!p)
+ 5     return -ENOMEM;
+ 6   spin_lock(&gp_lock);
+ 7   if (rcu_access_pointer(gp)) {
+ 8     spin_unlock(&gp_lock);
+ 9     return false;
+10   }
+11   gp = p; /* ORDERING BUG */
+12   p->a = a;
+13   p->b = a;
+14   spin_unlock(&gp_lock);
+15   return true;
+16 }
+
+
+ +

+If an RCU reader fetches gp just after +add_gp_buggy_optimized executes line 11, +it will see garbage in the ->a and ->b +fields. +And this is but one of many ways in which compiler and hardware optimizations +could cause trouble. +Therefore, we clearly need some way to prevent the compiler and the CPU from +reordering in this manner, which brings us to the publish-subscribe +guarantee discussed in the next section. + +

Publish/Subscribe Guarantee

+ +

+RCU's publish-subscribe guarantee allows data to be inserted +into a linked data structure without disrupting RCU readers. +The updater uses rcu_assign_pointer() to insert the +new data, and readers use rcu_dereference() to +access data, whether new or old. +The following shows an example of insertion: + +

+
+ 1 bool add_gp(int a, int b)
+ 2 {
+ 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4   if (!p)
+ 5     return -ENOMEM;
+ 6   spin_lock(&gp_lock);
+ 7   if (rcu_access_pointer(gp)) {
+ 8     spin_unlock(&gp_lock);
+ 9     return false;
+10   }
+11   p->a = a;
+12   p->b = a;
+13   rcu_assign_pointer(gp, p);
+14   spin_unlock(&gp_lock);
+15   return true;
+16 }
+
+
+ +

+The rcu_assign_pointer() on line 13 is conceptually +equivalent to a simple assignment statement, but also guarantees +that its assignment will +happen after the two assignments in lines 11 and 12, +similar to the C11 memory_order_release store operation. +It also prevents any number of “interesting” compiler +optimizations, for example, the use of gp as a scratch +location immediately preceding the assignment. + + + + + + + + +
 
Quick Quiz:
+ But rcu_assign_pointer() does nothing to prevent the + two assignments to p->a and p->b + from being reordered. + Can't that also cause problems? +
Answer:
+ No, it cannot. + The readers cannot see either of these two fields until + the assignment to gp, by which time both fields are + fully initialized. + So reordering the assignments + to p->a and p->b cannot possibly + cause any problems. +
 
+ +

+It is tempting to assume that the reader need not do anything special +to control its accesses to the RCU-protected data, +as shown in do_something_gp_buggy() below: + +

+
+ 1 bool do_something_gp_buggy(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   p = gp;  /* OPTIMIZATIONS GALORE!!! */
+ 5   if (p) {
+ 6     do_something(p->a, p->b);
+ 7     rcu_read_unlock();
+ 8     return true;
+ 9   }
+10   rcu_read_unlock();
+11   return false;
+12 }
+
+
+ +

+However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler +(to say nothing of +DEC Alpha CPUs) +can trip this code up. +For but one example, if the compiler were short of registers, it +might choose to refetch from gp rather than keeping +a separate copy in p as follows: + +

+
+ 1 bool do_something_gp_buggy_optimized(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   if (gp) { /* OPTIMIZATIONS GALORE!!! */
+ 5     do_something(gp->a, gp->b);
+ 6     rcu_read_unlock();
+ 7     return true;
+ 8   }
+ 9   rcu_read_unlock();
+10   return false;
+11 }
+
+
+ +

+If this function ran concurrently with a series of updates that +replaced the current structure with a new one, +the fetches of gp->a +and gp->b might well come from two different structures, +which could cause serious confusion. +To prevent this (and much else besides), do_something_gp() uses +rcu_dereference() to fetch from gp: + +

+
+ 1 bool do_something_gp(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   p = rcu_dereference(gp);
+ 5   if (p) {
+ 6     do_something(p->a, p->b);
+ 7     rcu_read_unlock();
+ 8     return true;
+ 9   }
+10   rcu_read_unlock();
+11   return false;
+12 }
+
+
+ +

+The rcu_dereference() uses volatile casts and (for DEC Alpha) +memory barriers in the Linux kernel. +Should a +high-quality implementation of C11 memory_order_consume [PDF] +ever appear, then rcu_dereference() could be implemented +as a memory_order_consume load. +Regardless of the exact implementation, a pointer fetched by +rcu_dereference() may not be used outside of the +outermost RCU read-side critical section containing that +rcu_dereference(), unless protection of +the corresponding data element has been passed from RCU to some +other synchronization mechanism, most commonly locking or +reference counting. + +

+In short, updaters use rcu_assign_pointer() and readers +use rcu_dereference(), and these two RCU API elements +work together to ensure that readers have a consistent view of +newly added data elements. + +

+Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +

    +
  1. Remove the data element from the enclosing structure. +
  2. Wait for all pre-existing RCU read-side critical sections + to complete (because only pre-existing readers can possibly have + a reference to the newly removed data element). +
  3. At this point, only the updater has a reference to the + newly removed data element, so it can safely reclaim + the data element, for example, by passing it to kfree(). +
+ +This process is implemented by remove_gp_synchronous(): + +
+
+ 1 bool remove_gp_synchronous(void)
+ 2 {
+ 3   struct foo *p;
+ 4
+ 5   spin_lock(&gp_lock);
+ 6   p = rcu_access_pointer(gp);
+ 7   if (!p) {
+ 8     spin_unlock(&gp_lock);
+ 9     return false;
+10   }
+11   rcu_assign_pointer(gp, NULL);
+12   spin_unlock(&gp_lock);
+13   synchronize_rcu();
+14   kfree(p);
+15   return true;
+16 }
+
+
+ +

+This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. +This waiting ensures that readers will reach line 7 of +do_something_gp() before the data element referenced by +p is freed. +The rcu_access_pointer() on line 6 is similar to +rcu_dereference(), except that: + +

    +
  1. The value returned by rcu_access_pointer() + cannot be dereferenced. + If you want to access the value pointed to as well as + the pointer itself, use rcu_dereference() + instead of rcu_access_pointer(). +
  2. The call to rcu_access_pointer() need not be + protected. + In contrast, rcu_dereference() must either be + within an RCU read-side critical section or in a code + segment where the pointer cannot change, for example, in + code protected by the corresponding update-side lock. +
+ + + + + + + + +
 
Quick Quiz:
+ Without the rcu_dereference() or the + rcu_access_pointer(), what destructive optimizations + might the compiler make use of? +
Answer:
+ Let's start with what happens to do_something_gp() + if it fails to use rcu_dereference(). + It could reuse a value formerly fetched from this same pointer. + It could also fetch the pointer from gp in a byte-at-a-time + manner, resulting in load tearing, in turn resulting a bytewise + mash-up of two distinct pointer values. + It might even use value-speculation optimizations, where it makes + a wrong guess, but by the time it gets around to checking the + value, an update has changed the pointer to match the wrong guess. + Too bad about any dereferences that returned pre-initialization garbage + in the meantime! + + +

+ For remove_gp_synchronous(), as long as all modifications + to gp are carried out while holding gp_lock, + the above optimizations are harmless. + However, sparse will complain if you + define gp with __rcu and then + access it without using + either rcu_access_pointer() or rcu_dereference(). +

 
+ +

+In short, RCU's publish-subscribe guarantee is provided by the combination +of rcu_assign_pointer() and rcu_dereference(). +This guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. +This guarantee can be used in combination with the grace-period +guarantee to also allow data elements to be removed from RCU-protected +linked data structures, again without disrupting RCU readers. + +

+This guarantee was only partially premeditated. +DYNIX/ptx used an explicit memory barrier for publication, but had nothing +resembling rcu_dereference() for subscription, nor did it +have anything resembling the smp_read_barrier_depends() +that was later subsumed into rcu_dereference() and later +still into READ_ONCE(). +The need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. +It took the Alpha architects a good hour to convince me that any sort +of barrier would ever be needed, and it then took me a good two hours +to convince them that their documentation did not make this point clear. +More recent work with the C and C++ standards committees have provided +much education on tricks and traps from the compiler. +In short, compilers were much less tricky in the early 1990s, but in +2015, don't even think about omitting rcu_dereference()! + +

Memory-Barrier Guarantees

+ +

+The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +

    +
  1. Each CPU that has an RCU read-side critical section that + begins before synchronize_rcu() starts is + guaranteed to execute a full memory barrier between the time + that the RCU read-side critical section ends and the time that + synchronize_rcu() returns. + Without this guarantee, a pre-existing RCU read-side critical section + might hold a reference to the newly removed struct foo + after the kfree() on line 14 of + remove_gp_synchronous(). +
  2. Each CPU that has an RCU read-side critical section that ends + after synchronize_rcu() returns is guaranteed + to execute a full memory barrier between the time that + synchronize_rcu() begins and the time that the RCU + read-side critical section begins. + Without this guarantee, a later RCU read-side critical section + running after the kfree() on line 14 of + remove_gp_synchronous() might + later run do_something_gp() and find the + newly deleted struct foo. +
  3. If the task invoking synchronize_rcu() remains + on a given CPU, then that CPU is guaranteed to execute a full + memory barrier sometime during the execution of + synchronize_rcu(). + This guarantee ensures that the kfree() on + line 14 of remove_gp_synchronous() really does + execute after the removal on line 11. +
  4. If the task invoking synchronize_rcu() migrates + among a group of CPUs during that invocation, then each of the + CPUs in that group is guaranteed to execute a full memory barrier + sometime during the execution of synchronize_rcu(). + This guarantee also ensures that the kfree() on + line 14 of remove_gp_synchronous() really does + execute after the removal on + line 11, but also in the case where the thread executing the + synchronize_rcu() migrates in the meantime. +
+ + + + + + + + +
 
Quick Quiz:
+ Given that multiple CPUs can start RCU read-side critical sections + at any time without any ordering whatsoever, how can RCU possibly + tell whether or not a given RCU read-side critical section starts + before a given instance of synchronize_rcu()? +
Answer:
+ If RCU cannot tell whether or not a given + RCU read-side critical section starts before a + given instance of synchronize_rcu(), + then it must assume that the RCU read-side critical section + started first. + In other words, a given instance of synchronize_rcu() + can avoid waiting on a given RCU read-side critical section only + if it can prove that synchronize_rcu() started first. + + +

+ A related question is “When rcu_read_lock() + doesn't generate any code, why does it matter how it relates + to a grace period?” + The answer is that it is not the relationship of + rcu_read_lock() itself that is important, but rather + the relationship of the code within the enclosed RCU read-side + critical section to the code preceding and following the + grace period. + If we take this viewpoint, then a given RCU read-side critical + section begins before a given grace period when some access + preceding the grace period observes the effect of some access + within the critical section, in which case none of the accesses + within the critical section may observe the effects of any + access following the grace period. + + +

+ As of late 2016, mathematical models of RCU take this + viewpoint, for example, see slides 62 and 63 + of the + 2016 LinuxCon EU + presentation. +

 
+ + + + + + + + +
 
Quick Quiz:
+ The first and second guarantees require unbelievably strict ordering! + Are all these memory barriers really required? +
Answer:
+ Yes, they really are required. + To see why the first guarantee is required, consider the following + sequence of events: + + +
    +
  1. + CPU 1: rcu_read_lock() + +
  2. + CPU 1: q = rcu_dereference(gp); + /* Very likely to return p. */ + +
  3. + CPU 0: list_del_rcu(p); + +
  4. + CPU 0: synchronize_rcu() starts. + +
  5. + CPU 1: do_something_with(q->a); + /* No smp_mb(), so might happen after kfree(). */ + +
  6. + CPU 1: rcu_read_unlock() + +
  7. + CPU 0: synchronize_rcu() returns. + +
  8. + CPU 0: kfree(p); + +
+ +

+ Therefore, there absolutely must be a full memory barrier between the + end of the RCU read-side critical section and the end of the + grace period. + + +

+ The sequence of events demonstrating the necessity of the second rule + is roughly similar: + + +

    +
  1. CPU 0: list_del_rcu(p); + +
  2. CPU 0: synchronize_rcu() starts. + +
  3. CPU 1: rcu_read_lock() + +
  4. CPU 1: q = rcu_dereference(gp); + /* Might return p if no memory barrier. */ + +
  5. CPU 0: synchronize_rcu() returns. + +
  6. CPU 0: kfree(p); + +
  7. + CPU 1: do_something_with(q->a); /* Boom!!! */ + +
  8. CPU 1: rcu_read_unlock() + +
+ +

+ And similarly, without a memory barrier between the beginning of the + grace period and the beginning of the RCU read-side critical section, + CPU 1 might end up accessing the freelist. + + +

+ The “as if” rule of course applies, so that any + implementation that acts as if the appropriate memory barriers + were in place is a correct implementation. + That said, it is much easier to fool yourself into believing + that you have adhered to the as-if rule than it is to actually + adhere to it! +

 
+ + + + + + + + +
 
Quick Quiz:
+ You claim that rcu_read_lock() and rcu_read_unlock() + generate absolutely no code in some kernel builds. + This means that the compiler might arbitrarily rearrange consecutive + RCU read-side critical sections. + Given such rearrangement, if a given RCU read-side critical section + is done, how can you be sure that all prior RCU read-side critical + sections are done? + Won't the compiler rearrangements make that impossible to determine? +
Answer:
+ In cases where rcu_read_lock() and rcu_read_unlock() + generate absolutely no code, RCU infers quiescent states only at + special locations, for example, within the scheduler. + Because calls to schedule() had better prevent calling-code + accesses to shared variables from being rearranged across the call to + schedule(), if RCU detects the end of a given RCU read-side + critical section, it will necessarily detect the end of all prior + RCU read-side critical sections, no matter how aggressively the + compiler scrambles the code. + + +

+ Again, this all assumes that the compiler cannot scramble code across + calls to the scheduler, out of interrupt handlers, into the idle loop, + into user-mode code, and so on. + But if your kernel build allows that sort of scrambling, you have broken + far more than just RCU! +

 
+ +

+Note that these memory-barrier requirements do not replace the fundamental +RCU requirement that a grace period wait for all pre-existing readers. +On the contrary, the memory barriers called out in this section must operate in +such a way as to enforce this fundamental requirement. +Of course, different implementations enforce this requirement in different +ways, but enforce it they must. + +

RCU Primitives Guaranteed to Execute Unconditionally

+ +

+The common-case RCU primitives are unconditional. +They are invoked, they do their job, and they return, with no possibility +of error, and no need to retry. +This is a key RCU design philosophy. + +

+However, this philosophy is pragmatic rather than pigheaded. +If someone comes up with a good justification for a particular conditional +RCU primitive, it might well be implemented and added. +After all, this guarantee was reverse-engineered, not premeditated. +The unconditional nature of the RCU primitives was initially an +accident of implementation, and later experience with synchronization +primitives with conditional primitives caused me to elevate this +accident to a guarantee. +Therefore, the justification for adding a conditional primitive to +RCU would need to be based on detailed and compelling use cases. + +

Guaranteed Read-to-Write Upgrade

+ +

+As far as RCU is concerned, it is always possible to carry out an +update within an RCU read-side critical section. +For example, that RCU read-side critical section might search for +a given data element, and then might acquire the update-side +spinlock in order to update that element, all while remaining +in that RCU read-side critical section. +Of course, it is necessary to exit the RCU read-side critical section +before invoking synchronize_rcu(), however, this +inconvenience can be avoided through use of the +call_rcu() and kfree_rcu() API members +described later in this document. + + + + + + + + +
 
Quick Quiz:
+ But how does the upgrade-to-write operation exclude other readers? +
Answer:
+ It doesn't, just like normal RCU updates, which also do not exclude + RCU readers. +
 
+ +

+This guarantee allows lookup code to be shared between read-side +and update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +

Fundamental Non-Requirements

+ +

+RCU provides extremely lightweight readers, and its read-side guarantees, +though quite useful, are correspondingly lightweight. +It is therefore all too easy to assume that RCU is guaranteeing more +than it really is. +Of course, the list of things that RCU does not guarantee is infinitely +long, however, the following sections list a few non-guarantees that +have caused confusion. +Except where otherwise noted, these non-guarantees were premeditated. + +

    +
  1. + Readers Impose Minimal Ordering +
  2. + Readers Do Not Exclude Updaters +
  3. + Updaters Only Wait For Old Readers +
  4. + Grace Periods Don't Partition Read-Side Critical Sections +
  5. + Read-Side Critical Sections Don't Partition Grace Periods +
  6. + Disabling Preemption Does Not Block Grace Periods +
+ +

Readers Impose Minimal Ordering

+ +

+Reader-side markers such as rcu_read_lock() and +rcu_read_unlock() provide absolutely no ordering guarantees +except through their interaction with the grace-period APIs such as +synchronize_rcu(). +To see this, consider the following pair of threads: + +

+
+ 1 void thread0(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   WRITE_ONCE(x, 1);
+ 5   rcu_read_unlock();
+ 6   rcu_read_lock();
+ 7   WRITE_ONCE(y, 1);
+ 8   rcu_read_unlock();
+ 9 }
+10
+11 void thread1(void)
+12 {
+13   rcu_read_lock();
+14   r1 = READ_ONCE(y);
+15   rcu_read_unlock();
+16   rcu_read_lock();
+17   r2 = READ_ONCE(x);
+18   rcu_read_unlock();
+19 }
+
+
+ +

+After thread0() and thread1() execute +concurrently, it is quite possible to have + +

+
+(r1 == 1 && r2 == 0)
+
+
+ +(that is, y appears to have been assigned before x), +which would not be possible if rcu_read_lock() and +rcu_read_unlock() had much in the way of ordering +properties. +But they do not, so the CPU is within its rights +to do significant reordering. +This is by design: Any significant ordering constraints would slow down +these fast-path APIs. + + + + + + + + +
 
Quick Quiz:
+ Can't the compiler also reorder this code? +
Answer:
+ No, the volatile casts in READ_ONCE() and + WRITE_ONCE() prevent the compiler from reordering in + this particular case. +
 
+ +

Readers Do Not Exclude Updaters

+ +

+Neither rcu_read_lock() nor rcu_read_unlock() +exclude updates. +All they do is to prevent grace periods from ending. +The following example illustrates this: + +

+
+ 1 void thread0(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   r1 = READ_ONCE(y);
+ 5   if (r1) {
+ 6     do_something_with_nonzero_x();
+ 7     r2 = READ_ONCE(x);
+ 8     WARN_ON(!r2); /* BUG!!! */
+ 9   }
+10   rcu_read_unlock();
+11 }
+12
+13 void thread1(void)
+14 {
+15   spin_lock(&my_lock);
+16   WRITE_ONCE(x, 1);
+17   WRITE_ONCE(y, 1);
+18   spin_unlock(&my_lock);
+19 }
+
+
+ +

+If the thread0() function's rcu_read_lock() +excluded the thread1() function's update, +the WARN_ON() could never fire. +But the fact is that rcu_read_lock() does not exclude +much of anything aside from subsequent grace periods, of which +thread1() has none, so the +WARN_ON() can and does fire. + +

Updaters Only Wait For Old Readers

+ +

+It might be tempting to assume that after synchronize_rcu() +completes, there are no readers executing. +This temptation must be avoided because +new readers can start immediately after synchronize_rcu() +starts, and synchronize_rcu() is under no +obligation to wait for these new readers. + + + + + + + + +
 
Quick Quiz:
+ Suppose that synchronize_rcu() did wait until all + readers had completed instead of waiting only on + pre-existing readers. + For how long would the updater be able to rely on there + being no readers? +
Answer:
+ For no time at all. + Even if synchronize_rcu() were to wait until + all readers had completed, a new reader might start immediately after + synchronize_rcu() completed. + Therefore, the code following + synchronize_rcu() can never rely on there being + no readers. +
 
+ +

+Grace Periods Don't Partition Read-Side Critical Sections

+ +

+It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. +An example of this situation can be illustrated as follows, where +x, y, and z are initially all zero: + +

+
+ 1 void thread0(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   WRITE_ONCE(a, 1);
+ 5   WRITE_ONCE(b, 1);
+ 6   rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11   r1 = READ_ONCE(a);
+12   synchronize_rcu();
+13   WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18   rcu_read_lock();
+19   r2 = READ_ONCE(b);
+20   r3 = READ_ONCE(c);
+21   rcu_read_unlock();
+22 }
+
+
+ +

+It turns out that the outcome: + +

+
+(r1 == 1 && r2 == 0 && r3 == 1)
+
+
+ +is entirely possible. +The following figure show how this can happen, with each circled +QS indicating the point at which RCU recorded a +quiescent state for each thread, that is, a state in which +RCU knows that the thread cannot be in the midst of an RCU read-side +critical section that started before the current grace period: + +

GPpartitionReaders1.svg

+ +

+If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first +grace period is known to end before the second grace period starts: + +

+
+ 1 void thread0(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   WRITE_ONCE(a, 1);
+ 5   WRITE_ONCE(b, 1);
+ 6   rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11   r1 = READ_ONCE(a);
+12   synchronize_rcu();
+13   WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18   r2 = READ_ONCE(c);
+19   synchronize_rcu();
+20   WRITE_ONCE(d, 1);
+21 }
+22
+23 void thread3(void)
+24 {
+25   rcu_read_lock();
+26   r3 = READ_ONCE(b);
+27   r4 = READ_ONCE(d);
+28   rcu_read_unlock();
+29 }
+
+
+ +

+Here, if (r1 == 1), then +thread0()'s write to b must happen +before the end of thread1()'s grace period. +If in addition (r4 == 1), then +thread3()'s read from b must happen +after the beginning of thread2()'s grace period. +If it is also the case that (r2 == 1), then the +end of thread1()'s grace period must precede the +beginning of thread2()'s grace period. +This mean that the two RCU read-side critical sections cannot overlap, +guaranteeing that (r3 == 1). +As a result, the outcome: + +

+
+(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
+
+
+ +cannot happen. + +

+This non-requirement was also non-premeditated, but became apparent +when studying RCU's interaction with memory ordering. + +

+Read-Side Critical Sections Don't Partition Grace Periods

+ +

+It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. +However, this temptation leads nowhere good, as can be illustrated by +the following, with all variables initially zero: + +

+
+ 1 void thread0(void)
+ 2 {
+ 3   rcu_read_lock();
+ 4   WRITE_ONCE(a, 1);
+ 5   WRITE_ONCE(b, 1);
+ 6   rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+10 {
+11   r1 = READ_ONCE(a);
+12   synchronize_rcu();
+13   WRITE_ONCE(c, 1);
+14 }
+15
+16 void thread2(void)
+17 {
+18   rcu_read_lock();
+19   WRITE_ONCE(d, 1);
+20   r2 = READ_ONCE(c);
+21   rcu_read_unlock();
+22 }
+23
+24 void thread3(void)
+25 {
+26   r3 = READ_ONCE(d);
+27   synchronize_rcu();
+28   WRITE_ONCE(e, 1);
+29 }
+30
+31 void thread4(void)
+32 {
+33   rcu_read_lock();
+34   r4 = READ_ONCE(b);
+35   r5 = READ_ONCE(e);
+36   rcu_read_unlock();
+37 }
+
+
+ +

+In this case, the outcome: + +

+
+(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
+
+
+ +is entirely possible, as illustrated below: + +

ReadersPartitionGP1.svg

+ +

+Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire +grace period. +As a result, an RCU read-side critical section cannot partition a pair +of RCU grace periods. + + + + + + + + +
 
Quick Quiz:
+ How long a sequence of grace periods, each separated by an RCU + read-side critical section, would be required to partition the RCU + read-side critical sections at the beginning and end of the chain? +
Answer:
+ In theory, an infinite number. + In practice, an unknown number that is sensitive to both implementation + details and timing considerations. + Therefore, even in practice, RCU users must abide by the + theoretical rather than the practical answer. +
 
+ +

+Disabling Preemption Does Not Block Grace Periods

+ +

+There was a time when disabling preemption on any given CPU would block +subsequent grace periods. +However, this was an accident of implementation and is not a requirement. +And in the current Linux-kernel implementation, disabling preemption +on a given CPU in fact does not block grace periods, as Oleg Nesterov +demonstrated. + +

+If you need a preempt-disable region to block grace periods, you need to add +rcu_read_lock() and rcu_read_unlock(), for example +as follows: + +

+
+ 1 preempt_disable();
+ 2 rcu_read_lock();
+ 3 do_something();
+ 4 rcu_read_unlock();
+ 5 preempt_enable();
+ 6
+ 7 /* Spinlocks implicitly disable preemption. */
+ 8 spin_lock(&mylock);
+ 9 rcu_read_lock();
+10 do_something();
+11 rcu_read_unlock();
+12 spin_unlock(&mylock);
+
+
+ +

+In theory, you could enter the RCU read-side critical section first, +but it is more efficient to keep the entire RCU read-side critical +section contained in the preempt-disable region as shown above. +Of course, RCU read-side critical sections that extend outside of +preempt-disable regions will work correctly, but such critical sections +can be preempted, which forces rcu_read_unlock() to do +more work. +And no, this is not an invitation to enclose all of your RCU +read-side critical sections within preempt-disable regions, because +doing so would degrade real-time response. + +

+This non-requirement appeared with preemptible RCU. +If you need a grace period that waits on non-preemptible code regions, use +RCU-sched. + +

Parallelism Facts of Life

+ +

+These parallelism facts of life are by no means specific to RCU, but +the RCU implementation must abide by them. +They therefore bear repeating: + +

    +
  1. Any CPU or task may be delayed at any time, + and any attempts to avoid these delays by disabling + preemption, interrupts, or whatever are completely futile. + This is most obvious in preemptible user-level + environments and in virtualized environments (where + a given guest OS's VCPUs can be preempted at any time by + the underlying hypervisor), but can also happen in bare-metal + environments due to ECC errors, NMIs, and other hardware + events. + Although a delay of more than about 20 seconds can result + in splats, the RCU implementation is obligated to use + algorithms that can tolerate extremely long delays, but where + “extremely long” is not long enough to allow + wrap-around when incrementing a 64-bit counter. +
  2. Both the compiler and the CPU can reorder memory accesses. + Where it matters, RCU must use compiler directives and + memory-barrier instructions to preserve ordering. +
  3. Conflicting writes to memory locations in any given cache line + will result in expensive cache misses. + Greater numbers of concurrent writes and more-frequent + concurrent writes will result in more dramatic slowdowns. + RCU is therefore obligated to use algorithms that have + sufficient locality to avoid significant performance and + scalability problems. +
  4. As a rough rule of thumb, only one CPU's worth of processing + may be carried out under the protection of any given exclusive + lock. + RCU must therefore use scalable locking designs. +
  5. Counters are finite, especially on 32-bit systems. + RCU's use of counters must therefore tolerate counter wrap, + or be designed such that counter wrap would take way more + time than a single system is likely to run. + An uptime of ten years is quite possible, a runtime + of a century much less so. + As an example of the latter, RCU's dyntick-idle nesting counter + allows 54 bits for interrupt nesting level (this counter + is 64 bits even on a 32-bit system). + Overflowing this counter requires 254 + half-interrupts on a given CPU without that CPU ever going idle. + If a half-interrupt happened every microsecond, it would take + 570 years of runtime to overflow this counter, which is currently + believed to be an acceptably long time. +
  6. Linux systems can have thousands of CPUs running a single + Linux kernel in a single shared-memory environment. + RCU must therefore pay close attention to high-end scalability. +
+ +

+This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. +The idea that Linux might scale to systems with thousands of CPUs would +have been met with some skepticism in the 1990s, but these requirements +would have otherwise have been unsurprising, even in the early 1990s. + +

Quality-of-Implementation Requirements

+ +

+These sections list quality-of-implementation requirements. +Although an RCU implementation that ignores these requirements could +still be used, it would likely be subject to limitations that would +make it inappropriate for industrial-strength production use. +Classes of quality-of-implementation requirements are as follows: + +

    +
  1. Specialization +
  2. Performance and Scalability +
  3. Composability +
  4. Corner Cases +
+ +

+These classes is covered in the following sections. + +

Specialization

+ +

+RCU is and always has been intended primarily for read-mostly situations, +which means that RCU's read-side primitives are optimized, often at the +expense of its update-side primitives. +Experience thus far is captured by the following list of situations: + +

    +
  1. Read-mostly data, where stale and inconsistent data is not + a problem: RCU works great! +
  2. Read-mostly data, where data must be consistent: + RCU works well. +
  3. Read-write data, where data must be consistent: + RCU might work OK. + Or not. +
  4. Write-mostly data, where data must be consistent: + RCU is very unlikely to be the right tool for the job, + with the following exceptions, where RCU can provide: +
      +
    1. Existence guarantees for update-friendly mechanisms. +
    2. Wait-free read-side primitives for real-time use. +
    +
+ +

+This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. +For example, the add_gp() and remove_gp_synchronous() +examples discussed earlier use RCU to protect readers and locking to +coordinate updaters. +However, the need extends much farther, requiring that a variety of +synchronization primitives be legal within RCU read-side critical sections, +including spinlocks, sequence locks, atomic operations, reference +counters, and memory barriers. + + + + + + + + +
 
Quick Quiz:
+ What about sleeping locks? +
Answer:
+ These are forbidden within Linux-kernel RCU read-side critical + sections because it is not legal to place a quiescent state + (in this case, voluntary context switch) within an RCU read-side + critical section. + However, sleeping locks may be used within userspace RCU read-side + critical sections, and also within Linux-kernel sleepable RCU + (SRCU) + read-side critical sections. + In addition, the -rt patchset turns spinlocks into a + sleeping locks so that the corresponding critical sections + can be preempted, which also means that these sleeplockified + spinlocks (but not other sleeping locks!) may be acquire within + -rt-Linux-kernel RCU read-side critical sections. + + +

+ Note that it is legal for a normal RCU read-side + critical section to conditionally acquire a sleeping locks + (as in mutex_trylock()), but only as long as it does + not loop indefinitely attempting to conditionally acquire that + sleeping locks. + The key point is that things like mutex_trylock() + either return with the mutex held, or return an error indication if + the mutex was not immediately available. + Either way, mutex_trylock() returns immediately without + sleeping. +

 
+ +

+It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, +with network routing being the poster child. +Internet routing algorithms take significant time to propagate +updates, so that by the time an update arrives at a given system, +that system has been sending network traffic the wrong way for +a considerable length of time. +Having a few threads continue to send traffic the wrong way for a +few more milliseconds is clearly not a problem: In the worst case, +TCP retransmissions will eventually get the data where it needs to go. +In general, when tracking the state of the universe outside of the +computer, some level of inconsistency must be tolerated due to +speed-of-light delays if nothing else. + +

+Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. +But how long should they wait after the last heartbeat to decide that +the cat is in fact dead? +Waiting less than 400 milliseconds makes no sense because this would +mean that a relaxed cat would be considered to cycle between death +and life more than 100 times per minute. +Moreover, just as with human beings, a cat's heart might stop for +some period of time, so the exact wait period is a judgment call. +One of our pair of veterinarians might wait 30 seconds before pronouncing +the cat dead, while the other might insist on waiting a full minute. +The two veterinarians would then disagree on the state of the cat during +the final 30 seconds of the minute following the last heartbeat. + +

+Interestingly enough, this same situation applies to hardware. +When push comes to shove, how do we tell whether or not some +external server has failed? +We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. +Policy decisions can usually tolerate short +periods of inconsistency. +The policy was decided some time ago, and is only now being put into +effect, so a few milliseconds of delay is normally inconsequential. + +

+However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore +ID to the corresponding in-kernel data structure is protected by RCU, +but it is absolutely forbidden to update a semaphore that has just been +removed. +In the Linux kernel, this need for consistency is accommodated by acquiring +spinlocks located in the in-kernel data structure from within +the RCU read-side critical section, and this is indicated by the +green box in the figure above. +Many other techniques may be used, and are in fact used within the +Linux kernel. + +

+In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows +the right mix of synchronization tools to be used for a given job. + +

Performance and Scalability

+ +

+Energy efficiency is a critical component of performance today, +and Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. +I cannot claim that this requirement was premeditated. +In fact, I learned of it during a telephone conversation in which I +was given “frank and open” feedback on the importance +of energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. +So much so that mere Linux-kernel-mailing-list posts are +insufficient to vent their ire. + +

+Memory consumption is not particularly important for in most +situations, and has become decreasingly +so as memory sizes have expanded and memory +costs have plummeted. +However, as I learned from Matt Mackall's +bloatwatch +efforts, memory footprint is critically important on single-CPU systems with +non-preemptible (CONFIG_PREEMPT=n) kernels, and thus +tiny RCU +was born. +Josh Triplett has since taken over the small-memory banner with his +Linux kernel tinification +project, which resulted in +SRCU +becoming optional for those kernels not needing it. + +

+The remaining performance requirements are, for the most part, +unsurprising. +For example, in keeping with RCU's read-side specialization, +rcu_dereference() should have negligible overhead (for +example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, rcu_read_lock() and +rcu_read_unlock() should have exactly zero overhead. + +

+In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), rcu_read_lock() and +rcu_read_unlock() should have minimal overhead. +In particular, they should not contain atomic read-modify-write +operations, memory-barrier instructions, preemption disabling, +interrupt disabling, or backwards branches. +However, in the case where the RCU read-side critical section was preempted, +rcu_read_unlock() may acquire spinlocks and disable interrupts. +This is why it is better to nest an RCU read-side critical section +within a preempt-disable region than vice versa, at least in cases +where that critical section is short enough to avoid unduly degrading +real-time latencies. + +

+The synchronize_rcu() grace-period-wait primitive is +optimized for throughput. +It may therefore incur several milliseconds of latency in addition to +the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +synchronize_rcu() are required to use batching optimizations +so that they can be satisfied by a single underlying grace-period-wait +operation. +For example, in the Linux kernel, it is not unusual for a single +grace-period-wait operation to serve more than +1,000 separate invocations +of synchronize_rcu(), thus amortizing the per-invocation +overhead down to nearly zero. +However, the grace-period optimization is also required to avoid +measurable degradation of real-time scheduling and interrupt latencies. + +

+In some cases, the multi-millisecond synchronize_rcu() +latencies are unacceptable. +In these cases, synchronize_rcu_expedited() may be used +instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. +There are currently no special latency requirements for +synchronize_rcu_expedited() on large systems, but, +consistent with the empirical nature of the RCU specification, +that is subject to change. +However, there most definitely are scalability requirements: +A storm of synchronize_rcu_expedited() invocations on 4096 +CPUs should at least make reasonable forward progress. +In return for its shorter latencies, synchronize_rcu_expedited() +is permitted to impose modest degradation of real-time latency +on non-idle online CPUs. +Here, “modest” means roughly the same latency +degradation as a scheduling-clock interrupt. + +

+There are a number of situations where even +synchronize_rcu_expedited()'s reduced grace-period +latency is unacceptable. +In these situations, the asynchronous call_rcu() can be +used in place of synchronize_rcu() as follows: + +

+
+ 1 struct foo {
+ 2   int a;
+ 3   int b;
+ 4   struct rcu_head rh;
+ 5 };
+ 6
+ 7 static void remove_gp_cb(struct rcu_head *rhp)
+ 8 {
+ 9   struct foo *p = container_of(rhp, struct foo, rh);
+10
+11   kfree(p);
+12 }
+13
+14 bool remove_gp_asynchronous(void)
+15 {
+16   struct foo *p;
+17
+18   spin_lock(&gp_lock);
+19   p = rcu_dereference(gp);
+20   if (!p) {
+21     spin_unlock(&gp_lock);
+22     return false;
+23   }
+24   rcu_assign_pointer(gp, NULL);
+25   call_rcu(&p->rh, remove_gp_cb);
+26   spin_unlock(&gp_lock);
+27   return true;
+28 }
+
+
+ +

+A definition of struct foo is finally needed, and appears +on lines 1-5. +The function remove_gp_cb() is passed to call_rcu() +on line 25, and will be invoked after the end of a subsequent +grace period. +This gets the same effect as remove_gp_synchronous(), +but without forcing the updater to wait for a grace period to elapse. +The call_rcu() function may be used in a number of +situations where neither synchronize_rcu() nor +synchronize_rcu_expedited() would be legal, +including within preempt-disable code, local_bh_disable() code, +interrupt-disable code, and interrupt handlers. +However, even call_rcu() is illegal within NMI handlers +and from idle and offline CPUs. +The callback function (remove_gp_cb() in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, +either within a real softirq handler or under the protection +of local_bh_disable(). +In both the Linux kernel and in userspace, it is bad practice to +write an RCU callback function that takes too long. +Long-running operations should be relegated to separate threads or +(in the Linux kernel) workqueues. + + + + + + + + +
 
Quick Quiz:
+ Why does line 19 use rcu_access_pointer()? + After all, call_rcu() on line 25 stores into the + structure, which would interact badly with concurrent insertions. + Doesn't this mean that rcu_dereference() is required? +
Answer:
+ Presumably the ->gp_lock acquired on line 18 excludes + any changes, including any insertions that rcu_dereference() + would protect against. + Therefore, any insertions will be delayed until after + ->gp_lock + is released on line 25, which in turn means that + rcu_access_pointer() suffices. +
 
+ +

+However, all that remove_gp_cb() is doing is +invoking kfree() on the data element. +This is a common idiom, and is supported by kfree_rcu(), +which allows “fire and forget” operation as shown below: + +

+
+ 1 struct foo {
+ 2   int a;
+ 3   int b;
+ 4   struct rcu_head rh;
+ 5 };
+ 6
+ 7 bool remove_gp_faf(void)
+ 8 {
+ 9   struct foo *p;
+10
+11   spin_lock(&gp_lock);
+12   p = rcu_dereference(gp);
+13   if (!p) {
+14     spin_unlock(&gp_lock);
+15     return false;
+16   }
+17   rcu_assign_pointer(gp, NULL);
+18   kfree_rcu(p, rh);
+19   spin_unlock(&gp_lock);
+20   return true;
+21 }
+
+
+ +

+Note that remove_gp_faf() simply invokes +kfree_rcu() and proceeds, without any need to pay any +further attention to the subsequent grace period and kfree(). +It is permissible to invoke kfree_rcu() from the same +environments as for call_rcu(). +Interestingly enough, DYNIX/ptx had the equivalents of +call_rcu() and kfree_rcu(), but not +synchronize_rcu(). +This was due to the fact that RCU was not heavily used within DYNIX/ptx, +so the very few places that needed something like +synchronize_rcu() simply open-coded it. + + + + + + + + +
 
Quick Quiz:
+ Earlier it was claimed that call_rcu() and + kfree_rcu() allowed updaters to avoid being blocked + by readers. + But how can that be correct, given that the invocation of the callback + and the freeing of the memory (respectively) must still wait for + a grace period to elapse? +
Answer:
+ We could define things this way, but keep in mind that this sort of + definition would say that updates in garbage-collected languages + cannot complete until the next time the garbage collector runs, + which does not seem at all reasonable. + The key point is that in most cases, an updater using either + call_rcu() or kfree_rcu() can proceed to the + next update as soon as it has invoked call_rcu() or + kfree_rcu(), without having to wait for a subsequent + grace period. +
 
+ +

+But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks +that can be carried out in the meantime? +The polling-style get_state_synchronize_rcu() and +cond_synchronize_rcu() functions may be used for this +purpose, as shown below: + +

+
+ 1 bool remove_gp_poll(void)
+ 2 {
+ 3   struct foo *p;
+ 4   unsigned long s;
+ 5
+ 6   spin_lock(&gp_lock);
+ 7   p = rcu_access_pointer(gp);
+ 8   if (!p) {
+ 9     spin_unlock(&gp_lock);
+10     return false;
+11   }
+12   rcu_assign_pointer(gp, NULL);
+13   spin_unlock(&gp_lock);
+14   s = get_state_synchronize_rcu();
+15   do_something_while_waiting();
+16   cond_synchronize_rcu(s);
+17   kfree(p);
+18   return true;
+19 }
+
+
+ +

+On line 14, get_state_synchronize_rcu() obtains a +“cookie” from RCU, +then line 15 carries out other tasks, +and finally, line 16 returns immediately if a grace period has +elapsed in the meantime, but otherwise waits as required. +The need for get_state_synchronize_rcu and +cond_synchronize_rcu() has appeared quite recently, +so it is too early to tell whether they will stand the test of time. + +

+RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +

Composability

+ +

+Composability has received much attention in recent years, perhaps in part +due to the collision of multicore hardware with object-oriented techniques +designed in single-threaded environments for single-threaded use. +And in theory, RCU read-side critical sections may be composed, and in +fact may be nested arbitrarily deeply. +In practice, as with all real-world implementations of composable +constructs, there are limitations. + +

+Implementations of RCU for which rcu_read_lock() +and rcu_read_unlock() generate no code, such as +Linux-kernel RCU when CONFIG_PREEMPT=n, can be +nested arbitrarily deeply. +After all, there is no overhead. +Except that if all these instances of rcu_read_lock() +and rcu_read_unlock() are visible to the compiler, +compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. +If the nesting is not visible to the compiler, as is the case with +mutually recursive functions each in its own translation unit, +stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable +will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. +Nevertheless, this class of RCU implementations is one +of the most composable constructs in existence. + +

+RCU implementations that explicitly track nesting depth +are limited by the nesting-depth counter. +For example, the Linux kernel's preemptible RCU limits nesting to +INT_MAX. +This should suffice for almost all practical purposes. +That said, a consecutive pair of RCU read-side critical sections +between which there is an operation that waits for a grace period +cannot be enclosed in another RCU read-side critical section. +This is because it is not legal to wait for a grace period within +an RCU read-side critical section: To do so would result either +in deadlock or +in RCU implicitly splitting the enclosing RCU read-side critical +section, neither of which is conducive to a long-lived and prosperous +kernel. + +

+It is worth noting that RCU is not alone in limiting composability. +For example, many transactional-memory implementations prohibit +composing a pair of transactions separated by an irrevocable +operation (for example, a network receive operation). +For another example, lock-based critical sections can be composed +surprisingly freely, but only if deadlock is avoided. + +

+In short, although RCU read-side critical sections are highly composable, +care is required in some situations, just as is the case for any other +composable synchronization mechanism. + +

Corner Cases

+ +

+A given RCU workload might have an endless and intense stream of +RCU read-side critical sections, perhaps even so intense that there +was never a point in time during which there was not at least one +RCU read-side critical section in flight. +RCU cannot allow this situation to block grace periods: As long as +all the RCU read-side critical sections are finite, grace periods +must also be finite. + +

+That said, preemptible RCU implementations could potentially result +in RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side +critical section. +This situation can arise only in heavily loaded systems, but systems using +real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. +That said, the exact requirements on RCU priority boosting will likely +evolve as more experience accumulates. + +

+Other workloads might have very high update rates. +Although one can argue that such workloads should instead use +something other than RCU, the fact remains that RCU must +handle such workloads gracefully. +This requirement is another factor driving batching of grace periods, +but it is also the driving force behind the checks for large numbers +of queued RCU callbacks in the call_rcu() code path. +Finally, high update rates should not delay RCU read-side critical +sections, although some small read-side delays can occur when using +synchronize_rcu_expedited(), courtesy of this function's use +of smp_call_function_single(). + +

+Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of close(open(path)) +in a tight loop +in the early 2000s suddenly provided a much deeper appreciation of the +high-update-rate corner case. +This test also motivated addition of some RCU code to react to high update +rates, for example, if a given CPU finds itself with more than 10,000 +RCU callbacks queued, it will cause RCU to take evasive action by +more aggressively starting grace periods and more aggressively forcing +completion of grace-period processing. +This evasive action causes the grace period to complete more quickly, +but at the cost of restricting RCU's batching optimizations, thus +increasing the CPU overhead incurred by that grace period. + +

+Software-Engineering Requirements

+ +

+Between Murphy's Law and “To err is human”, it is necessary to +guard against mishaps and misuse: + +

    +
  1. It is all too easy to forget to use rcu_read_lock() + everywhere that it is needed, so kernels built with + CONFIG_PROVE_RCU=y will splat if + rcu_dereference() is used outside of an + RCU read-side critical section. + Update-side code can use rcu_dereference_protected(), + which takes a + lockdep expression + to indicate what is providing the protection. + If the indicated protection is not provided, a lockdep splat + is emitted. + +

    + Code shared between readers and updaters can use + rcu_dereference_check(), which also takes a + lockdep expression, and emits a lockdep splat if neither + rcu_read_lock() nor the indicated protection + is in place. + In addition, rcu_dereference_raw() is used in those + (hopefully rare) cases where the required protection cannot + be easily described. + Finally, rcu_read_lock_held() is provided to + allow a function to verify that it has been invoked within + an RCU read-side critical section. + I was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +

  2. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. + The rcu_lockdep_assert() does this job, + asserting the expression in kernels having lockdep enabled + and doing nothing otherwise. +
  3. It is also easy to forget to use rcu_assign_pointer() + and rcu_dereference(), perhaps (incorrectly) + substituting a simple assignment. + To catch this sort of error, a given RCU-protected pointer may be + tagged with __rcu, after which sparse + will complain about simple-assignment accesses to that pointer. + Arnd Bergmann made me aware of this requirement, and also + supplied the needed + patch series. +
  4. Kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y + will splat if a data element is passed to call_rcu() + twice in a row, without a grace period in between. + (This error is similar to a double free.) + The corresponding rcu_head structures that are + dynamically allocated are automatically tracked, but + rcu_head structures allocated on the stack + must be initialized with init_rcu_head_on_stack() + and cleaned up with destroy_rcu_head_on_stack(). + Similarly, statically allocated non-stack rcu_head + structures must be initialized with init_rcu_head() + and cleaned up with destroy_rcu_head(). + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + patch. +
  5. An infinite loop in an RCU read-side critical section will + eventually trigger an RCU CPU stall warning splat, with + the duration of “eventually” being controlled by the + RCU_CPU_STALL_TIMEOUT Kconfig option, or, + alternatively, by the + rcupdate.rcu_cpu_stall_timeout boot/sysfs + parameter. + However, RCU is not obligated to produce this splat + unless there is a grace period waiting on that particular + RCU read-side critical section. +

    + Some extreme workloads might intentionally delay + RCU grace periods, and systems running those workloads can + be booted with rcupdate.rcu_cpu_stall_suppress + to suppress the splats. + This kernel parameter may also be set via sysfs. + Furthermore, RCU CPU stall warnings are counter-productive + during sysrq dumps and during panics. + RCU therefore supplies the rcu_sysrq_start() and + rcu_sysrq_end() API members to be called before + and after long sysrq dumps. + RCU also supplies the rcu_panic() notifier that is + automatically invoked at the beginning of a panic to suppress + further RCU CPU stall warnings. + +

    + This requirement made itself known in the early 1990s, pretty + much the first time that it was necessary to debug a CPU stall. + That said, the initial implementation in DYNIX/ptx was quite + generic in comparison with that of Linux. +

  6. Although it would be very good to detect pointers leaking out + of RCU read-side critical sections, there is currently no + good way of doing this. + One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to + some other synchronization mechanism, for example, reference + counting. +
  7. In kernels built with CONFIG_RCU_TRACE=y, RCU-related + information is provided via event tracing. +
  8. Open-coded use of rcu_assign_pointer() and + rcu_dereference() to create typical linked + data structures can be surprisingly error-prone. + Therefore, RCU-protected + linked lists + and, more recently, RCU-protected + hash tables + are available. + Many other special-purpose RCU-protected data structures are + available in the Linux kernel and the userspace RCU library. +
  9. Some linked structures are created at compile time, but still + require __rcu checking. + The RCU_POINTER_INITIALIZER() macro serves this + purpose. +
  10. It is not necessary to use rcu_assign_pointer() + when creating linked structures that are to be published via + a single external pointer. + The RCU_INIT_POINTER() macro is provided for + this task and also for assigning NULL pointers + at runtime. +
+ +

+This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found +in real-world RCU usage. + +

Linux Kernel Complications

+ +

+The Linux kernel provides an interesting environment for all kinds of +software, including RCU. +Some of the relevant points of interest are as follows: + +

    +
  1. Configuration. +
  2. Firmware Interface. +
  3. Early Boot. +
  4. + Interrupts and non-maskable interrupts (NMIs). +
  5. Loadable Modules. +
  6. Hotplug CPU. +
  7. Scheduler and RCU. +
  8. Tracing and RCU. +
  9. Energy Efficiency. +
  10. + Scheduling-Clock Interrupts and RCU. +
  11. Memory Efficiency. +
  12. + Performance, Scalability, Response Time, and Reliability. +
+ +

+This list is probably incomplete, but it does give a feel for the +most notable Linux-kernel complications. +Each of the following sections covers one of the above topics. + +

Configuration

+ +

+RCU's goal is automatic configuration, so that almost nobody +needs to worry about RCU's Kconfig options. +And for almost all users, RCU does in fact work well +“out of the box.” + +

+However, there are specialized use cases that are handled by +kernel boot parameters and Kconfig options. +Unfortunately, the Kconfig system will explicitly ask users +about new Kconfig options, which requires almost all of them +be hidden behind a CONFIG_RCU_EXPERT Kconfig option. + +

+This all should be quite obvious, but the fact remains that +Linus Torvalds recently had to +remind +me of this requirement. + +

Firmware Interface

+ +

+In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. +Or the translation is accurate, but the original message is bogus. + +

+For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. +If RCU naively believed the firmware, as it used to do, +it would create too many per-CPU kthreads. +Although the resulting system will still run correctly, the extra +kthreads needlessly consume memory and can cause confusion +when they show up in ps listings. + +

+RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. +The resulting “ghost CPUs” (which are never going to +come online) cause a number of +interesting complications. + +

Early Boot

+ +

+The Linux kernel's boot sequence is an interesting process, +and RCU is used early, even before rcu_init() +is invoked. +In fact, a number of RCU's primitives can be used as soon as the +initial task's task_struct is available and the +boot CPU's per-CPU variables are set up. +The read-side primitives (rcu_read_lock(), +rcu_read_unlock(), rcu_dereference(), +and rcu_access_pointer()) will operate normally very early on, +as will rcu_assign_pointer(). + +

+Although call_rcu() may be invoked at any +time during boot, callbacks are not guaranteed to be invoked until after +all of RCU's kthreads have been spawned, which occurs at +early_initcall() time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full initialization +cannot occur until after the scheduler has initialized itself to the +point where RCU can spawn and run its kthreads. +In theory, it would be possible to invoke callbacks earlier, +however, this is not a panacea because there would be severe restrictions +on what operations those callbacks could invoke. + +

+Perhaps surprisingly, synchronize_rcu(), +synchronize_rcu_bh() +(discussed below), +synchronize_sched(), +synchronize_rcu_expedited(), +synchronize_rcu_bh_expedited(), and +synchronize_sched_expedited() +will all operate normally +during very early boot, the reason being that there is only one CPU +and preemption is disabled. +This means that the call synchronize_rcu() (or friends) +itself is a quiescent +state and thus a grace period, so the early-boot implementation can +be a no-op. + +

+However, once the scheduler has spawned its first kthread, this early +boot trick fails for synchronize_rcu() (as well as for +synchronize_rcu_expedited()) in CONFIG_PREEMPT=y +kernels. +The reason is that an RCU read-side critical section might be preempted, +which means that a subsequent synchronize_rcu() really does have +to wait for something, as opposed to simply returning immediately. +Unfortunately, synchronize_rcu() can't do this until all of +its kthreads are spawned, which doesn't happen until some time during +early_initcalls() time. +But this is no excuse: RCU is nevertheless required to correctly handle +synchronous grace periods during this time period. +Once all of its kthreads are up and running, RCU starts running +normally. + + + + + + + + +
 
Quick Quiz:
+ How can RCU possibly handle grace periods before all of its + kthreads have been spawned??? +
Answer:
+ Very carefully! + + +

+ During the “dead zone” between the time that the + scheduler spawns the first task and the time that all of RCU's + kthreads have been spawned, all synchronous grace periods are + handled by the expedited grace-period mechanism. + At runtime, this expedited mechanism relies on workqueues, but + during the dead zone the requesting task itself drives the + desired expedited grace period. + Because dead-zone execution takes place within task context, + everything works. + Once the dead zone ends, expedited grace periods go back to + using workqueues, as is required to avoid problems that would + otherwise occur when a user task received a POSIX signal while + driving an expedited grace period. + + +

+ And yes, this does mean that it is unhelpful to send POSIX + signals to random tasks between the time that the scheduler + spawns its first kthread and the time that RCU's kthreads + have all been spawned. + If there ever turns out to be a good reason for sending POSIX + signals during that time, appropriate adjustments will be made. + (If it turns out that POSIX signals are sent during this time for + no good reason, other adjustments will be made, appropriate + or otherwise.) +

 
+ +

+I learned of these boot-time requirements as a result of a series of +system hangs. + +

Interrupts and NMIs

+ +

+The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions +of code, as are invocations of call_rcu(). + +

+Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead stealthily +transitioning back to process context. +This trick is sometimes used to invoke system calls from inside the kernel. +These “half-interrupts” mean that RCU has to be very careful +about how it counts interrupt nesting levels. +I learned of this requirement the hard way during a rewrite +of RCU's dyntick-idle code. + +

+The Linux kernel has non-maskable interrupts (NMIs), and +RCU read-side critical sections are legal within NMI handlers. +Thankfully, RCU update-side primitives, including +call_rcu(), are prohibited within NMI handlers. + +

+The name notwithstanding, some Linux-kernel architectures +can have nested NMIs, which RCU must handle correctly. +Andy Lutomirski +surprised me +with this requirement; +he also kindly surprised me with +an algorithm +that meets this requirement. + +

Loadable Modules

+ +

+The Linux kernel has loadable modules, and these modules can +also be unloaded. +After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. +The module-unload functions must therefore cancel any +delayed calls to loadable-module functions, for example, +any outstanding mod_timer() must be dealt with +via del_timer_sync() or similar. + +

+Unfortunately, there is no way to cancel an RCU callback; +once you invoke call_rcu(), the callback function is +going to eventually be invoked, unless the system goes down first. +Because it is normally considered socially irresponsible to crash the system +in response to a module unload request, we need some other way +to deal with in-flight RCU callbacks. + +

+RCU therefore provides +rcu_barrier(), +which waits until all in-flight RCU callbacks have been invoked. +If a module uses call_rcu(), its exit function should therefore +prevent any future invocation of call_rcu(), then invoke +rcu_barrier(). +In theory, the underlying module-unload code could invoke +rcu_barrier() unconditionally, but in practice this would +incur unacceptable latencies. + +

+Nikita Danilov noted this requirement for an analogous filesystem-unmount +situation, and Dipankar Sarma incorporated rcu_barrier() into RCU. +The need for rcu_barrier() for module unloading became +apparent later. + +

+Important note: The rcu_barrier() function is not, +repeat, not, obligated to wait for a grace period. +It is instead only required to wait for RCU callbacks that have +already been posted. +Therefore, if there are no RCU callbacks posted anywhere in the system, +rcu_barrier() is within its rights to return immediately. +Even if there are callbacks posted, rcu_barrier() does not +necessarily need to wait for a grace period. + + + + + + + + +
 
Quick Quiz:
+ Wait a minute! + Each RCU callbacks must wait for a grace period to complete, + and rcu_barrier() must wait for each pre-existing + callback to be invoked. + Doesn't rcu_barrier() therefore need to wait for + a full grace period if there is even one callback posted anywhere + in the system? +
Answer:
+ Absolutely not!!! + + +

+ Yes, each RCU callbacks must wait for a grace period to complete, + but it might well be partly (or even completely) finished waiting + by the time rcu_barrier() is invoked. + In that case, rcu_barrier() need only wait for the + remaining portion of the grace period to elapse. + So even if there are quite a few callbacks posted, + rcu_barrier() might well return quite quickly. + + +

+ So if you need to wait for a grace period as well as for all + pre-existing callbacks, you will need to invoke both + synchronize_rcu() and rcu_barrier(). + If latency is a concern, you can always use workqueues + to invoke them concurrently. +

 
+ +

Hotplug CPU

+ +

+The Linux kernel supports CPU hotplug, which means that CPUs +can come and go. +It is of course illegal to use any RCU API member from an offline CPU, +with the exception of SRCU read-side +critical sections. +This requirement was present from day one in DYNIX/ptx, but +on the other hand, the Linux kernel's CPU-hotplug implementation +is “interesting.” + +

+The Linux-kernel CPU-hotplug implementation has notifiers that +are used to allow the various kernel subsystems (including RCU) +to respond appropriately to a given CPU-hotplug operation. +Most RCU operations may be invoked from CPU-hotplug notifiers, +including even synchronous grace-period operations such as +synchronize_rcu() and synchronize_rcu_expedited(). + +

+However, all-callback-wait operations such as +rcu_barrier() are also not supported, due to the +fact that there are phases of CPU-hotplug operations where +the outgoing CPU's callbacks will not be invoked until after +the CPU-hotplug operation ends, which could also result in deadlock. +Furthermore, rcu_barrier() blocks CPU-hotplug operations +during its execution, which results in another type of deadlock +when invoked from a CPU-hotplug notifier. + +

Scheduler and RCU

+ +

+RCU depends on the scheduler, and the scheduler uses RCU to +protect some of its data structures. +This means the scheduler is forbidden from acquiring +the runqueue locks and the priority-inheritance locks +in the middle of an outermost RCU read-side critical section unless either +(1) it releases them before exiting that same +RCU read-side critical section, or +(2) interrupts are disabled across +that entire RCU read-side critical section. +This same prohibition also applies (recursively!) to any lock that is acquired +while holding any lock to which this prohibition applies. +Adhering to this rule prevents preemptible RCU from invoking +rcu_read_unlock_special() while either runqueue or +priority-inheritance locks are held, thus avoiding deadlock. + +

+Prior to v4.4, it was only necessary to disable preemption across +RCU read-side critical sections that acquired scheduler locks. +In v4.4, expedited grace periods started using IPIs, and these +IPIs could force a rcu_read_unlock() to take the slowpath. +Therefore, this expedited-grace-period change required disabling of +interrupts, not just preemption. + +

+For RCU's part, the preemptible-RCU rcu_read_unlock() +implementation must be written carefully to avoid similar deadlocks. +In particular, rcu_read_unlock() must tolerate an +interrupt where the interrupt handler invokes both +rcu_read_lock() and rcu_read_unlock(). +This possibility requires rcu_read_unlock() to use +negative nesting levels to avoid destructive recursion via +interrupt handler's use of RCU. + +

+This pair of mutual scheduler-RCU requirements came as a +complete surprise. + +

+As noted above, RCU makes use of kthreads, and it is necessary to +avoid excessive CPU-time accumulation by these kthreads. +This requirement was no surprise, but RCU's violation of it +when running context-switch-heavy workloads when built with +CONFIG_NO_HZ_FULL=y +did come as a surprise [PDF]. +RCU has made good progress towards meeting this requirement, even +for context-switch-have CONFIG_NO_HZ_FULL=y workloads, +but there is room for further improvement. + +

Tracing and RCU

+ +

+It is possible to use tracing on RCU code, but tracing itself +uses RCU. +For this reason, rcu_dereference_raw_notrace() +is provided for use by tracing, which avoids the destructive +recursion that could otherwise ensue. +This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing +cannot be used. +The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +

Energy Efficiency

+ +

+Interrupting idle CPUs is considered socially unacceptable, +especially by people with battery-powered embedded systems. +RCU therefore conserves energy by detecting which CPUs are +idle, including tracking CPUs that have been interrupted from idle. +This is a large part of the energy-efficiency requirement, +so I learned of this via an irate phone call. + +

+Because RCU avoids interrupting idle CPUs, it is illegal to +execute an RCU read-side critical section on an idle CPU. +(Kernels built with CONFIG_PROVE_RCU=y will splat +if you try it.) +The RCU_NONIDLE() macro and _rcuidle +event tracing is provided to work around this restriction. +In addition, rcu_is_watching() may be used to +test whether or not it is currently legal to run RCU read-side +critical sections on this CPU. +I learned of the need for diagnostics on the one hand +and RCU_NONIDLE() on the other while inspecting +idle-loop code. +Steven Rostedt supplied _rcuidle event tracing, +which is used quite heavily in the idle loop. +However, there are some restrictions on the code placed within +RCU_NONIDLE(): + +

    +
  1. Blocking is prohibited. + In practice, this is not a serious restriction given that idle + tasks are prohibited from blocking to begin with. +
  2. Although nesting RCU_NONIDLE() is permitted, they cannot + nest indefinitely deeply. + However, given that they can be nested on the order of a million + deep, even on 32-bit systems, this should not be a serious + restriction. + This nesting limit would probably be reached long after the + compiler OOMed or the stack overflowed. +
  3. Any code path that enters RCU_NONIDLE() must sequence + out of that same RCU_NONIDLE(). + For example, the following is grossly illegal: + +
    +
    + 1     RCU_NONIDLE({
    + 2       do_something();
    + 3       goto bad_idea;  /* BUG!!! */
    + 4       do_something_else();});
    + 5   bad_idea:
    +	
    +
    + +

    + It is just as illegal to transfer control into the middle of + RCU_NONIDLE()'s argument. + Yes, in theory, you could transfer in as long as you also + transferred out, but in practice you could also expect to get sharply + worded review comments. +

+ +

+It is similarly socially unacceptable to interrupt an +nohz_full CPU running in userspace. +RCU must therefore track nohz_full userspace +execution. +RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent +any time idle and/or executing in userspace. + +

+These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of +which was finally able to demonstrate +real energy savings running on real hardware [PDF]. +As noted earlier, +I learned of many of these requirements via angry phone calls: +Flaming me on the Linux-kernel mailing list was apparently not +sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +

+Scheduling-Clock Interrupts and RCU

+ +

+The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. +Depending on kernel configuration, RCU handles these states differently: + + + + + + + + + + + + + + + + + + +
HZ KconfigIn-KernelUsermodeIdle
HZ_PERIODICCan rely on scheduling-clock interrupt.Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.Can rely on RCU's dyntick-idle detection.
NO_HZ_IDLECan rely on scheduling-clock interrupt.Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.Can rely on RCU's dyntick-idle detection.
NO_HZ_FULLCan only sometimes rely on scheduling-clock interrupt. + In other cases, it is necessary to bound kernel execution + times and/or use IPIs.Can rely on RCU's dyntick-idle detection.Can rely on RCU's dyntick-idle detection.
+ + + + + + + + +
 
Quick Quiz:
+ Why can't NO_HZ_FULL in-kernel execution rely on the + scheduling-clock interrupt, just like HZ_PERIODIC + and NO_HZ_IDLE do? +
Answer:
+ Because, as a performance optimization, NO_HZ_FULL + does not necessarily re-enable the scheduling-clock interrupt + on entry to each and every system call. +
 
+ +

+However, RCU must be reliably informed as to whether any given +CPU is currently in the idle loop, and, for NO_HZ_FULL, +also whether that CPU is executing in usermode, as discussed +earlier. +It also requires that the scheduling-clock interrupt be enabled when +RCU needs it to be: + +

    +
  1. If a CPU is either idle or executing in usermode, and RCU believes + it is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, + very long (11-second) grace periods, with a pointless IPI waking + the CPU from time to time. +
  2. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. DON'T DO THIS!!! + +
    This is one reason to test with lockdep, which will complain + about this sort of thing. +
  3. If a CPU is in a portion of the kernel that is absolutely + positively no-joking guaranteed to never execute any RCU read-side + critical sections, and RCU believes this CPU to to be idle, + no problem. This sort of thing is used by some architectures + for light-weight exception handlers, which can then avoid the + overhead of rcu_irq_enter() and rcu_irq_exit() + at exception entry and exit, respectively. + Some go further and avoid the entireties of irq_enter() + and irq_exit(). + +
    Just make very sure you are running some of your tests with + CONFIG_PROVE_RCU=y, just in case one of your code paths + was in fact joking about not doing RCU read-side critical sections. +
  4. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, + and if the CPU goes idle (from an RCU perspective) every few + jiffies, no problem. It is usually OK for there to be the + occasional gap between idle periods of up to a second or so. + +
    If the gap grows too long, you get RCU CPU stall warnings. +
  5. If a CPU is either idle or executing in usermode, and RCU believes + it to be idle, of course no problem. +
  6. If a CPU is executing in the kernel, the kernel code + path is passing through quiescent states at a reasonable + frequency (preferably about once per few jiffies, but the + occasional excursion to a second or so is usually OK) and the + scheduling-clock interrupt is enabled, of course no problem. + +
    If the gap between a successive pair of quiescent states grows + too long, you get RCU CPU stall warnings. +
+ + + + + + + + +
 
Quick Quiz:
+ But what if my driver has a hardware interrupt handler + that can run for many seconds? + I cannot invoke schedule() from an hardware + interrupt handler, after all! +
Answer:
+ One approach is to do rcu_irq_exit();rcu_irq_enter(); + every so often. + But given that long-running interrupt handlers can cause + other problems, not least for response time, shouldn't you + work to keep your interrupt handler's runtime within reasonable + bounds? +
 
+ +

+But as long as RCU is properly informed of kernel state transitions between +in-kernel execution, usermode execution, and idle, and as long as the +scheduling-clock interrupt is enabled when RCU needs it to be, you +can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +

Memory Efficiency

+ +

+Although small-memory non-realtime systems can simply use Tiny RCU, +code size is only one aspect of memory efficiency. +Another aspect is the size of the rcu_head structure +used by call_rcu() and kfree_rcu(). +Although this structure contains nothing more than a pair of pointers, +it does appear in many RCU-protected data structures, including +some that are size critical. +The page structure is a case in point, as evidenced by +the many occurrences of the union keyword within that structure. + +

+This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the rcu_head structures that +are waiting for a grace period to elapse. +It is also the reason why rcu_head structures do not contain +debug information, such as fields tracking the file and line of the +call_rcu() or kfree_rcu() that posted them. +Although this information might appear in debug-only kernel builds at some +point, in the meantime, the ->func field will often provide +the needed debug information. + +

+However, in some cases, the need for memory efficiency leads to even +more extreme measures. +Returning to the page structure, the rcu_head field +shares storage with a great many other structures that are used at +various points in the corresponding page's lifetime. +In order to correctly resolve certain +race conditions, +the Linux kernel's memory-management subsystem needs a particular bit +to remain zero during all phases of grace-period processing, +and that bit happens to map to the bottom bit of the +rcu_head structure's ->next field. +RCU makes this guarantee as long as call_rcu() +is used to post the callback, as opposed to kfree_rcu() +or some future “lazy” +variant of call_rcu() that might one day be created for +energy-efficiency purposes. + +

+That said, there are limits. +RCU requires that the rcu_head structure be aligned to a +two-byte boundary, and passing a misaligned rcu_head +structure to one of the call_rcu() family of functions +will result in a splat. +It is therefore necessary to exercise caution when packing +structures containing fields of type rcu_head. +Why not a four-byte or even eight-byte alignment requirement? +Because the m68k architecture provides only two-byte alignment, +and thus acts as alignment's least common denominator. + +

+The reason for reserving the bottom bit of pointers to +rcu_head structures is to leave the door open to +“lazy” callbacks whose invocations can safely be deferred. +Deferring invocation could potentially have energy-efficiency +benefits, but only if the rate of non-lazy callbacks decreases +significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open +in case it one day becomes useful. + +

+Performance, Scalability, Response Time, and Reliability

+ +

+Expanding on the +earlier discussion, +RCU is used heavily by hot code paths in performance-critical +portions of the Linux kernel's networking, security, virtualization, +and scheduling code paths. +RCU must therefore use efficient implementations, especially in its +read-side primitives. +To that end, it would be good if preemptible RCU's implementation +of rcu_read_lock() could be inlined, however, doing +this requires resolving #include issues with the +task_struct structure. + +

+The Linux kernel supports hardware configurations with up to +4096 CPUs, which means that RCU must be extremely scalable. +Algorithms that involve frequent acquisitions of global locks or +frequent atomic operations on global variables simply cannot be +tolerated within the RCU implementation. +RCU therefore makes heavy use of a combining tree based on the +rcu_node structure. +RCU is required to tolerate all CPUs continuously invoking any +combination of RCU's runtime primitives with minimal per-operation +overhead. +In fact, in many cases, increasing load must decrease the +per-operation overhead, witness the batching optimizations for +synchronize_rcu(), call_rcu(), +synchronize_rcu_expedited(), and rcu_barrier(). +As a general rule, RCU must cheerfully accept whatever the +rest of the Linux kernel decides to throw at it. + +

+The Linux kernel is used for real-time workloads, especially +in conjunction with the +-rt patchset. +The real-time-latency response requirements are such that the +traditional approach of disabling preemption across RCU +read-side critical sections is inappropriate. +Kernels built with CONFIG_PREEMPT=y therefore +use an RCU implementation that allows RCU read-side critical +sections to be preempted. +This requirement made its presence known after users made it +clear that an earlier +real-time patch +did not meet their needs, in conjunction with some +RCU issues +encountered by a very early version of the -rt patchset. + +

+In addition, RCU must make do with a sub-100-microsecond real-time latency +budget. +In fact, on smaller systems with the -rt patchset, the Linux kernel +provides sub-20-microsecond real-time latencies for the whole kernel, +including RCU. +RCU's scalability and latency must therefore be sufficient for +these sorts of configurations. +To my surprise, the sub-100-microsecond real-time latency budget + +applies to even the largest systems [PDF], +up to and including systems with 4096 CPUs. +This real-time requirement motivated the grace-period kthread, which +also simplified handling of a number of race conditions. + +

+RCU must avoid degrading real-time response for CPU-bound threads, whether +executing in usermode (which is one use case for +CONFIG_NO_HZ_FULL=y) or in the kernel. +That said, CPU-bound loops in the kernel must execute +cond_resched() at least once per few tens of milliseconds +in order to avoid receiving an IPI from RCU. + +

+Finally, RCU's status as a synchronization primitive means that +any RCU failure can result in arbitrary memory corruption that can be +extremely difficult to debug. +This means that RCU must be extremely reliable, which in +practice also means that RCU must have an aggressive stress-test +suite. +This stress-test suite is called rcutorture. + +

+Although the need for rcutorture was no surprise, +the current immense popularity of the Linux kernel is posing +interesting—and perhaps unprecedented—validation +challenges. +To see this, keep in mind that there are well over one billion +instances of the Linux kernel running today, given Android +smartphones, Linux-powered televisions, and servers. +This number can be expected to increase sharply with the advent of +the celebrated Internet of Things. + +

+Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. +This bug will be occurring about three times per day across +the installed base. +RCU could simply hide behind hardware error rates, given that no one +should really expect their smartphone to last for a million years. +However, anyone taking too much comfort from this thought should +consider the fact that in most jurisdictions, a successful multi-year +test of a given mechanism, which might include a Linux kernel, +suffices for a number of types of safety-critical certifications. +In fact, rumor has it that the Linux kernel is already being used +in production for safety-critical applications. +I don't know about you, but I would feel quite bad if a bug in RCU +killed someone. +Which might explain my recent focus on validation and verification. + +

Other RCU Flavors

+ +

+One of the more surprising things about RCU is that there are now +no fewer than five flavors, or API families. +In addition, the primary flavor that has been the sole focus up to +this point has two different implementations, non-preemptible and +preemptible. +The other four flavors are listed below, with requirements for each +described in a separate section. + +

    +
  1. Bottom-Half Flavor +
  2. Sched Flavor +
  3. Sleepable RCU +
  4. Tasks RCU +
  5. + Waiting for Multiple Grace Periods +
+ +

Bottom-Half Flavor

+ +

+The softirq-disable (AKA “bottom-half”, +hence the “_bh” abbreviations) +flavor of RCU, or RCU-bh, was developed by +Dipankar Sarma to provide a flavor of RCU that could withstand the +network-based denial-of-service attacks researched by Robert +Olsson. +These attacks placed so much networking load on the system +that some of the CPUs never exited softirq execution, +which in turn prevented those CPUs from ever executing a context switch, +which, in the RCU implementation of that time, prevented grace periods +from ever ending. +The result was an out-of-memory condition and a system hang. + +

+The solution was the creation of RCU-bh, which does +local_bh_disable() +across its read-side critical sections, and which uses the transition +from one type of softirq processing to another as a quiescent state +in addition to context switch, idle, user mode, and offline. +This means that RCU-bh grace periods can complete even when some of +the CPUs execute in softirq indefinitely, thus allowing algorithms +based on RCU-bh to withstand network-based denial-of-service attacks. + +

+Because +rcu_read_lock_bh() and rcu_read_unlock_bh() +disable and re-enable softirq handlers, any attempt to start a softirq +handlers during the +RCU-bh read-side critical section will be deferred. +In this case, rcu_read_unlock_bh() +will invoke softirq processing, which can take considerable time. +One can of course argue that this softirq overhead should be associated +with the code following the RCU-bh read-side critical section rather +than rcu_read_unlock_bh(), but the fact +is that most profiling tools cannot be expected to make this sort +of fine distinction. +For example, suppose that a three-millisecond-long RCU-bh read-side +critical section executes during a time of heavy networking load. +There will very likely be an attempt to invoke at least one softirq +handler during that three milliseconds, but any such invocation will +be delayed until the time of the rcu_read_unlock_bh(). +This can of course make it appear at first glance as if +rcu_read_unlock_bh() was executing very slowly. + +

+The +RCU-bh API +includes +rcu_read_lock_bh(), +rcu_read_unlock_bh(), +rcu_dereference_bh(), +rcu_dereference_bh_check(), +synchronize_rcu_bh(), +synchronize_rcu_bh_expedited(), +call_rcu_bh(), +rcu_barrier_bh(), and +rcu_read_lock_bh_held(). + +

Sched Flavor

+ +

+Before preemptible RCU, waiting for an RCU grace period had the +side effect of also waiting for all pre-existing interrupt +and NMI handlers. +However, there are legitimate preemptible-RCU implementations that +do not have this property, given that any point in the code outside +of an RCU read-side critical section can be a quiescent state. +Therefore, RCU-sched was created, which follows “classic” +RCU in that an RCU-sched grace period waits for for pre-existing +interrupt and NMI handlers. +In kernels built with CONFIG_PREEMPT=n, the RCU and RCU-sched +APIs have identical implementations, while kernels built with +CONFIG_PREEMPT=y provide a separate implementation for each. + +

+Note well that in CONFIG_PREEMPT=y kernels, +rcu_read_lock_sched() and rcu_read_unlock_sched() +disable and re-enable preemption, respectively. +This means that if there was a preemption attempt during the +RCU-sched read-side critical section, rcu_read_unlock_sched() +will enter the scheduler, with all the latency and overhead entailed. +Just as with rcu_read_unlock_bh(), this can make it look +as if rcu_read_unlock_sched() was executing very slowly. +However, the highest-priority task won't be preempted, so that task +will enjoy low-overhead rcu_read_unlock_sched() invocations. + +

+The +RCU-sched API +includes +rcu_read_lock_sched(), +rcu_read_unlock_sched(), +rcu_read_lock_sched_notrace(), +rcu_read_unlock_sched_notrace(), +rcu_dereference_sched(), +rcu_dereference_sched_check(), +synchronize_sched(), +synchronize_rcu_sched_expedited(), +call_rcu_sched(), +rcu_barrier_sched(), and +rcu_read_lock_sched_held(). +However, anything that disables preemption also marks an RCU-sched +read-side critical section, including +preempt_disable() and preempt_enable(), +local_irq_save() and local_irq_restore(), +and so on. + +

Sleepable RCU

+ +

+For well over a decade, someone saying “I need to block within +an RCU read-side critical section” was a reliable indication +that this someone did not understand RCU. +After all, if you are always blocking in an RCU read-side critical +section, you can probably afford to use a higher-overhead synchronization +mechanism. +However, that changed with the advent of the Linux kernel's notifiers, +whose RCU read-side critical +sections almost never sleep, but sometimes need to. +This resulted in the introduction of +sleepable RCU, +or SRCU. + +

+SRCU allows different domains to be defined, with each such domain +defined by an instance of an srcu_struct structure. +A pointer to this structure must be passed in to each SRCU function, +for example, synchronize_srcu(&ss), where +ss is the srcu_struct structure. +The key benefit of these domains is that a slow SRCU reader in one +domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code +must pass a “cookie” from srcu_read_lock() +to srcu_read_unlock(), for example, as follows: + +

+
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&ss);
+ 4 do_something();
+ 5 srcu_read_unlock(&ss, idx);
+
+
+ +

+As noted above, it is legal to block within SRCU read-side critical sections, +however, with great power comes great responsibility. +If you block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can block waiting, either directly or indirectly, for that domain's +grace period to elapse. +For example, this results in a self-deadlock: + +

+
+ 1 int idx;
+ 2
+ 3 idx = srcu_read_lock(&ss);
+ 4 do_something();
+ 5 synchronize_srcu(&ss);
+ 6 srcu_read_unlock(&ss, idx);
+
+
+ +

+However, if line 5 acquired a mutex that was held across +a synchronize_srcu() for domain ss, +deadlock would still be possible. +Furthermore, if line 5 acquired a mutex that was held across +a synchronize_srcu() for some other domain ss1, +and if an ss1-domain SRCU read-side critical section +acquired another mutex that was held across as ss-domain +synchronize_srcu(), +deadlock would again be possible. +Such a deadlock cycle could extend across an arbitrarily large number +of different SRCU domains. +Again, with great power comes great responsibility. + +

+Unlike the other RCU flavors, SRCU read-side critical sections can +run on idle and even offline CPUs. +This ability requires that srcu_read_lock() and +srcu_read_unlock() contain memory barriers, which means +that SRCU readers will run a bit slower than would RCU readers. +It also motivates the smp_mb__after_srcu_read_unlock() +API, which, in combination with srcu_read_unlock(), +guarantees a full memory barrier. + +

+Also unlike other RCU flavors, SRCU's callbacks-wait function +srcu_barrier() may be invoked from CPU-hotplug notifiers, +though this is not necessarily a good idea. +The reason that this is possible is that SRCU is insensitive +to whether or not a CPU is online, which means that srcu_barrier() +need not exclude CPU-hotplug operations. + +

+SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. +This means that in the current SRCU implementation, expediting a +future grace period has the side effect of expediting all prior +grace periods that have not yet completed. +(But please note that this is a property of the current implementation, +not necessarily of future implementations.) +In addition, if SRCU has been idle for longer than the interval +specified by the srcutree.exp_holdoff kernel boot parameter +(25 microseconds by default), +and if a synchronize_srcu() invocation ends this idle period, +that invocation will be automatically expedited. + +

+As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating +a locking bottleneck present in prior kernel versions. +Although this will allow users to put much heavier stress on +call_srcu(), it is important to note that SRCU does not +yet take any special steps to deal with callback flooding. +So if you are posting (say) 10,000 SRCU callbacks per second per CPU, +you are probably totally OK, but if you intend to post (say) 1,000,000 +SRCU callbacks per second per CPU, please run some tests first. +SRCU just might need a few adjustment to deal with that sort of load. +Of course, your mileage may vary based on the speed of your CPUs and +the size of your memory. + +

+The +SRCU API +includes +srcu_read_lock(), +srcu_read_unlock(), +srcu_dereference(), +srcu_dereference_check(), +synchronize_srcu(), +synchronize_srcu_expedited(), +call_srcu(), +srcu_barrier(), and +srcu_read_lock_held(). +It also includes +DEFINE_SRCU(), +DEFINE_STATIC_SRCU(), and +init_srcu_struct() +APIs for defining and initializing srcu_struct structures. + +

Tasks RCU

+ +

+Some forms of tracing use “trampolines” to handle the +binary rewriting required to install different types of probes. +It would be good to be able to free old trampolines, which sounds +like a job for some form of RCU. +However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers +such as rcu_read_lock() and rcu_read_unlock(). +In addition, it does not work to have these markers in the trampoline +itself, because there would need to be instructions following +rcu_read_unlock(). +Although synchronize_rcu() would guarantee that execution +reached the rcu_read_unlock(), it would not be able to +guarantee that execution had completely left the trampoline. + +

+The solution, in the form of +Tasks RCU, +is to have implicit +read-side critical sections that are delimited by voluntary context +switches, that is, calls to schedule(), +cond_resched(), and +synchronize_rcu_tasks(). +In addition, transitions to and from userspace execution also delimit +tasks-RCU read-side critical sections. + +

+The tasks-RCU API is quite compact, consisting only of +call_rcu_tasks(), +synchronize_rcu_tasks(), and +rcu_barrier_tasks(). + +

+Waiting for Multiple Grace Periods

+ +

+Perhaps you have an RCU protected data structure that is accessed from +RCU read-side critical sections, from softirq handlers, and from +hardware interrupt handlers. +That is three flavors of RCU, the normal flavor, the bottom-half flavor, +and the sched flavor. +How to wait for a compound grace period? + +

+The best approach is usually to “just say no!” and +insert rcu_read_lock() and rcu_read_unlock() +around each RCU read-side critical section, regardless of what +environment it happens to be in. +But suppose that some of the RCU read-side critical sections are +on extremely hot code paths, and that use of CONFIG_PREEMPT=n +is not a viable option, so that rcu_read_lock() and +rcu_read_unlock() are not free. +What then? + +

+You could wait on all three grace periods in succession, as follows: + +

+
+ 1 synchronize_rcu();
+ 2 synchronize_rcu_bh();
+ 3 synchronize_sched();
+
+
+ +

+This works, but triples the update-side latency penalty. +In cases where this is not acceptable, synchronize_rcu_mult() +may be used to wait on all three flavors of grace period concurrently: + +

+
+ 1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);
+
+
+ +

+But what if it is necessary to also wait on SRCU? +This can be done as follows: + +

+
+ 1 static void call_my_srcu(struct rcu_head *head,
+ 2        void (*func)(struct rcu_head *head))
+ 3 {
+ 4   call_srcu(&my_srcu, head, func);
+ 5 }
+ 6
+ 7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu);
+
+
+ +

+If you needed to wait on multiple different flavors of SRCU +(but why???), you would need to create a wrapper function resembling +call_my_srcu() for each SRCU flavor. + + + + + + + + +
 
Quick Quiz:
+ But what if I need to wait for multiple RCU flavors, but I also need + the grace periods to be expedited? +
Answer:
+ If you are using expedited grace periods, there should be less penalty + for waiting on them in succession. + But if that is nevertheless a problem, you can use workqueues + or multiple kthreads to wait on the various expedited grace + periods concurrently. +
 
+ +

+Again, it is usually better to adjust the RCU read-side critical sections +to use a single flavor of RCU, but when this is not feasible, you can use +synchronize_rcu_mult(). + +

Possible Future Changes

+ +

+One of the tricks that RCU uses to attain update-side scalability is +to increase grace-period latency with increasing numbers of CPUs. +If this becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +

+Expedited grace periods scan the CPUs, so their latency and overhead +increases with increasing numbers of CPUs. +If this becomes a serious problem on large systems, it will be necessary +to do some redesign to avoid this scalability problem. + +

+RCU disables CPU hotplug in a few places, perhaps most notably in the +rcu_barrier() operations. +If there is a strong reason to use rcu_barrier() in CPU-hotplug +notifiers, it will be necessary to avoid disabling CPU hotplug. +This would introduce some complexity, so there had better be a very +good reason. + +

+The tradeoff between grace-period latency on the one hand and interruptions +of other CPUs on the other hand may need to be re-examined. +The desire is of course for zero grace-period latency as well as zero +interprocessor interrupts undertaken during an expedited grace period +operation. +While this ideal is unlikely to be achievable, it is quite possible that +further improvements can be made. + +

+The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such +as sockets or cores. +Such spreading and alignment is currently believed to be unnecessary +because the hotpath read-side primitives do not access the combining +tree, nor does call_rcu() in the common case. +If you believe that your architecture needs such spreading and alignment, +then your architecture should also benefit from the +rcutree.rcu_fanout_leaf boot parameter, which can be set +to the number of CPUs in a socket, NUMA node, or whatever. +If the number of CPUs is too large, use a fraction of the number of +CPUs. +If the number of CPUs is a large prime number, well, that certainly +is an “interesting” architectural choice! +More flexible arrangements might be considered, but only if +rcutree.rcu_fanout_leaf has proven inadequate, and only +if the inadequacy has been demonstrated by a carefully run and +realistic system-level workload. + +

+Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +

+There is an embarrassingly large number of flavors of RCU, and this +number has been increasing over time. +Perhaps it will be possible to combine some at some future date. + +

+RCU's various kthreads are reasonably recent additions. +It is quite likely that adjustments will be required to more gracefully +handle extreme loads. +It might also be necessary to be able to relate CPU utilization by +RCU's kthreads and softirq handlers to the code that instigated this +CPU utilization. +For example, RCU callback overhead might be charged back to the +originating call_rcu() instance, though probably not +in production kernels. + +

Summary

+ +

+This document has presented more than two decade's worth of RCU +requirements. +Given that the requirements keep changing, this will not be the last +word on this subject, but at least it serves to get an important +subset of the requirements set forth. + +

Acknowledgments

+ +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, +Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and +Andy Lutomirski for their help in rendering +this article human readable, and to Michelle Rankin for her support +of this effort. +Other contributions are acknowledged in the Linux kernel's git archive. + + -- cgit v1.2.3