From 5d1646d90e1f2cceb9f0828f4b28318cd0ec7744 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 12:05:51 +0200 Subject: Adding upstream version 5.10.209. Signed-off-by: Daniel Baumann --- .../Design/Data-Structures/BigTreeClassicRCU.svg | 474 ++ .../BigTreePreemptRCUBHdyntickCB.svg | 662 +++ .../RCU/Design/Data-Structures/Data-Structures.rst | 1163 +++++ .../Design/Data-Structures/HugeTreeClassicRCU.svg | 939 ++++ .../RCU/Design/Data-Structures/TreeLevel.svg | 828 ++++ .../RCU/Design/Data-Structures/TreeMapping.svg | 305 ++ .../Design/Data-Structures/TreeMappingLevel.svg | 380 ++ .../RCU/Design/Data-Structures/blkd_task.svg | 631 +++ .../RCU/Design/Data-Structures/nxtlist.svg | 386 ++ .../Design/Expedited-Grace-Periods/ExpRCUFlow.svg | 830 ++++ .../Expedited-Grace-Periods/ExpSchedFlow.svg | 830 ++++ .../Expedited-Grace-Periods.rst | 521 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel0.svg | 275 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel1.svg | 275 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel2.svg | 287 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel3.svg | 323 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel4.svg | 323 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel5.svg | 335 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel6.svg | 335 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel7.svg | 347 ++ .../RCU/Design/Expedited-Grace-Periods/Funnel8.svg | 311 ++ .../Memory-Ordering/Tree-RCU-Memory-Ordering.rst | 624 +++ .../TreeRCU-callback-invocation.svg | 486 ++ .../Memory-Ordering/TreeRCU-callback-registry.svg | 655 +++ .../RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Design/Memory-Ordering/TreeRCU-gp-cleanup.svg | 1133 +++++ .../RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 +++++ .../Design/Memory-Ordering/TreeRCU-gp-init-1.svg | 658 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-3.svg | 636 +++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5144 ++++++++++++++++++++ .../RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 +++++ .../RCU/Design/Memory-Ordering/rcu_node-lock.svg | 229 + .../Design/Requirements/GPpartitionReaders1.svg | 374 ++ .../Design/Requirements/ReadersPartitionGP1.svg | 639 +++ .../RCU/Design/Requirements/Requirements.rst | 2680 ++++++++++ 37 files changed, 28553 insertions(+) create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg create mode 100644 Documentation/RCU/Design/Data-Structures/Data-Structures.rst create mode 100644 Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMapping.svg create mode 100644 Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg create mode 100644 Documentation/RCU/Design/Data-Structures/blkd_task.svg create mode 100644 Documentation/RCU/Design/Data-Structures/nxtlist.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg create mode 100644 Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg create mode 100644 Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg create mode 100644 Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg create mode 100644 Documentation/RCU/Design/Requirements/Requirements.rst (limited to 'Documentation/RCU/Design') diff --git a/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg new file mode 100644 index 000000000..727e270b1 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreeClassicRCU.svg @@ -0,0 +1,474 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_data + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + CPU 1007 + + struct + + rcu_data + + CPU 1023 + + struct rcu_state + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_node + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg new file mode 100644 index 000000000..3a1a4f85d --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/BigTreePreemptRCUBHdyntickCB.svg @@ -0,0 +1,662 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_head + + struct + + rcu_head + + struct + + rcu_head + + rcu_state + + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + + + + + + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst new file mode 100644 index 000000000..f4efd6897 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -0,0 +1,1163 @@ +=================================================== +A Tour Through TREE_RCU's Data Structures [LWN.net] +=================================================== + +December 18, 2016 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document describes RCU's major data structures and their relationship +to each other. + +Data-Structure Relationships +============================ + +RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +.. kernel-figure:: BigTreeClassicRCU.svg + +This diagram shows an enclosing ``rcu_state`` structure containing a tree +of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up +to 16 ``rcu_data`` structures associated with it, so that there are +``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the common +case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. +For example, a number of Linux distributions set ``NR_CPUs=4096``, +which results in a three-level ``rcu_node`` tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an ``rcu_node`` tree with only a single node. + +The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU ``rcu_data`` structures, +and other events are recorded by the leaf-level ``rcu_node`` +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root ``rcu_node`` +structure. +A grace period can be completed at the root once every CPU +(or, in the case of ``CONFIG_PREEMPT_RCU``, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why isn't the fanout at the leaves also 64? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because there are more types of events that affect the leaf-level | +| ``rcu_node`` structures than further up the tree. Therefore, if the | +| leaf ``rcu_node`` structures have fanout of 64, the contention on | +| these structures' ``->structures`` becomes excessive. Experimentation | +| on a wide variety of systems has shown that a fanout of 16 works well | +| for the leaves of the ``rcu_node`` tree. | +| | +| Of course, further experience with systems having hundreds or | +| thousands of CPUs may demonstrate that the fanout for the non-leaf | +| ``rcu_node`` structures must also be reduced. Such reduction can be | +| easily carried out when and if it proves necessary. In the meantime, | +| if you are using such a system and running into contention problems | +| on the non-leaf ``rcu_node`` structures, you may use the | +| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | +| non-leaf fanout as needed. | +| | +| Kernels built for systems with strong NUMA characteristics might | +| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | +| the ``rcu_node`` structures align with hardware boundaries. | +| However, there has thus far been no need for this. | ++-----------------------------------------------------------------------+ + +If your system has more than 1,024 CPUs (or more than 512 CPUs on a +32-bit system), then RCU will automatically add more levels to the tree. +For example, if you are crazy enough to build a 64-bit system with +65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: + +.. kernel-figure:: HugeTreeClassicRCU.svg + +RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. On the other hand, you can set both +``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as +2, which would result in a 16-CPU test using a 4-level tree. This can be +useful for testing large-system capabilities on small test machines. + +This multi-level combining tree allows us to get most of the performance +and scalability benefits of partitioning, even though RCU grace-period +detection is inherently a global operation. The trick here is that only +the last CPU to report a quiescent state into a given ``rcu_node`` +structure need advance to the ``rcu_node`` structure at the next level +up the tree. This means that at the leaf-level ``rcu_node`` structure, +only one access out of sixteen will progress up the tree. For the +internal ``rcu_node`` structures, the situation is even more extreme: +Only one access out of sixty-four will progress up the tree. Because the +vast majority of the CPUs do not progress up the tree, the lock +contention remains roughly constant up the tree. No matter how many CPUs +there are in the system, at most 64 quiescent-state reports per grace +period will progress all the way to the root ``rcu_node`` structure, +thus ensuring that the lock contention on that root ``rcu_node`` +structure remains acceptably low. + +In effect, the combining tree acts like a big shock absorber, keeping +lock contention under control at all tree levels regardless of the level +of loading on the system. + +RCU updaters wait for normal grace periods by registering RCU callbacks, +either directly via ``call_rcu()`` or indirectly via +``synchronize_rcu()`` and friends. RCU callbacks are represented by +``rcu_head`` structures, which are queued on ``rcu_data`` structures +while they are waiting for a grace period to elapse, as shown in the +following figure: + +.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg + +This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data +structures are related. Lesser data structures will be introduced with +the algorithms that make use of them. + +Note that each of the data structures in the above figure has its own +synchronization: + +#. Each ``rcu_state`` structures has a lock and a mutex, and some fields + are protected by the corresponding root ``rcu_node`` structure's lock. +#. Each ``rcu_node`` structure has a spinlock. +#. The fields in ``rcu_data`` are private to the corresponding CPU, + although a few can be read and written by other CPUs. + +It is important to note that different data structures can have very +different ideas about the state of RCU at any given time. For but one +example, awareness of the start or end of a given RCU grace period +propagates slowly through the data structures. This slow propagation is +absolutely necessary for RCU to have good read-side performance. If this +balkanized implementation seems foreign to you, one useful trick is to +consider each instance of these data structures to be a different +person, each having the usual slightly different view of reality. + +The general role of each of these data structures is as follows: + +#. ``rcu_state``: This structure forms the interconnection between the + ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, + serves as short-term repository for callbacks orphaned by CPU-hotplug + events, maintains ``rcu_barrier()`` state, tracks expedited + grace-period state, and maintains state used to force quiescent + states when grace periods extend too long, +#. ``rcu_node``: This structure forms the combining tree that propagates + quiescent-state information from the leaves to the root, and also + propagates grace-period information from the root to the leaves. It + provides local copies of the grace-period state in order to allow + this information to be accessed in a synchronized manner without + suffering the scalability limitations that would otherwise be imposed + by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the + lists of tasks that have blocked while in their current RCU read-side + critical section. In ``CONFIG_PREEMPT_RCU`` with + ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` + priority-boosting kernel threads (kthreads) and state. Finally, it + records CPU-hotplug state in order to determine which CPUs should be + ignored during a given grace period. +#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state + detection and RCU callback queuing. It also tracks its relationship + to the corresponding leaf ``rcu_node`` structure to allow + more-efficient propagation of quiescent states up the ``rcu_node`` + combining tree. Like the ``rcu_node`` structure, it provides a local + copy of the grace-period information to allow for-free synchronized + access to this information from the corresponding CPU. Finally, this + structure records past dyntick-idle state for the corresponding CPU + and also tracks statistics. +#. ``rcu_head``: This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. The ``rcu_head`` + structure is normally embedded within the RCU-protected data + structure. + +If all you wanted from this article was a general notion of how RCU's +data structures are related, you are done. Otherwise, each of the +following sections give more details on the ``rcu_state``, ``rcu_node`` +and ``rcu_data`` data structures. + +The ``rcu_state`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_state`` structure is the base structure that represents the +state of RCU in the system. This structure forms the interconnection +between the ``rcu_node`` and ``rcu_data`` structures, tracks grace +periods, contains the lock used to synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when grace periods +extend too long, + +A few of the ``rcu_state`` structure's fields are discussed, singly and +in groups, in the following sections. The more specialized fields are +covered in the discussion of their use. + +Relationship to rcu_node and rcu_data Structures +'''''''''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 struct rcu_node node[NUM_RCU_NODES]; + 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; + 3 struct rcu_data __percpu *rda; + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that the ``rcu_node`` structures formed a | +| tree, but they are declared as a flat array! What gives? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The tree is laid out in the array. The first node In the array is the | +| head, the next set of nodes in the array are children of the head | +| node, and so on until the last set of nodes in the array are the | +| leaves. | +| See the following diagrams to see how this works. | ++-----------------------------------------------------------------------+ + +The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown +in the following figure: + +.. kernel-figure:: TreeMapping.svg + +One interesting consequence of this mapping is that a breadth-first +traversal of the tree is implemented as a simple linear scan of the +array, which is in fact what the ``rcu_for_each_node_breadth_first()`` +macro does. This macro is used at the beginning and ends of grace +periods. + +Each entry of the ``->level`` array references the first ``rcu_node`` +structure on the corresponding level of the tree, for example, as shown +below: + +.. kernel-figure:: TreeMappingLevel.svg + +The zero\ :sup:`th` element of the array references the root +``rcu_node`` structure, the first element references the first child of +the root ``rcu_node``, and finally the second element references the +first leaf ``rcu_node`` structure. + +For whatever it is worth, if you draw the tree to be tree-shaped rather +than array-shaped, it is easy to draw a planar representation: + +.. kernel-figure:: TreeLevel.svg + +Finally, the ``->rda`` field references a per-CPU pointer to the +corresponding CPU's ``rcu_data`` structure. + +All of these fields are constant once initialization is complete, and +therefore need no protection. + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + +RCU grace periods are numbered, and the ``->gp_seq`` field contains the +current grace-period sequence number. The bottom two bits are the state +of the current grace period, which can be zero for not yet started or +one for in progress. In other words, if the bottom two bits of +``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom +two bits indicates that something is broken. This field is protected by +the root ``rcu_node`` structure's ``->lock`` field. + +There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` +structures as well. The fields in the ``rcu_state`` structure represent +the most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a +distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` +(down the tree from the root to the leaves) to ``rcu_data``. + +Miscellaneous +''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_max; + 2 char abbr; + 3 char *name; + +The ``->gp_max`` field tracks the duration of the longest grace period +in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. + +The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU +(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). +These fields are used for diagnostic and tracing purposes. + +The ``rcu_node`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_node`` structures form the combining tree that propagates +quiescent-state information from the leaves to the root and also that +propagates grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order to allow +this information to be accessed in a synchronized manner without +suffering the scalability limitations that would otherwise be imposed by +global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists +of tasks that have blocked while in their current RCU read-side critical +section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they +manage the per-\ ``rcu_node`` priority-boosting kernel threads +(kthreads) and state. Finally, they record CPU-hotplug state in order to +determine which CPUs should be ignored during a given grace period. + +The ``rcu_node`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Combining Tree +'''''''''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 struct rcu_node *parent; + 2 u8 level; + 3 u8 grpnum; + 4 unsigned long grpmask; + 5 int grplo; + 6 int grphi; + +The ``->parent`` pointer references the ``rcu_node`` one level up in the +tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation +makes heavy use of this field to push quiescent states up the tree. The +``->level`` field gives the level in the tree, with the root being at +level zero, its children at level one, and so on. The ``->grpnum`` field +gives this node's position within the children of its parent, so this +number can range between 0 and 31 on 32-bit systems and between 0 and 63 +on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only +during initialization and for tracing. The ``->grpmask`` field is the +bitmask counterpart of ``->grpnum``, and therefore always has exactly +one bit set. This mask is used to clear the bit corresponding to this +``rcu_node`` structure in its parent's bitmasks, which are described +later. Finally, the ``->grplo`` and ``->grphi`` fields contain the +lowest and highest numbered CPU served by this ``rcu_node`` structure, +respectively. + +All of these fields are constant, and thus do not require any +synchronization. + +Synchronization +''''''''''''''' + +This field of the ``rcu_node`` structure is declared as follows: + +:: + + 1 raw_spinlock_t lock; + +This field is used to protect the remaining fields in this structure, +unless otherwise stated. That said, all of the fields in this structure +can be accessed without locking for tracing purposes. Yes, this can +result in confusing traces, but better some tracing confusion than to be +heisenbugged out of existence. + +.. _grace-period-tracking-1: + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + +The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of +the field of the same name in the ``rcu_state`` structure. They each may +lag up to one step behind their ``rcu_state`` counterpart. If the bottom +two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, +then this ``rcu_node`` structure believes that RCU is idle. + +The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the +beginning and the end of each grace period. + +The ``->gp_seq_needed`` fields record the furthest-in-the-future grace +period request seen by the corresponding ``rcu_node`` structure. The +request is considered fulfilled when the value of the ``->gp_seq`` field +equals or exceeds that of the ``->gp_seq_needed`` field. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that this ``rcu_node`` structure doesn't see a request for a | +| very long time. Won't wrapping of the ``->gp_seq`` field cause | +| problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, because if the ``->gp_seq_needed`` field lags behind the | +| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | +| the end of the grace period. Modulo-arithmetic comparisons therefore | +| will always get the correct answer, even with wrapping. | ++-----------------------------------------------------------------------+ + +Quiescent-State Tracking +'''''''''''''''''''''''' + +These fields manage the propagation of quiescent states up the combining +tree. + +This portion of the ``rcu_node`` structure has fields as follows: + +:: + + 1 unsigned long qsmask; + 2 unsigned long expmask; + 3 unsigned long qsmaskinit; + 4 unsigned long expmaskinit; + +The ``->qsmask`` field tracks which of this ``rcu_node`` structure's +children still need to report quiescent states for the current normal +grace period. Such children will have a value of 1 in their +corresponding bit. Note that the leaf ``rcu_node`` structures should be +thought of as having ``rcu_data`` structures as their children. +Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` +structure's children still need to report quiescent states for the +current expedited grace period. An expedited grace period has the same +conceptual properties as a normal grace period, but the expedited +implementation accepts extreme CPU overhead to obtain much lower +grace-period latency, for example, consuming a few tens of microseconds +worth of CPU time to reduce grace-period duration from milliseconds to +tens of microseconds. The ``->qsmaskinit`` field tracks which of this +``rcu_node`` structure's children cover for at least one online CPU. +This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is +used to initialize ``->expmask`` and the beginning of the normal and +expedited grace periods, respectively. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why are these bitmasks protected by locking? Come on, haven't you | +| heard of atomic instructions??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Lockless grace-period computation! Such a tantalizing possibility! | +| But consider the following sequence of events: | +| | +| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | +| wakes up, it notices that the current RCU grace period needs it to | +| report in, so it sets a flag where the scheduling clock interrupt | +| will find it. | +| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | +| notices that CPU 0 has been in dyntick idle mode, which qualifies | +| as an extended quiescent state. | +| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | +| read-side critical section, and notices that the RCU core needs | +| something, so commences RCU softirq processing. | +| #. CPU 0's softirq handler executes and is just about ready to report | +| its quiescent state up the ``rcu_node`` tree. | +| #. But CPU 1 beats it to the punch, completing the current grace | +| period and starting a new one. | +| #. CPU 0 now reports its quiescent state for the wrong grace period. | +| That grace period might now end before the RCU read-side critical | +| section. If that happens, disaster will ensue. | +| | +| So the locking is absolutely required in order to coordinate clearing | +| of the bits with updating of the grace-period sequence number in | +| ``->gp_seq``. | ++-----------------------------------------------------------------------+ + +Blocked-Task Management +''''''''''''''''''''''' + +``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU +read-side critical sections, and these tasks must be tracked explicitly. +The details of exactly why and how they are tracked will be covered in a +separate article on RCU read-side processing. For now, it is enough to +know that the ``rcu_node`` structure tracks them. + +:: + + 1 struct list_head blkd_tasks; + 2 struct list_head *gp_tasks; + 3 struct list_head *exp_tasks; + 4 bool wait_blkd_tasks; + +The ``->blkd_tasks`` field is a list header for the list of blocked and +preempted tasks. As tasks undergo context switches within RCU read-side +critical sections, their ``task_struct`` structures are enqueued (via +the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the +``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding +to the CPU on which the outgoing context switch executed. As these tasks +later exit their RCU read-side critical sections, they remove themselves +from the list. This list is therefore in reverse time order, so that if +one of the tasks is blocking the current grace period, all subsequent +tasks must also be blocking that same grace period. Therefore, a single +pointer into this list suffices to track all tasks blocking a given +grace period. That pointer is stored in ``->gp_tasks`` for normal grace +periods and in ``->exp_tasks`` for expedited grace periods. These last +two fields are ``NULL`` if either there is no grace period in flight or +if there are no blocked tasks preventing that grace period from +completing. If either of these two pointers is referencing a task that +removes itself from the ``->blkd_tasks`` list, then that task must +advance the pointer to the next task on the list, or set the pointer to +``NULL`` if there are no subsequent tasks on the list. + +For example, suppose that tasks T1, T2, and T3 are all hard-affinitied +to the largest-numbered CPU in the system. Then if task T1 blocked in an +RCU read-side critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, then a normal +grace period started, and finally task 3 blocked in an RCU read-side +critical section, then the state of the last leaf ``rcu_node`` +structure's blocked-task list would be as shown below: + +.. kernel-figure:: blkd_task.svg + +Task T1 is blocking both grace periods, task T2 is blocking only the +normal grace period, and task T3 is blocking neither grace period. Note +that these tasks will not remove themselves from this list immediately +upon resuming execution. They will instead remain on the list until they +execute the outermost ``rcu_read_unlock()`` that ends their RCU +read-side critical section. + +The ``->wait_blkd_tasks`` field indicates whether or not the current +grace period is waiting on a blocked task. + +Sizing the ``rcu_node`` Array +''''''''''''''''''''''''''''' + +The ``rcu_node`` array is sized via a series of C-preprocessor +expressions as follows: + +:: + + 1 #ifdef CONFIG_RCU_FANOUT + 2 #define RCU_FANOUT CONFIG_RCU_FANOUT + 3 #else + 4 # ifdef CONFIG_64BIT + 5 # define RCU_FANOUT 64 + 6 # else + 7 # define RCU_FANOUT 32 + 8 # endif + 9 #endif + 10 + 11 #ifdef CONFIG_RCU_FANOUT_LEAF + 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF + 13 #else + 14 # ifdef CONFIG_64BIT + 15 # define RCU_FANOUT_LEAF 64 + 16 # else + 17 # define RCU_FANOUT_LEAF 32 + 18 # endif + 19 #endif + 20 + 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) + 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) + 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) + 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + 25 + 26 #if NR_CPUS <= RCU_FANOUT_1 + 27 # define RCU_NUM_LVLS 1 + 28 # define NUM_RCU_LVL_0 1 + 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 + 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } + 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } + 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } + 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } + 34 #elif NR_CPUS <= RCU_FANOUT_2 + 35 # define RCU_NUM_LVLS 2 + 36 # define NUM_RCU_LVL_0 1 + 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) + 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } + 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } + 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } + 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } + 43 #elif NR_CPUS <= RCU_FANOUT_3 + 44 # define RCU_NUM_LVLS 3 + 45 # define NUM_RCU_LVL_0 1 + 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) + 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } + 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } + 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } + 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } + 53 #elif NR_CPUS <= RCU_FANOUT_4 + 54 # define RCU_NUM_LVLS 4 + 55 # define NUM_RCU_LVL_0 1 + 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) + 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) + 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } + 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } + 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } + 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } + 64 #else + 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" + 66 #endif + +The maximum number of levels in the ``rcu_node`` structure is currently +limited to four, as specified by lines 21-24 and the structure of the +subsequent “if” statement. For 32-bit systems, this allows +16*32*32*32=524,288 CPUs, which should be sufficient for the next few +years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is +allowed, which should see us through the next decade or so. This +four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` +to support up to 4096 CPUs, which might be useful in very large systems +having eight CPUs per socket (but please note that no one has yet shown +any measurable performance degradation due to misaligned socket and +``rcu_node`` boundaries). In addition, building kernels with a full four +levels of ``rcu_node`` tree permits better testing of RCU's +combining-tree code. + +The ``RCU_FANOUT`` symbol controls how many children are permitted at +each non-leaf level of the ``rcu_node`` tree. If the +``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based +on the word size of the system, which is also the Kconfig default. + +The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by +each leaf ``rcu_node`` structure. Experience has shown that allowing a +given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the +number of bits in the ``->qsmask`` field on a 64-bit system, results in +excessive contention for the leaf ``rcu_node`` structures' ``->lock`` +fields. The number of CPUs per leaf ``rcu_node`` structure is therefore +limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If +``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based +on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. +Lines 11-19 perform this computation. + +Lines 21-24 compute the maximum number of CPUs supported by a +single-level (which contains a single ``rcu_node`` structure), +two-level, three-level, and four-level ``rcu_node`` tree, respectively, +given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. +These numbers of CPUs are retained in the ``RCU_FANOUT_1``, +``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor +variables, respectively. + +These variables are used to control the C-preprocessor ``#if`` statement +spanning lines 26-66 that computes the number of ``rcu_node`` structures +required for each level of the tree, as well as the number of levels +required. The number of levels is placed in the ``NUM_RCU_LVLS`` +C-preprocessor variable by lines 27, 35, 44, and 54. The number of +``rcu_node`` structures for the topmost level of the tree is always +exactly one, and this value is unconditionally placed into +``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels +(if any) of the ``rcu_node`` tree are computed by dividing the maximum +number of CPUs by the fanout supported by the number of levels from the +current level down, rounding up. This computation is performed by +lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create +initializers for lockdep lock-class names. Finally, lines 64-66 produce +an error if the maximum number of CPUs is too large for the specified +fanout. + +The ``rcu_segcblist`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_segcblist`` structure maintains a segmented list of callbacks +as follows: + +:: + + 1 #define RCU_DONE_TAIL 0 + 2 #define RCU_WAIT_TAIL 1 + 3 #define RCU_NEXT_READY_TAIL 2 + 4 #define RCU_NEXT_TAIL 3 + 5 #define RCU_CBLIST_NSEGS 4 + 6 + 7 struct rcu_segcblist { + 8 struct rcu_head *head; + 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; + 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; + 11 long len; + 12 long len_lazy; + 13 }; + +The segments are as follows: + +#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These + callbacks are ready to be invoked. +#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace + period. Note that different CPUs can have different ideas about which + grace period is current, hence the ``->gp_seq`` field. +#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period + to start. +#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a + grace period. + +The ``->head`` pointer references the first callback or is ``NULL`` if +the list contains no callbacks (which is *not* the same as being empty). +Each element of the ``->tails[]`` array references the ``->next`` +pointer of the last callback in the corresponding segment of the list, +or the list's ``->head`` pointer if that segment and all previous +segments are empty. If the corresponding segment is empty but some +previous segment is not empty, then the array element is identical to +its predecessor. Older callbacks are closer to the head of the list, and +new callbacks are added at the tail. This relationship between the +``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown +in this diagram: + +.. kernel-figure:: nxtlist.svg + +In this figure, the ``->head`` pointer references the first RCU callback +in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the +``->head`` pointer itself, indicating that none of the callbacks is +ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references +callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 +are both waiting on the current grace period, give or take possible +disagreements about exactly which grace period is the current one. The +``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU +callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that +there are no callbacks waiting on the next RCU grace period. The +``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` +pointer, indicating that all the remaining RCU callbacks have not yet +been assigned to an RCU grace period. Note that the +``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU +callback's ``->next`` pointer unless the callback list is empty, in +which case it references the ``->head`` pointer. + +There is one additional important special case for the +``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this +list is *disabled*. Lists are disabled when the corresponding CPU is +offline or when the corresponding CPU's callbacks are offloaded to a +kthread, both of which are described elsewhere. + +CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the +``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the +``RCU_DONE_TAIL`` list segments as grace periods advance. + +The ``->gp_seq[]`` array records grace-period numbers corresponding to +the list segments. This is what allows different CPUs to have different +ideas as to which is the current grace period while still avoiding +premature invocation of their callbacks. In particular, this allows CPUs +that go idle for extended periods to determine which of their callbacks +are ready to be invoked after reawakening. + +The ``->len`` counter contains the number of callbacks in ``->head``, +and the ``->len_lazy`` contains the number of those callbacks that are +known to only free memory, and whose invocation can therefore be safely +deferred. + +.. important:: + + It is the ``->len`` field that determines whether or + not there are callbacks associated with this ``rcu_segcblist`` + structure, *not* the ``->head`` pointer. The reason for this is that all + the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` + segment) are extracted all at once at callback-invocation time + (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there + are no not-done callbacks remaining in the ``rcu_segcblist``. If + callback invocation must be postponed, for example, because a + high-priority process just woke up on this CPU, then the remaining + callbacks are placed back on the ``RCU_DONE_TAIL`` segment and + ``->head`` once again points to the start of the segment. In short, the + head field can briefly be ``NULL`` even though the CPU has callbacks + present the entire time. Therefore, it is not appropriate to test the + ``->head`` pointer for ``NULL``. + +In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only +after the corresponding callbacks have been invoked. This means that the +``->len`` count is zero only if the ``rcu_segcblist`` structure really +is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` +count requires careful use of appropriate synchronization, for example, +memory barriers. This synchronization can be a bit subtle, particularly +in the case of ``rcu_barrier()``. + +The ``rcu_data`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The +fields in this structure may be accessed only from the corresponding CPU +(and from tracing) unless otherwise stated. This structure is the focus +of quiescent-state detection and RCU callback queuing. It also tracks +its relationship to the corresponding leaf ``rcu_node`` structure to +allow more-efficient propagation of quiescent states up the ``rcu_node`` +combining tree. Like the ``rcu_node`` structure, it provides a local +copy of the grace-period information to allow for-free synchronized +access to this information from the corresponding CPU. Finally, this +structure records past dyntick-idle state for the corresponding CPU and +also tracks statistics. + +The ``rcu_data`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Other Data Structures +''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int cpu; + 2 struct rcu_node *mynode; + 3 unsigned long grpmask; + 4 bool beenonline; + +The ``->cpu`` field contains the number of the corresponding CPU and the +``->mynode`` field references the corresponding ``rcu_node`` structure. +The ``->mynode`` is used to propagate quiescent states up the combining +tree. These two fields are constant and therefore do not require +synchronization. + +The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` +corresponding to this ``rcu_data`` structure, and is also used when +propagating quiescent states. The ``->beenonline`` flag is set whenever +the corresponding CPU comes online, which means that the debugfs tracing +need not dump out any ``rcu_data`` structure for which this flag is not +set. + +Quiescent-State and Grace-Period Tracking +''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + 3 bool cpu_no_qs; + 4 bool core_needs_qs; + 5 bool gpwrap; + +The ``->gp_seq`` field is the counterpart of the field of the same name +in the ``rcu_state`` and ``rcu_node`` structures. The +``->gp_seq_needed`` field is the counterpart of the field of the same +name in the rcu_node structure. They may each lag up to one behind their +``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and +``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in +dyntick-idle mode (but these counters will catch up upon exit from +dyntick-idle mode). If the lower two bits of a given ``rcu_data`` +structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure +believes that RCU is idle. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| All this replication of the grace period numbers can only cause | +| massive confusion. Why not just keep a global sequence number and be | +| done with it??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because if there was only a single global sequence numbers, there | +| would need to be a single global lock to allow safely accessing and | +| updating it. And if we are not going to have a single global lock, we | +| need to carefully manage the numbers on a per-node basis. Recall from | +| the answer to a previous Quick Quiz that the consequences of applying | +| a previously sampled quiescent state to the wrong grace period are | +| quite severe. | ++-----------------------------------------------------------------------+ + +The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed +through a quiescent state, while the ``->core_needs_qs`` flag indicates +that the RCU core needs a quiescent state from the corresponding CPU. +The ``->gpwrap`` field indicates that the corresponding CPU has remained +idle for so long that the ``gp_seq`` counter is in danger of overflow, +which will cause the CPU to disregard the values of its counters on its +next exit from idle. + +RCU Callback Handling +''''''''''''''''''''' + +In the absence of CPU-hotplug events, RCU callbacks are invoked by the +same CPU that registered them. This is strictly a cache-locality +optimization: callbacks can and do get invoked on CPUs other than the +one that registered them. After all, if the CPU that registered a given +callback has gone offline before the callback can be invoked, there +really is no other choice. + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 struct rcu_segcblist cblist; + 2 long qlen_last_fqs_check; + 3 unsigned long n_cbs_invoked; + 4 unsigned long n_nocbs_invoked; + 5 unsigned long n_cbs_orphaned; + 6 unsigned long n_cbs_adopted; + 7 unsigned long n_force_qs_snap; + 8 long blimit; + +The ``->cblist`` structure is the segmented callback list described +earlier. The CPU advances the callbacks in its ``rcu_data`` structure +whenever it notices that another RCU grace period has completed. The CPU +detects the completion of an RCU grace period by noticing that the value +of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of +its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` +structure's ``->gp_seq`` field is updated at the beginnings and ends of +each grace period. + +The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the +forcing of quiescent states from ``call_rcu()`` and friends when +callback lists grow excessively long. + +The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` +fields count the number of callbacks invoked, sent to other CPUs when +this CPU goes offline, and received from other CPUs when those other +CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's +callbacks are offloaded to a kthread. + +Finally, the ``->blimit`` counter is the maximum number of RCU callbacks +that may be invoked at a given time. + +Dyntick-Idle Handling +''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int dynticks_snap; + 2 unsigned long dynticks_fqs; + +The ``->dynticks_snap`` field is used to take a snapshot of the +corresponding CPU's dyntick-idle state when forcing quiescent states, +and is therefore accessed from other CPUs. Finally, the +``->dynticks_fqs`` field is used to count the number of times this CPU +is determined to be in dyntick-idle state, and is used for tracing and +debugging purposes. + +This portion of the rcu_data structure is declared as follows: + +:: + + 1 long dynticks_nesting; + 2 long dynticks_nmi_nesting; + 3 atomic_t dynticks; + 4 bool rcu_need_heavy_qs; + 5 bool rcu_urgent_qs; + +These fields in the rcu_data structure maintain the per-CPU dyntick-idle +state for the corresponding CPU. The fields may be accessed only from +the corresponding CPU (and from tracing) unless otherwise stated. + +The ``->dynticks_nesting`` field counts the nesting depth of process +execution, so that in normal circumstances this counter has value zero +or one. NMIs, irqs, and tracers are counted by the +``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes +to this variable have to be undertaken carefully using an algorithm +provided by Andy Lutomirski. The initial transition from idle adds one, +and nested transitions add two, so that a nesting level of five is +represented by a ``->dynticks_nmi_nesting`` value of nine. This counter +can therefore be thought of as counting the number of reasons why this +CPU cannot be permitted to enter dyntick-idle mode, aside from +process-level transitions. + +However, it turns out that when running in non-idle kernel context, the +Linux kernel is fully capable of entering interrupt handlers that never +exit and perhaps also vice versa. Therefore, whenever the +``->dynticks_nesting`` field is incremented up from zero, the +``->dynticks_nmi_nesting`` field is set to a large positive number, and +whenever the ``->dynticks_nesting`` field is decremented down to zero, +the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that +the number of misnested interrupts is not sufficient to overflow the +counter, this approach corrects the ``->dynticks_nmi_nesting`` field +every time the corresponding CPU enters the idle loop from process +context. + +The ``->dynticks`` field counts the corresponding CPU's transitions to +and from either dyntick-idle or user mode, so that this counter has an +even value when the CPU is in dyntick-idle mode or user mode and an odd +value otherwise. The transitions to/from user mode need to be counted +for user mode adaptive-ticks support (see timers/NO_HZ.txt). + +The ``->rcu_need_heavy_qs`` field is used to record the fact that the +RCU core code would really like to see a quiescent state from the +corresponding CPU, so much so that it is willing to call for +heavy-weight dyntick-counter operations. This flag is checked by RCU's +context-switch and ``cond_resched()`` code, which provide a momentary +idle sojourn in response. + +Finally, the ``->rcu_urgent_qs`` field is used to record the fact that +the RCU core code would really like to see a quiescent state from the +corresponding CPU, with the various other fields indicating just how +badly RCU wants this quiescent state. This flag is checked by RCU's +context-switch path (``rcu_note_context_switch``) and the cond_resched +code. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not simply combine the ``->dynticks_nesting`` and | +| ``->dynticks_nmi_nesting`` counters into a single counter that just | +| counts the number of reasons that the corresponding CPU is non-idle? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because this would fail in the presence of interrupts whose handlers | +| never return and of handlers that manage to return from a made-up | +| interrupt. | ++-----------------------------------------------------------------------+ + +Additional fields are present for some special-purpose builds, and are +discussed separately. + +The ``rcu_head`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each ``rcu_head`` structure represents an RCU callback. These structures +are normally embedded within RCU-protected data structures whose +algorithms use asynchronous grace periods. In contrast, when using +algorithms that block waiting for RCU grace periods, RCU users need not +provide ``rcu_head`` structures. + +The ``rcu_head`` structure has fields as follows: + +:: + + 1 struct rcu_head *next; + 2 void (*func)(struct rcu_head *head); + +The ``->next`` field is used to link the ``rcu_head`` structures +together in the lists within the ``rcu_data`` structures. The ``->func`` +field is a pointer to the function to be called when the callback is +ready to be invoked, and this function is passed a pointer to the +``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` +field to record the offset of the ``rcu_head`` structure within the +enclosing RCU-protected data structure. + +Both of these fields are used internally by RCU. From the viewpoint of +RCU users, this structure is an opaque “cookie”. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that the callback function ``->func`` is passed a pointer to | +| the ``rcu_head`` structure, how is that function supposed to find the | +| beginning of the enclosing RCU-protected data structure? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In actual practice, there is a separate callback function per type of | +| RCU-protected data structure. The callback function can therefore use | +| the ``container_of()`` macro in the Linux kernel (or other | +| pointer-manipulation facilities in other software environments) to | +| find the beginning of the enclosing structure. | ++-----------------------------------------------------------------------+ + +RCU-Specific Fields in the ``task_struct`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in +the ``task_struct`` structure: + +:: + + 1 #ifdef CONFIG_PREEMPT_RCU + 2 int rcu_read_lock_nesting; + 3 union rcu_special rcu_read_unlock_special; + 4 struct list_head rcu_node_entry; + 5 struct rcu_node *rcu_blocked_node; + 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ + 7 #ifdef CONFIG_TASKS_RCU + 8 unsigned long rcu_tasks_nvcsw; + 9 bool rcu_tasks_holdout; + 10 struct list_head rcu_tasks_holdout_list; + 11 int rcu_tasks_idle_cpu; + 12 #endif /* #ifdef CONFIG_TASKS_RCU */ + +The ``->rcu_read_lock_nesting`` field records the nesting level for RCU +read-side critical sections, and the ``->rcu_read_unlock_special`` field +is a bitmask that records special conditions that require +``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` +field is used to form lists of tasks that have blocked within +preemptible-RCU read-side critical sections and the +``->rcu_blocked_node`` field references the ``rcu_node`` structure whose +list this task is a member of, or ``NULL`` if it is not blocked within a +preemptible-RCU read-side critical section. + +The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context +switches that this task had undergone at the beginning of the current +tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current +tasks-RCU grace period is waiting on this task, +``->rcu_tasks_holdout_list`` is a list element enqueuing this task on +the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this +idle task is running, but only if the task is currently running, that +is, if the CPU is currently idle. + +Accessor Functions +~~~~~~~~~~~~~~~~~~ + +The following listing shows the ``rcu_get_root()``, +``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` +function and macros: + +:: + + 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) + 2 { + 3 return &rsp->node[0]; + 4 } + 5 + 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ + 7 for ((rnp) = &(rsp)->node[0]; \ + 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + 9 + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + +The ``rcu_get_root()`` simply returns a pointer to the first element of +the specified ``rcu_state`` structure's ``->node[]`` array, which is the +root ``rcu_node`` structure. + +As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes +advantage of the layout of the ``rcu_node`` structures in the +``rcu_state`` structure's ``->node[]`` array, performing a breadth-first +traversal by simply traversing the array in order. Similarly, the +``rcu_for_each_leaf_node()`` macro traverses only the last part of the +array, thus traversing only the leaf ``rcu_node`` structures. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | +| contains only a single node? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | +| single node. | ++-----------------------------------------------------------------------+ + +Summary +~~~~~~~ + +So the state of RCU is represented by an ``rcu_state`` structure, which +contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. +Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state +is tracked by dynticks-related fields in the ``rcu_data`` structure. If +you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +Acknowledgments +~~~~~~~~~~~~~~~ + +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for +helping me get this document into a more human-readable state. + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg new file mode 100644 index 000000000..2bf12b468 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/HugeTreeClassicRCU.svg @@ -0,0 +1,939 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + CPU 0 + + struct + + rcu_data + + CPU 15 + + struct + + rcu_data + + struct + + rcu_data + + CPU 21823 + + CPU 21839 + + rcu_data + + struct + + struct + + rcu_data + + CPU 43679 + + CPU 43695 + + rcu_data + + struct + + struct + + rcu_data + + CPU 65519 + + CPU 65535 + + rcu_data + + struct + + struct rcu_state + + struct + + rcu_node + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg new file mode 100644 index 000000000..7a7eb3bac --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeLevel.svg @@ -0,0 +1,828 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_node + + struct + + struct + + rcu_node + + struct + + rcu_node + + rcu_node + + struct + + rcu_node + + struct + + struct + + rcu_node + + ->level[0] + + ->level[1] + + ->level[2] + + struct + + rcu_node + + CPU 15 + + CPU 0 + + CPU 65535 + + CPU 65519 + + CPU 43695 + + CPU 43679 + + CPU 21839 + + CPU 21823 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMapping.svg b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg new file mode 100644 index 000000000..729cfa9e6 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMapping.svg @@ -0,0 +1,305 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + diff --git a/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg new file mode 100644 index 000000000..5b416a4b8 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/TreeMappingLevel.svg @@ -0,0 +1,380 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->level[0] + + ->level[1] + + ->level[2] + + 0:7 + + 4:7 + + 0:1 + + 2:3 + + 4:5 + + 6:7 + + 0:3 + + struct rcu_state + + + + + + + + + diff --git a/Documentation/RCU/Design/Data-Structures/blkd_task.svg b/Documentation/RCU/Design/Data-Structures/blkd_task.svg new file mode 100644 index 000000000..bed13e9ec --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/blkd_task.svg @@ -0,0 +1,631 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + struct + + rcu_node + + struct + + rcu_node + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct + + rcu_data + + struct rcu_state + + + + + + + + + + rcu_state + + T3 + + T2 + + T1 + + + + + + + + + + + + + rcu_node + + struct + + blkd_tasks + + gp_tasks + + exp_tasks + + diff --git a/Documentation/RCU/Design/Data-Structures/nxtlist.svg b/Documentation/RCU/Design/Data-Structures/nxtlist.svg new file mode 100644 index 000000000..0223e79c3 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/nxtlist.svg @@ -0,0 +1,386 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->head + + ->tails[RCU_DONE_TAIL] + + ->tails[RCU_WAIT_TAIL] + + ->tails[RCU_NEXT_READY_TAIL] + + ->tails[RCU_NEXT_TAIL] + + CB 1 + + next + + CB 3 + + next + + CB 4 + + next + + CB 2 + + next + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg new file mode 100644 index 000000000..7c6c90bd0 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpRCUFlow.svg @@ -0,0 +1,830 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + In RCUread-sidecriticalsection? + + N + + + + IPI Handler + + + + N + + + Report CPUQuiescentState + + + Y + + + rcu_read_unlock() + + + + + + Context Switch + + + + + Report CPUand TaskQuiescent States + + + + Enqueue Task + + + + + + Report CPUQuiescentState + + + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg new file mode 100644 index 000000000..6189ffcc6 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/ExpSchedFlow.svg @@ -0,0 +1,830 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Idle oroffline? + + + + CPU N Start + + + + Y + + + Done + + + + Send IPI to CPU N + + + + + CPU idle? + + N + + + + IPI Handler + + + + Y + + + Report CPUQuiescentState + + + N + + + Context Switch + + + + + + CPU Offline + + + + + Report CPUQuiescentState + + + + + Report CPUQuiescentState + + + + + + Requestcontext switch + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst new file mode 100644 index 000000000..72f0f6fbd --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst @@ -0,0 +1,521 @@ +================================================= +A Tour Through TREE_RCU's Expedited Grace Periods +================================================= + +Introduction +============ + +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier +third RCU-bh flavor having been implemented in terms of the other two. +Each of the two implementations is covered in its own section. + +Expedited Grace Period Design +============================= + +The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to ``try_stop_cpus()`` has been +replaced with a set of calls to ``smp_call_function_single()``, +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +The details of the ``smp_call_function_single()`` handler's +operation depend on the RCU flavor, as described in the following +sections. + +RCU-preempt Expedited Grace Periods +=================================== + +``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. +The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +.. kernel-figure:: ExpRCUFlow.svg + +The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +If a given CPU is offline or idle, ``synchronize_rcu_expedited()`` +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +``smp_call_function_single()`` to send the CPU an IPI, which +is handled by ``rcu_exp_handler()``. + +However, because this is preemptible RCU, ``rcu_exp_handler()`` +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost ``rcu_read_unlock()`` +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost ``rcu_read_unlock()``. +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, +which is called from ``rcu_preempt_note_context_switch()``, which +in turn is called from ``rcu_note_context_switch()``, which in +turn is called from the scheduler. + + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not just have the expedited grace period check the state of all | +| the CPUs? After all, that would avoid all those real-time-unfriendly | +| IPIs. | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because we want the RCU read-side critical sections to run fast, | +| which means no memory barriers. Therefore, it is not possible to | +| safely check the state from some other CPU. And even if it was | +| possible to safely check the state, it would still be necessary to | +| IPI the CPU to safely interact with the upcoming | +| ``rcu_read_unlock()`` invocation, which means that the remote state | +| testing would not help the worst-case latency that real-time | +| applications care about. | +| | +| One way to prevent your real-time application from getting hit with | +| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | +| would then perceive the CPU running your application as being idle, | +| and it would be able to safely detect that state without needing to | +| IPI the CPU. | ++-----------------------------------------------------------------------+ + +Please note that this is just the overall flow: Additional complications +can arise due to races with CPUs going idle or offline, among other +things. + +RCU-sched Expedited Grace Periods +--------------------------------- + +``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of +the handling of a given CPU by an RCU-sched expedited grace period is +shown in the following diagram: + +.. kernel-figure:: ExpSchedFlow.svg + +As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores +offline and idle CPUs, again because they are in remotely detectable +quiescent states. However, because the ``rcu_read_lock_sched()`` and +``rcu_read_unlock_sched()`` leave no trace of their invocation, in +general it is not possible to tell whether or not the current CPU is in +an RCU read-side critical section. The best that RCU-sched's +``rcu_exp_handler()`` can do is to check for idle, on the off-chance +that the CPU went idle while the IPI was in flight. If the CPU is idle, +then ``rcu_exp_handler()`` reports the quiescent state. + +Otherwise, the handler forces a future context switch by setting the +NEED_RESCHED flag of the current task's thread flag and the CPU preempt +counter. At the time of the context switch, the CPU reports the +quiescent state. Should the CPU go offline first, it will report the +quiescent state at that time. + +Expedited Grace Period and CPU Hotplug +-------------------------------------- + +The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. In addition, attempting to IPI offline CPUs will result +in splats, but failing to IPI online CPUs can result in too-short grace +periods. Neither option is acceptable in production kernels. + +The interaction between expedited grace periods and CPU hotplug +operations is carried out at several levels: + +#. The number of CPUs that have ever been online is tracked by the + ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` + structure's ``->ncpus_snap`` field tracks the number of CPUs that + have ever been online at the beginning of an RCU expedited grace + period. Note that this number never decreases, at least in the + absence of a time machine. +#. The identities of the CPUs that have ever been online is tracked by + the ``rcu_node`` structure's ``->expmaskinitnext`` field. The + ``rcu_node`` structure's ``->expmaskinit`` field tracks the + identities of the CPUs that were online at least once at the + beginning of the most recent RCU expedited grace period. The + ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are + used to detect when new CPUs have come online for the first time, + that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` + field has changed since the beginning of the last RCU expedited grace + period, which triggers an update of each ``rcu_node`` structure's + ``->expmaskinit`` field from its ``->expmaskinitnext`` field. +#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to + initialize that structure's ``->expmask`` at the beginning of each + RCU expedited grace period. This means that only those CPUs that have + been online at least once will be considered for a given grace + period. +#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` + structure's ``->qsmaskinitnext`` field, so any CPU with that bit + clear can safely be ignored. However, it is possible for a CPU coming + online or going offline to have this bit set for some time while + ``cpu_online`` returns ``false``. +#. For each non-idle CPU that RCU believes is currently online, the + grace period invokes ``smp_call_function_single()``. If this + succeeds, the CPU was fully online. Failure indicates that the CPU is + in the process of coming online or going offline, in which case it is + necessary to wait for a short time period and try again. The purpose + of this wait (or series of waits, as the case may be) is to permit a + concurrent CPU-hotplug operation to complete. +#. In the case of RCU-sched, one of the last acts of an outgoing CPU is + to invoke ``rcu_report_dead()``, which reports a quiescent state for + that CPU. However, this is likely paranoia-induced redundancy. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why all the dancing around with multiple counters and masks tracking | +| CPUs that were once online? Why not just have a single set of masks | +| tracking the currently online CPUs and be done with it? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Maintaining single set of masks tracking the online CPUs *sounds* | +| easier, at least until you try working out all the race conditions | +| between grace-period initialization and CPU-hotplug operations. For | +| example, suppose initialization is progressing down the tree while a | +| CPU-offline operation is progressing up the tree. This situation can | +| result in bits set at the top of the tree that have no counterparts | +| at the bottom of the tree. Those bits will never be cleared, which | +| will result in grace-period hangs. In short, that way lies madness, | +| to say nothing of a great many bugs, hangs, and deadlocks. | +| In contrast, the current multi-mask multi-counter scheme ensures that | +| grace-period initialization will always see consistent masks up and | +| down the tree, which brings significant simplifications over the | +| single-mask method. | +| | +| This is an instance of `deferring work in order to avoid | +| synchronization `__. | +| Lazily recording CPU-hotplug events at the beginning of the next | +| grace period greatly simplifies maintenance of the CPU-tracking | +| bitmasks in the ``rcu_node`` tree. | ++-----------------------------------------------------------------------+ + +Expedited Grace Period Refinements +---------------------------------- + +Idle-CPU Checks +~~~~~~~~~~~~~~~ + +Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU (both +checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is +idle at any time between those two times, the CPU will not be IPIed. +Instead, the task pushing the grace period forward will include the idle +CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. + +For RCU-sched, there is an additional check: If the IPI has interrupted +the idle loop, then ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state. + +For RCU-preempt, there is no specific check for idle in the IPI handler +(``rcu_exp_handler()``), but because RCU read-side critical sections are +not permitted within the idle loop, if ``rcu_exp_handler()`` sees that +the CPU is within RCU read-side critical section, the CPU cannot +possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state, +regardless of whether or not that quiescent state was due to the CPU +being idle. + +In summary, RCU expedited grace periods check for idle when building the +bitmask of CPUs that must be IPIed, just before sending each IPI, and +(either explicitly or implicitly) within the IPI handler. + +Batching via Sequence Counter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If each grace-period request was carried out separately, expedited grace +periods would have abysmal scalability and problematic high-load +characteristics. Because each grace-period operation can serve an +unlimited number of updates, it is important to *batch* requests, so +that a single expedited grace-period operation will cover all requests +in the corresponding batch. + +This batching is controlled by a sequence counter named +``->expedited_sequence`` in the ``rcu_state`` structure. This counter +has an odd value when there is an expedited grace period in progress and +an even value otherwise, so that dividing the counter value by two gives +the number of completed grace periods. During any given update request, +the counter must transition from even to odd and then back to even, thus +indicating that a grace period has elapsed. Therefore, if the initial +value of the counter is ``s``, the updater must wait until the counter +reaches at least the value ``(s+3)&~0x1``. This counter is managed by +the following access functions: + +#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited + grace period. +#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace + period. +#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. +#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited + grace period has elapsed since the corresponding call to + ``rcu_exp_gp_seq_snap()``. + +Again, only one request in a given batch need actually carry out a +grace-period operation, which means there must be an efficient way to +identify which of many concurrent reqeusts will initiate the grace +period, and that there be an efficient way for the remaining requests to +wait for that grace period to complete. However, that is the topic of +the next section. + +Funnel Locking and Wait/Wakeup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the ``rcu_node`` combining tree, as +implemented by the ``exp_funnel_lock()`` function. The first updater +corresponding to a given grace period arriving at a given ``rcu_node`` +structure records its desired grace-period sequence number in the +``->exp_seq_rq`` field and moves up to the next level in the tree. +Otherwise, if the ``->exp_seq_rq`` field already contains the sequence +number for the desired grace period or some later one, the updater +blocks on one of four wait queues in the ``->exp_wq[]`` array, using the +second-from-bottom and third-from bottom bits as an index. An +``->exp_lock`` field in the ``rcu_node`` structure synchronizes access +to these fields. + +An empty ``rcu_node`` tree is shown in the following diagram, with the +white cells representing the ``->exp_seq_rq`` field and the red cells +representing the elements of the ``->exp_wq[]`` array. + +.. kernel-figure:: Funnel0.svg + +The next diagram shows the situation after the arrival of Task A and +Task B at the leftmost and rightmost leaf ``rcu_node`` structures, +respectively. The current value of the ``rcu_state`` structure's +``->expedited_sequence`` field is zero, so adding three and clearing the +bottom bit results in the value two, which both tasks record in the +``->exp_seq_rq`` field of their respective ``rcu_node`` structures: + +.. kernel-figure:: Funnel1.svg + +Each of Tasks A and B will move up to the root ``rcu_node`` structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +.. kernel-figure:: Funnel2.svg + +Task A now advances to initiate a new grace period, while Task B moves +up to the root ``rcu_node`` structure, and, seeing that its desired +sequence number is already recorded, blocks on ``->exp_wq[1]``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why ``->exp_wq[1]``? Given that the value of these tasks' desired | +| sequence number is two, so shouldn't they instead block on | +| ``->exp_wq[2]``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No. | +| Recall that the bottom bit of the desired sequence number indicates | +| whether or not a grace period is currently in progress. It is | +| therefore necessary to shift the sequence number right one bit | +| position to obtain the number of the grace period. This results in | +| ``->exp_wq[1]``. | ++-----------------------------------------------------------------------+ + +If Tasks C and D also arrive at this point, they will compute the same +desired grace-period sequence number, and see that both leaf +``rcu_node`` structures already have that value recorded. They will +therefore block on their respective ``rcu_node`` structures' +``->exp_wq[1]`` fields, as shown below: + +.. kernel-figure:: Funnel3.svg + +Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and +initiates the grace period, which increments ``->expedited_sequence``. +Therefore, if Tasks E and F arrive, they will compute a desired sequence +number of 4 and will record this value as shown below: + +.. kernel-figure:: Funnel4.svg + +Tasks E and F will propagate up the ``rcu_node`` combining tree, with +Task F blocking on the root ``rcu_node`` structure and Task E wait for +Task A to finish so that it can start the next grace period. The +resulting state is as shown below: + +.. kernel-figure:: Funnel5.svg + +Once the grace period completes, Task A starts waking up the tasks +waiting for this grace period to complete, increments the +``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then +releases the ``->exp_mutex``. This results in the following state: + +.. kernel-figure:: Funnel6.svg + +Task E can then acquire ``->exp_mutex`` and increment +``->expedited_sequence`` to the value three. If new tasks G and H arrive +and moves up the combining tree at the same time, the state will be as +follows: + +.. kernel-figure:: Funnel7.svg + +Note that three of the root ``rcu_node`` structure's waitqueues are now +occupied. However, at some point, Task A will wake up the tasks blocked +on the ``->exp_wq`` waitqueues, resulting in the following state: + +.. kernel-figure:: Funnel8.svg + +Execution will continue with Tasks E and H completing their grace +periods and carrying out their wakeups. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What happens if Task A takes so long to do its wakeups that Task E's | +| grace period completes? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Then Task E will block on the ``->exp_wake_mutex``, which will also | +| prevent it from releasing ``->exp_mutex``, which in turn will prevent | +| the next grace period from starting. This last is important in | +| preventing overflow of the ``->exp_wq[]`` array. | ++-----------------------------------------------------------------------+ + +Use of Workqueues +~~~~~~~~~~~~~~~~~ + +In earlier implementations, the task requesting the expedited grace +period also drove it to completion. This straightforward approach had +the disadvantage of needing to account for POSIX signals sent to user +tasks, so more recent implemementations use the Linux kernel's +`workqueues `__. + +The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock does a +``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a +workqueue kthread does the actual grace-period processing. Because +workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. In addition, this approach +allows wakeups for the previous expedited grace period to be overlapped +with processing for the next expedited grace period. Because there are +only four sets of waitqueues, it is necessary to ensure that the +previous grace period's wakeups complete before the next grace period's +wakeups start. This is handled by having the ``->exp_mutex`` guard +expedited grace-period processing and the ``->exp_wake_mutex`` guard +wakeups. The key point is that the ``->exp_mutex`` is not released until +the first wakeup is complete, which means that the ``->exp_wake_mutex`` +has already been acquired at that point. This approach ensures that the +previous grace period's wakeups can be carried out while the current +grace period is in process, but that these wakeups will complete before +the next grace period starts. This means that only three waitqueues are +required, guaranteeing that the four that are provided are sufficient. + +Stall Warnings +~~~~~~~~~~~~~~ + +Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check for +stalls just as normal grace periods do. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But why not just let the normal grace-period machinery detect the | +| stalls, given that a given reader must block both normal and | +| expedited grace periods? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because it is quite possible that at a given time there is no normal | +| grace period in progress, in which case the normal grace period | +| cannot emit a stall warning. | ++-----------------------------------------------------------------------+ + +The ``synchronize_sched_expedited_wait()`` function loops waiting for +the expedited grace period to end, but with a timeout set to the current +RCU CPU stall-warning time. If this time is exceeded, any CPUs or +``rcu_node`` structures blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +Mid-boot operation +~~~~~~~~~~~~~~~~~~ + +The use of workqueues has the advantage that the expedited grace-period +code need not worry about POSIX signals. Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until they are +initialized, which does not happen until some time after the scheduler +spawns the first task. Given that there are parts of the kernel that +really do want to execute grace periods during this mid-boot “dead +zone”, expedited grace periods must do something else during thie time. + +What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case before +the use of workqueues. However, the requesting task is only required to +drive the grace period during the mid-boot dead zone. Before mid-boot, a +synchronous grace period is a no-op. Some time after mid-boot, +workqueues are used. + +Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. This is handled by causing non-expedited grace +periods to take the expedited code path during mid-boot. + +The current code assumes that there are no POSIX signals during the +mid-boot dead zone. However, if an overwhelming need for POSIX signals +somehow arises, appropriate adjustments can be made to the expedited +stall-warning code. One such adjustment would reinstate the +pre-workqueue stall-warning checks, but only during the mid-boot dead +zone. + +With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. That +is, aside from some points in the suspend, hibernate, or shutdown code +path. + +Summary +~~~~~~~ + +Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. A funnel lock is used to efficiently identify the one task out +of a concurrent group that will request the grace period. All members of +the group will block on waitqueues provided in the ``rcu_node`` +structure. The actual grace-period processing is carried out by a +workqueue. + +CPU-hotplug operations are noted lazily in order to prevent the need for +tight synchronization between expedited grace periods and CPU-hotplug +operations. The dyntick-idle counters are used to avoid sending IPIs to +idle CPUs, at least in the common case. RCU-preempt and RCU-sched use +different IPI handlers and different code to respond to the state +changes carried out by those handlers, but otherwise use common code. + +Quiescent states are tracked using the ``rcu_node`` tree, and once all +necessary quiescent states have been reported, all tasks waiting on this +expedited grace period are awakened. A pair of mutexes are used to allow +one grace period's wakeups to proceed concurrently with the next grace +period's processing. + +This combination of mechanisms allows expedited grace periods to run +reasonably efficiently. However, for non-time-critical tasks, normal +grace periods should be used instead because their longer duration +permits much higher degrees of batching, and thus much lower per-request +overheads. diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg new file mode 100644 index 000000000..98af66557 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :0 + + + + + + + + :0 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg new file mode 100644 index 000000000..e0184a37a --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg @@ -0,0 +1,275 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + A:2 + + + + + + + + B:2 + + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg new file mode 100644 index 000000000..1bc3fed54 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg @@ -0,0 +1,287 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 + + + + + + + + + + + + :2 + + + + + + + + B:2 + + A:2 + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg new file mode 100644 index 000000000..6d8a1bffb --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 0 GP: A + + + + + + + + + + + + :2 + C + + + + + + + + :2 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg new file mode 100644 index 000000000..44018fd63 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg @@ -0,0 +1,323 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A + + + + + + + + + + + + E:4 + C + + + + + + + + F:4 + D + + :2 + B + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg new file mode 100644 index 000000000..e5eef5045 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 1 GP: A,E + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg new file mode 100644 index 000000000..fbd2c1892 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg @@ -0,0 +1,335 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 2 GP: E Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :4 + D + + :4 + B + F + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg new file mode 100644 index 000000000..502e159ed --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg @@ -0,0 +1,347 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H Wakeup: A + + + + + + + + + + + + :4 + C + + + + + + + + :6 + D + + :6 + B + F + G + + diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg new file mode 100644 index 000000000..677401551 --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg @@ -0,0 +1,311 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + ->expedited_sequence: 3 GP: E,H + + + + + + + + + + + + :4 + + + + + + + + :6 + + :6 + F + G + + diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst new file mode 100644 index 000000000..83ae3b79a --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -0,0 +1,624 @@ +====================================================== +A Tour Through TREE_RCU's Grace-Period Memory Ordering +====================================================== + +August 8, 2017 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +What Is Tree RCU's Grace Period Memory Ordering Guarantee? +========================================================== + +RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +Note well that RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +``synchronize_rcu()`` as ``smp_mb()`` on steroids. + +RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +Tree RCU Grace Period Memory Ordering Building Blocks +===================================================== + +The workhorse for RCU's grace-period memory ordering is the +critical section for the ``rcu_node`` structure's +``->lock``. These critical sections use helper functions for lock +acquisition, including ``raw_spin_lock_rcu_node()``, +``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. +Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, +``raw_spin_unlock_irq_rcu_node()``, and +``raw_spin_unlock_irqrestore_rcu_node()``, respectively. +For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. +The key point is that the lock-acquisition functions, including +``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` +immediately after successful acquisition of the lock. + +Therefore, for any given ``rcu_node`` structure, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different ``rcu_node`` structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions:: + + 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } + 10 + 11 void task1(void) + 12 { + 13 raw_spin_lock_rcu_node(rnp); + 14 WRITE_ONCE(y, 1); + 15 r2 = READ_ONCE(z); + 16 raw_spin_unlock_rcu_node(rnp); + 17 } + 18 + 19 void task2(void) + 20 { + 21 WRITE_ONCE(z, 1); + 22 smp_mb(); + 23 r3 = READ_ONCE(x); + 24 } + 25 + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + +The ``WARN_ON()`` is evaluated at "the end of time", +after all changes have propagated throughout the system. +Without the ``smp_mb__after_unlock_lock()`` provided by the +acquisition functions, this ``WARN_ON()`` could trigger, for example +on PowerPC. +The ``smp_mb__after_unlock_lock()`` invocations prevent this +``WARN_ON()`` from triggering. + +This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +``atomic_add_return()`` read-modify-write atomic operation that +is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry +time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. +The grace-period kthread invokes ``rcu_dynticks_snap()`` and +``rcu_dynticks_in_eqs_since()`` (both of which invoke +an ``atomic_add_return()`` of zero) to detect idle CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about CPUs that remain offline for the entire grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Such CPUs will be offline at the beginning of the grace period, so | +| the grace period won't expect quiescent states from them. Races | +| between grace-period start and CPU-hotplug operations are mediated | +| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described | +| above. | ++-----------------------------------------------------------------------+ + +The approach must be extended to handle one final case, that of waking a +task blocked in ``synchronize_rcu()``. This task might be affinitied to +a CPU that is not yet aware that the grace period has ended, and thus +might not yet be subject to the grace period's memory ordering. +Therefore, there is an ``smp_mb()`` after the return from +``wait_for_completion()`` in the ``synchronize_rcu()`` code path. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What? Where??? I don't see any ``smp_mb()`` after the return from | +| ``wait_for_completion()``!!! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| That would be because I spotted the need for that ``smp_mb()`` during | +| the creation of this documentation, and it is therefore unlikely to | +| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter | +| Zijlstra, and Jonathan Cameron for asking questions that sensitized | +| me to the rather elaborate sequence of events that demonstrate the | +| need for this memory barrier. | ++-----------------------------------------------------------------------+ + +Tree RCU's grace--period memory-ordering guarantees rely most heavily on +the ``rcu_node`` structure's ``->lock`` field, so much so that it is +necessary to abbreviate this pattern in the diagrams in the next +section. For example, consider the ``rcu_prepare_for_idle()`` function +shown below, which is one of several functions that enforce ordering of +newly arrived RCU callbacks against future grace periods: + +:: + + 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 + 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || + 11 rcu_is_nocb_cpu(smp_processor_id())) + 12 return; + 13 tne = READ_ONCE(tick_nohz_active); + 14 if (tne != rdtp->tick_nohz_enabled_snap) { + 15 if (rcu_cpu_has_callbacks(NULL)) + 16 invoke_rcu_core(); + 17 rdtp->tick_nohz_enabled_snap = tne; + 18 return; + 19 } + 20 if (!tne) + 21 return; + 22 if (rdtp->all_lazy && + 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { + 24 rdtp->all_lazy = false; + 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; + 26 invoke_rcu_core(); + 27 return; + 28 } + 29 if (rdtp->last_accelerate == jiffies) + 30 return; + 31 rdtp->last_accelerate = jiffies; + 32 for_each_rcu_flavor(rsp) { + 33 rdp = this_cpu_ptr(rsp->rda); + 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) + 35 continue; + 36 rnp = rdp->mynode; + 37 raw_spin_lock_rcu_node(rnp); + 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); + 39 raw_spin_unlock_rcu_node(rnp); + 40 if (needwake) + 41 rcu_gp_kthread_wake(rsp); + 42 } + 43 } + +But the only part of ``rcu_prepare_for_idle()`` that really matters for +this discussion are lines 37–39. We will therefore abbreviate this +function as follows: + +.. kernel-figure:: rcu_node-lock.svg + +The box represents the ``rcu_node`` structure's ``->lock`` critical +section, with the double line on top representing the additional +``smp_mb__after_unlock_lock()``. + +Tree RCU Grace Period Memory Ordering Components +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tree RCU's grace-period memory-ordering guarantee is provided by a +number of RCU components: + +#. `Callback Registry`_ +#. `Grace-Period Initialization`_ +#. `Self-Reported Quiescent States`_ +#. `Dynamic Tick Interface`_ +#. `CPU-Hotplug Interface`_ +#. `Forcing Quiescent States`_ +#. `Grace-Period Cleanup`_ +#. `Callback Invocation`_ + +Each of the following section looks at the corresponding component in +detail. + +Callback Registry +^^^^^^^^^^^^^^^^^ + +If RCU's grace-period guarantee is to mean anything at all, any access +that happens before a given invocation of ``call_rcu()`` must also +happen before the corresponding grace period. The implementation of this +portion of RCU's grace period guarantee is shown in the following +figure: + +.. kernel-figure:: TreeRCU-callback-registry.svg + +Because ``call_rcu()`` normally acts only on CPU-local state, it +provides no ordering guarantees, either for itself or for phase one of +the update (which again will usually be removal of an element from an +RCU-protected data structure). It simply enqueues the ``rcu_head`` +structure on a per-CPU list, which cannot become associated with a grace +period until a later call to ``rcu_accelerate_cbs()``, as shown in the +diagram above. + +One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` +via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the +current CPU is inundated with queued ``rcu_head`` structures) or more +likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle +is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which +invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The +final code path on the right is taken only in kernels built with +``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via +``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, +``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn +is invoked on a surviving CPU after the outgoing CPU has been completely +offlined. + +There are a few other code paths within grace-period processing that +opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, +all of the CPU's recently queued ``rcu_head`` structures are associated +with a future grace-period number under the protection of the CPU's lead +``rcu_node`` structure's ``->lock``. In all cases, there is full +ordering against any prior critical section for that same ``rcu_node`` +structure's ``->lock``, and also full ordering against any of the +current task's or CPU's prior critical sections for any ``rcu_node`` +structure's ``->lock``. + +The next section will show how this ordering ensures that any accesses +prior to the ``call_rcu()`` (particularly including phase one of the +update) happen before the start of the corresponding grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | +| which invokes it. So either way, it eventually comes down to | +| ``call_rcu()``. | ++-----------------------------------------------------------------------+ + +Grace-Period Initialization +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Grace-period initialization is carried out by the grace-period kernel +thread, which makes several passes over the ``rcu_node`` tree within the +``rcu_gp_init()`` function. This means that showing the full flow of +ordering through the grace-period computation will require duplicating +this tree. If you find this confusing, please note that the state of the +``rcu_node`` changes over time, just like Heraclitus's river. However, +to keep the ``rcu_node`` river tractable, the grace-period kernel +thread's traversals are presented in multiple parts, starting in this +section with the various phases of grace-period initialization. + +The first ordering-related grace-period initialization action is to +advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number +counter, as shown below: + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The actual increment is carried out using ``smp_store_release()``, which +helps reject false-positive RCU CPU stall detection. Note that only the +root ``rcu_node`` structure is touched. + +The first pass through the ``rcu_node`` tree updates bitmasks based on +CPUs having come online or gone offline since the start of the previous +grace period. In the common case where the number of online CPUs for +this ``rcu_node`` structure has not transitioned to or from zero, this +pass will scan only the leaf ``rcu_node`` structures. However, if the +number of online CPUs for a given leaf ``rcu_node`` structure has +transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the +first incoming CPU. Similarly, if the number of online CPUs for a given +leaf ``rcu_node`` structure has transitioned to zero, +``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +``rcu_node`` structure onlines its first CPU and if the next +``rcu_node`` structure has no online CPUs (or, alternatively if the +leftmost ``rcu_node`` structure offlines its last CPU and if the next +``rcu_node`` structure has no online CPUs). + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses +breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field +to the newly advanced value from the ``rcu_state`` structure, as shown +in the following diagram. + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +This change will also cause each CPU's next call to +``__note_gp_changes()`` to notice that a new grace period has started, +as described in the next section. But because the grace-period kthread +started the grace period at the root (with the advancing of the +``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf +``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of +the start of the grace period will happen after the actual start of the +grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about the CPU that started the grace period? Why wouldn't it | +| see the start of the grace period right when it started that grace | +| period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In some deep philosophical and overly anthromorphized sense, yes, the | +| CPU starting the grace period is immediately aware of having done so. | +| However, if we instead assume that RCU is not self-aware, then even | +| the CPU starting the grace period does not really become aware of the | +| start of this grace period until its first call to | +| ``__note_gp_changes()``. On the other hand, this CPU potentially gets | +| early notification because it invokes ``__note_gp_changes()`` during | +| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` | +| structure. | ++-----------------------------------------------------------------------+ + +Self-Reported Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. Online +non-idle CPUs report their own quiescent states, as shown in the +following diagram: + +.. kernel-figure:: TreeRCU-qs.svg + +This is for the last CPU to report a quiescent state, which signals the +end of the grace period. Earlier quiescent states would push up the +``rcu_node`` tree only until they encountered an ``rcu_node`` structure +that is waiting for additional quiescent states. However, ordering is +nevertheless preserved because some later quiescent state will acquire +that ``rcu_node`` structure's ``->lock``. + +Any number of events can lead up to a CPU invoking ``note_gp_changes`` +(or alternatively, directly invoking ``__note_gp_changes()``), at which +point that CPU will notice the start of a new grace period while holding +its leaf ``rcu_node`` lock. Therefore, all execution shown in this +diagram happens after the start of the grace period. In addition, this +CPU will consider any RCU read-side critical section that started before +the invocation of ``__note_gp_changes()`` to have started before the +grace period, and thus a critical section that the grace period must +wait on. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But a RCU read-side critical section might have started after the | +| beginning of the grace period (the advancing of ``->gp_seq`` from | +| earlier), so why should the grace period wait on such a critical | +| section? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It is indeed not necessary for the grace period to wait on such a | +| critical section. However, it is permissible to wait on it. And it is | +| furthermore important to wait on it, as this lazy approach is far | +| more scalable than a “big bang” all-at-once grace-period start could | +| possibly be. | ++-----------------------------------------------------------------------+ + +If the CPU does a context switch, a quiescent state will be noted by +``rcu_note_context_switch()`` on the left. On the other hand, if the CPU +takes a scheduler-clock interrupt while executing in usermode, a +quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. +Either way, the passage through a quiescent state will be noted in a +per-CPU variable. + +The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for +example, after the next scheduler-clock interrupt), ``rcu_core()`` will +invoke ``rcu_check_quiescent_state()``, which will notice the recorded +quiescent state, and invoke ``rcu_report_qs_rdp()``. If +``rcu_report_qs_rdp()`` verifies that the quiescent state really does +apply to the current grace period, it invokes ``rcu_report_rnp()`` which +traverses up the ``rcu_node`` tree as shown at the bottom of the +diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` +field, and propagating up the tree when the result is zero. + +Note that traversal passes upwards out of a given ``rcu_node`` structure +only if the current CPU is reporting the last quiescent state for the +subtree headed by that ``rcu_node`` structure. A key point is that if a +CPU's traversal stops at a given ``rcu_node`` structure, then there will +be a later traversal by another CPU (or perhaps the same one) that +proceeds upwards from that point, and the ``rcu_node`` ``->lock`` +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. Applying this line of thought +repeatedly shows that all CPUs' quiescent states happen before the last +CPU traverses through the root ``rcu_node`` structure, the “last CPU” +being the one that clears the last bit in the root ``rcu_node`` +structure's ``->qsmask`` field. + +Dynamic Tick Interface +^^^^^^^^^^^^^^^^^^^^^^ + +Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. CPUs are therefore required to notify RCU when +entering or leaving idle state, which they do via fully ordered +value-returning atomic operations on a per-CPU variable. The ordering +effects are as shown below: + +.. kernel-figure:: TreeRCU-dyntick.svg + +The RCU grace-period kernel thread samples the per-CPU idleness variable +while holding the corresponding CPU's leaf ``rcu_node`` structure's +``->lock``. This means that any RCU read-side critical sections that +precede the idle period (the oval near the top of the diagram above) +will happen before the end of the current grace period. Similarly, the +beginning of the current grace period will happen before any RCU +read-side critical sections that follow the idle period (the oval near +the bottom of the diagram above). + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +CPU-Hotplug Interface +^^^^^^^^^^^^^^^^^^^^^ + +RCU is also forbidden from disturbing offline CPUs, which might well be +powered off and removed from the system completely. CPUs are therefore +required to notify RCU of their comings and goings as part of the +corresponding CPU hotplug operations. The ordering effects are shown +below: + +.. kernel-figure:: TreeRCU-hotplug.svg + +Because CPU hotplug operations are much less frequent than idle +transitions, they are heavier weight, and thus acquire the CPU's leaf +``rcu_node`` structure's ``->lock`` and update this structure's +``->qsmaskinitnext``. The RCU grace-period kernel thread samples this +mask to detect CPUs having gone offline since the beginning of this +grace period. + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +Forcing Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^ + +As noted above, idle and offline CPUs cannot report their own quiescent +states, and therefore the grace-period kernel thread must do the +reporting on their behalf. This process is called “forcing quiescent +states”, it is repeated every few jiffies, and its ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-fqs.svg + +Each pass of quiescent state forcing is guaranteed to traverse the leaf +``rcu_node`` structures, and if there are no new quiescent states due to +recently idled and/or offlined CPUs, then only the leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left or +a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. As with +self-reported quiescent states, the upwards driving stops once it +reaches an ``rcu_node`` structure that has quiescent states outstanding +from other CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The leftmost drive to root stopped before it reached the root | +| ``rcu_node`` structure, which means that there are still CPUs | +| subordinate to that structure on which the current grace period is | +| waiting. Given that, how is it possible that the rightmost drive to | +| root ended the grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Good analysis! It is in fact impossible in the absence of bugs in | +| RCU. But this diagram is complex enough as it is, so simplicity | +| overrode accuracy. You can think of it as poetic license, or you can | +| think of it as misdirection that is resolved in the | +| `stitched-together diagram <#Putting%20It%20All%20Together>`__. | ++-----------------------------------------------------------------------+ + +Grace-Period Cleanup +^^^^^^^^^^^^^^^^^^^^ + +Grace-period cleanup first scans the ``rcu_node`` tree breadth-first +advancing all the ``->gp_seq`` fields, then it advances the +``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-cleanup.svg + +As indicated by the oval at the bottom of the diagram, once grace-period +cleanup is complete, the next grace period can begin. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But when precisely does the grace period end? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| There is no useful single point at which the grace period can be said | +| to end. The earliest reasonable candidate is as soon as the last CPU | +| has reported its quiescent state, but it may be some milliseconds | +| before RCU becomes aware of this. The latest reasonable candidate is | +| once the ``rcu_state`` structure's ``->gp_seq`` field has been | +| updated, but it is quite possible that some CPUs have already | +| completed phase two of their updates by that time. In short, if you | +| are going to work with RCU, you need to learn to embrace uncertainty. | ++-----------------------------------------------------------------------+ + +Callback Invocation +^^^^^^^^^^^^^^^^^^^ + +Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has +been updated, that CPU can begin invoking its RCU callbacks that were +waiting for this grace period to end. These callbacks are identified by +``rcu_advance_cbs()``, which is usually invoked by +``__note_gp_changes()``. As shown in the diagram below, this invocation +can be triggered by the scheduling-clock interrupt +(``rcu_sched_clock_irq()`` on the left) or by idle entry +(``rcu_cleanup_after_idle()`` on the right, but only for kernels build +with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is +raised, which results in ``rcu_do_batch()`` invoking the callbacks, +which in turn allows those callbacks to carry out (either directly or +indirectly via wakeup) the needed phase-two processing for each update. + +.. kernel-figure:: TreeRCU-callback-invocation.svg + +Please note that callback invocation can also be prompted by any number +of corner-case code paths, for example, when a CPU notes that it has +excessive numbers of callbacks queued. In all cases, the CPU acquires +its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, +which preserves the required ordering against the newly completed grace +period. + +However, if the callback function communicates to other CPUs, for +example, doing a wakeup, then it is that function's responsibility to +maintain ordering. For example, if the callback function wakes up a task +that runs on some other CPU, proper ordering must in place in both the +callback function and the task being awakened. To see why this is +important, consider the top half of the `grace-period +cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be +running on a CPU corresponding to the leftmost leaf ``rcu_node`` +structure, and awaken a task that is to run on a CPU corresponding to +the rightmost leaf ``rcu_node`` structure, and the grace-period kernel +thread might not yet have reached the rightmost leaf. In this case, the +grace period's memory ordering might not yet have reached that CPU, so +again the callback function and the awakened task must supply proper +ordering. + +Putting It All Together +~~~~~~~~~~~~~~~~~~~~~~~ + +A stitched-together diagram is here: + +.. kernel-figure:: TreeRCU-gp.svg + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg new file mode 100644 index 000000000..3fcf0c17c --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -0,0 +1,486 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_sched_clock_irq() + + rcu_cleanup_after_idle() + + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg new file mode 100644 index 000000000..7ac6f9269 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg @@ -0,0 +1,655 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg new file mode 100644 index 000000000..423df00c4 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg @@ -0,0 +1,700 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + Leaf + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg new file mode 100644 index 000000000..bf84fbab2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg @@ -0,0 +1,1133 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + Root + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Root + rcu_seq_end(&rsp->gp_seq) + + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + rcu_seq_end(&rnp->gp_seq) + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg new file mode 100644 index 000000000..7ddc094d7 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg @@ -0,0 +1,1309 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg new file mode 100644 index 000000000..8c2075508 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg @@ -0,0 +1,658 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_start(rsp->gp_seq) + + + + + Root + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg new file mode 100644 index 000000000..4d956a732 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg @@ -0,0 +1,656 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg new file mode 100644 index 000000000..d24d7d555 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg @@ -0,0 +1,636 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->gp_seq = rsp->gp_seq + + + + + Root + + + rcu_gp_init() + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + ->gp_seq = rsp->gp_seq + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg new file mode 100644 index 000000000..069f6f837 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -0,0 +1,5144 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + + + + + + Root + rcu_seq_start(rsp->gp_seq) + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + + Grace-period + kernel thread + awakened + + + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + + + + + + + Root + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_note_context_switch() + + + + rcu_sched_clock_irq() + + + + rcu_core() + rcu_check_quiescent_state() + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + + Grace-period + kernel thread + awakened + + + + + + + + Root + rcu_seq_end(&rnp->gp_seq) + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + Leaf + + + + + + Root + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + + + + rcu_sched_clock_irq() + + rcu_cleanup_after_idle() + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rsp->gp_seq) + ->gp_seq = rsp->gp_seq + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg new file mode 100644 index 000000000..2c9310ba2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg @@ -0,0 +1,775 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + Leaf + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg new file mode 100644 index 000000000..7d6c5f7e5 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -0,0 +1,1095 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_note_context_switch() + + + + rcu_sched_clock_irq() + + + + rcu_core() + rcu_check_quiescent_state() + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg new file mode 100644 index 000000000..94c96c595 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + + + rcu_prepare_for_idle() + + diff --git a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg new file mode 100644 index 000000000..4b4014fda --- /dev/null +++ b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg @@ -0,0 +1,374 @@ + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + synchronize_rcu() + + + + + + + WRITE_ONCE(a, 1); + WRITE_ONCE(b, 1); + r1 = READ_ONCE(a); + WRITE_ONCE(c, 1); + r2 = READ_ONCE(b); + r3 = READ_ONCE(c); + thread0() + thread1() + thread2() + + + + rcu_read_lock(); + rcu_read_lock(); + rcu_read_unlock(); + rcu_read_unlock(); + + QS + + QS + + + QS + + diff --git a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg new file mode 100644 index 000000000..48cd1623d --- /dev/null +++ b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg @@ -0,0 +1,639 @@ + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + synchronize_rcu() + + + + + + + WRITE_ONCE(a, 1); + WRITE_ONCE(b, 1); + r1 = READ_ONCE(a); + WRITE_ONCE(c, 1); + WRITE_ONCE(d, 1); + r2 = READ_ONCE(c); + thread0() + thread1() + thread2() + + + + rcu_read_lock(); + rcu_read_lock(); + rcu_read_unlock(); + rcu_read_unlock(); + + QS + + QS + + + QS + + + + synchronize_rcu() + + + + + + + r3 = READ_ONCE(d); + WRITE_ONCE(e, 1); + + QS + r4 = READ_ONCE(b); + r5 = READ_ONCE(e); + rcu_read_lock(); + rcu_read_unlock(); + QS + + QS + + QS + + thread3() + thread4() + + diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst new file mode 100644 index 000000000..1ae79a10a --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -0,0 +1,2680 @@ +================================= +A Tour Through RCU's Requirements +================================= + +Copyright IBM Corporation, 2015 + +Author: Paul E. McKenney + +The initial version of this document appeared in the +`LWN `_ on those articles: +`part 1 `_, +`part 2 `_, and +`part 3 `_. + +Introduction +------------ + +Read-copy update (RCU) is a synchronization mechanism that is often used +as a replacement for reader-writer locking. RCU is unusual in that +updaters do not block readers, which means that RCU's read-side +primitives can be exceedingly fast and scalable. In addition, updaters +can make useful forward progress concurrently with readers. However, all +this concurrency between RCU readers and updaters does raise the +question of exactly what RCU readers are doing, which in turn raises the +question of exactly what RCU's requirements are. + +This document therefore summarizes RCU's requirements, and can be +thought of as an informal, high-level specification for RCU. It is +important to understand that RCU's specification is primarily empirical +in nature; in fact, I learned about many of these requirements the hard +way. This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been a +great privilege to work with so many people willing to apply +technologies in interesting new ways. + +All that aside, here are the categories of currently known RCU +requirements: + +#. `Fundamental Requirements`_ +#. `Fundamental Non-Requirements`_ +#. `Parallelism Facts of Life`_ +#. `Quality-of-Implementation Requirements`_ +#. `Linux Kernel Complications`_ +#. `Software-Engineering Requirements`_ +#. `Other RCU Flavors`_ +#. `Possible Future Changes`_ + +This is followed by a `summary <#Summary>`__, however, the answers to +each quick quiz immediately follows the quiz. Select the big white space +with your mouse to see the answer. + +Fundamental Requirements +------------------------ + +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. These are: + +#. `Grace-Period Guarantee`_ +#. `Publish/Subscribe Guarantee`_ +#. `Memory-Barrier Guarantees`_ +#. `RCU Primitives Guaranteed to Execute Unconditionally`_ +#. `Guaranteed Read-to-Write Upgrade`_ + +Grace-Period Guarantee +~~~~~~~~~~~~~~~~~~~~~~ + +RCU's grace-period guarantee is unusual in being premeditated: Jack +Slingwine and I had this guarantee firmly in mind when we started work +on RCU (then called “rclock”) in the early 1990s. That said, the past +two decades of experience with RCU have produced a much more detailed +understanding of this guarantee. + +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. An RCU read-side +critical section begins with the marker ``rcu_read_lock()`` and ends +with the marker ``rcu_read_unlock()``. These markers may be nested, and +RCU treats a nested set as one big RCU read-side critical section. +Production-quality implementations of ``rcu_read_lock()`` and +``rcu_read_unlock()`` are extremely lightweight, and in fact have +exactly zero overhead in Linux kernels built for production use with +``CONFIG_PREEMPT=n``. + +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + + :: + + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 WRITE_ONCE(x, 1); + 14 synchronize_rcu(); + 15 WRITE_ONCE(y, 1); + 16 } + +Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing +readers, any instance of ``thread0()`` that loads a value of zero from +``x`` must complete before ``thread1()`` stores to ``y``, so that +instance must also load a value of zero from ``y``. Similarly, any +instance of ``thread0()`` that loads a value of one from ``y`` must have +started after the ``synchronize_rcu()`` started, and must therefore also +load a value of one from ``x``. Therefore, the outcome: + + :: + + (r1 == 0 && r2 == 1) + +cannot happen. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that updaters can make useful forward | +| progress concurrently with readers, but pre-existing readers will | +| block ``synchronize_rcu()``!!! | +| Just who are you trying to fool??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| First, if updaters do not wish to be blocked by readers, they can use | +| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | +| Second, even when using ``synchronize_rcu()``, the other update-side | +| code does run concurrently with readers, whether pre-existing or not. | ++-----------------------------------------------------------------------+ + +This scenario resembles one of the first uses of RCU in +`DYNIX/ptx `__, which managed a +distributed lock manager's transition into a state suitable for handling +recovery from node failure, more or less as follows: + + :: + + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { + 10 int state_snap; + 11 + 12 rcu_read_lock(); + 13 state_snap = READ_ONCE(state); + 14 if (state_snap == STATE_NORMAL) + 15 do_something(); + 16 else + 17 do_something_carefully(); + 18 rcu_read_unlock(); + 19 } + 20 + 21 void start_recovery(void) + 22 { + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); + 24 synchronize_rcu(); + 25 WRITE_ONCE(state, STATE_RECOVERING); + 26 recovery(); + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); + 28 synchronize_rcu(); + 29 WRITE_ONCE(state, STATE_NORMAL); + 30 } + +The RCU read-side critical section in ``do_something_dlm()`` works with +the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that +``do_something()`` never runs concurrently with ``recovery()``, but with +little or no synchronization overhead in ``do_something_dlm()``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why is the ``synchronize_rcu()`` on line 28 needed? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Without that extra grace period, memory reordering could result in | +| ``do_something_dlm()`` executing ``do_something()`` concurrently with | +| the last bits of ``recovery()``. | ++-----------------------------------------------------------------------+ + +In order to avoid fatal problems such as deadlocks, an RCU read-side +critical section must not contain calls to ``synchronize_rcu()``. +Similarly, an RCU read-side critical section must not contain anything +that waits, directly or indirectly, on completion of an invocation of +``synchronize_rcu()``. + +Although RCU's grace-period guarantee is useful in and of itself, with +`quite a few use cases `__, it would +be good to be able to use RCU to coordinate read-side access to linked +data structures. For this, the grace-period guarantee is not sufficient, +as can be seen in function ``add_gp_buggy()`` below. We will look at the +reader's code later, but in the meantime, just think of the reader as +locklessly picking up the ``gp`` pointer, and, if the value loaded is +non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. + + :: + + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 gp = p; /* ORDERING BUG */ + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + + :: + + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 gp = p; /* ORDERING BUG */ + 12 p->a = a; + 13 p->b = a; + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` +executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. +And this is but one of many ways in which compiler and hardware +optimizations could cause trouble. Therefore, we clearly need some way +to prevent the compiler and the CPU from reordering in this manner, +which brings us to the publish-subscribe guarantee discussed in the next +section. + +Publish/Subscribe Guarantee +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RCU's publish-subscribe guarantee allows data to be inserted into a +linked data structure without disrupting RCU readers. The updater uses +``rcu_assign_pointer()`` to insert the new data, and readers use +``rcu_dereference()`` to access data, whether new or old. The following +shows an example of insertion: + + :: + + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 rcu_assign_pointer(gp, p); + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a +simple assignment statement, but also guarantees that its assignment +will happen after the two assignments in lines 11 and 12, similar to the +C11 ``memory_order_release`` store operation. It also prevents any +number of “interesting” compiler optimizations, for example, the use of +``gp`` as a scratch location immediately preceding the assignment. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But ``rcu_assign_pointer()`` does nothing to prevent the two | +| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | +| also cause problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, it cannot. The readers cannot see either of these two fields | +| until the assignment to ``gp``, by which time both fields are fully | +| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | +| cannot possibly cause any problems. | ++-----------------------------------------------------------------------+ + +It is tempting to assume that the reader need not do anything special to +control its accesses to the RCU-protected data, as shown in +``do_something_gp_buggy()`` below: + + :: + + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler (to say nothing of +`DEC Alpha CPUs `__) +can trip this code up. For but one example, if the compiler were short +of registers, it might choose to refetch from ``gp`` rather than keeping +a separate copy in ``p`` as follows: + + :: + + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ + 5 do_something(gp->a, gp->b); + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); + 10 return false; + 11 } + +If this function ran concurrently with a series of updates that replaced +the current structure with a new one, the fetches of ``gp->a`` and +``gp->b`` might well come from two different structures, which could +cause serious confusion. To prevent this (and much else besides), +``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: + + :: + + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory +barriers in the Linux kernel. Should a `high-quality implementation of +C11 ``memory_order_consume`` +[PDF] `__ +ever appear, then ``rcu_dereference()`` could be implemented as a +``memory_order_consume`` load. Regardless of the exact implementation, a +pointer fetched by ``rcu_dereference()`` may not be used outside of the +outermost RCU read-side critical section containing that +``rcu_dereference()``, unless protection of the corresponding data +element has been passed from RCU to some other synchronization +mechanism, most commonly locking or `reference +counting `__. + +In short, updaters use ``rcu_assign_pointer()`` and readers use +``rcu_dereference()``, and these two RCU API elements work together to +ensure that readers have a consistent view of newly added data elements. + +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +#. Remove the data element from the enclosing structure. +#. Wait for all pre-existing RCU read-side critical sections to complete + (because only pre-existing readers can possibly have a reference to + the newly removed data element). +#. At this point, only the updater has a reference to the newly removed + data element, so it can safely reclaim the data element, for example, + by passing it to ``kfree()``. + +This process is implemented by ``remove_gp_synchronous()``: + + :: + + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 rcu_assign_pointer(gp, NULL); + 12 spin_unlock(&gp_lock); + 13 synchronize_rcu(); + 14 kfree(p); + 15 return true; + 16 } + +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. This waiting ensures +that readers will reach line 7 of ``do_something_gp()`` before the data +element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on +line 6 is similar to ``rcu_dereference()``, except that: + +#. The value returned by ``rcu_access_pointer()`` cannot be + dereferenced. If you want to access the value pointed to as well as + the pointer itself, use ``rcu_dereference()`` instead of + ``rcu_access_pointer()``. +#. The call to ``rcu_access_pointer()`` need not be protected. In + contrast, ``rcu_dereference()`` must either be within an RCU + read-side critical section or in a code segment where the pointer + cannot change, for example, in code protected by the corresponding + update-side lock. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | +| what destructive optimizations might the compiler make use of? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Let's start with what happens to ``do_something_gp()`` if it fails to | +| use ``rcu_dereference()``. It could reuse a value formerly fetched | +| from this same pointer. It could also fetch the pointer from ``gp`` | +| in a byte-at-a-time manner, resulting in *load tearing*, in turn | +| resulting a bytewise mash-up of two distinct pointer values. It might | +| even use value-speculation optimizations, where it makes a wrong | +| guess, but by the time it gets around to checking the value, an | +| update has changed the pointer to match the wrong guess. Too bad | +| about any dereferences that returned pre-initialization garbage in | +| the meantime! | +| For ``remove_gp_synchronous()``, as long as all modifications to | +| ``gp`` are carried out while holding ``gp_lock``, the above | +| optimizations are harmless. However, ``sparse`` will complain if you | +| define ``gp`` with ``__rcu`` and then access it without using either | +| ``rcu_access_pointer()`` or ``rcu_dereference()``. | ++-----------------------------------------------------------------------+ + +In short, RCU's publish-subscribe guarantee is provided by the +combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This +guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. This guarantee +can be used in combination with the grace-period guarantee to also allow +data elements to be removed from RCU-protected linked data structures, +again without disrupting RCU readers. + +This guarantee was only partially premeditated. DYNIX/ptx used an +explicit memory barrier for publication, but had nothing resembling +``rcu_dereference()`` for subscription, nor did it have anything +resembling the dependency-ordering barrier that was later subsumed +into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The +need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. It took the Alpha architects a +good hour to convince me that any sort of barrier would ever be needed, +and it then took me a good *two* hours to convince them that their +documentation did not make this point clear. More recent work with the C +and C++ standards committees have provided much education on tricks and +traps from the compiler. In short, compilers were much less tricky in +the early 1990s, but in 2015, don't even think about omitting +``rcu_dereference()``! + +Memory-Barrier Guarantees +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +#. Each CPU that has an RCU read-side critical section that begins + before ``synchronize_rcu()`` starts is guaranteed to execute a full + memory barrier between the time that the RCU read-side critical + section ends and the time that ``synchronize_rcu()`` returns. Without + this guarantee, a pre-existing RCU read-side critical section might + hold a reference to the newly removed ``struct foo`` after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()``. +#. Each CPU that has an RCU read-side critical section that ends after + ``synchronize_rcu()`` returns is guaranteed to execute a full memory + barrier between the time that ``synchronize_rcu()`` begins and the + time that the RCU read-side critical section begins. Without this + guarantee, a later RCU read-side critical section running after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run + ``do_something_gp()`` and find the newly deleted ``struct foo``. +#. If the task invoking ``synchronize_rcu()`` remains on a given CPU, + then that CPU is guaranteed to execute a full memory barrier sometime + during the execution of ``synchronize_rcu()``. This guarantee ensures + that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really + does execute after the removal on line 11. +#. If the task invoking ``synchronize_rcu()`` migrates among a group of + CPUs during that invocation, then each of the CPUs in that group is + guaranteed to execute a full memory barrier sometime during the + execution of ``synchronize_rcu()``. This guarantee also ensures that + the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does + execute after the removal on line 11, but also in the case where the + thread executing the ``synchronize_rcu()`` migrates in the meantime. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that multiple CPUs can start RCU read-side critical sections at | +| any time without any ordering whatsoever, how can RCU possibly tell | +| whether or not a given RCU read-side critical section starts before a | +| given instance of ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| If RCU cannot tell whether or not a given RCU read-side critical | +| section starts before a given instance of ``synchronize_rcu()``, then | +| it must assume that the RCU read-side critical section started first. | +| In other words, a given instance of ``synchronize_rcu()`` can avoid | +| waiting on a given RCU read-side critical section only if it can | +| prove that ``synchronize_rcu()`` started first. | +| A related question is “When ``rcu_read_lock()`` doesn't generate any | +| code, why does it matter how it relates to a grace period?” The | +| answer is that it is not the relationship of ``rcu_read_lock()`` | +| itself that is important, but rather the relationship of the code | +| within the enclosed RCU read-side critical section to the code | +| preceding and following the grace period. If we take this viewpoint, | +| then a given RCU read-side critical section begins before a given | +| grace period when some access preceding the grace period observes the | +| effect of some access within the critical section, in which case none | +| of the accesses within the critical section may observe the effects | +| of any access following the grace period. | +| | +| As of late 2016, mathematical models of RCU take this viewpoint, for | +| example, see slides 62 and 63 of the `2016 LinuxCon | +| EU `__ | +| presentation. | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The first and second guarantees require unbelievably strict ordering! | +| Are all these memory barriers *really* required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Yes, they really are required. To see why the first guarantee is | +| required, consider the following sequence of events: | +| | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``do_something_with(q->a);`` | +| ``/* No smp_mb(), so might happen after kfree(). */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| | +| Therefore, there absolutely must be a full memory barrier between the | +| end of the RCU read-side critical section and the end of the grace | +| period. | +| | +| The sequence of events demonstrating the necessity of the second rule | +| is roughly similar: | +| | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp);`` | +| ``/* Might return p if no memory barrier. */`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| | +| And similarly, without a memory barrier between the beginning of the | +| grace period and the beginning of the RCU read-side critical section, | +| CPU 1 might end up accessing the freelist. | +| | +| The “as if” rule of course applies, so that any implementation that | +| acts as if the appropriate memory barriers were in place is a correct | +| implementation. That said, it is much easier to fool yourself into | +| believing that you have adhered to the as-if rule than it is to | +| actually adhere to it! | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code in some kernel builds. This means that the | +| compiler might arbitrarily rearrange consecutive RCU read-side | +| critical sections. Given such rearrangement, if a given RCU read-side | +| critical section is done, how can you be sure that all prior RCU | +| read-side critical sections are done? Won't the compiler | +| rearrangements make that impossible to determine? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code, RCU infers quiescent states only at special | +| locations, for example, within the scheduler. Because calls to | +| ``schedule()`` had better prevent calling-code accesses to shared | +| variables from being rearranged across the call to ``schedule()``, if | +| RCU detects the end of a given RCU read-side critical section, it | +| will necessarily detect the end of all prior RCU read-side critical | +| sections, no matter how aggressively the compiler scrambles the code. | +| Again, this all assumes that the compiler cannot scramble code across | +| calls to the scheduler, out of interrupt handlers, into the idle | +| loop, into user-mode code, and so on. But if your kernel build allows | +| that sort of scrambling, you have broken far more than just RCU! | ++-----------------------------------------------------------------------+ + +Note that these memory-barrier requirements do not replace the +fundamental RCU requirement that a grace period wait for all +pre-existing readers. On the contrary, the memory barriers called out in +this section must operate in such a way as to *enforce* this fundamental +requirement. Of course, different implementations enforce this +requirement in different ways, but enforce it they must. + +RCU Primitives Guaranteed to Execute Unconditionally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The common-case RCU primitives are unconditional. They are invoked, they +do their job, and they return, with no possibility of error, and no need +to retry. This is a key RCU design philosophy. + +However, this philosophy is pragmatic rather than pigheaded. If someone +comes up with a good justification for a particular conditional RCU +primitive, it might well be implemented and added. After all, this +guarantee was reverse-engineered, not premeditated. The unconditional +nature of the RCU primitives was initially an accident of +implementation, and later experience with synchronization primitives +with conditional primitives caused me to elevate this accident to a +guarantee. Therefore, the justification for adding a conditional +primitive to RCU would need to be based on detailed and compelling use +cases. + +Guaranteed Read-to-Write Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As far as RCU is concerned, it is always possible to carry out an update +within an RCU read-side critical section. For example, that RCU +read-side critical section might search for a given data element, and +then might acquire the update-side spinlock in order to update that +element, all while remaining in that RCU read-side critical section. Of +course, it is necessary to exit the RCU read-side critical section +before invoking ``synchronize_rcu()``, however, this inconvenience can +be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API +members described later in this document. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But how does the upgrade-to-write operation exclude other readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It doesn't, just like normal RCU updates, which also do not exclude | +| RCU readers. | ++-----------------------------------------------------------------------+ + +This guarantee allows lookup code to be shared between read-side and +update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +Fundamental Non-Requirements +---------------------------- + +RCU provides extremely lightweight readers, and its read-side +guarantees, though quite useful, are correspondingly lightweight. It is +therefore all too easy to assume that RCU is guaranteeing more than it +really is. Of course, the list of things that RCU does not guarantee is +infinitely long, however, the following sections list a few +non-guarantees that have caused confusion. Except where otherwise noted, +these non-guarantees were premeditated. + +#. `Readers Impose Minimal Ordering`_ +#. `Readers Do Not Exclude Updaters`_ +#. `Updaters Only Wait For Old Readers`_ +#. `Grace Periods Don't Partition Read-Side Critical Sections`_ +#. `Read-Side Critical Sections Don't Partition Grace Periods`_ + +Readers Impose Minimal Ordering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reader-side markers such as ``rcu_read_lock()`` and +``rcu_read_unlock()`` provide absolutely no ordering guarantees except +through their interaction with the grace-period APIs such as +``synchronize_rcu()``. To see this, consider the following pair of +threads: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 rcu_read_lock(); + 14 r1 = READ_ONCE(y); + 15 rcu_read_unlock(); + 16 rcu_read_lock(); + 17 r2 = READ_ONCE(x); + 18 rcu_read_unlock(); + 19 } + +After ``thread0()`` and ``thread1()`` execute concurrently, it is quite +possible to have + + :: + + (r1 == 1 && r2 == 0) + +(that is, ``y`` appears to have been assigned before ``x``), which would +not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had +much in the way of ordering properties. But they do not, so the CPU is +within its rights to do significant reordering. This is by design: Any +significant ordering constraints would slow down these fast-path APIs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Can't the compiler also reorder this code? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | +| prevent the compiler from reordering in this particular case. | ++-----------------------------------------------------------------------+ + +Readers Do Not Exclude Updaters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. +All they do is to prevent grace periods from ending. The following +example illustrates this: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } + 10 rcu_read_unlock(); + 11 } + 12 + 13 void thread1(void) + 14 { + 15 spin_lock(&my_lock); + 16 WRITE_ONCE(x, 1); + 17 WRITE_ONCE(y, 1); + 18 spin_unlock(&my_lock); + 19 } + +If the ``thread0()`` function's ``rcu_read_lock()`` excluded the +``thread1()`` function's update, the ``WARN_ON()`` could never fire. But +the fact is that ``rcu_read_lock()`` does not exclude much of anything +aside from subsequent grace periods, of which ``thread1()`` has none, so +the ``WARN_ON()`` can and does fire. + +Updaters Only Wait For Old Readers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It might be tempting to assume that after ``synchronize_rcu()`` +completes, there are no readers executing. This temptation must be +avoided because new readers can start immediately after +``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no +obligation to wait for these new readers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that synchronize_rcu() did wait until *all* readers had | +| completed instead of waiting only on pre-existing readers. For how | +| long would the updater be able to rely on there being no readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| For no time at all. Even if ``synchronize_rcu()`` were to wait until | +| all readers had completed, a new reader might start immediately after | +| ``synchronize_rcu()`` completed. Therefore, the code following | +| ``synchronize_rcu()`` can *never* rely on there being no readers. | ++-----------------------------------------------------------------------+ + +Grace Periods Don't Partition Read-Side Critical Sections +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. An example of this +situation can be illustrated as follows, where ``x``, ``y``, and ``z`` +are initially all zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 r2 = READ_ONCE(b); + 20 r3 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + +It turns out that the outcome: + + :: + + (r1 == 1 && r2 == 0 && r3 == 1) + +is entirely possible. The following figure show how this can happen, +with each circled ``QS`` indicating the point at which RCU recorded a +*quiescent state* for each thread, that is, a state in which RCU knows +that the thread cannot be in the midst of an RCU read-side critical +section that started before the current grace period: + +.. kernel-figure:: GPpartitionReaders1.svg + +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first grace +period is known to end before the second grace period starts: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 r2 = READ_ONCE(c); + 19 synchronize_rcu(); + 20 WRITE_ONCE(d, 1); + 21 } + 22 + 23 void thread3(void) + 24 { + 25 rcu_read_lock(); + 26 r3 = READ_ONCE(b); + 27 r4 = READ_ONCE(d); + 28 rcu_read_unlock(); + 29 } + +Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen +before the end of ``thread1()``'s grace period. If in addition +``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after +the beginning of ``thread2()``'s grace period. If it is also the case +that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must +precede the beginning of ``thread2()``'s grace period. This mean that +the two RCU read-side critical sections cannot overlap, guaranteeing +that ``(r3 == 1)``. As a result, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) + +cannot happen. + +This non-requirement was also non-premeditated, but became apparent when +studying RCU's interaction with memory ordering. + +Read-Side Critical Sections Don't Partition Grace Periods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. However, this temptation leads nowhere good, as can be +illustrated by the following, with all variables initially zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 WRITE_ONCE(d, 1); + 20 r2 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + 23 + 24 void thread3(void) + 25 { + 26 r3 = READ_ONCE(d); + 27 synchronize_rcu(); + 28 WRITE_ONCE(e, 1); + 29 } + 30 + 31 void thread4(void) + 32 { + 33 rcu_read_lock(); + 34 r4 = READ_ONCE(b); + 35 r5 = READ_ONCE(e); + 36 rcu_read_unlock(); + 37 } + +In this case, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) + +is entirely possible, as illustrated below: + +.. kernel-figure:: ReadersPartitionGP1.svg + +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire grace +period. As a result, an RCU read-side critical section cannot partition +a pair of RCU grace periods. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How long a sequence of grace periods, each separated by an RCU | +| read-side critical section, would be required to partition the RCU | +| read-side critical sections at the beginning and end of the chain? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In theory, an infinite number. In practice, an unknown number that is | +| sensitive to both implementation details and timing considerations. | +| Therefore, even in practice, RCU users must abide by the theoretical | +| rather than the practical answer. | ++-----------------------------------------------------------------------+ + +Parallelism Facts of Life +------------------------- + +These parallelism facts of life are by no means specific to RCU, but the +RCU implementation must abide by them. They therefore bear repeating: + +#. Any CPU or task may be delayed at any time, and any attempts to avoid + these delays by disabling preemption, interrupts, or whatever are + completely futile. This is most obvious in preemptible user-level + environments and in virtualized environments (where a given guest + OS's VCPUs can be preempted at any time by the underlying + hypervisor), but can also happen in bare-metal environments due to + ECC errors, NMIs, and other hardware events. Although a delay of more + than about 20 seconds can result in splats, the RCU implementation is + obligated to use algorithms that can tolerate extremely long delays, + but where “extremely long” is not long enough to allow wrap-around + when incrementing a 64-bit counter. +#. Both the compiler and the CPU can reorder memory accesses. Where it + matters, RCU must use compiler directives and memory-barrier + instructions to preserve ordering. +#. Conflicting writes to memory locations in any given cache line will + result in expensive cache misses. Greater numbers of concurrent + writes and more-frequent concurrent writes will result in more + dramatic slowdowns. RCU is therefore obligated to use algorithms that + have sufficient locality to avoid significant performance and + scalability problems. +#. As a rough rule of thumb, only one CPU's worth of processing may be + carried out under the protection of any given exclusive lock. RCU + must therefore use scalable locking designs. +#. Counters are finite, especially on 32-bit systems. RCU's use of + counters must therefore tolerate counter wrap, or be designed such + that counter wrap would take way more time than a single system is + likely to run. An uptime of ten years is quite possible, a runtime of + a century much less so. As an example of the latter, RCU's + dyntick-idle nesting counter allows 54 bits for interrupt nesting + level (this counter is 64 bits even on a 32-bit system). Overflowing + this counter requires 2\ :sup:`54` half-interrupts on a given CPU + without that CPU ever going idle. If a half-interrupt happened every + microsecond, it would take 570 years of runtime to overflow this + counter, which is currently believed to be an acceptably long time. +#. Linux systems can have thousands of CPUs running a single Linux + kernel in a single shared-memory environment. RCU must therefore pay + close attention to high-end scalability. + +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. The idea that Linux might +scale to systems with thousands of CPUs would have been met with some +skepticism in the 1990s, but these requirements would have otherwise +have been unsurprising, even in the early 1990s. + +Quality-of-Implementation Requirements +-------------------------------------- + +These sections list quality-of-implementation requirements. Although an +RCU implementation that ignores these requirements could still be used, +it would likely be subject to limitations that would make it +inappropriate for industrial-strength production use. Classes of +quality-of-implementation requirements are as follows: + +#. `Specialization`_ +#. `Performance and Scalability`_ +#. `Forward Progress`_ +#. `Composability`_ +#. `Corner Cases`_ + +These classes is covered in the following sections. + +Specialization +~~~~~~~~~~~~~~ + +RCU is and always has been intended primarily for read-mostly +situations, which means that RCU's read-side primitives are optimized, +often at the expense of its update-side primitives. Experience thus far +is captured by the following list of situations: + +#. Read-mostly data, where stale and inconsistent data is not a problem: + RCU works great! +#. Read-mostly data, where data must be consistent: RCU works well. +#. Read-write data, where data must be consistent: RCU *might* work OK. + Or not. +#. Write-mostly data, where data must be consistent: RCU is very + unlikely to be the right tool for the job, with the following + exceptions, where RCU can provide: + + a. Existence guarantees for update-friendly mechanisms. + b. Wait-free read-side primitives for real-time use. + +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. For example, the ``add_gp()`` and +``remove_gp_synchronous()`` examples discussed earlier use RCU to +protect readers and locking to coordinate updaters. However, the need +extends much farther, requiring that a variety of synchronization +primitives be legal within RCU read-side critical sections, including +spinlocks, sequence locks, atomic operations, reference counters, and +memory barriers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What about sleeping locks? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| These are forbidden within Linux-kernel RCU read-side critical | +| sections because it is not legal to place a quiescent state (in this | +| case, voluntary context switch) within an RCU read-side critical | +| section. However, sleeping locks may be used within userspace RCU | +| read-side critical sections, and also within Linux-kernel sleepable | +| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | +| addition, the -rt patchset turns spinlocks into a sleeping locks so | +| that the corresponding critical sections can be preempted, which also | +| means that these sleeplockified spinlocks (but not other sleeping | +| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | +| sections. | +| Note that it *is* legal for a normal RCU read-side critical section | +| to conditionally acquire a sleeping locks (as in | +| ``mutex_trylock()``), but only as long as it does not loop | +| indefinitely attempting to conditionally acquire that sleeping locks. | +| The key point is that things like ``mutex_trylock()`` either return | +| with the mutex held, or return an error indication if the mutex was | +| not immediately available. Either way, ``mutex_trylock()`` returns | +| immediately without sleeping. | ++-----------------------------------------------------------------------+ + +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, with +network routing being the poster child. Internet routing algorithms take +significant time to propagate updates, so that by the time an update +arrives at a given system, that system has been sending network traffic +the wrong way for a considerable length of time. Having a few threads +continue to send traffic the wrong way for a few more milliseconds is +clearly not a problem: In the worst case, TCP retransmissions will +eventually get the data where it needs to go. In general, when tracking +the state of the universe outside of the computer, some level of +inconsistency must be tolerated due to speed-of-light delays if nothing +else. + +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. But how long should they wait +after the last heartbeat to decide that the cat is in fact dead? Waiting +less than 400 milliseconds makes no sense because this would mean that a +relaxed cat would be considered to cycle between death and life more +than 100 times per minute. Moreover, just as with human beings, a cat's +heart might stop for some period of time, so the exact wait period is a +judgment call. One of our pair of veterinarians might wait 30 seconds +before pronouncing the cat dead, while the other might insist on waiting +a full minute. The two veterinarians would then disagree on the state of +the cat during the final 30 seconds of the minute following the last +heartbeat. + +Interestingly enough, this same situation applies to hardware. When push +comes to shove, how do we tell whether or not some external server has +failed? We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. Policy decisions +can usually tolerate short periods of inconsistency. The policy was +decided some time ago, and is only now being put into effect, so a few +milliseconds of delay is normally inconsequential. + +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore ID +to the corresponding in-kernel data structure is protected by RCU, but +it is absolutely forbidden to update a semaphore that has just been +removed. In the Linux kernel, this need for consistency is accommodated +by acquiring spinlocks located in the in-kernel data structure from +within the RCU read-side critical section, and this is indicated by the +green box in the figure above. Many other techniques may be used, and +are in fact used within the Linux kernel. + +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows the +right mix of synchronization tools to be used for a given job. + +Performance and Scalability +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Energy efficiency is a critical component of performance today, and +Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. I cannot claim that this requirement was +premeditated. In fact, I learned of it during a telephone conversation +in which I was given “frank and open” feedback on the importance of +energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. So much so that +mere Linux-kernel-mailing-list posts are insufficient to vent their ire. + +Memory consumption is not particularly important for in most situations, +and has become decreasingly so as memory sizes have expanded and memory +costs have plummeted. However, as I learned from Matt Mackall's +`bloatwatch `__ efforts, memory +footprint is critically important on single-CPU systems with +non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny +RCU `__ +was born. Josh Triplett has since taken over the small-memory banner +with his `Linux kernel tinification `__ +project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional +for those kernels not needing it. + +The remaining performance requirements are, for the most part, +unsurprising. For example, in keeping with RCU's read-side +specialization, ``rcu_dereference()`` should have negligible overhead +(for example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have exactly zero overhead. + +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have minimal overhead. In particular, they +should not contain atomic read-modify-write operations, memory-barrier +instructions, preemption disabling, interrupt disabling, or backwards +branches. However, in the case where the RCU read-side critical section +was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable +interrupts. This is why it is better to nest an RCU read-side critical +section within a preempt-disable region than vice versa, at least in +cases where that critical section is short enough to avoid unduly +degrading real-time latencies. + +The ``synchronize_rcu()`` grace-period-wait primitive is optimized for +throughput. It may therefore incur several milliseconds of latency in +addition to the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +``synchronize_rcu()`` are required to use batching optimizations so that +they can be satisfied by a single underlying grace-period-wait +operation. For example, in the Linux kernel, it is not unusual for a +single grace-period-wait operation to serve more than `1,000 separate +invocations `__ +of ``synchronize_rcu()``, thus amortizing the per-invocation overhead +down to nearly zero. However, the grace-period optimization is also +required to avoid measurable degradation of real-time scheduling and +interrupt latencies. + +In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are +unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be +used instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. There are currently no special latency +requirements for ``synchronize_rcu_expedited()`` on large systems, but, +consistent with the empirical nature of the RCU specification, that is +subject to change. However, there most definitely are scalability +requirements: A storm of ``synchronize_rcu_expedited()`` invocations on +4096 CPUs should at least make reasonable forward progress. In return +for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted +to impose modest degradation of real-time latency on non-idle online +CPUs. Here, “modest” means roughly the same latency degradation as a +scheduling-clock interrupt. + +There are a number of situations where even +``synchronize_rcu_expedited()``'s reduced grace-period latency is +unacceptable. In these situations, the asynchronous ``call_rcu()`` can +be used in place of ``synchronize_rcu()`` as follows: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); + 10 + 11 kfree(p); + 12 } + 13 + 14 bool remove_gp_asynchronous(void) + 15 { + 16 struct foo *p; + 17 + 18 spin_lock(&gp_lock); + 19 p = rcu_access_pointer(gp); + 20 if (!p) { + 21 spin_unlock(&gp_lock); + 22 return false; + 23 } + 24 rcu_assign_pointer(gp, NULL); + 25 call_rcu(&p->rh, remove_gp_cb); + 26 spin_unlock(&gp_lock); + 27 return true; + 28 } + +A definition of ``struct foo`` is finally needed, and appears on +lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` +on line 25, and will be invoked after the end of a subsequent grace +period. This gets the same effect as ``remove_gp_synchronous()``, but +without forcing the updater to wait for a grace period to elapse. The +``call_rcu()`` function may be used in a number of situations where +neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would +be legal, including within preempt-disable code, ``local_bh_disable()`` +code, interrupt-disable code, and interrupt handlers. However, even +``call_rcu()`` is illegal within NMI handlers and from idle and offline +CPUs. The callback function (``remove_gp_cb()`` in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, either within a real softirq handler or under the +protection of ``local_bh_disable()``. In both the Linux kernel and in +userspace, it is bad practice to write an RCU callback function that +takes too long. Long-running operations should be relegated to separate +threads or (in the Linux kernel) workqueues. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why does line 19 use ``rcu_access_pointer()``? After all, | +| ``call_rcu()`` on line 25 stores into the structure, which would | +| interact badly with concurrent insertions. Doesn't this mean that | +| ``rcu_dereference()`` is required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Presumably the ``->gp_lock`` acquired on line 18 excludes any | +| changes, including any insertions that ``rcu_dereference()`` would | +| protect against. Therefore, any insertions will be delayed until | +| after ``->gp_lock`` is released on line 25, which in turn means that | +| ``rcu_access_pointer()`` suffices. | ++-----------------------------------------------------------------------+ + +However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on +the data element. This is a common idiom, and is supported by +``kfree_rcu()``, which allows “fire and forget” operation as shown +below: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; + 10 + 11 spin_lock(&gp_lock); + 12 p = rcu_dereference(gp); + 13 if (!p) { + 14 spin_unlock(&gp_lock); + 15 return false; + 16 } + 17 rcu_assign_pointer(gp, NULL); + 18 kfree_rcu(p, rh); + 19 spin_unlock(&gp_lock); + 20 return true; + 21 } + +Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and +proceeds, without any need to pay any further attention to the +subsequent grace period and ``kfree()``. It is permissible to invoke +``kfree_rcu()`` from the same environments as for ``call_rcu()``. +Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` +and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the +fact that RCU was not heavily used within DYNIX/ptx, so the very few +places that needed something like ``synchronize_rcu()`` simply +open-coded it. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | +| allowed updaters to avoid being blocked by readers. But how can that | +| be correct, given that the invocation of the callback and the freeing | +| of the memory (respectively) must still wait for a grace period to | +| elapse? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| We could define things this way, but keep in mind that this sort of | +| definition would say that updates in garbage-collected languages | +| cannot complete until the next time the garbage collector runs, which | +| does not seem at all reasonable. The key point is that in most cases, | +| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | +| to the next update as soon as it has invoked ``call_rcu()`` or | +| ``kfree_rcu()``, without having to wait for a subsequent grace | +| period. | ++-----------------------------------------------------------------------+ + +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks that can +be carried out in the meantime? The polling-style +``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions +may be used for this purpose, as shown below: + + :: + + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); + 10 return false; + 11 } + 12 rcu_assign_pointer(gp, NULL); + 13 spin_unlock(&gp_lock); + 14 s = get_state_synchronize_rcu(); + 15 do_something_while_waiting(); + 16 cond_synchronize_rcu(s); + 17 kfree(p); + 18 return true; + 19 } + +On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, +then line 15 carries out other tasks, and finally, line 16 returns +immediately if a grace period has elapsed in the meantime, but otherwise +waits as required. The need for ``get_state_synchronize_rcu`` and +``cond_synchronize_rcu()`` has appeared quite recently, so it is too +early to tell whether they will stand the test of time. + +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +Forward Progress +~~~~~~~~~~~~~~~~ + +In theory, delaying grace-period completion and callback invocation is +harmless. In practice, not only are memory sizes finite but also +callbacks sometimes do wakeups, and sufficiently deferred wakeups can be +difficult to distinguish from system hangs. Therefore, RCU must provide +a number of mechanisms to promote forward progress. + +These mechanisms are not foolproof, nor can they be. For one simple +example, an infinite loop in an RCU read-side critical section must by +definition prevent later grace periods from ever completing. For a more +involved example, consider a 64-CPU system built with +``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where +CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even +if these tight loops also contain calls to ``cond_resched()`` (thus +allowing grace periods to complete), CPU 0 simply will not be able to +invoke callbacks as fast as the other 63 CPUs can register them, at +least not until the system runs out of memory. In both of these +examples, the Spiderman principle applies: With great power comes great +responsibility. However, short of this level of abuse, RCU is required +to ensure timely completion of grace periods and timely invocation of +callbacks. + +RCU takes the following steps to encourage timely completion of grace +periods: + +#. If a grace period fails to complete within 100 milliseconds, RCU + causes future invocations of ``cond_resched()`` on the holdout CPUs + to provide an RCU quiescent state. RCU also causes those CPUs' + ``need_resched()`` invocations to return ``true``, but only after the + corresponding CPU's next scheduling-clock. +#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run + indefinitely in the kernel without scheduling-clock interrupts, which + defeats the above ``need_resched()`` strategem. RCU will therefore + invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out + after 109 milliseconds. +#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that + has been preempted within an RCU read-side critical section is + holding out for more than 500 milliseconds, RCU will resort to + priority boosting. +#. If a CPU is still holding out 10 seconds into the grace period, RCU + will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` + state. + +The above values are defaults for systems running with ``HZ=1000``. They +will vary as the value of ``HZ`` varies, and can also be changed using +the relevant Kconfig options and kernel boot parameters. RCU currently +does not do much sanity checking of these parameters, so please use +caution when changing them. Note that these forward-progress measures +are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. + +RCU takes the following steps in ``call_rcu()`` to encourage timely +invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has +10,000 callbacks, or has 10,000 more callbacks than it had the last time +encouragement was provided: + +#. Starts a grace period, if one is not already in progress. +#. Forces immediate checking for quiescent states, rather than waiting + for three milliseconds to have elapsed since the beginning of the + grace period. +#. Immediately tags the CPU's callbacks with their grace period + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` + handler to get around to it. +#. Lifts callback-execution batch limits, which speeds up callback + invocation at the expense of degrading realtime response. + +Again, these are default values when running at ``HZ=1000``, and can be +overridden. Again, these forward-progress measures are provided only for +RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward +progress for ``rcu_nocbs`` CPUs is much less well-developed, in part +because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke +``call_rcu()`` relatively infrequently. If workloads emerge that need +both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then +additional forward-progress work will be required. + +Composability +~~~~~~~~~~~~~ + +Composability has received much attention in recent years, perhaps in +part due to the collision of multicore hardware with object-oriented +techniques designed in single-threaded environments for single-threaded +use. And in theory, RCU read-side critical sections may be composed, and +in fact may be nested arbitrarily deeply. In practice, as with all +real-world implementations of composable constructs, there are +limitations. + +Implementations of RCU for which ``rcu_read_lock()`` and +``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when +``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there +is no overhead. Except that if all these instances of +``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the +compiler, compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. If the nesting is +not visible to the compiler, as is the case with mutually recursive +functions each in its own translation unit, stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux +kernel) you will get an RCU CPU stall warning. Nevertheless, this class +of RCU implementations is one of the most composable constructs in +existence. + +RCU implementations that explicitly track nesting depth are limited by +the nesting-depth counter. For example, the Linux kernel's preemptible +RCU limits nesting to ``INT_MAX``. This should suffice for almost all +practical purposes. That said, a consecutive pair of RCU read-side +critical sections between which there is an operation that waits for a +grace period cannot be enclosed in another RCU read-side critical +section. This is because it is not legal to wait for a grace period +within an RCU read-side critical section: To do so would result either +in deadlock or in RCU implicitly splitting the enclosing RCU read-side +critical section, neither of which is conducive to a long-lived and +prosperous kernel. + +It is worth noting that RCU is not alone in limiting composability. For +example, many transactional-memory implementations prohibit composing a +pair of transactions separated by an irrevocable operation (for example, +a network receive operation). For another example, lock-based critical +sections can be composed surprisingly freely, but only if deadlock is +avoided. + +In short, although RCU read-side critical sections are highly +composable, care is required in some situations, just as is the case for +any other composable synchronization mechanism. + +Corner Cases +~~~~~~~~~~~~ + +A given RCU workload might have an endless and intense stream of RCU +read-side critical sections, perhaps even so intense that there was +never a point in time during which there was not at least one RCU +read-side critical section in flight. RCU cannot allow this situation to +block grace periods: As long as all the RCU read-side critical sections +are finite, grace periods must also be finite. + +That said, preemptible RCU implementations could potentially result in +RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side critical +section. This situation can arise only in heavily loaded systems, but +systems using real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. That said, the exact requirements on RCU priority boosting will +likely evolve as more experience accumulates. + +Other workloads might have very high update rates. Although one can +argue that such workloads should instead use something other than RCU, +the fact remains that RCU must handle such workloads gracefully. This +requirement is another factor driving batching of grace periods, but it +is also the driving force behind the checks for large numbers of queued +RCU callbacks in the ``call_rcu()`` code path. Finally, high update +rates should not delay RCU read-side critical sections, although some +small read-side delays can occur when using +``synchronize_rcu_expedited()``, courtesy of this function's use of +``smp_call_function_single()``. + +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of ``close(open(path))`` in a +tight loop in the early 2000s suddenly provided a much deeper +appreciation of the high-update-rate corner case. This test also +motivated addition of some RCU code to react to high update rates, for +example, if a given CPU finds itself with more than 10,000 RCU callbacks +queued, it will cause RCU to take evasive action by more aggressively +starting grace periods and more aggressively forcing completion of +grace-period processing. This evasive action causes the grace period to +complete more quickly, but at the cost of restricting RCU's batching +optimizations, thus increasing the CPU overhead incurred by that grace +period. + +Software-Engineering Requirements +--------------------------------- + +Between Murphy's Law and “To err is human”, it is necessary to guard +against mishaps and misuse: + +#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will + splat if ``rcu_dereference()`` is used outside of an RCU read-side + critical section. Update-side code can use + ``rcu_dereference_protected()``, which takes a `lockdep + expression `__ to indicate what is + providing the protection. If the indicated protection is not + provided, a lockdep splat is emitted. + Code shared between readers and updaters can use + ``rcu_dereference_check()``, which also takes a lockdep expression, + and emits a lockdep splat if neither ``rcu_read_lock()`` nor the + indicated protection is in place. In addition, + ``rcu_dereference_raw()`` is used in those (hopefully rare) cases + where the required protection cannot be easily described. Finally, + ``rcu_read_lock_held()`` is provided to allow a function to verify + that it has been invoked within an RCU read-side critical section. I + was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +#. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. The + ``rcu_lockdep_assert()`` does this job, asserting the expression in + kernels having lockdep enabled and doing nothing otherwise. +#. It is also easy to forget to use ``rcu_assign_pointer()`` and + ``rcu_dereference()``, perhaps (incorrectly) substituting a simple + assignment. To catch this sort of error, a given RCU-protected + pointer may be tagged with ``__rcu``, after which sparse will + complain about simple-assignment accesses to that pointer. Arnd + Bergmann made me aware of this requirement, and also supplied the + needed `patch series `__. +#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if + a data element is passed to ``call_rcu()`` twice in a row, without a + grace period in between. (This error is similar to a double free.) + The corresponding ``rcu_head`` structures that are dynamically + allocated are automatically tracked, but ``rcu_head`` structures + allocated on the stack must be initialized with + ``init_rcu_head_on_stack()`` and cleaned up with + ``destroy_rcu_head_on_stack()``. Similarly, statically allocated + non-stack ``rcu_head`` structures must be initialized with + ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + `patch `__. +#. An infinite loop in an RCU read-side critical section will eventually + trigger an RCU CPU stall warning splat, with the duration of + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` + ``Kconfig`` option, or, alternatively, by the + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU + is not obligated to produce this splat unless there is a grace period + waiting on that particular RCU read-side critical section. + + Some extreme workloads might intentionally delay RCU grace periods, + and systems running those workloads can be booted with + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU + stall warnings are counter-productive during sysrq dumps and during + panics. RCU therefore supplies the ``rcu_sysrq_start()`` and + ``rcu_sysrq_end()`` API members to be called before and after long + sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is + automatically invoked at the beginning of a panic to suppress further + RCU CPU stall warnings. + + This requirement made itself known in the early 1990s, pretty much + the first time that it was necessary to debug a CPU stall. That said, + the initial implementation in DYNIX/ptx was quite generic in + comparison with that of Linux. + +#. Although it would be very good to detect pointers leaking out of RCU + read-side critical sections, there is currently no good way of doing + this. One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to some other + synchronization mechanism, for example, reference counting. +#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information + is provided via event tracing. +#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` + to create typical linked data structures can be surprisingly + error-prone. Therefore, RCU-protected `linked + lists `__ and, + more recently, RCU-protected `hash + tables `__ are available. Many + other special-purpose RCU-protected data structures are available in + the Linux kernel and the userspace RCU library. +#. Some linked structures are created at compile time, but still require + ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves + this purpose. +#. It is not necessary to use ``rcu_assign_pointer()`` when creating + linked structures that are to be published via a single external + pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task + and also for assigning ``NULL`` pointers at runtime. + +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found in +real-world RCU usage. + +Linux Kernel Complications +-------------------------- + +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. Some of the relevant points of interest are as +follows: + +#. `Configuration`_ +#. `Firmware Interface`_ +#. `Early Boot`_ +#. `Interrupts and NMIs`_ +#. `Loadable Modules`_ +#. `Hotplug CPU`_ +#. `Scheduler and RCU`_ +#. `Tracing and RCU`_ +#. `Accesses to User Memory and RCU`_ +#. `Energy Efficiency`_ +#. `Scheduling-Clock Interrupts and RCU`_ +#. `Memory Efficiency`_ +#. `Performance, Scalability, Response Time, and Reliability`_ + +This list is probably incomplete, but it does give a feel for the most +notable Linux-kernel complications. Each of the following sections +covers one of the above topics. + +Configuration +~~~~~~~~~~~~~ + +RCU's goal is automatic configuration, so that almost nobody needs to +worry about RCU's ``Kconfig`` options. And for almost all users, RCU +does in fact work well “out of the box.” + +However, there are specialized use cases that are handled by kernel boot +parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` +system will explicitly ask users about new ``Kconfig`` options, which +requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` +``Kconfig`` option. + +This all should be quite obvious, but the fact remains that Linus +Torvalds recently had to +`remind `__ +me of this requirement. + +Firmware Interface +~~~~~~~~~~~~~~~~~~ + +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. Or the +translation is accurate, but the original message is bogus. + +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. If RCU naively believed the firmware, as it +used to do, it would create too many per-CPU kthreads. Although the +resulting system will still run correctly, the extra kthreads needlessly +consume memory and can cause confusion when they show up in ``ps`` +listings. + +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. The +resulting “ghost CPUs” (which are never going to come online) cause a +number of `interesting +complications `__. + +Early Boot +~~~~~~~~~~ + +The Linux kernel's boot sequence is an interesting process, and RCU is +used early, even before ``rcu_init()`` is invoked. In fact, a number of +RCU's primitives can be used as soon as the initial task's +``task_struct`` is available and the boot CPU's per-CPU variables are +set up. The read-side primitives (``rcu_read_lock()``, +``rcu_read_unlock()``, ``rcu_dereference()``, and +``rcu_access_pointer()``) will operate normally very early on, as will +``rcu_assign_pointer()``. + +Although ``call_rcu()`` may be invoked at any time during boot, +callbacks are not guaranteed to be invoked until after all of RCU's +kthreads have been spawned, which occurs at ``early_initcall()`` time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full +initialization cannot occur until after the scheduler has initialized +itself to the point where RCU can spawn and run its kthreads. In theory, +it would be possible to invoke callbacks earlier, however, this is not a +panacea because there would be severe restrictions on what operations +those callbacks could invoke. + +Perhaps surprisingly, ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``, will operate normally during very early +boot, the reason being that there is only one CPU and preemption is +disabled. This means that the call ``synchronize_rcu()`` (or friends) +itself is a quiescent state and thus a grace period, so the early-boot +implementation can be a no-op. + +However, once the scheduler has spawned its first kthread, this early +boot trick fails for ``synchronize_rcu()`` (as well as for +``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The +reason is that an RCU read-side critical section might be preempted, +which means that a subsequent ``synchronize_rcu()`` really does have to +wait for something, as opposed to simply returning immediately. +Unfortunately, ``synchronize_rcu()`` can't do this until all of its +kthreads are spawned, which doesn't happen until some time during +``early_initcalls()`` time. But this is no excuse: RCU is nevertheless +required to correctly handle synchronous grace periods during this time +period. Once all of its kthreads are up and running, RCU starts running +normally. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How can RCU possibly handle grace periods before all of its kthreads | +| have been spawned??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Very carefully! | +| During the “dead zone” between the time that the scheduler spawns the | +| first task and the time that all of RCU's kthreads have been spawned, | +| all synchronous grace periods are handled by the expedited | +| grace-period mechanism. At runtime, this expedited mechanism relies | +| on workqueues, but during the dead zone the requesting task itself | +| drives the desired expedited grace period. Because dead-zone | +| execution takes place within task context, everything works. Once the | +| dead zone ends, expedited grace periods go back to using workqueues, | +| as is required to avoid problems that would otherwise occur when a | +| user task received a POSIX signal while driving an expedited grace | +| period. | +| | +| And yes, this does mean that it is unhelpful to send POSIX signals to | +| random tasks between the time that the scheduler spawns its first | +| kthread and the time that RCU's kthreads have all been spawned. If | +| there ever turns out to be a good reason for sending POSIX signals | +| during that time, appropriate adjustments will be made. (If it turns | +| out that POSIX signals are sent during this time for no good reason, | +| other adjustments will be made, appropriate or otherwise.) | ++-----------------------------------------------------------------------+ + +I learned of these boot-time requirements as a result of a series of +system hangs. + +Interrupts and NMIs +~~~~~~~~~~~~~~~~~~~ + +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions of +code, as are invocations of ``call_rcu()``. + +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead +stealthily transitioning back to process context. This trick is +sometimes used to invoke system calls from inside the kernel. These +“half-interrupts” mean that RCU has to be very careful about how it +counts interrupt nesting levels. I learned of this requirement the hard +way during a rewrite of RCU's dyntick-idle code. + +The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side +critical sections are legal within NMI handlers. Thankfully, RCU +update-side primitives, including ``call_rcu()``, are prohibited within +NMI handlers. + +The name notwithstanding, some Linux-kernel architectures can have +nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised +me `__ +with this requirement; he also kindly surprised me with `an +algorithm `__ +that meets this requirement. + +Furthermore, NMI handlers can be interrupted by what appear to RCU to be +normal interrupts. One way that this can happen is for code that +directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called +from an NMI handler. This astonishing fact of life prompted the current +code structure, which has ``rcu_irq_enter()`` invoking +``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. +And yes, I also learned of this requirement the hard way. + +Loadable Modules +~~~~~~~~~~~~~~~~ + +The Linux kernel has loadable modules, and these modules can also be +unloaded. After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. The module-unload +functions must therefore cancel any delayed calls to loadable-module +functions, for example, any outstanding ``mod_timer()`` must be dealt +with via ``del_timer_sync()`` or similar. + +Unfortunately, there is no way to cancel an RCU callback; once you +invoke ``call_rcu()``, the callback function is eventually going to be +invoked, unless the system goes down first. Because it is normally +considered socially irresponsible to crash the system in response to a +module unload request, we need some other way to deal with in-flight RCU +callbacks. + +RCU therefore provides ``rcu_barrier()``, which waits until all +in-flight RCU callbacks have been invoked. If a module uses +``call_rcu()``, its exit function should therefore prevent any future +invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, +the underlying module-unload code could invoke ``rcu_barrier()`` +unconditionally, but in practice this would incur unacceptable +latencies. + +Nikita Danilov noted this requirement for an analogous +filesystem-unmount situation, and Dipankar Sarma incorporated +``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module +unloading became apparent later. + +.. important:: + + The ``rcu_barrier()`` function is not, repeat, + *not*, obligated to wait for a grace period. It is instead only required + to wait for RCU callbacks that have already been posted. Therefore, if + there are no RCU callbacks posted anywhere in the system, + ``rcu_barrier()`` is within its rights to return immediately. Even if + there are callbacks posted, ``rcu_barrier()`` does not necessarily need + to wait for a grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! Each RCU callbacks must wait for a grace period to | +| complete, and ``rcu_barrier()`` must wait for each pre-existing | +| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | +| wait for a full grace period if there is even one callback posted | +| anywhere in the system? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Absolutely not!!! | +| Yes, each RCU callbacks must wait for a grace period to complete, but | +| it might well be partly (or even completely) finished waiting by the | +| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | +| need only wait for the remaining portion of the grace period to | +| elapse. So even if there are quite a few callbacks posted, | +| ``rcu_barrier()`` might well return quite quickly. | +| | +| So if you need to wait for a grace period as well as for all | +| pre-existing callbacks, you will need to invoke both | +| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | +| you can always use workqueues to invoke them concurrently. | ++-----------------------------------------------------------------------+ + +Hotplug CPU +~~~~~~~~~~~ + +The Linux kernel supports CPU hotplug, which means that CPUs can come +and go. It is of course illegal to use any RCU API member from an +offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side +critical sections. This requirement was present from day one in +DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug +implementation is “interesting.” + +The Linux-kernel CPU-hotplug implementation has notifiers that are used +to allow the various kernel subsystems (including RCU) to respond +appropriately to a given CPU-hotplug operation. Most RCU operations may +be invoked from CPU-hotplug notifiers, including even synchronous +grace-period operations such as ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``. + +However, all-callback-wait operations such as ``rcu_barrier()`` are also +not supported, due to the fact that there are phases of CPU-hotplug +operations where the outgoing CPU's callbacks will not be invoked until +after the CPU-hotplug operation ends, which could also result in +deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations +during its execution, which results in another type of deadlock when +invoked from a CPU-hotplug notifier. + +Scheduler and RCU +~~~~~~~~~~~~~~~~~ + +RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time +accumulation by these kthreads. This requirement was no surprise, but +RCU's violation of it when running context-switch-heavy workloads when +built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise +[PDF] `__. +RCU has made good progress towards meeting this requirement, even for +context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is +room for further improvement. + +There is no longer any prohibition against holding any of +scheduler's runqueue or priority-inheritance spinlocks across an +``rcu_read_unlock()``, even if interrupts and preemption were enabled +somewhere within the corresponding RCU read-side critical section. +Therefore, it is now perfectly legal to execute ``rcu_read_lock()`` +with preemption enabled, acquire one of the scheduler locks, and hold +that lock across the matching ``rcu_read_unlock()``. + +Similarly, the RCU flavor consolidation has removed the need for negative +nesting. The fact that interrupt-disabled regions of code act as RCU +read-side critical sections implicitly avoids earlier issues that used +to result in destructive recursion via interrupt handler's use of RCU. + +Tracing and RCU +~~~~~~~~~~~~~~~ + +It is possible to use tracing on RCU code, but tracing itself uses RCU. +For this reason, ``rcu_dereference_raw_check()`` is provided for use +by tracing, which avoids the destructive recursion that could otherwise +ensue. This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing cannot be +used. The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +Accesses to User Memory and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel needs to access user-space memory, for example, to access data +referenced by system-call parameters. The ``get_user()`` macro does this job. + +However, user-space memory might well be paged out, which means that +``get_user()`` might well page-fault and thus block while waiting for the +resulting I/O to complete. It would be a very bad thing for the compiler to +reorder a ``get_user()`` invocation into an RCU read-side critical section. + +For example, suppose that the source code looked like this: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + +The compiler must not be permitted to transform this source code into +the following: + + :: + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + +If the compiler did make this transformation in a ``CONFIG_PREEMPT=n`` kernel +build, and if ``get_user()`` did page fault, the result would be a quiescent +state in the middle of an RCU read-side critical section. This misplaced +quiescent state could result in line 4 being a use-after-free access, +which could be bad for your kernel's actuarial statistics. Similar examples +can be constructed with the call to ``get_user()`` preceding the +``rcu_read_lock()``. + +Unfortunately, ``get_user()`` doesn't have any particular ordering properties, +and in some architectures the underlying ``asm`` isn't even marked +``volatile``. And even if it was marked ``volatile``, the above access to +``p->value`` is not volatile, so the compiler would not have any reason to keep +those two accesses in order. + +Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and +``rcu_read_unlock()`` must act as compiler barriers, at least for outermost +instances of ``rcu_read_lock()`` and ``rcu_read_unlock()`` within a nested set +of RCU read-side critical sections. + +Energy Efficiency +~~~~~~~~~~~~~~~~~ + +Interrupting idle CPUs is considered socially unacceptable, especially +by people with battery-powered embedded systems. RCU therefore conserves +energy by detecting which CPUs are idle, including tracking CPUs that +have been interrupted from idle. This is a large part of the +energy-efficiency requirement, so I learned of this via an irate phone +call. + +Because RCU avoids interrupting idle CPUs, it is illegal to execute an +RCU read-side critical section on an idle CPU. (Kernels built with +``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` +macro and ``_rcuidle`` event tracing is provided to work around this +restriction. In addition, ``rcu_is_watching()`` may be used to test +whether or not it is currently legal to run RCU read-side critical +sections on this CPU. I learned of the need for diagnostics on the one +hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. +Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite +heavily in the idle loop. However, there are some restrictions on the +code placed within ``RCU_NONIDLE()``: + +#. Blocking is prohibited. In practice, this is not a serious + restriction given that idle tasks are prohibited from blocking to + begin with. +#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest + indefinitely deeply. However, given that they can be nested on the + order of a million deep, even on 32-bit systems, this should not be a + serious restriction. This nesting limit would probably be reached + long after the compiler OOMed or the stack overflowed. +#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that + same ``RCU_NONIDLE()``. For example, the following is grossly + illegal: + + :: + + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + + + It is just as illegal to transfer control into the middle of + ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in + as long as you also transferred out, but in practice you could also + expect to get sharply worded review comments. + +It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU +running in userspace. RCU must therefore track ``nohz_full`` userspace +execution. RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent any +time idle and/or executing in userspace. + +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of which +was finally able to demonstrate `real energy savings running on real +hardware +[PDF] `__. +As noted earlier, I learned of many of these requirements via angry +phone calls: Flaming me on the Linux-kernel mailing list was apparently +not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. Depending on kernel configuration, RCU +handles these states differently: + ++-----------------+------------------+------------------+-----------------+ +| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | ++=================+==================+==================+=================+ +| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | +| | sometimes rely | RCU's | RCU's | +| | on | dyntick-idle | dyntick-idle | +| | scheduling-clock | detection. | detection. | +| | interrupt. In | | | +| | other cases, it | | | +| | is necessary to | | | +| | bound kernel | | | +| | execution times | | | +| | and/or use | | | +| | IPIs. | | | ++-----------------+------------------+------------------+-----------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | +| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | +| ``NO_HZ_IDLE`` do? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because, as a performance optimization, ``NO_HZ_FULL`` does not | +| necessarily re-enable the scheduling-clock interrupt on entry to each | +| and every system call. | ++-----------------------------------------------------------------------+ + +However, RCU must be reliably informed as to whether any given CPU is +currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that +CPU is executing in usermode, as discussed +`earlier <#Energy%20Efficiency>`__. It also requires that the +scheduling-clock interrupt be enabled when RCU needs it to be: + +#. If a CPU is either idle or executing in usermode, and RCU believes it + is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, very long + (11-second) grace periods, with a pointless IPI waking the CPU from + time to time. +#. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. **DON'T DO THIS!!!** + This is one reason to test with lockdep, which will complain about + this sort of thing. +#. If a CPU is in a portion of the kernel that is absolutely positively + no-joking guaranteed to never execute any RCU read-side critical + sections, and RCU believes this CPU to be idle, no problem. This + sort of thing is used by some architectures for light-weight + exception handlers, which can then avoid the overhead of + ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and + exit, respectively. Some go further and avoid the entireties of + ``irq_enter()`` and ``irq_exit()``. + Just make very sure you are running some of your tests with + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in + fact joking about not doing RCU read-side critical sections. +#. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, and if + the CPU goes idle (from an RCU perspective) every few jiffies, no + problem. It is usually OK for there to be the occasional gap between + idle periods of up to a second or so. + If the gap grows too long, you get RCU CPU stall warnings. +#. If a CPU is either idle or executing in usermode, and RCU believes it + to be idle, of course no problem. +#. If a CPU is executing in the kernel, the kernel code path is passing + through quiescent states at a reasonable frequency (preferably about + once per few jiffies, but the occasional excursion to a second or so + is usually OK) and the scheduling-clock interrupt is enabled, of + course no problem. + If the gap between a successive pair of quiescent states grows too + long, you get RCU CPU stall warnings. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what if my driver has a hardware interrupt handler that can run | +| for many seconds? I cannot invoke ``schedule()`` from an hardware | +| interrupt handler, after all! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | +| often. But given that long-running interrupt handlers can cause other | +| problems, not least for response time, shouldn't you work to keep | +| your interrupt handler's runtime within reasonable bounds? | ++-----------------------------------------------------------------------+ + +But as long as RCU is properly informed of kernel state transitions +between in-kernel execution, usermode execution, and idle, and as long +as the scheduling-clock interrupt is enabled when RCU needs it to be, +you can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +Memory Efficiency +~~~~~~~~~~~~~~~~~ + +Although small-memory non-realtime systems can simply use Tiny RCU, code +size is only one aspect of memory efficiency. Another aspect is the size +of the ``rcu_head`` structure used by ``call_rcu()`` and +``kfree_rcu()``. Although this structure contains nothing more than a +pair of pointers, it does appear in many RCU-protected data structures, +including some that are size critical. The ``page`` structure is a case +in point, as evidenced by the many occurrences of the ``union`` keyword +within that structure. + +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the ``rcu_head`` structures that are +waiting for a grace period to elapse. It is also the reason why +``rcu_head`` structures do not contain debug information, such as fields +tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that +posted them. Although this information might appear in debug-only kernel +builds at some point, in the meantime, the ``->func`` field will often +provide the needed debug information. + +However, in some cases, the need for memory efficiency leads to even +more extreme measures. Returning to the ``page`` structure, the +``rcu_head`` field shares storage with a great many other structures +that are used at various points in the corresponding page's lifetime. In +order to correctly resolve certain `race +conditions `__, +the Linux kernel's memory-management subsystem needs a particular bit to +remain zero during all phases of grace-period processing, and that bit +happens to map to the bottom bit of the ``rcu_head`` structure's +``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is +used to post the callback, as opposed to ``kfree_rcu()`` or some future +“lazy” variant of ``call_rcu()`` that might one day be created for +energy-efficiency purposes. + +That said, there are limits. RCU requires that the ``rcu_head`` +structure be aligned to a two-byte boundary, and passing a misaligned +``rcu_head`` structure to one of the ``call_rcu()`` family of functions +will result in a splat. It is therefore necessary to exercise caution +when packing structures containing fields of type ``rcu_head``. Why not +a four-byte or even eight-byte alignment requirement? Because the m68k +architecture provides only two-byte alignment, and thus acts as +alignment's least common denominator. + +The reason for reserving the bottom bit of pointers to ``rcu_head`` +structures is to leave the door open to “lazy” callbacks whose +invocations can safely be deferred. Deferring invocation could +potentially have energy-efficiency benefits, but only if the rate of +non-lazy callbacks decreases significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open in case +it one day becomes useful. + +Performance, Scalability, Response Time, and Reliability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Expanding on the `earlier +discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by +hot code paths in performance-critical portions of the Linux kernel's +networking, security, virtualization, and scheduling code paths. RCU +must therefore use efficient implementations, especially in its +read-side primitives. To that end, it would be good if preemptible RCU's +implementation of ``rcu_read_lock()`` could be inlined, however, doing +this requires resolving ``#include`` issues with the ``task_struct`` +structure. + +The Linux kernel supports hardware configurations with up to 4096 CPUs, +which means that RCU must be extremely scalable. Algorithms that involve +frequent acquisitions of global locks or frequent atomic operations on +global variables simply cannot be tolerated within the RCU +implementation. RCU therefore makes heavy use of a combining tree based +on the ``rcu_node`` structure. RCU is required to tolerate all CPUs +continuously invoking any combination of RCU's runtime primitives with +minimal per-operation overhead. In fact, in many cases, increasing load +must *decrease* the per-operation overhead, witness the batching +optimizations for ``synchronize_rcu()``, ``call_rcu()``, +``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general +rule, RCU must cheerfully accept whatever the rest of the Linux kernel +decides to throw at it. + +The Linux kernel is used for real-time workloads, especially in +conjunction with the `-rt +patchset `__. The +real-time-latency response requirements are such that the traditional +approach of disabling preemption across RCU read-side critical sections +is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use +an RCU implementation that allows RCU read-side critical sections to be +preempted. This requirement made its presence known after users made it +clear that an earlier `real-time +patch `__ did not meet their needs, in +conjunction with some `RCU +issues `__ +encountered by a very early version of the -rt patchset. + +In addition, RCU must make do with a sub-100-microsecond real-time +latency budget. In fact, on smaller systems with the -rt patchset, the +Linux kernel provides sub-20-microsecond real-time latencies for the +whole kernel, including RCU. RCU's scalability and latency must +therefore be sufficient for these sorts of configurations. To my +surprise, the sub-100-microsecond real-time latency budget `applies to +even the largest systems +[PDF] `__, +up to and including systems with 4096 CPUs. This real-time requirement +motivated the grace-period kthread, which also simplified handling of a +number of race conditions. + +RCU must avoid degrading real-time response for CPU-bound threads, +whether executing in usermode (which is one use case for +``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in +the kernel must execute ``cond_resched()`` at least once per few tens of +milliseconds in order to avoid receiving an IPI from RCU. + +Finally, RCU's status as a synchronization primitive means that any RCU +failure can result in arbitrary memory corruption that can be extremely +difficult to debug. This means that RCU must be extremely reliable, +which in practice also means that RCU must have an aggressive +stress-test suite. This stress-test suite is called ``rcutorture``. + +Although the need for ``rcutorture`` was no surprise, the current +immense popularity of the Linux kernel is posing interesting—and perhaps +unprecedented—validation challenges. To see this, keep in mind that +there are well over one billion instances of the Linux kernel running +today, given Android smartphones, Linux-powered televisions, and +servers. This number can be expected to increase sharply with the advent +of the celebrated Internet of Things. + +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. This bug will be occurring about +three times per *day* across the installed base. RCU could simply hide +behind hardware error rates, given that no one should really expect +their smartphone to last for a million years. However, anyone taking too +much comfort from this thought should consider the fact that in most +jurisdictions, a successful multi-year test of a given mechanism, which +might include a Linux kernel, suffices for a number of types of +safety-critical certifications. In fact, rumor has it that the Linux +kernel is already being used in production for safety-critical +applications. I don't know about you, but I would feel quite bad if a +bug in RCU killed someone. Which might explain my recent focus on +validation and verification. + +Other RCU Flavors +----------------- + +One of the more surprising things about RCU is that there are now no +fewer than five *flavors*, or API families. In addition, the primary +flavor that has been the sole focus up to this point has two different +implementations, non-preemptible and preemptible. The other four flavors +are listed below, with requirements for each described in a separate +section. + +#. `Bottom-Half Flavor (Historical)`_ +#. `Sched Flavor (Historical)`_ +#. `Sleepable RCU`_ +#. `Tasks RCU`_ + +Bottom-Half Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-bh flavor of RCU has since been expressed in terms of the other +RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +softirq and to be accounted for by lockdep. Much of the material in this +section is therefore strictly historical in nature. + +The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) +flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a +flavor of RCU that could withstand the network-based denial-of-service +attacks researched by Robert Olsson. These attacks placed so much +networking load on the system that some of the CPUs never exited softirq +execution, which in turn prevented those CPUs from ever executing a +context switch, which, in the RCU implementation of that time, prevented +grace periods from ever ending. The result was an out-of-memory +condition and a system hang. + +The solution was the creation of RCU-bh, which does +``local_bh_disable()`` across its read-side critical sections, and which +uses the transition from one type of softirq processing to another as a +quiescent state in addition to context switch, idle, user mode, and +offline. This means that RCU-bh grace periods can complete even when +some of the CPUs execute in softirq indefinitely, thus allowing +algorithms based on RCU-bh to withstand network-based denial-of-service +attacks. + +Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and +re-enable softirq handlers, any attempt to start a softirq handlers +during the RCU-bh read-side critical section will be deferred. In this +case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can +take considerable time. One can of course argue that this softirq +overhead should be associated with the code following the RCU-bh +read-side critical section rather than ``rcu_read_unlock_bh()``, but the +fact is that most profiling tools cannot be expected to make this sort +of fine distinction. For example, suppose that a three-millisecond-long +RCU-bh read-side critical section executes during a time of heavy +networking load. There will very likely be an attempt to invoke at least +one softirq handler during that three milliseconds, but any such +invocation will be delayed until the time of the +``rcu_read_unlock_bh()``. This can of course make it appear at first +glance as if ``rcu_read_unlock_bh()`` was executing very slowly. + +The `RCU-bh +API `__ +includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, +``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, +``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, +``call_rcu_bh()``, ``rcu_barrier_bh()``, and +``rcu_read_lock_bh_held()``. However, the update-side APIs are now +simple wrappers for other RCU flavors, namely RCU-sched in +CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. + +Sched Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-sched flavor of RCU has since been expressed in terms of the +other RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +preemption and to be accounted for by lockdep. Much of the material in +this section is therefore strictly historical in nature. + +Before preemptible RCU, waiting for an RCU grace period had the side +effect of also waiting for all pre-existing interrupt and NMI handlers. +However, there are legitimate preemptible-RCU implementations that do +not have this property, given that any point in the code outside of an +RCU read-side critical section can be a quiescent state. Therefore, +*RCU-sched* was created, which follows “classic” RCU in that an +RCU-sched grace period waits for pre-existing interrupt and NMI +handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and +RCU-sched APIs have identical implementations, while kernels built with +``CONFIG_PREEMPT=y`` provide a separate implementation for each. + +Note well that in ``CONFIG_PREEMPT=y`` kernels, +``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and +re-enable preemption, respectively. This means that if there was a +preemption attempt during the RCU-sched read-side critical section, +``rcu_read_unlock_sched()`` will enter the scheduler, with all the +latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, +this can make it look as if ``rcu_read_unlock_sched()`` was executing +very slowly. However, the highest-priority task won't be preempted, so +that task will enjoy low-overhead ``rcu_read_unlock_sched()`` +invocations. + +The `RCU-sched +API `__ +includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, +``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, +``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, +``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, +``call_rcu_sched()``, ``rcu_barrier_sched()``, and +``rcu_read_lock_sched_held()``. However, anything that disables +preemption also marks an RCU-sched read-side critical section, including +``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and +``local_irq_restore()``, and so on. + +Sleepable RCU +~~~~~~~~~~~~~ + +For well over a decade, someone saying “I need to block within an RCU +read-side critical section” was a reliable indication that this someone +did not understand RCU. After all, if you are always blocking in an RCU +read-side critical section, you can probably afford to use a +higher-overhead synchronization mechanism. However, that changed with +the advent of the Linux kernel's notifiers, whose RCU read-side critical +sections almost never sleep, but sometimes need to. This resulted in the +introduction of `sleepable RCU `__, or +*SRCU*. + +SRCU allows different domains to be defined, with each such domain +defined by an instance of an ``srcu_struct`` structure. A pointer to +this structure must be passed in to each SRCU function, for example, +``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` +structure. The key benefit of these domains is that a slow SRCU reader +in one domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code must +pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for +example, as follows: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); + +As noted above, it is legal to block within SRCU read-side critical +sections, however, with great power comes great responsibility. If you +block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can wait, either directly or indirectly, for that domain's grace +period to elapse. For example, this results in a self-deadlock: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); + +However, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for domain ``ss``, deadlock would still be +possible. Furthermore, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for some other domain ``ss1``, and if an +``ss1``-domain SRCU read-side critical section acquired another mutex +that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock +would again be possible. Such a deadlock cycle could extend across an +arbitrarily large number of different SRCU domains. Again, with great +power comes great responsibility. + +Unlike the other RCU flavors, SRCU read-side critical sections can run +on idle and even offline CPUs. This ability requires that +``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, +which means that SRCU readers will run a bit slower than would RCU +readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, +which, in combination with ``srcu_read_unlock()``, guarantees a full +memory barrier. + +Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be +invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being +temporarily “stranded” on the outgoing CPU. This stranding of timers +means that timers posted to the outgoing CPU will not fire until late in +the CPU-hotplug process. The problem is that if a notifier is waiting on +an SRCU grace period, that grace period is waiting on a timer, and that +timer is stranded on the outgoing CPU, then the notifier will never be +awakened, in other words, deadlock has occurred. This same situation of +course also prohibits ``srcu_barrier()`` from being invoked from +CPU-hotplug notifiers. + +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. This +means that in the current SRCU implementation, expediting a future grace +period has the side effect of expediting all prior grace periods that +have not yet completed. (But please note that this is a property of the +current implementation, not necessarily of future implementations.) In +addition, if SRCU has been idle for longer than the interval specified +by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds +by default), and if a ``synchronize_srcu()`` invocation ends this idle +period, that invocation will be automatically expedited. + +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a +locking bottleneck present in prior kernel versions. Although this will +allow users to put much heavier stress on ``call_srcu()``, it is +important to note that SRCU does not yet take any special steps to deal +with callback flooding. So if you are posting (say) 10,000 SRCU +callbacks per second per CPU, you are probably totally OK, but if you +intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please +run some tests first. SRCU just might need a few adjustment to deal with +that sort of load. Of course, your mileage may vary based on the speed +of your CPUs and the size of your memory. + +The `SRCU +API `__ +includes ``srcu_read_lock()``, ``srcu_read_unlock()``, +``srcu_dereference()``, ``srcu_dereference_check()``, +``synchronize_srcu()``, ``synchronize_srcu_expedited()``, +``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It +also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and +``init_srcu_struct()`` APIs for defining and initializing +``srcu_struct`` structures. + +Tasks RCU +~~~~~~~~~ + +Some forms of tracing use “trampolines” to handle the binary rewriting +required to install different types of probes. It would be good to be +able to free old trampolines, which sounds like a job for some form of +RCU. However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers such +as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does +not work to have these markers in the trampoline itself, because there +would need to be instructions following ``rcu_read_unlock()``. Although +``synchronize_rcu()`` would guarantee that execution reached the +``rcu_read_unlock()``, it would not be able to guarantee that execution +had completely left the trampoline. Worse yet, in some situations +the trampoline's protection must extend a few instructions *prior* to +execution reaching the trampoline. For example, these few instructions +might calculate the address of the trampoline, so that entering the +trampoline would be pre-ordained a surprisingly long time before execution +actually reached the trampoline itself. + +The solution, in the form of `Tasks +RCU `__, is to have implicit read-side +critical sections that are delimited by voluntary context switches, that +is, calls to ``schedule()``, ``cond_resched()``, and +``synchronize_rcu_tasks()``. In addition, transitions to and from +userspace execution also delimit tasks-RCU read-side critical sections. + +The tasks-RCU API is quite compact, consisting only of +``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and +``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines +cannot be preempted, so these APIs map to ``call_rcu()``, +``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In +``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these +three APIs are therefore implemented by separate functions that check +for voluntary context switches. + +Possible Future Changes +----------------------- + +One of the tricks that RCU uses to attain update-side scalability is to +increase grace-period latency with increasing numbers of CPUs. If this +becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +RCU disables CPU hotplug in a few places, perhaps most notably in the +``rcu_barrier()`` operations. If there is a strong reason to use +``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to +avoid disabling CPU hotplug. This would introduce some complexity, so +there had better be a *very* good reason. + +The tradeoff between grace-period latency on the one hand and +interruptions of other CPUs on the other hand may need to be +re-examined. The desire is of course for zero grace-period latency as +well as zero interprocessor interrupts undertaken during an expedited +grace period operation. While this ideal is unlikely to be achievable, +it is quite possible that further improvements can be made. + +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such as +sockets or cores. Such spreading and alignment is currently believed to +be unnecessary because the hotpath read-side primitives do not access +the combining tree, nor does ``call_rcu()`` in the common case. If you +believe that your architecture needs such spreading and alignment, then +your architecture should also benefit from the +``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the +number of CPUs in a socket, NUMA node, or whatever. If the number of +CPUs is too large, use a fraction of the number of CPUs. If the number +of CPUs is a large prime number, well, that certainly is an +“interesting” architectural choice! More flexible arrangements might be +considered, but only if ``rcutree.rcu_fanout_leaf`` has proven +inadequate, and only if the inadequacy has been demonstrated by a +carefully run and realistic system-level workload. + +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +RCU's various kthreads are reasonably recent additions. It is quite +likely that adjustments will be required to more gracefully handle +extreme loads. It might also be necessary to be able to relate CPU +utilization by RCU's kthreads and softirq handlers to the code that +instigated this CPU utilization. For example, RCU callback overhead +might be charged back to the originating ``call_rcu()`` instance, though +probably not in production kernels. + +Additional work may be required to provide reasonable forward-progress +guarantees under heavy load for grace periods and for callback +invocation. + +Summary +------- + +This document has presented more than two decade's worth of RCU +requirements. Given that the requirements keep changing, this will not +be the last word on this subject, but at least it serves to get an +important subset of the requirements set forth. + +Acknowledgments +--------------- + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg +Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy +Lutomirski for their help in rendering this article human readable, and +to Michelle Rankin for her support of this effort. Other contributions +are acknowledged in the Linux kernel's git archive. -- cgit v1.2.3