From 5d1646d90e1f2cceb9f0828f4b28318cd0ec7744 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sat, 27 Apr 2024 12:05:51 +0200 Subject: Adding upstream version 5.10.209. Signed-off-by: Daniel Baumann --- .../Memory-Ordering/Tree-RCU-Memory-Ordering.rst | 624 +++ .../TreeRCU-callback-invocation.svg | 486 ++ .../Memory-Ordering/TreeRCU-callback-registry.svg | 655 +++ .../RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg | 700 +++ .../Design/Memory-Ordering/TreeRCU-gp-cleanup.svg | 1133 +++++ .../RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg | 1309 +++++ .../Design/Memory-Ordering/TreeRCU-gp-init-1.svg | 658 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-2.svg | 656 +++ .../Design/Memory-Ordering/TreeRCU-gp-init-3.svg | 636 +++ .../RCU/Design/Memory-Ordering/TreeRCU-gp.svg | 5144 ++++++++++++++++++++ .../RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg | 775 +++ .../RCU/Design/Memory-Ordering/TreeRCU-qs.svg | 1095 +++++ .../RCU/Design/Memory-Ordering/rcu_node-lock.svg | 229 + 13 files changed, 14100 insertions(+) create mode 100644 Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg create mode 100644 Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg (limited to 'Documentation/RCU/Design/Memory-Ordering') diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst new file mode 100644 index 000000000..83ae3b79a --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -0,0 +1,624 @@ +====================================================== +A Tour Through TREE_RCU's Grace-Period Memory Ordering +====================================================== + +August 8, 2017 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +What Is Tree RCU's Grace Period Memory Ordering Guarantee? +========================================================== + +RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +Note well that RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +``synchronize_rcu()`` as ``smp_mb()`` on steroids. + +RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +Tree RCU Grace Period Memory Ordering Building Blocks +===================================================== + +The workhorse for RCU's grace-period memory ordering is the +critical section for the ``rcu_node`` structure's +``->lock``. These critical sections use helper functions for lock +acquisition, including ``raw_spin_lock_rcu_node()``, +``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. +Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, +``raw_spin_unlock_irq_rcu_node()``, and +``raw_spin_unlock_irqrestore_rcu_node()``, respectively. +For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. +The key point is that the lock-acquisition functions, including +``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` +immediately after successful acquisition of the lock. + +Therefore, for any given ``rcu_node`` structure, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different ``rcu_node`` structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions:: + + 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } + 10 + 11 void task1(void) + 12 { + 13 raw_spin_lock_rcu_node(rnp); + 14 WRITE_ONCE(y, 1); + 15 r2 = READ_ONCE(z); + 16 raw_spin_unlock_rcu_node(rnp); + 17 } + 18 + 19 void task2(void) + 20 { + 21 WRITE_ONCE(z, 1); + 22 smp_mb(); + 23 r3 = READ_ONCE(x); + 24 } + 25 + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + +The ``WARN_ON()`` is evaluated at "the end of time", +after all changes have propagated throughout the system. +Without the ``smp_mb__after_unlock_lock()`` provided by the +acquisition functions, this ``WARN_ON()`` could trigger, for example +on PowerPC. +The ``smp_mb__after_unlock_lock()`` invocations prevent this +``WARN_ON()`` from triggering. + +This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +``atomic_add_return()`` read-modify-write atomic operation that +is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry +time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. +The grace-period kthread invokes ``rcu_dynticks_snap()`` and +``rcu_dynticks_in_eqs_since()`` (both of which invoke +an ``atomic_add_return()`` of zero) to detect idle CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about CPUs that remain offline for the entire grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Such CPUs will be offline at the beginning of the grace period, so | +| the grace period won't expect quiescent states from them. Races | +| between grace-period start and CPU-hotplug operations are mediated | +| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described | +| above. | ++-----------------------------------------------------------------------+ + +The approach must be extended to handle one final case, that of waking a +task blocked in ``synchronize_rcu()``. This task might be affinitied to +a CPU that is not yet aware that the grace period has ended, and thus +might not yet be subject to the grace period's memory ordering. +Therefore, there is an ``smp_mb()`` after the return from +``wait_for_completion()`` in the ``synchronize_rcu()`` code path. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What? Where??? I don't see any ``smp_mb()`` after the return from | +| ``wait_for_completion()``!!! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| That would be because I spotted the need for that ``smp_mb()`` during | +| the creation of this documentation, and it is therefore unlikely to | +| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter | +| Zijlstra, and Jonathan Cameron for asking questions that sensitized | +| me to the rather elaborate sequence of events that demonstrate the | +| need for this memory barrier. | ++-----------------------------------------------------------------------+ + +Tree RCU's grace--period memory-ordering guarantees rely most heavily on +the ``rcu_node`` structure's ``->lock`` field, so much so that it is +necessary to abbreviate this pattern in the diagrams in the next +section. For example, consider the ``rcu_prepare_for_idle()`` function +shown below, which is one of several functions that enforce ordering of +newly arrived RCU callbacks against future grace periods: + +:: + + 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 + 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || + 11 rcu_is_nocb_cpu(smp_processor_id())) + 12 return; + 13 tne = READ_ONCE(tick_nohz_active); + 14 if (tne != rdtp->tick_nohz_enabled_snap) { + 15 if (rcu_cpu_has_callbacks(NULL)) + 16 invoke_rcu_core(); + 17 rdtp->tick_nohz_enabled_snap = tne; + 18 return; + 19 } + 20 if (!tne) + 21 return; + 22 if (rdtp->all_lazy && + 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { + 24 rdtp->all_lazy = false; + 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; + 26 invoke_rcu_core(); + 27 return; + 28 } + 29 if (rdtp->last_accelerate == jiffies) + 30 return; + 31 rdtp->last_accelerate = jiffies; + 32 for_each_rcu_flavor(rsp) { + 33 rdp = this_cpu_ptr(rsp->rda); + 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) + 35 continue; + 36 rnp = rdp->mynode; + 37 raw_spin_lock_rcu_node(rnp); + 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); + 39 raw_spin_unlock_rcu_node(rnp); + 40 if (needwake) + 41 rcu_gp_kthread_wake(rsp); + 42 } + 43 } + +But the only part of ``rcu_prepare_for_idle()`` that really matters for +this discussion are lines 37–39. We will therefore abbreviate this +function as follows: + +.. kernel-figure:: rcu_node-lock.svg + +The box represents the ``rcu_node`` structure's ``->lock`` critical +section, with the double line on top representing the additional +``smp_mb__after_unlock_lock()``. + +Tree RCU Grace Period Memory Ordering Components +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tree RCU's grace-period memory-ordering guarantee is provided by a +number of RCU components: + +#. `Callback Registry`_ +#. `Grace-Period Initialization`_ +#. `Self-Reported Quiescent States`_ +#. `Dynamic Tick Interface`_ +#. `CPU-Hotplug Interface`_ +#. `Forcing Quiescent States`_ +#. `Grace-Period Cleanup`_ +#. `Callback Invocation`_ + +Each of the following section looks at the corresponding component in +detail. + +Callback Registry +^^^^^^^^^^^^^^^^^ + +If RCU's grace-period guarantee is to mean anything at all, any access +that happens before a given invocation of ``call_rcu()`` must also +happen before the corresponding grace period. The implementation of this +portion of RCU's grace period guarantee is shown in the following +figure: + +.. kernel-figure:: TreeRCU-callback-registry.svg + +Because ``call_rcu()`` normally acts only on CPU-local state, it +provides no ordering guarantees, either for itself or for phase one of +the update (which again will usually be removal of an element from an +RCU-protected data structure). It simply enqueues the ``rcu_head`` +structure on a per-CPU list, which cannot become associated with a grace +period until a later call to ``rcu_accelerate_cbs()``, as shown in the +diagram above. + +One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` +via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the +current CPU is inundated with queued ``rcu_head`` structures) or more +likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle +is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which +invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The +final code path on the right is taken only in kernels built with +``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via +``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, +``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn +is invoked on a surviving CPU after the outgoing CPU has been completely +offlined. + +There are a few other code paths within grace-period processing that +opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, +all of the CPU's recently queued ``rcu_head`` structures are associated +with a future grace-period number under the protection of the CPU's lead +``rcu_node`` structure's ``->lock``. In all cases, there is full +ordering against any prior critical section for that same ``rcu_node`` +structure's ``->lock``, and also full ordering against any of the +current task's or CPU's prior critical sections for any ``rcu_node`` +structure's ``->lock``. + +The next section will show how this ordering ensures that any accesses +prior to the ``call_rcu()`` (particularly including phase one of the +update) happen before the start of the corresponding grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | +| which invokes it. So either way, it eventually comes down to | +| ``call_rcu()``. | ++-----------------------------------------------------------------------+ + +Grace-Period Initialization +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Grace-period initialization is carried out by the grace-period kernel +thread, which makes several passes over the ``rcu_node`` tree within the +``rcu_gp_init()`` function. This means that showing the full flow of +ordering through the grace-period computation will require duplicating +this tree. If you find this confusing, please note that the state of the +``rcu_node`` changes over time, just like Heraclitus's river. However, +to keep the ``rcu_node`` river tractable, the grace-period kernel +thread's traversals are presented in multiple parts, starting in this +section with the various phases of grace-period initialization. + +The first ordering-related grace-period initialization action is to +advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number +counter, as shown below: + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The actual increment is carried out using ``smp_store_release()``, which +helps reject false-positive RCU CPU stall detection. Note that only the +root ``rcu_node`` structure is touched. + +The first pass through the ``rcu_node`` tree updates bitmasks based on +CPUs having come online or gone offline since the start of the previous +grace period. In the common case where the number of online CPUs for +this ``rcu_node`` structure has not transitioned to or from zero, this +pass will scan only the leaf ``rcu_node`` structures. However, if the +number of online CPUs for a given leaf ``rcu_node`` structure has +transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the +first incoming CPU. Similarly, if the number of online CPUs for a given +leaf ``rcu_node`` structure has transitioned to zero, +``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +``rcu_node`` structure onlines its first CPU and if the next +``rcu_node`` structure has no online CPUs (or, alternatively if the +leftmost ``rcu_node`` structure offlines its last CPU and if the next +``rcu_node`` structure has no online CPUs). + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses +breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field +to the newly advanced value from the ``rcu_state`` structure, as shown +in the following diagram. + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +This change will also cause each CPU's next call to +``__note_gp_changes()`` to notice that a new grace period has started, +as described in the next section. But because the grace-period kthread +started the grace period at the root (with the advancing of the +``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf +``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of +the start of the grace period will happen after the actual start of the +grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about the CPU that started the grace period? Why wouldn't it | +| see the start of the grace period right when it started that grace | +| period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In some deep philosophical and overly anthromorphized sense, yes, the | +| CPU starting the grace period is immediately aware of having done so. | +| However, if we instead assume that RCU is not self-aware, then even | +| the CPU starting the grace period does not really become aware of the | +| start of this grace period until its first call to | +| ``__note_gp_changes()``. On the other hand, this CPU potentially gets | +| early notification because it invokes ``__note_gp_changes()`` during | +| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` | +| structure. | ++-----------------------------------------------------------------------+ + +Self-Reported Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. Online +non-idle CPUs report their own quiescent states, as shown in the +following diagram: + +.. kernel-figure:: TreeRCU-qs.svg + +This is for the last CPU to report a quiescent state, which signals the +end of the grace period. Earlier quiescent states would push up the +``rcu_node`` tree only until they encountered an ``rcu_node`` structure +that is waiting for additional quiescent states. However, ordering is +nevertheless preserved because some later quiescent state will acquire +that ``rcu_node`` structure's ``->lock``. + +Any number of events can lead up to a CPU invoking ``note_gp_changes`` +(or alternatively, directly invoking ``__note_gp_changes()``), at which +point that CPU will notice the start of a new grace period while holding +its leaf ``rcu_node`` lock. Therefore, all execution shown in this +diagram happens after the start of the grace period. In addition, this +CPU will consider any RCU read-side critical section that started before +the invocation of ``__note_gp_changes()`` to have started before the +grace period, and thus a critical section that the grace period must +wait on. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But a RCU read-side critical section might have started after the | +| beginning of the grace period (the advancing of ``->gp_seq`` from | +| earlier), so why should the grace period wait on such a critical | +| section? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It is indeed not necessary for the grace period to wait on such a | +| critical section. However, it is permissible to wait on it. And it is | +| furthermore important to wait on it, as this lazy approach is far | +| more scalable than a “big bang” all-at-once grace-period start could | +| possibly be. | ++-----------------------------------------------------------------------+ + +If the CPU does a context switch, a quiescent state will be noted by +``rcu_note_context_switch()`` on the left. On the other hand, if the CPU +takes a scheduler-clock interrupt while executing in usermode, a +quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. +Either way, the passage through a quiescent state will be noted in a +per-CPU variable. + +The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for +example, after the next scheduler-clock interrupt), ``rcu_core()`` will +invoke ``rcu_check_quiescent_state()``, which will notice the recorded +quiescent state, and invoke ``rcu_report_qs_rdp()``. If +``rcu_report_qs_rdp()`` verifies that the quiescent state really does +apply to the current grace period, it invokes ``rcu_report_rnp()`` which +traverses up the ``rcu_node`` tree as shown at the bottom of the +diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` +field, and propagating up the tree when the result is zero. + +Note that traversal passes upwards out of a given ``rcu_node`` structure +only if the current CPU is reporting the last quiescent state for the +subtree headed by that ``rcu_node`` structure. A key point is that if a +CPU's traversal stops at a given ``rcu_node`` structure, then there will +be a later traversal by another CPU (or perhaps the same one) that +proceeds upwards from that point, and the ``rcu_node`` ``->lock`` +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. Applying this line of thought +repeatedly shows that all CPUs' quiescent states happen before the last +CPU traverses through the root ``rcu_node`` structure, the “last CPU” +being the one that clears the last bit in the root ``rcu_node`` +structure's ``->qsmask`` field. + +Dynamic Tick Interface +^^^^^^^^^^^^^^^^^^^^^^ + +Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. CPUs are therefore required to notify RCU when +entering or leaving idle state, which they do via fully ordered +value-returning atomic operations on a per-CPU variable. The ordering +effects are as shown below: + +.. kernel-figure:: TreeRCU-dyntick.svg + +The RCU grace-period kernel thread samples the per-CPU idleness variable +while holding the corresponding CPU's leaf ``rcu_node`` structure's +``->lock``. This means that any RCU read-side critical sections that +precede the idle period (the oval near the top of the diagram above) +will happen before the end of the current grace period. Similarly, the +beginning of the current grace period will happen before any RCU +read-side critical sections that follow the idle period (the oval near +the bottom of the diagram above). + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +CPU-Hotplug Interface +^^^^^^^^^^^^^^^^^^^^^ + +RCU is also forbidden from disturbing offline CPUs, which might well be +powered off and removed from the system completely. CPUs are therefore +required to notify RCU of their comings and goings as part of the +corresponding CPU hotplug operations. The ordering effects are shown +below: + +.. kernel-figure:: TreeRCU-hotplug.svg + +Because CPU hotplug operations are much less frequent than idle +transitions, they are heavier weight, and thus acquire the CPU's leaf +``rcu_node`` structure's ``->lock`` and update this structure's +``->qsmaskinitnext``. The RCU grace-period kernel thread samples this +mask to detect CPUs having gone offline since the beginning of this +grace period. + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +Forcing Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^ + +As noted above, idle and offline CPUs cannot report their own quiescent +states, and therefore the grace-period kernel thread must do the +reporting on their behalf. This process is called “forcing quiescent +states”, it is repeated every few jiffies, and its ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-fqs.svg + +Each pass of quiescent state forcing is guaranteed to traverse the leaf +``rcu_node`` structures, and if there are no new quiescent states due to +recently idled and/or offlined CPUs, then only the leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left or +a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. As with +self-reported quiescent states, the upwards driving stops once it +reaches an ``rcu_node`` structure that has quiescent states outstanding +from other CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The leftmost drive to root stopped before it reached the root | +| ``rcu_node`` structure, which means that there are still CPUs | +| subordinate to that structure on which the current grace period is | +| waiting. Given that, how is it possible that the rightmost drive to | +| root ended the grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Good analysis! It is in fact impossible in the absence of bugs in | +| RCU. But this diagram is complex enough as it is, so simplicity | +| overrode accuracy. You can think of it as poetic license, or you can | +| think of it as misdirection that is resolved in the | +| `stitched-together diagram <#Putting%20It%20All%20Together>`__. | ++-----------------------------------------------------------------------+ + +Grace-Period Cleanup +^^^^^^^^^^^^^^^^^^^^ + +Grace-period cleanup first scans the ``rcu_node`` tree breadth-first +advancing all the ``->gp_seq`` fields, then it advances the +``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-cleanup.svg + +As indicated by the oval at the bottom of the diagram, once grace-period +cleanup is complete, the next grace period can begin. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But when precisely does the grace period end? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| There is no useful single point at which the grace period can be said | +| to end. The earliest reasonable candidate is as soon as the last CPU | +| has reported its quiescent state, but it may be some milliseconds | +| before RCU becomes aware of this. The latest reasonable candidate is | +| once the ``rcu_state`` structure's ``->gp_seq`` field has been | +| updated, but it is quite possible that some CPUs have already | +| completed phase two of their updates by that time. In short, if you | +| are going to work with RCU, you need to learn to embrace uncertainty. | ++-----------------------------------------------------------------------+ + +Callback Invocation +^^^^^^^^^^^^^^^^^^^ + +Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has +been updated, that CPU can begin invoking its RCU callbacks that were +waiting for this grace period to end. These callbacks are identified by +``rcu_advance_cbs()``, which is usually invoked by +``__note_gp_changes()``. As shown in the diagram below, this invocation +can be triggered by the scheduling-clock interrupt +(``rcu_sched_clock_irq()`` on the left) or by idle entry +(``rcu_cleanup_after_idle()`` on the right, but only for kernels build +with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is +raised, which results in ``rcu_do_batch()`` invoking the callbacks, +which in turn allows those callbacks to carry out (either directly or +indirectly via wakeup) the needed phase-two processing for each update. + +.. kernel-figure:: TreeRCU-callback-invocation.svg + +Please note that callback invocation can also be prompted by any number +of corner-case code paths, for example, when a CPU notes that it has +excessive numbers of callbacks queued. In all cases, the CPU acquires +its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, +which preserves the required ordering against the newly completed grace +period. + +However, if the callback function communicates to other CPUs, for +example, doing a wakeup, then it is that function's responsibility to +maintain ordering. For example, if the callback function wakes up a task +that runs on some other CPU, proper ordering must in place in both the +callback function and the task being awakened. To see why this is +important, consider the top half of the `grace-period +cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be +running on a CPU corresponding to the leftmost leaf ``rcu_node`` +structure, and awaken a task that is to run on a CPU corresponding to +the rightmost leaf ``rcu_node`` structure, and the grace-period kernel +thread might not yet have reached the rightmost leaf. In this case, the +grace period's memory ordering might not yet have reached that CPU, so +again the callback function and the awakened task must supply proper +ordering. + +Putting It All Together +~~~~~~~~~~~~~~~~~~~~~~~ + +A stitched-together diagram is here: + +.. kernel-figure:: TreeRCU-gp.svg + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg new file mode 100644 index 000000000..3fcf0c17c --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg @@ -0,0 +1,486 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_sched_clock_irq() + + rcu_cleanup_after_idle() + + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg new file mode 100644 index 000000000..7ac6f9269 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg @@ -0,0 +1,655 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg new file mode 100644 index 000000000..423df00c4 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg @@ -0,0 +1,700 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + Leaf + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg new file mode 100644 index 000000000..bf84fbab2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg @@ -0,0 +1,1133 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + Root + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Root + rcu_seq_end(&rsp->gp_seq) + + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + rcu_seq_end(&rnp->gp_seq) + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg new file mode 100644 index 000000000..7ddc094d7 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg @@ -0,0 +1,1309 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg new file mode 100644 index 000000000..8c2075508 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg @@ -0,0 +1,658 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_seq_start(rsp->gp_seq) + + + + + Root + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg new file mode 100644 index 000000000..4d956a732 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg @@ -0,0 +1,656 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg new file mode 100644 index 000000000..d24d7d555 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg @@ -0,0 +1,636 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->gp_seq = rsp->gp_seq + + + + + Root + + + rcu_gp_init() + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + ->gp_seq = rsp->gp_seq + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg new file mode 100644 index 000000000..069f6f837 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg @@ -0,0 +1,5144 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + rcu_prepare_for_idle() + + rcu_accelerate_cbs() + + + + + note_gp_changes() + rcu_advance_cbs() + __note_gp_changes() + + call_rcu() + + Wake up + grace-period + kernel thread + + rcu_accelerate_cbs() + + + + + takedown_cpu() + rcutree_migrate_callbacks() + rcu_migrate_callbacks() + rcu_advance_cbs() + Leaf + Leaf + Leaf + + Phase One + of Update + + + + + + + Root + rcu_seq_start(rsp->gp_seq) + + + rcu_gp_init() + + + + + + + + + + + + Leaf + + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + End of + Last Grace + Period + + + + Grace-period + kernel thread + awakened + + + + + + + + + + + + + + Leaf + + + + + + + ->qsmaskinit + ->qsmaskinitnext + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmaskinit + + + + + + + + + rcu_init_new_rnp() or + rcu_cleanup_dead_rnp() + (optional) + + ->qsmaskinit + + + + + Root + ->qsmaskinitnext + + + + + + + + Root + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + Leaf + ->gp_seq = rsp->gp_seq + + + + + + + + + + + + + + rcu_gp_fqs() + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + force_qs_rnp() + dyntick_save_progress_counter() + + + + + + Root + ->qsmask &= ~->grpmask + + rcu_implicit_dynticks_qs() + ->qsmask &= ~->grpmask + + + RCU + read-side + critical section + + + + rcu_dynticks_eqs_enter() + atomic_add_return() + + + + rcu_dynticks_eqs_exit() + atomic_add_return() + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_note_context_switch() + + + + rcu_sched_clock_irq() + + + + rcu_core() + rcu_check_quiescent_state() + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + + Grace-period + kernel thread + awakened + + + + + + + + Root + rcu_seq_end(&rnp->gp_seq) + + + rcu_gp_cleanup() + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + Leaf + + + + + + Root + + + + + + + + + + + Leaf + + + + + + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + + + + + + + + + + + + + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + Leaf + rcu_seq_end(&rnp->gp_seq) + + + + + + + + + + Start of + Next Grace + Period + + + + + + rcu_sched_clock_irq() + + rcu_cleanup_after_idle() + rcu_advance_cbs() + + + Leaf + __note_gp_changes() + + + Phase Two + of Update + + + RCU_SOFTIRQ + rcu_do_batch() + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rnp->gp_seq) + rcu_seq_end(&rsp->gp_seq) + ->gp_seq = rsp->gp_seq + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg new file mode 100644 index 000000000..2c9310ba2 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg @@ -0,0 +1,775 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + dyntick_save_progress_counter() + rcu_implicit_dynticks_qs() + Leaf + + + + RCU + read-side + critical section + + + + rcu_report_dead() + rcu_cleanup_dying_idle_cpu() + + + + + ->qsmaskinitnext + Leaf + + + + RCU + read-side + critical section + + + + rcu_cpu_starting() + + + + + ->qsmaskinitnext + Leaf + + diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg new file mode 100644 index 000000000..7d6c5f7e5 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg @@ -0,0 +1,1095 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ->qsmask &= ~->grpmask + + + + + Root + + + rcu_report_rnp() + + + + + + + + + + + + Leaf + + + + + + + ->qsmask &= ~->grpmask + + + + + + + Leaf + + + + + + + Leaf + + + + + + + Leaf + ->qsmask &= ~->grpmask + + + + + + + + + + + + + + + + + + note_gp_changes() + rdp->gp_seq + __note_gp_changes() + Leaf + + + + rcu_note_context_switch() + + + + rcu_sched_clock_irq() + + + + rcu_core() + rcu_check_quiescent_state() + rcu__report_qs_rdp()) + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + RCU + read-side + critical section + + + + + + + Wake up + grace-period + kernel thread + + + + rcu_report_qs_rsp() + + diff --git a/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg new file mode 100644 index 000000000..94c96c595 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg @@ -0,0 +1,229 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rcu_accelerate_cbs() + + + + + + + rcu_prepare_for_idle() + + -- cgit v1.2.3