diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:02:30 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:02:30 +0000 |
commit | 76cb841cb886eef6b3bee341a2266c76578724ad (patch) | |
tree | f5892e5ba6cc11949952a6ce4ecbe6d516d6ce58 /Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html | |
parent | Initial commit. (diff) | |
download | linux-upstream.tar.xz linux-upstream.zip |
Adding upstream version 4.19.249.upstream/4.19.249upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r-- | Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html | 667 |
1 files changed, 667 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html new file mode 100644 index 000000000..7394f034b --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -0,0 +1,667 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" + "http://www.w3.org/TR/html4/loose.dtd"> + <html> + <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title> + <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> + +<h2>Introduction</h2> + +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +<p> +There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched), +but only two flavors of expedited grace periods because the RCU-bh +expedited grace period maps onto the RCU-sched expedited grace period. +Each of the remaining two implementations is covered in its own section. + +<ol> +<li> <a href="#Expedited Grace Period Design"> + Expedited Grace Period Design</a> +<li> <a href="#RCU-preempt Expedited Grace Periods"> + RCU-preempt Expedited Grace Periods</a> +<li> <a href="#RCU-sched Expedited Grace Periods"> + RCU-sched Expedited Grace Periods</a> +<li> <a href="#Expedited Grace Period and CPU Hotplug"> + Expedited Grace Period and CPU Hotplug</a> +<li> <a href="#Expedited Grace Period Refinements"> + Expedited Grace Period Refinements</a> +</ol> + +<h2><a name="Expedited Grace Period Design"> +Expedited Grace Period Design</a></h2> + +<p> +The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to <tt>try_stop_cpus()</tt> has been +replaced with a set of calls to <tt>smp_call_function_single()</tt>, +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +<p> +The details of the <tt>smp_call_function_single()</tt> handler's +operation depend on the RCU flavor, as described in the following +sections. + +<h2><a name="RCU-preempt Expedited Grace Periods"> +RCU-preempt Expedited Grace Periods</a></h2> + +<p> +The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +<p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%"> + +<p> +The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +<p> +If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt> +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +<tt>smp_call_function_single()</tt> to send the CPU an IPI, which +is handled by <tt>sync_rcu_exp_handler()</tt>. + +<p> +However, because this is preemptible RCU, <tt>sync_rcu_exp_handler()</tt> +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt> +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +<p> +Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost <tt>rcu_read_unlock()</tt>. +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>, +which is called from <tt>rcu_preempt_note_context_switch()</tt>, which +in turn is called from <tt>rcu_note_context_switch()</tt>, which in +turn is called from the scheduler. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why not just have the expedited grace period check the + state of all the CPUs? + After all, that would avoid all those real-time-unfriendly IPIs. +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Because we want the RCU read-side critical sections to run fast, + which means no memory barriers. + Therefore, it is not possible to safely check the state from some + other CPU. + And even if it was possible to safely check the state, it would + still be necessary to IPI the CPU to safely interact with the + upcoming <tt>rcu_read_unlock()</tt> invocation, which means that + the remote state testing would not help the worst-case + latency that real-time applications care about. + + <p><font color="ffffff">One way to prevent your real-time + application from getting hit with these IPIs is to + build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>. + RCU would then perceive the CPU running your application + as being idle, and it would be able to safely detect that + state without needing to IPI the CPU. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +Please note that this is just the overall flow: +Additional complications can arise due to races with CPUs going idle +or offline, among other things. + +<h2><a name="RCU-sched Expedited Grace Periods"> +RCU-sched Expedited Grace Periods</a></h2> + +<p> +The overall flow of the handling of a given CPU by an RCU-sched +expedited grace period is shown in the following diagram: + +<p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> + +<p> +As with RCU-preempt's <tt>synchronize_rcu_expedited()</tt>, +<tt>synchronize_sched_expedited()</tt> ignores offline and +idle CPUs, again because they are in remotely detectable +quiescent states. +However, the <tt>synchronize_rcu_expedited()</tt> handler +is <tt>sync_sched_exp_handler()</tt>, and because the +<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> +leave no trace of their invocation, in general it is not possible to tell +whether or not the current CPU is in an RCU read-side critical section. +The best that <tt>sync_sched_exp_handler()</tt> can do is to check +for idle, on the off-chance that the CPU went idle while the IPI +was in flight. +If the CPU is idle, then tt>sync_sched_exp_handler()</tt> reports +the quiescent state. + +<p> +Otherwise, the handler invokes <tt>resched_cpu()</tt>, which forces +a future context switch. +At the time of the context switch, the CPU reports the quiescent state. +Should the CPU go offline first, it will report the quiescent state +at that time. + +<h2><a name="Expedited Grace Period and CPU Hotplug"> +Expedited Grace Period and CPU Hotplug</a></h2> + +<p> +The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. +In addition, attempting to IPI offline CPUs will result in splats, but +failing to IPI online CPUs can result in too-short grace periods. +Neither option is acceptable in production kernels. + +<p> +The interaction between expedited grace periods and CPU hotplug operations +is carried out at several levels: + +<ol> +<li> The number of CPUs that have ever been online is tracked + by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt> + field. + The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt> + field tracks the number of CPUs that have ever been online + at the beginning of an RCU expedited grace period. + Note that this number never decreases, at least in the absence + of a time machine. +<li> The identities of the CPUs that have ever been online is + tracked by the <tt>rcu_node</tt> structure's + <tt>->expmaskinitnext</tt> field. + The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> + field tracks the identities of the CPUs that were online + at least once at the beginning of the most recent RCU + expedited grace period. + The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and + <tt>->ncpus_snap</tt> fields are used to detect when + new CPUs have come online for the first time, that is, + when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt> + field has changed since the beginning of the last RCU + expedited grace period, which triggers an update of each + <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> + field from its <tt>->expmaskinitnext</tt> field. +<li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> + field is used to initialize that structure's + <tt>->expmask</tt> at the beginning of each RCU + expedited grace period. + This means that only those CPUs that have been online at least + once will be considered for a given grace period. +<li> Any CPU that goes offline will clear its bit in its leaf + <tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt> + field, so any CPU with that bit clear can safely be ignored. + However, it is possible for a CPU coming online or going offline + to have this bit set for some time while <tt>cpu_online</tt> + returns <tt>false</tt>. +<li> For each non-idle CPU that RCU believes is currently online, the grace + period invokes <tt>smp_call_function_single()</tt>. + If this succeeds, the CPU was fully online. + Failure indicates that the CPU is in the process of coming online + or going offline, in which case it is necessary to wait for a + short time period and try again. + The purpose of this wait (or series of waits, as the case may be) + is to permit a concurrent CPU-hotplug operation to complete. +<li> In the case of RCU-sched, one of the last acts of an outgoing CPU + is to invoke <tt>rcu_report_dead()</tt>, which + reports a quiescent state for that CPU. + However, this is likely paranoia-induced redundancy. <!-- @@@ --> +</ol> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why all the dancing around with multiple counters and masks + tracking CPUs that were once online? + Why not just have a single set of masks tracking the currently + online CPUs and be done with it? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Maintaining single set of masks tracking the online CPUs <i>sounds</i> + easier, at least until you try working out all the race conditions + between grace-period initialization and CPU-hotplug operations. + For example, suppose initialization is progressing down the + tree while a CPU-offline operation is progressing up the tree. + This situation can result in bits set at the top of the tree + that have no counterparts at the bottom of the tree. + Those bits will never be cleared, which will result in + grace-period hangs. + In short, that way lies madness, to say nothing of a great many + bugs, hangs, and deadlocks. + + <p><font color="ffffff"> + In contrast, the current multi-mask multi-counter scheme ensures + that grace-period initialization will always see consistent masks + up and down the tree, which brings significant simplifications + over the single-mask method. + + <p><font color="ffffff"> + This is an instance of + <a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff"> + deferring work in order to avoid synchronization</a>. + Lazily recording CPU-hotplug events at the beginning of the next + grace period greatly simplifies maintenance of the CPU-tracking + bitmasks in the <tt>rcu_node</tt> tree. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h2><a name="Expedited Grace Period Refinements"> +Expedited Grace Period Refinements</a></h2> + +<ol> +<li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>. +<li> <a href="#Batching via Sequence Counter"> + Batching via sequence counter</a>. +<li> <a href="#Funnel Locking and Wait/Wakeup"> + Funnel locking and wait/wakeup</a>. +<li> <a href="#Use of Workqueues">Use of Workqueues</a>. +<li> <a href="#Stall Warnings">Stall warnings</a>. +<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>. +</ol> + +<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3> + +<p> +Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU +(both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>). +If the CPU is idle at any time between those two times, the CPU will +not be IPIed. +Instead, the task pushing the grace period forward will include the +idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. + +<p> +For RCU-sched, there is an additional check for idle in the IPI +handler, <tt>sync_sched_exp_handler()</tt>. +If the IPI has interrupted the idle loop, then +<tt>sync_sched_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> +to report the corresponding quiescent state. + +<p> +For RCU-preempt, there is no specific check for idle in the +IPI handler (<tt>sync_rcu_exp_handler()</tt>), but because +RCU read-side critical sections are not permitted within the +idle loop, if <tt>sync_rcu_exp_handler()</tt> sees that the CPU is within +RCU read-side critical section, the CPU cannot possibly be idle. +Otherwise, <tt>sync_rcu_exp_handler()</tt> invokes +<tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent +state, regardless of whether or not that quiescent state was due to +the CPU being idle. + +<p> +In summary, RCU expedited grace periods check for idle when building +the bitmask of CPUs that must be IPIed, just before sending each IPI, +and (either explicitly or implicitly) within the IPI handler. + +<h3><a name="Batching via Sequence Counter"> +Batching via Sequence Counter</a></h3> + +<p> +If each grace-period request was carried out separately, expedited +grace periods would have abysmal scalability and +problematic high-load characteristics. +Because each grace-period operation can serve an unlimited number of +updates, it is important to <i>batch</i> requests, so that a single +expedited grace-period operation will cover all requests in the +corresponding batch. + +<p> +This batching is controlled by a sequence counter named +<tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure. +This counter has an odd value when there is an expedited grace period +in progress and an even value otherwise, so that dividing the counter +value by two gives the number of completed grace periods. +During any given update request, the counter must transition from +even to odd and then back to even, thus indicating that a grace +period has elapsed. +Therefore, if the initial value of the counter is <tt>s</tt>, +the updater must wait until the counter reaches at least the +value <tt>(s+3)&~0x1</tt>. +This counter is managed by the following access functions: + +<ol> +<li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of + an expedited grace period. +<li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an + expedited grace period. +<li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of + the counter. +<li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt> + if a full expedited grace period has elapsed since the + corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>. +</ol> + +<p> +Again, only one request in a given batch need actually carry out +a grace-period operation, which means there must be an efficient +way to identify which of many concurrent reqeusts will initiate +the grace period, and that there be an efficient way for the +remaining requests to wait for that grace period to complete. +However, that is the topic of the next section. + +<h3><a name="Funnel Locking and Wait/Wakeup"> +Funnel Locking and Wait/Wakeup</a></h3> + +<p> +The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the <tt>rcu_node</tt> combining +tree, as implemented by the <tt>exp_funnel_lock()</tt> function. +The first updater corresponding to a given grace period arriving +at a given <tt>rcu_node</tt> structure records its desired grace-period +sequence number in the <tt>->exp_seq_rq</tt> field and moves up +to the next level in the tree. +Otherwise, if the <tt>->exp_seq_rq</tt> field already contains +the sequence number for the desired grace period or some later one, +the updater blocks on one of four wait queues in the +<tt>->exp_wq[]</tt> array, using the second-from-bottom +and third-from bottom bits as an index. +An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure +synchronizes access to these fields. + +<p> +An empty <tt>rcu_node</tt> tree is shown in the following diagram, +with the white cells representing the <tt>->exp_seq_rq</tt> field +and the red cells representing the elements of the +<tt>->exp_wq[]</tt> array. + +<p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%"> + +<p> +The next diagram shows the situation after the arrival of Task A +and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt> +structures, respectively. +The current value of the <tt>rcu_state</tt> structure's +<tt>->expedited_sequence</tt> field is zero, so adding three and +clearing the bottom bit results in the value two, which both tasks +record in the <tt>->exp_seq_rq</tt> field of their respective +<tt>rcu_node</tt> structures: + +<p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%"> + +<p> +Each of Tasks A and B will move up to the root +<tt>rcu_node</tt> structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +<p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%"> + +<p> +Task A now advances to initiate a new grace period, while Task B +moves up to the root <tt>rcu_node</tt> structure, and, seeing that +its desired sequence number is already recorded, blocks on +<tt>->exp_wq[1]</tt>. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why <tt>->exp_wq[1]</tt>? + Given that the value of these tasks' desired sequence number is + two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + No. + + <p><font color="ffffff"> + Recall that the bottom bit of the desired sequence number indicates + whether or not a grace period is currently in progress. + It is therefore necessary to shift the sequence number right one + bit position to obtain the number of the grace period. + This results in <tt>->exp_wq[1]</tt>. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +If Tasks C and D also arrive at this point, they will compute the +same desired grace-period sequence number, and see that both leaf +<tt>rcu_node</tt> structures already have that value recorded. +They will therefore block on their respective <tt>rcu_node</tt> +structures' <tt>->exp_wq[1]</tt> fields, as shown below: + +<p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%"> + +<p> +Task A now acquires the <tt>rcu_state</tt> structure's +<tt>->exp_mutex</tt> and initiates the grace period, which +increments <tt>->expedited_sequence</tt>. +Therefore, if Tasks E and F arrive, they will compute +a desired sequence number of 4 and will record this value as +shown below: + +<p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%"> + +<p> +Tasks E and F will propagate up the <tt>rcu_node</tt> +combining tree, with Task F blocking on the root <tt>rcu_node</tt> +structure and Task E wait for Task A to finish so that +it can start the next grace period. +The resulting state is as shown below: + +<p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%"> + +<p> +Once the grace period completes, Task A +starts waking up the tasks waiting for this grace period to complete, +increments the <tt>->expedited_sequence</tt>, +acquires the <tt>->exp_wake_mutex</tt> and then releases the +<tt>->exp_mutex</tt>. +This results in the following state: + +<p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%"> + +<p> +Task E can then acquire <tt>->exp_mutex</tt> and increment +<tt>->expedited_sequence</tt> to the value three. +If new tasks G and H arrive and moves up the combining tree at the +same time, the state will be as follows: + +<p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%"> + +<p> +Note that three of the root <tt>rcu_node</tt> structure's +waitqueues are now occupied. +However, at some point, Task A will wake up the +tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting +in the following state: + +<p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%"> + +<p> +Execution will continue with Tasks E and H completing +their grace periods and carrying out their wakeups. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + What happens if Task A takes so long to do its wakeups + that Task E's grace period completes? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Then Task E will block on the <tt>->exp_wake_mutex</tt>, + which will also prevent it from releasing <tt>->exp_mutex</tt>, + which in turn will prevent the next grace period from starting. + This last is important in preventing overflow of the + <tt>->exp_wq[]</tt> array. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<h3><a name="Use of Workqueues">Use of Workqueues</a></h3> + +<p> +In earlier implementations, the task requesting the expedited +grace period also drove it to completion. +This straightforward approach had the disadvantage of needing to +account for POSIX signals sent to user tasks, +so more recent implemementations use the Linux kernel's +<a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>. + +<p> +The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock +does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt> +so that a workqueue kthread does the actual grace-period processing. +Because workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. + +In addition, this approach allows wakeups for the previous expedited +grace period to be overlapped with processing for the next expedited +grace period. +Because there are only four sets of waitqueues, it is necessary to +ensure that the previous grace period's wakeups complete before the +next grace period's wakeups start. +This is handled by having the <tt>->exp_mutex</tt> +guard expedited grace-period processing and the +<tt>->exp_wake_mutex</tt> guard wakeups. +The key point is that the <tt>->exp_mutex</tt> is not released +until the first wakeup is complete, which means that the +<tt>->exp_wake_mutex</tt> has already been acquired at that point. +This approach ensures that the previous grace period's wakeups can +be carried out while the current grace period is in process, but +that these wakeups will complete before the next grace period starts. +This means that only three waitqueues are required, guaranteeing that +the four that are provided are sufficient. + +<h3><a name="Stall Warnings">Stall Warnings</a></h3> + +<p> +Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check +for stalls just as normal grace periods do. + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But why not just let the normal grace-period machinery + detect the stalls, given that a given reader must block + both normal and expedited grace periods? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Because it is quite possible that at a given time there + is no normal grace period in progress, in which case the + normal grace period cannot emit a stall warning. +</font></td></tr> +<tr><td> </td></tr> +</table> + +The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting +for the expedited grace period to end, but with a timeout set to the +current RCU CPU stall-warning time. +If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures +blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3> + +<p> +The use of workqueues has the advantage that the expedited +grace-period code need not worry about POSIX signals. +Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until +they are initialized, which does not happen until some time after +the scheduler spawns the first task. +Given that there are parts of the kernel that really do want to +execute grace periods during this mid-boot “dead zone”, +expedited grace periods must do something else during thie time. + +<p> +What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case +before the use of workqueues. +However, the requesting task is only required to drive the grace period +during the mid-boot dead zone. +Before mid-boot, a synchronous grace period is a no-op. +Some time after mid-boot, workqueues are used. + +<p> +Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. +This is handled by causing non-expedited grace periods to take the +expedited code path during mid-boot. + +<p> +The current code assumes that there are no POSIX signals during +the mid-boot dead zone. +However, if an overwhelming need for POSIX signals somehow arises, +appropriate adjustments can be made to the expedited stall-warning code. +One such adjustment would reinstate the pre-workqueue stall-warning +checks, but only during the mid-boot dead zone. + +<p> +With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. + +<h3><a name="Summary"> +Summary</a></h3> + +<p> +Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. +A funnel lock is used to efficiently identify the one task out of +a concurrent group that will request the grace period. +All members of the group will block on waitqueues provided in +the <tt>rcu_node</tt> structure. +The actual grace-period processing is carried out by a workqueue. + +<p> +CPU-hotplug operations are noted lazily in order to prevent the need +for tight synchronization between expedited grace periods and +CPU-hotplug operations. +The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, +at least in the common case. +RCU-preempt and RCU-sched use different IPI handlers and different +code to respond to the state changes carried out by those handlers, +but otherwise use common code. + +<p> +Quiescent states are tracked using the <tt>rcu_node</tt> tree, +and once all necessary quiescent states have been reported, +all tasks waiting on this expedited grace period are awakened. +A pair of mutexes are used to allow one grace period's wakeups +to proceed concurrently with the next grace period's processing. + +<p> +This combination of mechanisms allows expedited grace periods to +run reasonably efficiently. +However, for non-time-critical tasks, normal grace periods should be +used instead because their longer duration permits much higher +degrees of batching, and thus much lower per-request overheads. + +</body></html> |