From ace9429bb58fd418f0c81d4c2835699bddf6bde6 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Thu, 11 Apr 2024 10:27:49 +0200 Subject: Adding upstream version 6.6.15. Signed-off-by: Daniel Baumann --- Documentation/RCU/checklist.rst | 533 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 533 insertions(+) create mode 100644 Documentation/RCU/checklist.rst (limited to 'Documentation/RCU/checklist.rst') diff --git a/Documentation/RCU/checklist.rst b/Documentation/RCU/checklist.rst new file mode 100644 index 0000000000..bd3c58c44b --- /dev/null +++ b/Documentation/RCU/checklist.rst @@ -0,0 +1,533 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +Review Checklist for RCU Patches +================================ + + +This document contains a checklist for producing and reviewing patches +that make use of RCU. Violating any of the rules listed below will +result in the same sorts of problems that leaving out a locking primitive +would cause. This list is based on experiences reviewing such patches +over a rather long period of time, but improvements are always welcome! + +0. Is RCU being applied to a read-mostly situation? If the data + structure is updated more than about 10% of the time, then you + should strongly consider some other approach, unless detailed + performance measurements show that RCU is nonetheless the right + tool for the job. Yes, RCU does reduce read-side overhead by + increasing write-side overhead, which is exactly why normal uses + of RCU will do much more reading than updating. + + Another exception is where performance is not an issue, and RCU + provides a simpler implementation. An example of this situation + is the dynamic NMI code in the Linux 2.6 kernel, at least on + architectures where NMIs are rare. + + Yet another exception is where the low real-time latency of RCU's + read-side primitives is critically important. + + One final exception is where RCU readers are used to prevent + the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) + for lockless updates. This does result in the mildly + counter-intuitive situation where rcu_read_lock() and + rcu_read_unlock() are used to protect updates, however, this + approach can provide the same simplifications to certain types + of lockless algorithms that garbage collectors do. + +1. Does the update code have proper mutual exclusion? + + RCU does allow *readers* to run (almost) naked, but *writers* must + still use some sort of mutual exclusion, such as: + + a. locking, + b. atomic operations, or + c. restricting updates to a single task. + + If you choose #b, be prepared to describe how you have handled + memory barriers on weakly ordered machines (pretty much all of + them -- even x86 allows later loads to be reordered to precede + earlier stores), and be prepared to explain why this added + complexity is worthwhile. If you choose #c, be prepared to + explain how this single task does not become a major bottleneck + on large systems (for example, if the task is updating information + relating to itself that other tasks can read, there by definition + can be no bottleneck). Note that the definition of "large" has + changed significantly: Eight CPUs was "large" in the year 2000, + but a hundred CPUs was unremarkable in 2017. + +2. Do the RCU read-side critical sections make proper use of + rcu_read_lock() and friends? These primitives are needed + to prevent grace periods from ending prematurely, which + could result in data being unceremoniously freed out from + under your read-side code, which can greatly increase the + actuarial risk of your kernel. + + As a rough rule of thumb, any dereference of an RCU-protected + pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(), + rcu_read_lock_sched(), or by the appropriate update-side lock. + Explicit disabling of preemption (preempt_disable(), for example) + can serve as rcu_read_lock_sched(), but is less readable and + prevents lockdep from detecting locking issues. + + Please note that you *cannot* rely on code known to be built + only in non-preemptible kernels. Such code can and will break, + especially in kernels built with CONFIG_PREEMPT_COUNT=y. + + Letting RCU-protected pointers "leak" out of an RCU read-side + critical section is every bit as bad as letting them leak out + from under a lock. Unless, of course, you have arranged some + other means of protection, such as a lock or a reference count + *before* letting them out of the RCU read-side critical section. + +3. Does the update code tolerate concurrent accesses? + + The whole point of RCU is to permit readers to run without + any locks or atomic operations. This means that readers will + be running while updates are in progress. There are a number + of ways to handle this concurrency, depending on the situation: + + a. Use the RCU variants of the list and hlist update + primitives to add, remove, and replace elements on + an RCU-protected list. Alternatively, use the other + RCU-protected data structures that have been added to + the Linux kernel. + + This is almost always the best approach. + + b. Proceed as in (a) above, but also maintain per-element + locks (that are acquired by both readers and writers) + that guard per-element state. Fields that the readers + refrain from accessing can be guarded by some other lock + acquired only by updaters, if desired. + + This also works quite well. + + c. Make updates appear atomic to readers. For example, + pointer updates to properly aligned fields will + appear atomic, as will individual atomic primitives. + Sequences of operations performed under a lock will *not* + appear to be atomic to RCU readers, nor will sequences + of multiple atomic primitives. One alternative is to + move multiple individual fields to a separate structure, + thus solving the multiple-field problem by imposing an + additional level of indirection. + + This can work, but is starting to get a bit tricky. + + d. Carefully order the updates and the reads so that readers + see valid data at all phases of the update. This is often + more difficult than it sounds, especially given modern + CPUs' tendency to reorder memory references. One must + usually liberally sprinkle memory-ordering operations + through the code, making it difficult to understand and + to test. Where it works, it is better to use things + like smp_store_release() and smp_load_acquire(), but in + some cases the smp_mb() full memory barrier is required. + + As noted earlier, it is usually better to group the + changing data into a separate structure, so that the + change may be made to appear atomic by updating a pointer + to reference a new structure containing updated values. + +4. Weakly ordered CPUs pose special challenges. Almost all CPUs + are weakly ordered -- even x86 CPUs allow later loads to be + reordered to precede earlier stores. RCU code must take all of + the following measures to prevent memory-corruption problems: + + a. Readers must maintain proper ordering of their memory + accesses. The rcu_dereference() primitive ensures that + the CPU picks up the pointer before it picks up the data + that the pointer points to. This really is necessary + on Alpha CPUs. + + The rcu_dereference() primitive is also an excellent + documentation aid, letting the person reading the + code know exactly which pointers are protected by RCU. + Please note that compilers can also reorder code, and + they are becoming increasingly aggressive about doing + just that. The rcu_dereference() primitive therefore also + prevents destructive compiler optimizations. However, + with a bit of devious creativity, it is possible to + mishandle the return value from rcu_dereference(). + Please see rcu_dereference.rst for more information. + + The rcu_dereference() primitive is used by the + various "_rcu()" list-traversal primitives, such + as the list_for_each_entry_rcu(). Note that it is + perfectly legal (if redundant) for update-side code to + use rcu_dereference() and the "_rcu()" list-traversal + primitives. This is particularly useful in code that + is common to readers and updaters. However, lockdep + will complain if you access rcu_dereference() outside + of an RCU read-side critical section. See lockdep.rst + to learn what to do about this. + + Of course, neither rcu_dereference() nor the "_rcu()" + list-traversal primitives can substitute for a good + concurrency design coordinating among multiple updaters. + + b. If the list macros are being used, the list_add_tail_rcu() + and list_add_rcu() primitives must be used in order + to prevent weakly ordered machines from misordering + structure initialization and pointer planting. + Similarly, if the hlist macros are being used, the + hlist_add_head_rcu() primitive is required. + + c. If the list macros are being used, the list_del_rcu() + primitive must be used to keep list_del()'s pointer + poisoning from inflicting toxic effects on concurrent + readers. Similarly, if the hlist macros are being used, + the hlist_del_rcu() primitive is required. + + The list_replace_rcu() and hlist_replace_rcu() primitives + may be used to replace an old structure with a new one + in their respective types of RCU-protected lists. + + d. Rules similar to (4b) and (4c) apply to the "hlist_nulls" + type of RCU-protected linked lists. + + e. Updates must ensure that initialization of a given + structure happens before pointers to that structure are + publicized. Use the rcu_assign_pointer() primitive + when publicizing a pointer to a structure that can + be traversed by an RCU read-side critical section. + +5. If any of call_rcu(), call_srcu(), call_rcu_tasks(), + call_rcu_tasks_rude(), or call_rcu_tasks_trace() is used, + the callback function may be invoked from softirq context, + and in any case with bottom halves disabled. In particular, + this callback function cannot block. If you need the callback + to block, run that code in a workqueue handler scheduled from + the callback. The queue_rcu_work() function does this for you + in the case of call_rcu(). + +6. Since synchronize_rcu() can block, it cannot be called + from any sort of irq context. The same rule applies + for synchronize_srcu(), synchronize_rcu_expedited(), + synchronize_srcu_expedited(), synchronize_rcu_tasks(), + synchronize_rcu_tasks_rude(), and synchronize_rcu_tasks_trace(). + + The expedited forms of these primitives have the same semantics + as the non-expedited forms, but expediting is more CPU intensive. + Use of the expedited primitives should be restricted to rare + configuration-change operations that would not normally be + undertaken while a real-time workload is running. Note that + IPI-sensitive real-time workloads can use the rcupdate.rcu_normal + kernel boot parameter to completely disable expedited grace + periods, though this might have performance implications. + + In particular, if you find yourself invoking one of the expedited + primitives repeatedly in a loop, please do everyone a favor: + Restructure your code so that it batches the updates, allowing + a single non-expedited primitive to cover the entire batch. + This will very likely be faster than the loop containing the + expedited primitive, and will be much much easier on the rest + of the system, especially to real-time workloads running on the + rest of the system. Alternatively, instead use asynchronous + primitives such as call_rcu(). + +7. As of v4.20, a given kernel implements only one RCU flavor, which + is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. + If the updater uses call_rcu() or synchronize_rcu(), then + the corresponding readers may use: (1) rcu_read_lock() and + rcu_read_unlock(), (2) any pair of primitives that disables + and re-enables softirq, for example, rcu_read_lock_bh() and + rcu_read_unlock_bh(), or (3) any pair of primitives that disables + and re-enables preemption, for example, rcu_read_lock_sched() and + rcu_read_unlock_sched(). If the updater uses synchronize_srcu() + or call_srcu(), then the corresponding readers must use + srcu_read_lock() and srcu_read_unlock(), and with the same + srcu_struct. The rules for the expedited RCU grace-period-wait + primitives are the same as for their non-expedited counterparts. + + If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(), + then the readers must refrain from executing voluntary + context switches, that is, from blocking. If the updater uses + call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then + the corresponding readers must use rcu_read_lock_trace() and + rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude() + or synchronize_rcu_tasks_rude(), then the corresponding readers + must use anything that disables preemption, for example, + preempt_disable() and preempt_enable(). + + Mixing things up will result in confusion and broken kernels, and + has even resulted in an exploitable security issue. Therefore, + when using non-obvious pairs of primitives, commenting is + of course a must. One example of non-obvious pairing is + the XDP feature in networking, which calls BPF programs from + network-driver NAPI (softirq) context. BPF relies heavily on RCU + protection for its data structures, but because the BPF program + invocation happens entirely within a single local_bh_disable() + section in a NAPI poll cycle, this usage is safe. The reason + that this usage is safe is that readers can use anything that + disables BH when updaters use call_rcu() or synchronize_rcu(). + +8. Although synchronize_rcu() is slower than is call_rcu(), + it usually results in simpler code. So, unless update + performance is critically important, the updaters cannot block, + or the latency of synchronize_rcu() is visible from userspace, + synchronize_rcu() should be used in preference to call_rcu(). + Furthermore, kfree_rcu() and kvfree_rcu() usually result + in even simpler code than does synchronize_rcu() without + synchronize_rcu()'s multi-millisecond latency. So please take + advantage of kfree_rcu()'s and kvfree_rcu()'s "fire and forget" + memory-freeing capabilities where it applies. + + An especially important property of the synchronize_rcu() + primitive is that it automatically self-limits: if grace periods + are delayed for whatever reason, then the synchronize_rcu() + primitive will correspondingly delay updates. In contrast, + code using call_rcu() should explicitly limit update rate in + cases where grace periods are delayed, as failing to do so can + result in excessive realtime latencies or even OOM conditions. + + Ways of gaining this self-limiting property when using call_rcu(), + kfree_rcu(), or kvfree_rcu() include: + + a. Keeping a count of the number of data-structure elements + used by the RCU-protected data structure, including + those waiting for a grace period to elapse. Enforce a + limit on this number, stalling updates as needed to allow + previously deferred frees to complete. Alternatively, + limit only the number awaiting deferred free rather than + the total number of elements. + + One way to stall the updates is to acquire the update-side + mutex. (Don't try this with a spinlock -- other CPUs + spinning on the lock could prevent the grace period + from ever ending.) Another way to stall the updates + is for the updates to use a wrapper function around + the memory allocator, so that this wrapper function + simulates OOM when there is too much memory awaiting an + RCU grace period. There are of course many other + variations on this theme. + + b. Limiting update rate. For example, if updates occur only + once per hour, then no explicit rate limiting is + required, unless your system is already badly broken. + Older versions of the dcache subsystem take this approach, + guarding updates with a global lock, limiting their rate. + + c. Trusted update -- if updates can only be done manually by + superuser or some other trusted user, then it might not + be necessary to automatically limit them. The theory + here is that superuser already has lots of ways to crash + the machine. + + d. Periodically invoke rcu_barrier(), permitting a limited + number of updates per grace period. + + The same cautions apply to call_srcu(), call_rcu_tasks(), + call_rcu_tasks_rude(), and call_rcu_tasks_trace(). This is + why there is an srcu_barrier(), rcu_barrier_tasks(), + rcu_barrier_tasks_rude(), and rcu_barrier_tasks_rude(), + respectively. + + Note that although these primitives do take action to avoid + memory exhaustion when any given CPU has too many callbacks, + a determined user or administrator can still exhaust memory. + This is especially the case if a system with a large number of + CPUs has been configured to offload all of its RCU callbacks onto + a single CPU, or if the system has relatively little free memory. + +9. All RCU list-traversal primitives, which include + rcu_dereference(), list_for_each_entry_rcu(), and + list_for_each_safe_rcu(), must be either within an RCU read-side + critical section or must be protected by appropriate update-side + locks. RCU read-side critical sections are delimited by + rcu_read_lock() and rcu_read_unlock(), or by similar primitives + such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which + case the matching rcu_dereference() primitive must be used in + order to keep lockdep happy, in this case, rcu_dereference_bh(). + + The reason that it is permissible to use RCU list-traversal + primitives when the update-side lock is held is that doing so + can be quite helpful in reducing code bloat when common code is + shared between readers and updaters. Additional primitives + are provided for this case, as discussed in lockdep.rst. + + One exception to this rule is when data is only ever added to + the linked data structure, and is never removed during any + time that readers might be accessing that structure. In such + cases, READ_ONCE() may be used in place of rcu_dereference() + and the read-side markers (rcu_read_lock() and rcu_read_unlock(), + for example) may be omitted. + +10. Conversely, if you are in an RCU read-side critical section, + and you don't hold the appropriate update-side lock, you *must* + use the "_rcu()" variants of the list macros. Failing to do so + will break Alpha, cause aggressive compilers to generate bad code, + and confuse people trying to understand your code. + +11. Any lock acquired by an RCU callback must be acquired elsewhere + with softirq disabled, e.g., via spin_lock_bh(). Failing to + disable softirq on a given acquisition of that lock will result + in deadlock as soon as the RCU softirq handler happens to run + your RCU callback while interrupting that acquisition's critical + section. + +12. RCU callbacks can be and are executed in parallel. In many cases, + the callback code simply wrappers around kfree(), so that this + is not an issue (or, more accurately, to the extent that it is + an issue, the memory-allocator locking handles it). However, + if the callbacks do manipulate a shared data structure, they + must use whatever locking or other synchronization is required + to safely access and/or modify that data structure. + + Do not assume that RCU callbacks will be executed on the same + CPU that executed the corresponding call_rcu() or call_srcu(). + For example, if a given CPU goes offline while having an RCU + callback pending, then that RCU callback will execute on some + surviving CPU. (If this was not the case, a self-spawning RCU + callback would prevent the victim CPU from ever going offline.) + Furthermore, CPUs designated by rcu_nocbs= might well *always* + have their RCU callbacks executed on some other CPUs, in fact, + for some real-time workloads, this is the whole point of using + the rcu_nocbs= kernel boot parameter. + + In addition, do not assume that callbacks queued in a given order + will be invoked in that order, even if they all are queued on the + same CPU. Furthermore, do not assume that same-CPU callbacks will + be invoked serially. For example, in recent kernels, CPUs can be + switched between offloaded and de-offloaded callback invocation, + and while a given CPU is undergoing such a switch, its callbacks + might be concurrently invoked by that CPU's softirq handler and + that CPU's rcuo kthread. At such times, that CPU's callbacks + might be executed both concurrently and out of order. + +13. Unlike most flavors of RCU, it *is* permissible to block in an + SRCU read-side critical section (demarked by srcu_read_lock() + and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". + Please note that if you don't need to sleep in read-side critical + sections, you should be using RCU rather than SRCU, because RCU + is almost always faster and easier to use than is SRCU. + + Also unlike other forms of RCU, explicit initialization and + cleanup is required either at build time via DEFINE_SRCU() + or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() + and cleanup_srcu_struct(). These last two are passed a + "struct srcu_struct" that defines the scope of a given + SRCU domain. Once initialized, the srcu_struct is passed + to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), + synchronize_srcu_expedited(), and call_srcu(). A given + synchronize_srcu() waits only for SRCU read-side critical + sections governed by srcu_read_lock() and srcu_read_unlock() + calls that have been passed the same srcu_struct. This property + is what makes sleeping read-side critical sections tolerable -- + a given subsystem delays only its own updates, not those of other + subsystems using SRCU. Therefore, SRCU is less prone to OOM the + system than RCU would be if RCU's read-side critical sections + were permitted to sleep. + + The ability to sleep in read-side critical sections does not + come for free. First, corresponding srcu_read_lock() and + srcu_read_unlock() calls must be passed the same srcu_struct. + Second, grace-period-detection overhead is amortized only + over those updates sharing a given srcu_struct, rather than + being globally amortized as they are for other forms of RCU. + Therefore, SRCU should be used in preference to rw_semaphore + only in extremely read-intensive situations, or in situations + requiring SRCU's read-side deadlock immunity or low read-side + realtime latency. You should also consider percpu_rw_semaphore + when you need lightweight readers. + + SRCU's expedited primitive (synchronize_srcu_expedited()) + never sends IPIs to other CPUs, so it is easier on + real-time workloads than is synchronize_rcu_expedited(). + + It is also permissible to sleep in RCU Tasks Trace read-side + critical, which are delimited by rcu_read_lock_trace() and + rcu_read_unlock_trace(). However, this is a specialized flavor + of RCU, and you should not use it without first checking with + its current users. In most cases, you should instead use SRCU. + + Note that rcu_assign_pointer() relates to SRCU just as it does to + other forms of RCU, but instead of rcu_dereference() you should + use srcu_dereference() in order to avoid lockdep splats. + +14. The whole point of call_rcu(), synchronize_rcu(), and friends + is to wait until all pre-existing readers have finished before + carrying out some otherwise-destructive operation. It is + therefore critically important to *first* remove any path + that readers can follow that could be affected by the + destructive operation, and *only then* invoke call_rcu(), + synchronize_rcu(), or friends. + + Because these primitives only wait for pre-existing readers, it + is the caller's responsibility to guarantee that any subsequent + readers will execute safely. + +15. The various RCU read-side primitives do *not* necessarily contain + memory barriers. You should therefore plan for the CPU + and the compiler to freely reorder code into and out of RCU + read-side critical sections. It is the responsibility of the + RCU update-side primitives to deal with this. + + For SRCU readers, you can use smp_mb__after_srcu_read_unlock() + immediately after an srcu_read_unlock() to get a full barrier. + +16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks to validate your RCU code. These can help + find problems as follows: + + CONFIG_PROVE_LOCKING: + check that accesses to RCU-protected data structures + are carried out under the proper RCU read-side critical + section, while holding the right combination of locks, + or whatever other conditions are appropriate. + + CONFIG_DEBUG_OBJECTS_RCU_HEAD: + check that you don't pass the same object to call_rcu() + (or friends) before an RCU grace period has elapsed + since the last time that you passed that same object to + call_rcu() (or friends). + + __rcu sparse checks: + tag the pointer to the RCU-protected data structure + with __rcu, and sparse will warn you if you access that + pointer without the services of one of the variants + of rcu_dereference(). + + These debugging aids can help you find problems that are + otherwise extremely difficult to spot. + +17. If you pass a callback function defined within a module to one of + call_rcu(), call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(), + or call_rcu_tasks_trace(), then it is necessary to wait for all + pending callbacks to be invoked before unloading that module. + Note that it is absolutely *not* sufficient to wait for a grace + period! For example, synchronize_rcu() implementation is *not* + guaranteed to wait for callbacks registered on other CPUs via + call_rcu(). Or even on the current CPU if that CPU recently + went offline and came back online. + + You instead need to use one of the barrier functions: + + - call_rcu() -> rcu_barrier() + - call_srcu() -> srcu_barrier() + - call_rcu_tasks() -> rcu_barrier_tasks() + - call_rcu_tasks_rude() -> rcu_barrier_tasks_rude() + - call_rcu_tasks_trace() -> rcu_barrier_tasks_trace() + + However, these barrier functions are absolutely *not* guaranteed + to wait for a grace period. For example, if there are no + call_rcu() callbacks queued anywhere in the system, rcu_barrier() + can and will return immediately. + + So if you need to wait for both a grace period and for all + pre-existing callbacks, you will need to invoke both functions, + with the pair depending on the flavor of RCU: + + - Either synchronize_rcu() or synchronize_rcu_expedited(), + together with rcu_barrier() + - Either synchronize_srcu() or synchronize_srcu_expedited(), + together with and srcu_barrier() + - synchronize_rcu_tasks() and rcu_barrier_tasks() + - synchronize_tasks_rude() and rcu_barrier_tasks_rude() + - synchronize_tasks_trace() and rcu_barrier_tasks_trace() + + If necessary, you can use something like workqueues to execute + the requisite pair of functions concurrently. + + See rcubarrier.rst for more information. -- cgit v1.2.3