========================== Real-Time group scheduling ========================== .. CONTENTS 0. WARNING 1. Overview 1.1 The problem 1.2 The solution 2. The interface 2.1 System-wide settings 2.2 Default behaviour 2.3 Basis for grouping tasks 3. Future plans 0. WARNING ========== Fiddling with these settings can result in an unstable system, the knobs are root only and assumes root knows what he is doing. Most notable: * very small values in sched_rt_period_us can result in an unstable system when the period is smaller than either the available hrtimer resolution, or the time it takes to handle the budget refresh itself. * very small values in sched_rt_runtime_us can result in an unstable system when the runtime is so small the system has difficulty making forward progress (NOTE: the migration thread and kstopmachine both are real-time processes). 1. Overview =========== 1.1 The problem --------------- Real-time scheduling is all about determinism, a group has to be able to rely on the amount of bandwidth (eg. CPU time) being constant. In order to schedule multiple groups of real-time tasks, each group must be assigned a fixed portion of the CPU time available. Without a minimum guarantee a real-time group can obviously fall short. A fuzzy upper limit is of no use since it cannot be relied upon. Which leaves us with just the single fixed portion. 1.2 The solution ---------------- CPU time is divided by means of specifying how much time can be spent running in a given period. We allocate this "run time" for each real-time group which the other real-time groups will not be permitted to use. Any time not allocated to a real-time group will be used to run normal priority tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by SCHED_OTHER. Let's consider an example: a frame fixed real-time renderer must deliver 25 frames a second, which yields a period of 0.04s per frame. Now say it will also have to play some music and respond to input, leaving it with around 80% CPU time dedicated for the graphics. We can then give this group a run time of 0.8 * 0.04s = 0.032s. This way the graphics group will have a 0.04s period with a 0.032s run time limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = 0.00015s. So this group can be scheduled with a period of 0.005s and a run time of 0.00015s. The remaining CPU time will be used for user input and other tasks. Because real-time tasks have explicitly allocated the CPU time they need to perform their tasks, buffer underruns in the graphics or audio can be eliminated. NOTE: the above example is not fully implemented yet. We still lack an EDF scheduler to make non-uniform periods usable. 2. The Interface ================ 2.1 System wide settings ------------------------ The system wide settings are configured under the /proc virtual file system: /proc/sys/kernel/sched_rt_period_us: The scheduling period that is equivalent to 100% CPU bandwidth. /proc/sys/kernel/sched_rt_runtime_us: A global limit on how much time real-time scheduling may use. This is always less or equal to the period_us, as it denotes the time allocated from the period_us for the real-time tasks. Even without CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to real-time processes. With CONFIG_RT_GROUP_SCHED=y it signifies the total bandwidth available to all real-time groups. * Time is specified in us because the interface is s32. This gives an operating range from 1us to about 35 minutes. * sched_rt_period_us takes values from 1 to INT_MAX. * sched_rt_runtime_us takes values from -1 to sched_rt_period_us. * A run time of -1 specifies runtime == period, ie. no limit. 2.2 Default behaviour --------------------- The default values for sched_rt_period_us (1000000 or 1s) and sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away real-time tasks will not lock up the machine but leave a little time to recover it. By setting runtime to -1 you'd get the old behaviour back. By default all bandwidth is assigned to the root group and new groups get the period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you want to assign bandwidth to another group, reduce the root group's bandwidth and assign some or all of the difference to another group. Real-time group scheduling means you have to assign a portion of total CPU bandwidth to the group before it will accept real-time tasks. Therefore you will not be able to run real-time tasks as any user other than root until you have done that, even if the user has the rights to run processes with real-time priority! 2.3 Basis for grouping tasks ---------------------------- Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real CPU bandwidth to task groups. This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each control group. For more information on working with control groups, you should read Documentation/admin-guide/cgroup-v1/cgroups.rst as well. Group settings are checked against the following limits in order to keep the configuration schedulable: \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period For now, this can be simplified to just the following (but see Future plans): \Sum_{i} runtime_{i} <= global_runtime 3. Future plans =============== There is work in progress to make the scheduling period for each group ("<cgroup>/cpu.rt_period_us") configurable as well. The constraint on the period is that a subgroup must have a smaller or equal period to its parent. But realistically its not very useful _yet_ as its prone to starvation without deadline scheduling. Consider two sibling groups A and B; both have 50% bandwidth, but A's period is twice the length of B's. * group A: period=100000us, runtime=50000us - this runs for 0.05s once every 0.1s * group B: period= 50000us, runtime=25000us - this runs for 0.025s twice every 0.1s (or once every 0.05 sec). This means that currently a while (1) loop in A will run for the full period of B and can starve B's tasks (assuming they are of lower priority) for a whole period. The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring full deadline scheduling to the linux kernel. Deadline scheduling the above groups and treating end of the period as a deadline will ensure that they both get their allocated time. Implementing SCHED_EDF might take a while to complete. Priority Inheritance is the biggest challenge as the current linux PI infrastructure is geared towards the limited static priority levels 0-99. With deadline scheduling you need to do deadline inheritance (since priority is inversely proportional to the deadline delta (deadline - now)). This means the whole PI machinery will have to be reworked - and that is one of the most complex pieces of code we have.