diff options
Diffstat (limited to 'man/man7/sched.7')
-rw-r--r-- | man/man7/sched.7 | 992 |
1 files changed, 992 insertions, 0 deletions
diff --git a/man/man7/sched.7 b/man/man7/sched.7 new file mode 100644 index 0000000..607b919 --- /dev/null +++ b/man/man7/sched.7 @@ -0,0 +1,992 @@ +.\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com> +.\" and Copyright (C) 2014 Peter Zijlstra <peterz@infradead.org> +.\" and Copyright (C) 2014 Juri Lelli <juri.lelli@gmail.com> +.\" Various pieces from the old sched_setscheduler(2) page +.\" Copyright (C) Tom Bjorkholm, Markus Kuhn & David A. Wheeler 1996-1999 +.\" and Copyright (C) 2007 Carsten Emde <Carsten.Emde@osadl.org> +.\" and Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com> +.\" +.\" SPDX-License-Identifier: GPL-2.0-or-later +.\" +.\" Worth looking at: http://rt.wiki.kernel.org/index.php +.\" +.TH sched 7 2024-05-02 "Linux man-pages (unreleased)" +.SH NAME +sched \- overview of CPU scheduling +.SH DESCRIPTION +Since Linux 2.6.23, the default scheduler is CFS, +the "Completely Fair Scheduler". +The CFS scheduler replaced the earlier "O(1)" scheduler. +.\" +.SS API summary +Linux provides the following system calls for controlling +the CPU scheduling behavior, policy, and priority of processes +(or, more precisely, threads). +.TP +.BR nice (2) +Set a new nice value for the calling thread, +and return the new nice value. +.TP +.BR getpriority (2) +Return the nice value of a thread, a process group, +or the set of threads owned by a specified user. +.TP +.BR setpriority (2) +Set the nice value of a thread, a process group, +or the set of threads owned by a specified user. +.TP +.BR sched_setscheduler (2) +Set the scheduling policy and parameters of a specified thread. +.TP +.BR sched_getscheduler (2) +Return the scheduling policy of a specified thread. +.TP +.BR sched_setparam (2) +Set the scheduling parameters of a specified thread. +.TP +.BR sched_getparam (2) +Fetch the scheduling parameters of a specified thread. +.TP +.BR sched_get_priority_max (2) +Return the maximum priority available in a specified scheduling policy. +.TP +.BR sched_get_priority_min (2) +Return the minimum priority available in a specified scheduling policy. +.TP +.BR sched_rr_get_interval (2) +Fetch the quantum used for threads that are scheduled under +the "round-robin" scheduling policy. +.TP +.BR sched_yield (2) +Cause the caller to relinquish the CPU, +so that some other thread be executed. +.TP +.BR sched_setaffinity (2) +(Linux-specific) +Set the CPU affinity of a specified thread. +.TP +.BR sched_getaffinity (2) +(Linux-specific) +Get the CPU affinity of a specified thread. +.TP +.BR sched_setattr (2) +Set the scheduling policy and parameters of a specified thread. +This (Linux-specific) system call provides a superset of the functionality of +.BR sched_setscheduler (2) +and +.BR sched_setparam (2). +.TP +.BR sched_getattr (2) +Fetch the scheduling policy and parameters of a specified thread. +This (Linux-specific) system call provides a superset of the functionality of +.BR sched_getscheduler (2) +and +.BR sched_getparam (2). +.\" +.SS Scheduling policies +The scheduler is the kernel component that decides which runnable thread +will be executed by the CPU next. +Each thread has an associated scheduling policy and a \fIstatic\fP +scheduling priority, +.IR sched_priority . +The scheduler makes its decisions based on knowledge of the scheduling +policy and static priority of all threads on the system. +.P +For threads scheduled under one of the normal scheduling policies +(\fBSCHED_OTHER\fP, \fBSCHED_IDLE\fP, \fBSCHED_BATCH\fP), +\fIsched_priority\fP is not used in scheduling +decisions (it must be specified as 0). +.P +Processes scheduled under one of the real-time policies +(\fBSCHED_FIFO\fP, \fBSCHED_RR\fP) have a +\fIsched_priority\fP value in the range 1 (low) to 99 (high). +(As the numbers imply, real-time threads always have higher priority +than normal threads.) +Note well: POSIX.1 requires an implementation to support only a +minimum 32 distinct priority levels for the real-time policies, +and some systems supply just this minimum. +Portable programs should use +.BR sched_get_priority_min (2) +and +.BR sched_get_priority_max (2) +to find the range of priorities supported for a particular policy. +.P +Conceptually, the scheduler maintains a list of runnable +threads for each possible \fIsched_priority\fP value. +In order to determine which thread runs next, the scheduler looks for +the nonempty list with the highest static priority and selects the +thread at the head of this list. +.P +A thread's scheduling policy determines +where it will be inserted into the list of threads +with equal static priority and how it will move inside this list. +.P +All scheduling is preemptive: if a thread with a higher static +priority becomes ready to run, the currently running thread +will be preempted and +returned to the wait list for its static priority level. +The scheduling policy determines the +ordering only within the list of runnable threads with equal static +priority. +.SS SCHED_FIFO: First in-first out scheduling +\fBSCHED_FIFO\fP can be used only with static priorities higher than +0, which means that when a \fBSCHED_FIFO\fP thread becomes runnable, +it will always immediately preempt any currently running +\fBSCHED_OTHER\fP, \fBSCHED_BATCH\fP, or \fBSCHED_IDLE\fP thread. +\fBSCHED_FIFO\fP is a simple scheduling +algorithm without time slicing. +For threads scheduled under the +\fBSCHED_FIFO\fP policy, the following rules apply: +.IP \[bu] 3 +A running \fBSCHED_FIFO\fP thread that has been preempted by another thread of +higher priority will stay at the head of the list for its priority and +will resume execution as soon as all threads of higher priority are +blocked again. +.IP \[bu] +When a blocked \fBSCHED_FIFO\fP thread becomes runnable, it +will be inserted at the end of the list for its priority. +.IP \[bu] +If a call to +.BR sched_setscheduler (2), +.BR sched_setparam (2), +.BR sched_setattr (2), +.BR pthread_setschedparam (3), +or +.BR pthread_setschedprio (3) +changes the priority of the running or runnable +.B SCHED_FIFO +thread identified by +.I pid +the effect on the thread's position in the list depends on +the direction of the change to the thread's priority: +.RS +.IP (a) 5 +If the thread's priority is raised, +it is placed at the end of the list for its new priority. +As a consequence, +it may preempt a currently running thread with the same priority. +.IP (b) +If the thread's priority is unchanged, +its position in the run list is unchanged. +.IP (c) +If the thread's priority is lowered, +it is placed at the front of the list for its new priority. +.RE +.IP +According to POSIX.1-2008, +changes to a thread's priority (or policy) using any mechanism other than +.BR pthread_setschedprio (3) +should result in the thread being placed at the end of +the list for its priority. +.\" In Linux 2.2.x and Linux 2.4.x, the thread is placed at the front of the queue +.\" In Linux 2.0.x, the Right Thing happened: the thread went to the back -- MTK +.IP \[bu] +A thread calling +.BR sched_yield (2) +will be put at the end of the list. +.P +No other events will move a thread +scheduled under the \fBSCHED_FIFO\fP policy in the wait list of +runnable threads with equal static priority. +.P +A \fBSCHED_FIFO\fP +thread runs until either it is blocked by an I/O request, it is +preempted by a higher priority thread, or it calls +.BR sched_yield (2). +.SS SCHED_RR: Round-robin scheduling +\fBSCHED_RR\fP is a simple enhancement of \fBSCHED_FIFO\fP. +Everything +described above for \fBSCHED_FIFO\fP also applies to \fBSCHED_RR\fP, +except that each thread is allowed to run only for a maximum time +quantum. +If a \fBSCHED_RR\fP thread has been running for a time +period equal to or longer than the time quantum, it will be put at the +end of the list for its priority. +A \fBSCHED_RR\fP thread that has +been preempted by a higher priority thread and subsequently resumes +execution as a running thread will complete the unexpired portion of +its round-robin time quantum. +The length of the time quantum can be +retrieved using +.BR sched_rr_get_interval (2). +.\" On Linux 2.4, the length of the RR interval is influenced +.\" by the process nice value -- MTK +.\" +.SS SCHED_DEADLINE: Sporadic task model deadline scheduling +Since Linux 3.14, Linux provides a deadline scheduling policy +.RB ( SCHED_DEADLINE ). +This policy is currently implemented using +GEDF (Global Earliest Deadline First) +in conjunction with CBS (Constant Bandwidth Server). +To set and fetch this policy and associated attributes, +one must use the Linux-specific +.BR sched_setattr (2) +and +.BR sched_getattr (2) +system calls. +.P +A sporadic task is one that has a sequence of jobs, where each +job is activated at most once per period. +Each job also has a +.IR "relative deadline" , +before which it should finish execution, and a +.IR "computation time" , +which is the CPU time necessary for executing the job. +The moment when a task wakes up +because a new job has to be executed is called the +.I arrival time +(also referred to as the request time or release time). +The +.I start time +is the time at which a task starts its execution. +The +.I absolute deadline +is thus obtained by adding the relative deadline to the arrival time. +.P +The following diagram clarifies these terms: +.P +.in +4n +.EX +arrival/wakeup absolute deadline + | start time | + | | | + v v v +-----x--------xooooooooooooooooo--------x--------x--- + |<- comp. time ->| + |<------- relative deadline ------>| + |<-------------- period ------------------->| +.EE +.in +.P +When setting a +.B SCHED_DEADLINE +policy for a thread using +.BR sched_setattr (2), +one can specify three parameters: +.IR Runtime , +.IR Deadline , +and +.IR Period . +These parameters do not necessarily correspond to the aforementioned terms: +usual practice is to set Runtime to something bigger than the average +computation time (or worst-case execution time for hard real-time tasks), +Deadline to the relative deadline, and Period to the period of the task. +Thus, for +.B SCHED_DEADLINE +scheduling, we have: +.P +.in +4n +.EX +arrival/wakeup absolute deadline + | start time | + | | | + v v v +-----x--------xooooooooooooooooo--------x--------x--- + |<-- Runtime ------->| + |<----------- Deadline ----------->| + |<-------------- Period ------------------->| +.EE +.in +.P +The three deadline-scheduling parameters correspond to the +.IR sched_runtime , +.IR sched_deadline , +and +.I sched_period +fields of the +.I sched_attr +structure; see +.BR sched_setattr (2). +These fields express values in nanoseconds. +.\" FIXME It looks as though specifying sched_period as 0 means +.\" "make sched_period the same as sched_deadline". +.\" This needs to be documented. +If +.I sched_period +is specified as 0, then it is made the same as +.IR sched_deadline . +.P +The kernel requires that: +.P +.in +4n +.EX +sched_runtime <= sched_deadline <= sched_period +.EE +.in +.P +.\" See __checkparam_dl in kernel/sched/core.c +In addition, under the current implementation, +all of the parameter values must be at least 1024 +(i.e., just over one microsecond, +which is the resolution of the implementation), and less than 2\[ha]63. +If any of these checks fails, +.BR sched_setattr (2) +fails with the error +.BR EINVAL . +.P +The CBS guarantees non-interference between tasks, by throttling +threads that attempt to over-run their specified Runtime. +.P +To ensure deadline scheduling guarantees, +the kernel must prevent situations where the set of +.B SCHED_DEADLINE +threads is not feasible (schedulable) within the given constraints. +The kernel thus performs an admittance test when setting or changing +.B SCHED_DEADLINE +policy and attributes. +This admission test calculates whether the change is feasible; +if it is not, +.BR sched_setattr (2) +fails with the error +.BR EBUSY . +.P +For example, it is required (but not necessarily sufficient) for +the total utilization to be less than or equal to the total number of +CPUs available, where, since each thread can maximally run for +Runtime per Period, that thread's utilization is its +Runtime divided by its Period. +.P +In order to fulfill the guarantees that are made when +a thread is admitted to the +.B SCHED_DEADLINE +policy, +.B SCHED_DEADLINE +threads are the highest priority (user controllable) threads in the +system; if any +.B SCHED_DEADLINE +thread is runnable, +it will preempt any thread scheduled under one of the other policies. +.P +A call to +.BR fork (2) +by a thread scheduled under the +.B SCHED_DEADLINE +policy fails with the error +.BR EAGAIN , +unless the thread has its reset-on-fork flag set (see below). +.P +A +.B SCHED_DEADLINE +thread that calls +.BR sched_yield (2) +will yield the current job and wait for a new period to begin. +.\" +.\" FIXME Calling sched_getparam() on a SCHED_DEADLINE thread +.\" fails with EINVAL, but sched_getscheduler() succeeds. +.\" Is that intended? (Why?) +.\" +.SS SCHED_OTHER: Default Linux time-sharing scheduling +\fBSCHED_OTHER\fP can be used at only static priority 0 +(i.e., threads under real-time policies always have priority over +.B SCHED_OTHER +processes). +\fBSCHED_OTHER\fP is the standard Linux time-sharing scheduler that is +intended for all threads that do not require the special +real-time mechanisms. +.P +The thread to run is chosen from the static +priority 0 list based on a \fIdynamic\fP priority that is determined only +inside this list. +The dynamic priority is based on the nice value (see below) +and is increased for each time quantum the thread is ready to run, +but denied to run by the scheduler. +This ensures fair progress among all \fBSCHED_OTHER\fP threads. +.P +In the Linux kernel source code, the +.B SCHED_OTHER +policy is actually named +.BR SCHED_NORMAL . +.\" +.SS The nice value +The nice value is an attribute +that can be used to influence the CPU scheduler to +favor or disfavor a process in scheduling decisions. +It affects the scheduling of +.B SCHED_OTHER +and +.B SCHED_BATCH +(see below) processes. +The nice value can be modified using +.BR nice (2), +.BR setpriority (2), +or +.BR sched_setattr (2). +.P +According to POSIX.1, the nice value is a per-process attribute; +that is, the threads in a process should share a nice value. +However, on Linux, the nice value is a per-thread attribute: +different threads in the same process may have different nice values. +.P +The range of the nice value +varies across UNIX systems. +On modern Linux, the range is \-20 (high priority) to +19 (low priority). +On some other systems, the range is \-20..20. +Very early Linux kernels (before Linux 2.0) had the range \-infinity..15. +.\" Linux before 1.3.36 had \-infinity..15. +.\" Since Linux 1.3.43, Linux has the range \-20..19. +.P +The degree to which the nice value affects the relative scheduling of +.B SCHED_OTHER +processes likewise varies across UNIX systems and +across Linux kernel versions. +.P +With the advent of the CFS scheduler in Linux 2.6.23, +Linux adopted an algorithm that causes +relative differences in nice values to have a much stronger effect. +In the current implementation, each unit of difference in the +nice values of two processes results in a factor of 1.25 +in the degree to which the scheduler favors the higher priority process. +This causes very low nice values (+19) to truly provide little CPU +to a process whenever there is any other +higher priority load on the system, +and makes high nice values (\-20) deliver most of the CPU to applications +that require it (e.g., some audio applications). +.P +On Linux, the +.B RLIMIT_NICE +resource limit can be used to define a limit to which +an unprivileged process's nice value can be raised; see +.BR setrlimit (2) +for details. +.P +For further details on the nice value, see the subsections on +the autogroup feature and group scheduling, below. +.\" +.SS SCHED_BATCH: Scheduling batch processes +(Since Linux 2.6.16.) +\fBSCHED_BATCH\fP can be used only at static priority 0. +This policy is similar to \fBSCHED_OTHER\fP in that it schedules +the thread according to its dynamic priority +(based on the nice value). +The difference is that this policy +will cause the scheduler to always assume +that the thread is CPU-intensive. +Consequently, the scheduler will apply a small scheduling +penalty with respect to wakeup behavior, +so that this thread is mildly disfavored in scheduling decisions. +.P +.\" The following paragraph is drawn largely from the text that +.\" accompanied Ingo Molnar's patch for the implementation of +.\" SCHED_BATCH. +.\" commit b0a9499c3dd50d333e2aedb7e894873c58da3785 +This policy is useful for workloads that are noninteractive, +but do not want to lower their nice value, +and for workloads that want a deterministic scheduling policy without +interactivity causing extra preemptions (between the workload's tasks). +.\" +.SS SCHED_IDLE: Scheduling very low priority jobs +(Since Linux 2.6.23.) +\fBSCHED_IDLE\fP can be used only at static priority 0; +the process nice value has no influence for this policy. +.P +This policy is intended for running jobs at extremely low +priority (lower even than a +19 nice value with the +.B SCHED_OTHER +or +.B SCHED_BATCH +policies). +.\" +.SS Resetting scheduling policy for child processes +Each thread has a reset-on-fork scheduling flag. +When this flag is set, children created by +.BR fork (2) +do not inherit privileged scheduling policies. +The reset-on-fork flag can be set by either: +.IP \[bu] 3 +ORing the +.B SCHED_RESET_ON_FORK +flag into the +.I policy +argument when calling +.BR sched_setscheduler (2) +(since Linux 2.6.32); +or +.IP \[bu] +specifying the +.B SCHED_FLAG_RESET_ON_FORK +flag in +.I attr.sched_flags +when calling +.BR sched_setattr (2). +.P +Note that the constants used with these two APIs have different names. +The state of the reset-on-fork flag can analogously be retrieved using +.BR sched_getscheduler (2) +and +.BR sched_getattr (2). +.P +The reset-on-fork feature is intended for media-playback applications, +and can be used to prevent applications evading the +.B RLIMIT_RTTIME +resource limit (see +.BR getrlimit (2)) +by creating multiple child processes. +.P +More precisely, if the reset-on-fork flag is set, +the following rules apply for subsequently created children: +.IP \[bu] 3 +If the calling thread has a scheduling policy of +.B SCHED_FIFO +or +.BR SCHED_RR , +the policy is reset to +.B SCHED_OTHER +in child processes. +.IP \[bu] +If the calling process has a negative nice value, +the nice value is reset to zero in child processes. +.P +After the reset-on-fork flag has been enabled, +it can be reset only if the thread has the +.B CAP_SYS_NICE +capability. +This flag is disabled in child processes created by +.BR fork (2). +.\" +.SS Privileges and resource limits +Before Linux 2.6.12, only privileged +.RB ( CAP_SYS_NICE ) +threads can set a nonzero static priority (i.e., set a real-time +scheduling policy). +The only change that an unprivileged thread can make is to set the +.B SCHED_OTHER +policy, and this can be done only if the effective user ID of the caller +matches the real or effective user ID of the target thread +(i.e., the thread specified by +.IR pid ) +whose policy is being changed. +.P +A thread must be privileged +.RB ( CAP_SYS_NICE ) +in order to set or modify a +.B SCHED_DEADLINE +policy. +.P +Since Linux 2.6.12, the +.B RLIMIT_RTPRIO +resource limit defines a ceiling on an unprivileged thread's +static priority for the +.B SCHED_RR +and +.B SCHED_FIFO +policies. +The rules for changing scheduling policy and priority are as follows: +.IP \[bu] 3 +If an unprivileged thread has a nonzero +.B RLIMIT_RTPRIO +soft limit, then it can change its scheduling policy and priority, +subject to the restriction that the priority cannot be set to a +value higher than the maximum of its current priority and its +.B RLIMIT_RTPRIO +soft limit. +.IP \[bu] +If the +.B RLIMIT_RTPRIO +soft limit is 0, then the only permitted changes are to lower the priority, +or to switch to a non-real-time policy. +.IP \[bu] +Subject to the same rules, +another unprivileged thread can also make these changes, +as long as the effective user ID of the thread making the change +matches the real or effective user ID of the target thread. +.IP \[bu] +Special rules apply for the +.B SCHED_IDLE +policy. +Before Linux 2.6.39, +an unprivileged thread operating under this policy cannot +change its policy, regardless of the value of its +.B RLIMIT_RTPRIO +resource limit. +Since Linux 2.6.39, +.\" commit c02aa73b1d18e43cfd79c2f193b225e84ca497c8 +an unprivileged thread can switch to either the +.B SCHED_BATCH +or the +.B SCHED_OTHER +policy so long as its nice value falls within the range permitted by its +.B RLIMIT_NICE +resource limit (see +.BR getrlimit (2)). +.P +Privileged +.RB ( CAP_SYS_NICE ) +threads ignore the +.B RLIMIT_RTPRIO +limit; as with older kernels, +they can make arbitrary changes to scheduling policy and priority. +See +.BR getrlimit (2) +for further information on +.BR RLIMIT_RTPRIO . +.SS Limiting the CPU usage of real-time and deadline processes +A nonblocking infinite loop in a thread scheduled under the +.BR SCHED_FIFO , +.BR SCHED_RR , +or +.B SCHED_DEADLINE +policy can potentially block all other threads from accessing +the CPU forever. +Before Linux 2.6.25, the only way of preventing a runaway real-time +process from freezing the system was to run (at the console) +a shell scheduled under a higher static priority than the tested application. +This allows an emergency kill of tested +real-time applications that do not block or terminate as expected. +.P +Since Linux 2.6.25, there are other techniques for dealing with runaway +real-time and deadline processes. +One of these is to use the +.B RLIMIT_RTTIME +resource limit to set a ceiling on the CPU time that +a real-time process may consume. +See +.BR getrlimit (2) +for details. +.P +Since Linux 2.6.25, Linux also provides two +.I /proc +files that can be used to reserve a certain amount of CPU time +to be used by non-real-time processes. +Reserving CPU time in this fashion allows some CPU time to be +allocated to (say) a root shell that can be used to kill a runaway process. +Both of these files specify time values in microseconds: +.TP +.I /proc/sys/kernel/sched_rt_period_us +This file specifies a scheduling period that is equivalent to +100% CPU bandwidth. +The value in this file can range from 1 to +.BR INT_MAX , +giving an operating range of 1 microsecond to around 35 minutes. +The default value in this file is 1,000,000 (1 second). +.TP +.I /proc/sys/kernel/sched_rt_runtime_us +The value in this file specifies how much of the "period" time +can be used by all real-time and deadline scheduled processes +on the system. +The value in this file can range from \-1 to +.BR INT_MAX \-1. +Specifying \-1 makes the run time the same as the period; +that is, no CPU time is set aside for non-real-time processes +(which was the behavior before Linux 2.6.25). +The default value in this file is 950,000 (0.95 seconds), +meaning that 5% of the CPU time is reserved for processes that +don't run under a real-time or deadline scheduling policy. +.SS Response time +A blocked high priority thread waiting for I/O has a certain +response time before it is scheduled again. +The device driver writer +can greatly reduce this response time by using a "slow interrupt" +interrupt handler. +.\" as described in +.\" .BR request_irq (9). +.SS Miscellaneous +Child processes inherit the scheduling policy and parameters across a +.BR fork (2). +The scheduling policy and parameters are preserved across +.BR execve (2). +.P +Memory locking is usually needed for real-time processes to avoid +paging delays; this can be done with +.BR mlock (2) +or +.BR mlockall (2). +.\" +.SS The autogroup feature +.\" commit 5091faa449ee0b7d73bc296a93bca9540fc51d0a +Since Linux 2.6.38, +the kernel provides a feature known as autogrouping to improve interactive +desktop performance in the face of multiprocess, CPU-intensive +workloads such as building the Linux kernel with large numbers of +parallel build processes (i.e., the +.BR make (1) +.B \-j +flag). +.P +This feature operates in conjunction with the +CFS scheduler and requires a kernel that is configured with +.BR CONFIG_SCHED_AUTOGROUP . +On a running system, this feature is enabled or disabled via the file +.IR /proc/sys/kernel/sched_autogroup_enabled ; +a value of 0 disables the feature, while a value of 1 enables it. +The default value in this file is 1, unless the kernel was booted with the +.I noautogroup +parameter. +.P +A new autogroup is created when a new session is created via +.BR setsid (2); +this happens, for example, when a new terminal window is started. +A new process created by +.BR fork (2) +inherits its parent's autogroup membership. +Thus, all of the processes in a session are members of the same autogroup. +An autogroup is automatically destroyed when the last process +in the group terminates. +.P +When autogrouping is enabled, all of the members of an autogroup +are placed in the same kernel scheduler "task group". +The CFS scheduler employs an algorithm that equalizes the +distribution of CPU cycles across task groups. +The benefits of this for interactive desktop performance +can be described via the following example. +.P +Suppose that there are two autogroups competing for the same CPU +(i.e., presume either a single CPU system or the use of +.BR taskset (1) +to confine all the processes to the same CPU on an SMP system). +The first group contains ten CPU-bound processes from +a kernel build started with +.IR "make\~\-j10" . +The other contains a single CPU-bound process: a video player. +The effect of autogrouping is that the two groups will +each receive half of the CPU cycles. +That is, the video player will receive 50% of the CPU cycles, +rather than just 9% of the cycles, +which would likely lead to degraded video playback. +The situation on an SMP system is more complex, +.\" Mike Galbraith, 25 Nov 2016: +.\" I'd say something more wishy-washy here, like cycles are +.\" distributed fairly across groups and leave it at that, as your +.\" detailed example is incorrect due to SMP fairness (which I don't +.\" like much because [very unlikely] worst case scenario +.\" renders a box sized group incapable of utilizing more that +.\" a single CPU total). For example, if a group of NR_CPUS +.\" size competes with a singleton, load balancing will try to give +.\" the singleton a full CPU of its very own. If groups intersect for +.\" whatever reason on say my quad lappy, distribution is 80/20 in +.\" favor of the singleton. +but the general effect is the same: +the scheduler distributes CPU cycles across task groups such that +an autogroup that contains a large number of CPU-bound processes +does not end up hogging CPU cycles at the expense of the other +jobs on the system. +.P +A process's autogroup (task group) membership can be viewed via the file +.IR /proc/ pid /autogroup : +.P +.in +4n +.EX +$ \fBcat /proc/1/autogroup\fP +/autogroup\-1 nice 0 +.EE +.in +.P +This file can also be used to modify the CPU bandwidth allocated +to an autogroup. +This is done by writing a number in the "nice" range to the file +to set the autogroup's nice value. +The allowed range is from +19 (low priority) to \-20 (high priority). +(Writing values outside of this range causes +.BR write (2) +to fail with the error +.BR EINVAL .) +.\" FIXME . +.\" Because of a bug introduced in Linux 4.7 +.\" (commit 2159197d66770ec01f75c93fb11dc66df81fd45b made changes +.\" that exposed the fact that autogroup didn't call scale_load()), +.\" it happened that *all* values in this range caused a task group +.\" to be further disfavored by the scheduler, with \-20 resulting +.\" in the scheduler mildly disfavoring the task group and +19 greatly +.\" disfavoring it. +.\" +.\" A patch was posted on 23 Nov 2016 +.\" ("sched/autogroup: Fix 64bit kernel nice adjustment"; +.\" check later to see in which kernel version it lands. +.P +The autogroup nice setting has the same meaning as the process nice value, +but applies to distribution of CPU cycles to the autogroup as a whole, +based on the relative nice values of other autogroups. +For a process inside an autogroup, the CPU cycles that it receives +will be a product of the autogroup's nice value +(compared to other autogroups) +and the process's nice value +(compared to other processes in the same autogroup. +.P +The use of the +.BR cgroups (7) +CPU controller to place processes in cgroups other than the +root CPU cgroup overrides the effect of autogrouping. +.P +The autogroup feature groups only processes scheduled under +non-real-time policies +.RB ( SCHED_OTHER , +.BR SCHED_BATCH , +and +.BR SCHED_IDLE ). +It does not group processes scheduled under real-time and +deadline policies. +Those processes are scheduled according to the rules described earlier. +.\" +.SS The nice value and group scheduling +When scheduling non-real-time processes (i.e., those scheduled under the +.BR SCHED_OTHER , +.BR SCHED_BATCH , +and +.B SCHED_IDLE +policies), the CFS scheduler employs a technique known as "group scheduling", +if the kernel was configured with the +.B CONFIG_FAIR_GROUP_SCHED +option (which is typical). +.P +Under group scheduling, threads are scheduled in "task groups". +Task groups have a hierarchical relationship, +rooted under the initial task group on the system, +known as the "root task group". +Task groups are formed in the following circumstances: +.IP \[bu] 3 +All of the threads in a CPU cgroup form a task group. +The parent of this task group is the task group of the +corresponding parent cgroup. +.IP \[bu] +If autogrouping is enabled, +then all of the threads that are (implicitly) placed in an autogroup +(i.e., the same session, as created by +.BR setsid (2)) +form a task group. +Each new autogroup is thus a separate task group. +The root task group is the parent of all such autogroups. +.IP \[bu] +If autogrouping is enabled, then the root task group consists of +all processes in the root CPU cgroup that were not +otherwise implicitly placed into a new autogroup. +.IP \[bu] +If autogrouping is disabled, then the root task group consists of +all processes in the root CPU cgroup. +.IP \[bu] +If group scheduling was disabled (i.e., the kernel was configured without +.BR CONFIG_FAIR_GROUP_SCHED ), +then all of the processes on the system are notionally placed +in a single task group. +.P +Under group scheduling, +a thread's nice value has an effect for scheduling decisions +.IR "only relative to other threads in the same task group" . +This has some surprising consequences in terms of the traditional semantics +of the nice value on UNIX systems. +In particular, if autogrouping +is enabled (which is the default in various distributions), then employing +.BR setpriority (2) +or +.BR nice (1) +on a process has an effect only for scheduling relative +to other processes executed in the same session +(typically: the same terminal window). +.P +Conversely, for two processes that are (for example) +the sole CPU-bound processes in different sessions +(e.g., different terminal windows, +each of whose jobs are tied to different autogroups), +.I modifying the nice value of the process in one of the sessions +.I has no effect +in terms of the scheduler's decisions relative to the +process in the other session. +.\" More succinctly: the nice(1) command is in many cases a no-op since +.\" Linux 2.6.38. +.\" +A possibly useful workaround here is to use a command such as +the following to modify the autogroup nice value for +.I all +of the processes in a terminal session: +.P +.in +4n +.EX +$ \fBecho 10 > /proc/self/autogroup\fP +.EE +.in +.SS Real-time features in the mainline Linux kernel +.\" FIXME . Probably this text will need some minor tweaking +.\" ask Carsten Emde about this. +Since Linux 2.6.18, Linux is gradually +becoming equipped with real-time capabilities, +most of which are derived from the former +.I realtime\-preempt +patch set. +Until the patches have been completely merged into the +mainline kernel, +they must be installed to achieve the best real-time performance. +These patches are named: +.P +.in +4n +.EX +patch\-\fIkernelversion\fP\-rt\fIpatchversion\fP +.EE +.in +.P +and can be downloaded from +.UR http://www.kernel.org\:/pub\:/linux\:/kernel\:/projects\:/rt/ +.UE . +.P +Without the patches and prior to their full inclusion into the mainline +kernel, the kernel configuration offers only the three preemption classes +.BR CONFIG_PREEMPT_NONE , +.BR CONFIG_PREEMPT_VOLUNTARY , +and +.B CONFIG_PREEMPT_DESKTOP +which respectively provide no, some, and considerable +reduction of the worst-case scheduling latency. +.P +With the patches applied or after their full inclusion into the mainline +kernel, the additional configuration item +.B CONFIG_PREEMPT_RT +becomes available. +If this is selected, Linux is transformed into a regular +real-time operating system. +The FIFO and RR scheduling policies are then used to run a thread +with true real-time priority and a minimum worst-case scheduling latency. +.SH NOTES +The +.BR cgroups (7) +CPU controller can be used to limit the CPU consumption of +groups of processes. +.P +Originally, Standard Linux was intended as a general-purpose operating +system being able to handle background processes, interactive +applications, and less demanding real-time applications (applications that +need to usually meet timing deadlines). +Although the Linux 2.6 +allowed for kernel preemption and the newly introduced O(1) scheduler +ensures that the time needed to schedule is fixed and deterministic +irrespective of the number of active tasks, true real-time computing +was not possible up to Linux 2.6.17. +.SH SEE ALSO +.ad l +.nh +.BR chcpu (1), +.BR chrt (1), +.BR lscpu (1), +.BR ps (1), +.BR taskset (1), +.BR top (1), +.BR getpriority (2), +.BR mlock (2), +.BR mlockall (2), +.BR munlock (2), +.BR munlockall (2), +.BR nice (2), +.BR sched_get_priority_max (2), +.BR sched_get_priority_min (2), +.BR sched_getaffinity (2), +.BR sched_getparam (2), +.BR sched_getscheduler (2), +.BR sched_rr_get_interval (2), +.BR sched_setaffinity (2), +.BR sched_setparam (2), +.BR sched_setscheduler (2), +.BR sched_yield (2), +.BR setpriority (2), +.BR pthread_getaffinity_np (3), +.BR pthread_getschedparam (3), +.BR pthread_setaffinity_np (3), +.BR sched_getcpu (3), +.BR capabilities (7), +.BR cpuset (7) +.ad +.P +.I Programming for the real world \- POSIX.4 +by Bill O.\& Gallmeister, O'Reilly & Associates, Inc., ISBN 1-56592-074-0. +.P +The Linux kernel source files +.IR \%Documentation/\:scheduler/\:sched\-deadline\:.txt , +.IR \%Documentation/\:scheduler/\:sched\-rt\-group\:.txt , +.IR \%Documentation/\:scheduler/\:sched\-design\-CFS\:.txt , +and +.I \%Documentation/\:scheduler/\:sched\-nice\-design\:.txt |