diff options
Diffstat (limited to 'man7/cpuset.7')
-rw-r--r-- | man7/cpuset.7 | 1504 |
1 files changed, 0 insertions, 1504 deletions
diff --git a/man7/cpuset.7 b/man7/cpuset.7 deleted file mode 100644 index 2db2bfc..0000000 --- a/man7/cpuset.7 +++ /dev/null @@ -1,1504 +0,0 @@ -.\" Copyright (c) 2008 Silicon Graphics, Inc. -.\" -.\" Author: Paul Jackson (http://oss.sgi.com/projects/cpusets) -.\" -.\" SPDX-License-Identifier: GPL-2.0-only -.\" -.TH cpuset 7 2023-10-31 "Linux man-pages 6.7" -.SH NAME -cpuset \- confine processes to processor and memory node subsets -.SH DESCRIPTION -The cpuset filesystem is a pseudo-filesystem interface -to the kernel cpuset mechanism, -which is used to control the processor placement -and memory placement of processes. -It is commonly mounted at -.IR /dev/cpuset . -.P -On systems with kernels compiled with built in support for cpusets, -all processes are attached to a cpuset, and cpusets are always present. -If a system supports cpusets, then it will have the entry -.B nodev cpuset -in the file -.IR /proc/filesystems . -By mounting the cpuset filesystem (see the -.B EXAMPLES -section below), -the administrator can configure the cpusets on a system -to control the processor and memory placement of processes -on that system. -By default, if the cpuset configuration -on a system is not modified or if the cpuset filesystem -is not even mounted, then the cpuset mechanism, -though present, has no effect on the system's behavior. -.P -A cpuset defines a list of CPUs and memory nodes. -.P -The CPUs of a system include all the logical processing -units on which a process can execute, including, if present, -multiple processor cores within a package and Hyper-Threads -within a processor core. -Memory nodes include all distinct -banks of main memory; small and SMP systems typically have -just one memory node that contains all the system's main memory, -while NUMA (non-uniform memory access) systems have multiple memory nodes. -.P -Cpusets are represented as directories in a hierarchical -pseudo-filesystem, where the top directory in the hierarchy -.RI ( /dev/cpuset ) -represents the entire system (all online CPUs and memory nodes) -and any cpuset that is the child (descendant) of -another parent cpuset contains a subset of that parent's -CPUs and memory nodes. -The directories and files representing cpusets have normal -filesystem permissions. -.P -Every process in the system belongs to exactly one cpuset. -A process is confined to run only on the CPUs in -the cpuset it belongs to, and to allocate memory only -on the memory nodes in that cpuset. -When a process -.BR fork (2)s, -the child process is placed in the same cpuset as its parent. -With sufficient privilege, a process may be moved from one -cpuset to another and the allowed CPUs and memory nodes -of an existing cpuset may be changed. -.P -When the system begins booting, a single cpuset is -defined that includes all CPUs and memory nodes on the -system, and all processes are in that cpuset. -During the boot process, or later during normal system operation, -other cpusets may be created, as subdirectories of this top cpuset, -under the control of the system administrator, -and processes may be placed in these other cpusets. -.P -Cpusets are integrated with the -.BR sched_setaffinity (2) -scheduling affinity mechanism and the -.BR mbind (2) -and -.BR set_mempolicy (2) -memory-placement mechanisms in the kernel. -Neither of these mechanisms let a process make use -of a CPU or memory node that is not allowed by that process's cpuset. -If changes to a process's cpuset placement conflict with these -other mechanisms, then cpuset placement is enforced -even if it means overriding these other mechanisms. -The kernel accomplishes this overriding by silently -restricting the CPUs and memory nodes requested by -these other mechanisms to those allowed by the -invoking process's cpuset. -This can result in these -other calls returning an error, if for example, such -a call ends up requesting an empty set of CPUs or -memory nodes, after that request is restricted to -the invoking process's cpuset. -.P -Typically, a cpuset is used to manage -the CPU and memory-node confinement for a set of -cooperating processes such as a batch scheduler job, and these -other mechanisms are used to manage the placement of -individual processes or memory regions within that set or job. -.SH FILES -Each directory below -.I /dev/cpuset -represents a cpuset and contains a fixed set of pseudo-files -describing the state of that cpuset. -.P -New cpusets are created using the -.BR mkdir (2) -system call or the -.BR mkdir (1) -command. -The properties of a cpuset, such as its flags, allowed -CPUs and memory nodes, and attached processes, are queried and modified -by reading or writing to the appropriate file in that cpuset's directory, -as listed below. -.P -The pseudo-files in each cpuset directory are automatically created when -the cpuset is created, as a result of the -.BR mkdir (2) -invocation. -It is not possible to directly add or remove these pseudo-files. -.P -A cpuset directory that contains no child cpuset directories, -and has no attached processes, can be removed using -.BR rmdir (2) -or -.BR rmdir (1). -It is not necessary, or possible, -to remove the pseudo-files inside the directory before removing it. -.P -The pseudo-files in each cpuset directory are -small text files that may be read and -written using traditional shell utilities such as -.BR cat (1), -and -.BR echo (1), -or from a program by using file I/O library functions or system calls, -such as -.BR open (2), -.BR read (2), -.BR write (2), -and -.BR close (2). -.P -The pseudo-files in a cpuset directory represent internal kernel -state and do not have any persistent image on disk. -Each of these per-cpuset files is listed and described below. -.\" ====================== tasks ====================== -.TP -.I tasks -List of the process IDs (PIDs) of the processes in that cpuset. -The list is formatted as a series of ASCII -decimal numbers, each followed by a newline. -A process may be added to a cpuset (automatically removing -it from the cpuset that previously contained it) by writing its -PID to that cpuset's -.I tasks -file (with or without a trailing newline). -.IP -.B Warning: -only one PID may be written to the -.I tasks -file at a time. -If a string is written that contains more -than one PID, only the first one will be used. -.\" =================== notify_on_release =================== -.TP -.I notify_on_release -Flag (0 or 1). -If set (1), that cpuset will receive special handling -after it is released, that is, after all processes cease using -it (i.e., terminate or are moved to a different cpuset) -and all child cpuset directories have been removed. -See the \fBNotify On Release\fR section, below. -.\" ====================== cpus ====================== -.TP -.I cpuset.cpus -List of the physical numbers of the CPUs on which processes -in that cpuset are allowed to execute. -See \fBList Format\fR below for a description of the -format of -.IR cpus . -.IP -The CPUs allowed to a cpuset may be changed by -writing a new list to its -.I cpus -file. -.\" ==================== cpu_exclusive ==================== -.TP -.I cpuset.cpu_exclusive -Flag (0 or 1). -If set (1), the cpuset has exclusive use of -its CPUs (no sibling or cousin cpuset may overlap CPUs). -By default, this is off (0). -Newly created cpusets also initially default this to off (0). -.IP -Two cpusets are -.I sibling -cpusets if they share the same parent cpuset in the -.I /dev/cpuset -hierarchy. -Two cpusets are -.I cousin -cpusets if neither is the ancestor of the other. -Regardless of the -.I cpu_exclusive -setting, if one cpuset is the ancestor of another, -and if both of these cpusets have nonempty -.IR cpus , -then their -.I cpus -must overlap, because the -.I cpus -of any cpuset are always a subset of the -.I cpus -of its parent cpuset. -.\" ====================== mems ====================== -.TP -.I cpuset.mems -List of memory nodes on which processes in this cpuset are -allowed to allocate memory. -See \fBList Format\fR below for a description of the -format of -.IR mems . -.\" ==================== mem_exclusive ==================== -.TP -.I cpuset.mem_exclusive -Flag (0 or 1). -If set (1), the cpuset has exclusive use of -its memory nodes (no sibling or cousin may overlap). -Also if set (1), the cpuset is a \fBHardwall\fR cpuset (see below). -By default, this is off (0). -Newly created cpusets also initially default this to off (0). -.IP -Regardless of the -.I mem_exclusive -setting, if one cpuset is the ancestor of another, -then their memory nodes must overlap, because the memory -nodes of any cpuset are always a subset of the memory nodes -of that cpuset's parent cpuset. -.\" ==================== mem_hardwall ==================== -.TP -.IR cpuset.mem_hardwall " (since Linux 2.6.26)" -Flag (0 or 1). -If set (1), the cpuset is a \fBHardwall\fR cpuset (see below). -Unlike \fBmem_exclusive\fR, -there is no constraint on whether cpusets -marked \fBmem_hardwall\fR may have overlapping -memory nodes with sibling or cousin cpusets. -By default, this is off (0). -Newly created cpusets also initially default this to off (0). -.\" ==================== memory_migrate ==================== -.TP -.IR cpuset.memory_migrate " (since Linux 2.6.16)" -Flag (0 or 1). -If set (1), then memory migration is enabled. -By default, this is off (0). -See the \fBMemory Migration\fR section, below. -.\" ==================== memory_pressure ==================== -.TP -.IR cpuset.memory_pressure " (since Linux 2.6.16)" -A measure of how much memory pressure the processes in this -cpuset are causing. -See the \fBMemory Pressure\fR section, below. -Unless -.I memory_pressure_enabled -is enabled, always has value zero (0). -This file is read-only. -See the -.B WARNINGS -section, below. -.\" ================= memory_pressure_enabled ================= -.TP -.IR cpuset.memory_pressure_enabled " (since Linux 2.6.16)" -Flag (0 or 1). -This file is present only in the root cpuset, normally -.IR /dev/cpuset . -If set (1), the -.I memory_pressure -calculations are enabled for all cpusets in the system. -By default, this is off (0). -See the -\fBMemory Pressure\fR section, below. -.\" ================== memory_spread_page ================== -.TP -.IR cpuset.memory_spread_page " (since Linux 2.6.17)" -Flag (0 or 1). -If set (1), pages in the kernel page cache -(filesystem buffers) are uniformly spread across the cpuset. -By default, this is off (0) in the top cpuset, -and inherited from the parent cpuset in -newly created cpusets. -See the \fBMemory Spread\fR section, below. -.\" ================== memory_spread_slab ================== -.TP -.IR cpuset.memory_spread_slab " (since Linux 2.6.17)" -Flag (0 or 1). -If set (1), the kernel slab caches -for file I/O (directory and inode structures) are -uniformly spread across the cpuset. -By default, is off (0) in the top cpuset, -and inherited from the parent cpuset in -newly created cpusets. -See the \fBMemory Spread\fR section, below. -.\" ================== sched_load_balance ================== -.TP -.IR cpuset.sched_load_balance " (since Linux 2.6.24)" -Flag (0 or 1). -If set (1, the default) the kernel will -automatically load balance processes in that cpuset over -the allowed CPUs in that cpuset. -If cleared (0) the -kernel will avoid load balancing processes in this cpuset, -.I unless -some other cpuset with overlapping CPUs has its -.I sched_load_balance -flag set. -See \fBScheduler Load Balancing\fR, below, for further details. -.\" ================== sched_relax_domain_level ================== -.TP -.IR cpuset.sched_relax_domain_level " (since Linux 2.6.26)" -Integer, between \-1 and a small positive value. -The -.I sched_relax_domain_level -controls the width of the range of CPUs over which the kernel scheduler -performs immediate rebalancing of runnable tasks across CPUs. -If -.I sched_load_balance -is disabled, then the setting of -.I sched_relax_domain_level -does not matter, as no such load balancing is done. -If -.I sched_load_balance -is enabled, then the higher the value of the -.IR sched_relax_domain_level , -the wider -the range of CPUs over which immediate load balancing is attempted. -See \fBScheduler Relax Domain Level\fR, below, for further details. -.\" ================== proc cpuset ================== -.P -In addition to the above pseudo-files in each directory below -.IR /dev/cpuset , -each process has a pseudo-file, -.IR /proc/ pid /cpuset , -that displays the path of the process's cpuset directory -relative to the root of the cpuset filesystem. -.\" ================== proc status ================== -.P -Also the -.IR /proc/ pid /status -file for each process has four added lines, -displaying the process's -.I Cpus_allowed -(on which CPUs it may be scheduled) and -.I Mems_allowed -(on which memory nodes it may obtain memory), -in the two formats \fBMask Format\fR and \fBList Format\fR (see below) -as shown in the following example: -.P -.in +4n -.EX -Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff -Cpus_allowed_list: 0\-127 -Mems_allowed: ffffffff,ffffffff -Mems_allowed_list: 0\-63 -.EE -.in -.P -The "allowed" fields were added in Linux 2.6.24; -the "allowed_list" fields were added in Linux 2.6.26. -.\" ================== EXTENDED CAPABILITIES ================== -.SH EXTENDED CAPABILITIES -In addition to controlling which -.I cpus -and -.I mems -a process is allowed to use, cpusets provide the following -extended capabilities. -.\" ================== Exclusive Cpusets ================== -.SS Exclusive cpusets -If a cpuset is marked -.I cpu_exclusive -or -.IR mem_exclusive , -no other cpuset, other than a direct ancestor or descendant, -may share any of the same CPUs or memory nodes. -.P -A cpuset that is -.I mem_exclusive -restricts kernel allocations for -buffer cache pages and other internal kernel data pages -commonly shared by the kernel across -multiple users. -All cpusets, whether -.I mem_exclusive -or not, restrict allocations of memory for user space. -This enables configuring a -system so that several independent jobs can share common kernel data, -while isolating each job's user allocation in -its own cpuset. -To do this, construct a large -.I mem_exclusive -cpuset to hold all the jobs, and construct child, -.RI non- mem_exclusive -cpusets for each individual job. -Only a small amount of kernel memory, -such as requests from interrupt handlers, is allowed to be -placed on memory nodes -outside even a -.I mem_exclusive -cpuset. -.\" ================== Hardwall ================== -.SS Hardwall -A cpuset that has -.I mem_exclusive -or -.I mem_hardwall -set is a -.I hardwall -cpuset. -A -.I hardwall -cpuset restricts kernel allocations for page, buffer, -and other data commonly shared by the kernel across multiple users. -All cpusets, whether -.I hardwall -or not, restrict allocations of memory for user space. -.P -This enables configuring a system so that several independent -jobs can share common kernel data, such as filesystem pages, -while isolating each job's user allocation in its own cpuset. -To do this, construct a large -.I hardwall -cpuset to hold -all the jobs, and construct child cpusets for each individual -job which are not -.I hardwall -cpusets. -.P -Only a small amount of kernel memory, such as requests from -interrupt handlers, is allowed to be taken outside even a -.I hardwall -cpuset. -.\" ================== Notify On Release ================== -.SS Notify on release -If the -.I notify_on_release -flag is enabled (1) in a cpuset, -then whenever the last process in the cpuset leaves -(exits or attaches to some other cpuset) -and the last child cpuset of that cpuset is removed, -the kernel will run the command -.IR /sbin/cpuset_release_agent , -supplying the pathname (relative to the mount point of the -cpuset filesystem) of the abandoned cpuset. -This enables automatic removal of abandoned cpusets. -.P -The default value of -.I notify_on_release -in the root cpuset at system boot is disabled (0). -The default value of other cpusets at creation -is the current value of their parent's -.I notify_on_release -setting. -.P -The command -.I /sbin/cpuset_release_agent -is invoked, with the name -.RI ( /dev/cpuset -relative path) -of the to-be-released cpuset in -.IR argv[1] . -.P -The usual contents of the command -.I /sbin/cpuset_release_agent -is simply the shell script: -.P -.in +4n -.EX -#!/bin/sh -rmdir /dev/cpuset/$1 -.EE -.in -.P -As with other flag values below, this flag can -be changed by writing an ASCII -number 0 or 1 (with optional trailing newline) -into the file, to clear or set the flag, respectively. -.\" ================== Memory Pressure ================== -.SS Memory pressure -The -.I memory_pressure -of a cpuset provides a simple per-cpuset running average of -the rate that the processes in a cpuset are attempting to free up in-use -memory on the nodes of the cpuset to satisfy additional memory requests. -.P -This enables batch managers that are monitoring jobs running in dedicated -cpusets to efficiently detect what level of memory pressure that job -is causing. -.P -This is useful both on tightly managed systems running a wide mix of -submitted jobs, which may choose to terminate or reprioritize jobs that -are trying to use more memory than allowed on the nodes assigned them, -and with tightly coupled, long-running, massively parallel scientific -computing jobs that will dramatically fail to meet required performance -goals if they start to use more memory than allowed to them. -.P -This mechanism provides a very economical way for the batch manager -to monitor a cpuset for signs of memory pressure. -It's up to the batch manager or other user code to decide -what action to take if it detects signs of memory pressure. -.P -Unless memory pressure calculation is enabled by setting the pseudo-file -.IR /dev/cpuset/cpuset.memory_pressure_enabled , -it is not computed for any cpuset, and reads from any -.I memory_pressure -always return zero, as represented by the ASCII string "0\en". -See the \fBWARNINGS\fR section, below. -.P -A per-cpuset, running average is employed for the following reasons: -.IP \[bu] 3 -Because this meter is per-cpuset rather than per-process or per virtual -memory region, the system load imposed by a batch scheduler monitoring -this metric is sharply reduced on large systems, because a scan of -the tasklist can be avoided on each set of queries. -.IP \[bu] -Because this meter is a running average rather than an accumulating -counter, a batch scheduler can detect memory pressure with a -single read, instead of having to read and accumulate results -for a period of time. -.IP \[bu] -Because this meter is per-cpuset rather than per-process, -the batch scheduler can obtain the key information\[em]memory -pressure in a cpuset\[em]with a single read, rather than having to -query and accumulate results over all the (dynamically changing) -set of processes in the cpuset. -.P -The -.I memory_pressure -of a cpuset is calculated using a per-cpuset simple digital filter -that is kept within the kernel. -For each cpuset, this filter tracks -the recent rate at which processes attached to that cpuset enter the -kernel direct reclaim code. -.P -The kernel direct reclaim code is entered whenever a process has to -satisfy a memory page request by first finding some other page to -repurpose, due to lack of any readily available already free pages. -Dirty filesystem pages are repurposed by first writing them -to disk. -Unmodified filesystem buffer pages are repurposed -by simply dropping them, though if that page is needed again, it -will have to be reread from disk. -.P -The -.I cpuset.memory_pressure -file provides an integer number representing the recent (half-life of -10 seconds) rate of entries to the direct reclaim code caused by any -process in the cpuset, in units of reclaims attempted per second, -times 1000. -.\" ================== Memory Spread ================== -.SS Memory spread -There are two Boolean flag files per cpuset that control where the -kernel allocates pages for the filesystem buffers and related -in-kernel data structures. -They are called -.I cpuset.memory_spread_page -and -.IR cpuset.memory_spread_slab . -.P -If the per-cpuset Boolean flag file -.I cpuset.memory_spread_page -is set, then -the kernel will spread the filesystem buffers (page cache) evenly -over all the nodes that the faulting process is allowed to use, instead -of preferring to put those pages on the node where the process is running. -.P -If the per-cpuset Boolean flag file -.I cpuset.memory_spread_slab -is set, -then the kernel will spread some filesystem-related slab caches, -such as those for inodes and directory entries, evenly over all the nodes -that the faulting process is allowed to use, instead of preferring to -put those pages on the node where the process is running. -.P -The setting of these flags does not affect the data segment -(see -.BR brk (2)) -or stack segment pages of a process. -.P -By default, both kinds of memory spreading are off and the kernel -prefers to allocate memory pages on the node local to where the -requesting process is running. -If that node is not allowed by the -process's NUMA memory policy or cpuset configuration or if there are -insufficient free memory pages on that node, then the kernel looks -for the nearest node that is allowed and has sufficient free memory. -.P -When new cpusets are created, they inherit the memory spread settings -of their parent. -.P -Setting memory spreading causes allocations for the affected page or -slab caches to ignore the process's NUMA memory policy and be spread -instead. -However, the effect of these changes in memory placement -caused by cpuset-specified memory spreading is hidden from the -.BR mbind (2) -or -.BR set_mempolicy (2) -calls. -These two NUMA memory policy calls always appear to behave as if -no cpuset-specified memory spreading is in effect, even if it is. -If cpuset memory spreading is subsequently turned off, the NUMA -memory policy most recently specified by these calls is automatically -reapplied. -.P -Both -.I cpuset.memory_spread_page -and -.I cpuset.memory_spread_slab -are Boolean flag files. -By default, they contain "0", meaning that the feature is off -for that cpuset. -If a "1" is written to that file, that turns the named feature on. -.P -Cpuset-specified memory spreading behaves similarly to what is known -(in other contexts) as round-robin or interleave memory placement. -.P -Cpuset-specified memory spreading can provide substantial performance -improvements for jobs that: -.IP \[bu] 3 -need to place thread-local data on -memory nodes close to the CPUs which are running the threads that most -frequently access that data; but also -.IP \[bu] -need to access large filesystem data sets that must to be spread -across the several nodes in the job's cpuset in order to fit. -.P -Without this policy, -the memory allocation across the nodes in the job's cpuset -can become very uneven, -especially for jobs that might have just a single -thread initializing or reading in the data set. -.\" ================== Memory Migration ================== -.SS Memory migration -Normally, under the default setting (disabled) of -.IR cpuset.memory_migrate , -once a page is allocated (given a physical page -of main memory), then that page stays on whatever node it -was allocated, so long as it remains allocated, even if the -cpuset's memory-placement policy -.I mems -subsequently changes. -.P -When memory migration is enabled in a cpuset, if the -.I mems -setting of the cpuset is changed, then any memory page in use by any -process in the cpuset that is on a memory node that is no longer -allowed will be migrated to a memory node that is allowed. -.P -Furthermore, if a process is moved into a cpuset with -.I memory_migrate -enabled, any memory pages it uses that were on memory nodes allowed -in its previous cpuset, but which are not allowed in its new cpuset, -will be migrated to a memory node allowed in the new cpuset. -.P -The relative placement of a migrated page within -the cpuset is preserved during these migration operations if possible. -For example, -if the page was on the second valid node of the prior cpuset, -then the page will be placed on the second valid node of the new cpuset, -if possible. -.\" ================== Scheduler Load Balancing ================== -.SS Scheduler load balancing -The kernel scheduler automatically load balances processes. -If one CPU is underutilized, -the kernel will look for processes on other more -overloaded CPUs and move those processes to the underutilized CPU, -within the constraints of such placement mechanisms as cpusets and -.BR sched_setaffinity (2). -.P -The algorithmic cost of load balancing and its impact on key shared -kernel data structures such as the process list increases more than -linearly with the number of CPUs being balanced. -For example, it -costs more to load balance across one large set of CPUs than it does -to balance across two smaller sets of CPUs, each of half the size -of the larger set. -(The precise relationship between the number of CPUs being balanced -and the cost of load balancing depends -on implementation details of the kernel process scheduler, which is -subject to change over time, as improved kernel scheduler algorithms -are implemented.) -.P -The per-cpuset flag -.I sched_load_balance -provides a mechanism to suppress this automatic scheduler load -balancing in cases where it is not needed and suppressing it would have -worthwhile performance benefits. -.P -By default, load balancing is done across all CPUs, except those -marked isolated using the kernel boot time "isolcpus=" argument. -(See \fBScheduler Relax Domain Level\fR, below, to change this default.) -.P -This default load balancing across all CPUs is not well suited to -the following two situations: -.IP \[bu] 3 -On large systems, load balancing across many CPUs is expensive. -If the system is managed using cpusets to place independent jobs -on separate sets of CPUs, full load balancing is unnecessary. -.IP \[bu] -Systems supporting real-time on some CPUs need to minimize -system overhead on those CPUs, including avoiding process load -balancing if that is not needed. -.P -When the per-cpuset flag -.I sched_load_balance -is enabled (the default setting), -it requests load balancing across -all the CPUs in that cpuset's allowed CPUs, -ensuring that load balancing can move a process (not otherwise pinned, -as by -.BR sched_setaffinity (2)) -from any CPU in that cpuset to any other. -.P -When the per-cpuset flag -.I sched_load_balance -is disabled, then the -scheduler will avoid load balancing across the CPUs in that cpuset, -\fIexcept\fR in so far as is necessary because some overlapping cpuset -has -.I sched_load_balance -enabled. -.P -So, for example, if the top cpuset has the flag -.I sched_load_balance -enabled, then the scheduler will load balance across all -CPUs, and the setting of the -.I sched_load_balance -flag in other cpusets has no effect, -as we're already fully load balancing. -.P -Therefore in the above two situations, the flag -.I sched_load_balance -should be disabled in the top cpuset, and only some of the smaller, -child cpusets would have this flag enabled. -.P -When doing this, you don't usually want to leave any unpinned processes in -the top cpuset that might use nontrivial amounts of CPU, as such processes -may be artificially constrained to some subset of CPUs, depending on -the particulars of this flag setting in descendant cpusets. -Even if such a process could use spare CPU cycles in some other CPUs, -the kernel scheduler might not consider the possibility of -load balancing that process to the underused CPU. -.P -Of course, processes pinned to a particular CPU can be left in a cpuset -that disables -.I sched_load_balance -as those processes aren't going anywhere else anyway. -.\" ================== Scheduler Relax Domain Level ================== -.SS Scheduler relax domain level -The kernel scheduler performs immediate load balancing whenever -a CPU becomes free or another task becomes runnable. -This load -balancing works to ensure that as many CPUs as possible are usefully -employed running tasks. -The kernel also performs periodic load -balancing off the software clock described in -.BR time (7). -The setting of -.I sched_relax_domain_level -applies only to immediate load balancing. -Regardless of the -.I sched_relax_domain_level -setting, periodic load balancing is attempted over all CPUs -(unless disabled by turning off -.IR sched_load_balance .) -In any case, of course, tasks will be scheduled to run only on -CPUs allowed by their cpuset, as modified by -.BR sched_setaffinity (2) -system calls. -.P -On small systems, such as those with just a few CPUs, immediate load -balancing is useful to improve system interactivity and to minimize -wasteful idle CPU cycles. -But on large systems, attempting immediate -load balancing across a large number of CPUs can be more costly than -it is worth, depending on the particular performance characteristics -of the job mix and the hardware. -.P -The exact meaning of the small integer values of -.I sched_relax_domain_level -will depend on internal -implementation details of the kernel scheduler code and on the -non-uniform architecture of the hardware. -Both of these will evolve -over time and vary by system architecture and kernel version. -.P -As of this writing, when this capability was introduced in Linux -2.6.26, on certain popular architectures, the positive values of -.I sched_relax_domain_level -have the following meanings. -.P -.PD 0 -.TP -.B 1 -Perform immediate load balancing across Hyper-Thread -siblings on the same core. -.TP -.B 2 -Perform immediate load balancing across other cores in the same package. -.TP -.B 3 -Perform immediate load balancing across other CPUs -on the same node or blade. -.TP -.B 4 -Perform immediate load balancing across over several -(implementation detail) nodes [On NUMA systems]. -.TP -.B 5 -Perform immediate load balancing across over all CPUs -in system [On NUMA systems]. -.PD -.P -The -.I sched_relax_domain_level -value of zero (0) always means -don't perform immediate load balancing, -hence that load balancing is done only periodically, -not immediately when a CPU becomes available or another task becomes -runnable. -.P -The -.I sched_relax_domain_level -value of minus one (\-1) -always means use the system default value. -The system default value can vary by architecture and kernel version. -This system default value can be changed by kernel -boot-time "relax_domain_level=" argument. -.P -In the case of multiple overlapping cpusets which have conflicting -.I sched_relax_domain_level -values, then the highest such value -applies to all CPUs in any of the overlapping cpusets. -In such cases, -.B \-1 -is the lowest value, -overridden by any other value, -and -.B 0 -is the next lowest value. -.SH FORMATS -The following formats are used to represent sets of -CPUs and memory nodes. -.\" ================== Mask Format ================== -.SS Mask format -The \fBMask Format\fR is used to represent CPU and memory-node bit masks -in the -.IR /proc/ pid /status -file. -.P -This format displays each 32-bit -word in hexadecimal (using ASCII characters "0" - "9" and "a" - "f"); -words are filled with leading zeros, if required. -For masks longer than one word, a comma separator is used between words. -Words are displayed in big-endian -order, which has the most significant bit first. -The hex digits within a word are also in big-endian order. -.P -The number of 32-bit words displayed is the minimum number needed to -display all bits of the bit mask, based on the size of the bit mask. -.P -Examples of the \fBMask Format\fR: -.P -.in +4n -.EX -00000001 # just bit 0 set -40000000,00000000,00000000 # just bit 94 set -00000001,00000000,00000000 # just bit 64 set -000000ff,00000000 # bits 32\-39 set -00000000,000e3862 # 1,5,6,11\-13,17\-19 set -.EE -.in -.P -A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as: -.P -.in +4n -.EX -00000001,00000001,00010117 -.EE -.in -.P -The first "1" is for bit 64, the -second for bit 32, the third for bit 16, the fourth for bit 8, the -fifth for bit 4, and the "7" is for bits 2, 1, and 0. -.\" ================== List Format ================== -.SS List format -The \fBList Format\fR for -.I cpus -and -.I mems -is a comma-separated list of CPU or memory-node -numbers and ranges of numbers, in ASCII decimal. -.P -Examples of the \fBList Format\fR: -.P -.in +4n -.EX -0\-4,9 # bits 0, 1, 2, 3, 4, and 9 set -0\-2,7,12\-14 # bits 0, 1, 2, 7, 12, 13, and 14 set -.EE -.in -.\" ================== RULES ================== -.SH RULES -The following rules apply to each cpuset: -.IP \[bu] 3 -Its CPUs and memory nodes must be a (possibly equal) -subset of its parent's. -.IP \[bu] -It can be marked -.I cpu_exclusive -only if its parent is. -.IP \[bu] -It can be marked -.I mem_exclusive -only if its parent is. -.IP \[bu] -If it is -.IR cpu_exclusive , -its CPUs may not overlap any sibling. -.IP \[bu] -If it is -.IR mem_exclusive , -its memory nodes may not overlap any sibling. -.\" ================== PERMISSIONS ================== -.SH PERMISSIONS -The permissions of a cpuset are determined by the permissions -of the directories and pseudo-files in the cpuset filesystem, -normally mounted at -.IR /dev/cpuset . -.P -For instance, a process can put itself in some other cpuset (than -its current one) if it can write the -.I tasks -file for that cpuset. -This requires execute permission on the encompassing directories -and write permission on the -.I tasks -file. -.P -An additional constraint is applied to requests to place some -other process in a cpuset. -One process may not attach another to -a cpuset unless it would have permission to send that process -a signal (see -.BR kill (2)). -.P -A process may create a child cpuset if it can access and write the -parent cpuset directory. -It can modify the CPUs or memory nodes -in a cpuset if it can access that cpuset's directory (execute -permissions on the each of the parent directories) and write the -corresponding -.I cpus -or -.I mems -file. -.P -There is one minor difference between the manner in which these -permissions are evaluated and the manner in which normal filesystem -operation permissions are evaluated. -The kernel interprets -relative pathnames starting at a process's current working directory. -Even if one is operating on a cpuset file, relative pathnames -are interpreted relative to the process's current working directory, -not relative to the process's current cpuset. -The only ways that -cpuset paths relative to a process's current cpuset can be used are -if either the process's current working directory is its cpuset -(it first did a -.B cd -or -.BR chdir (2) -to its cpuset directory beneath -.IR /dev/cpuset , -which is a bit unusual) -or if some user code converts the relative cpuset path to a -full filesystem path. -.P -In theory, this means that user code should specify cpusets -using absolute pathnames, which requires knowing the mount point of -the cpuset filesystem (usually, but not necessarily, -.IR /dev/cpuset ). -In practice, all user level code that this author is aware of -simply assumes that if the cpuset filesystem is mounted, then -it is mounted at -.IR /dev/cpuset . -Furthermore, it is common practice for carefully written -user code to verify the presence of the pseudo-file -.I /dev/cpuset/tasks -in order to verify that the cpuset pseudo-filesystem -is currently mounted. -.\" ================== WARNINGS ================== -.SH WARNINGS -.SS Enabling memory_pressure -By default, the per-cpuset file -.I cpuset.memory_pressure -always contains zero (0). -Unless this feature is enabled by writing "1" to the pseudo-file -.IR /dev/cpuset/cpuset.memory_pressure_enabled , -the kernel does -not compute per-cpuset -.IR memory_pressure . -.SS Using the echo command -When using the -.B echo -command at the shell prompt to change the values of cpuset files, -beware that the built-in -.B echo -command in some shells does not display an error message if the -.BR write (2) -system call fails. -.\" Gack! csh(1)'s echo does this -For example, if the command: -.P -.in +4n -.EX -echo 19 > cpuset.mems -.EE -.in -.P -failed because memory node 19 was not allowed (perhaps -the current system does not have a memory node 19), then the -.B echo -command might not display any error. -It is better to use the -.B /bin/echo -external command to change cpuset file settings, as this -command will display -.BR write (2) -errors, as in the example: -.P -.in +4n -.EX -/bin/echo 19 > cpuset.mems -/bin/echo: write error: Invalid argument -.EE -.in -.\" ================== EXCEPTIONS ================== -.SH EXCEPTIONS -.SS Memory placement -Not all allocations of system memory are constrained by cpusets, -for the following reasons. -.P -If hot-plug functionality is used to remove all the CPUs that are -currently assigned to a cpuset, then the kernel will automatically -update the -.I cpus_allowed -of all processes attached to CPUs in that cpuset -to allow all CPUs. -When memory hot-plug functionality for removing -memory nodes is available, a similar exception is expected to apply -there as well. -In general, the kernel prefers to violate cpuset placement, -rather than starving a process that has had all its allowed CPUs or -memory nodes taken offline. -User code should reconfigure cpusets to refer only to online CPUs -and memory nodes when using hot-plug to add or remove such resources. -.P -A few kernel-critical, internal memory-allocation requests, marked -GFP_ATOMIC, must be satisfied immediately. -The kernel may drop some -request or malfunction if one of these allocations fail. -If such a request cannot be satisfied within the current process's cpuset, -then we relax the cpuset, and look for memory anywhere we can find it. -It's better to violate the cpuset than stress the kernel. -.P -Allocations of memory requested by kernel drivers while processing -an interrupt lack any relevant process context, and are not confined -by cpusets. -.SS Renaming cpusets -You can use the -.BR rename (2) -system call to rename cpusets. -Only simple renaming is supported; that is, changing the name of a cpuset -directory is permitted, but moving a directory into -a different directory is not permitted. -.\" ================== ERRORS ================== -.SH ERRORS -The Linux kernel implementation of cpusets sets -.I errno -to specify the reason for a failed system call affecting cpusets. -.P -The possible -.I errno -settings and their meaning when set on -a failed cpuset call are as listed below. -.TP -.B E2BIG -Attempted a -.BR write (2) -on a special cpuset file -with a length larger than some kernel-determined upper -limit on the length of such writes. -.TP -.B EACCES -Attempted to -.BR write (2) -the process ID (PID) of a process to a cpuset -.I tasks -file when one lacks permission to move that process. -.TP -.B EACCES -Attempted to add, using -.BR write (2), -a CPU or memory node to a cpuset, when that CPU or memory node was -not already in its parent. -.TP -.B EACCES -Attempted to set, using -.BR write (2), -.I cpuset.cpu_exclusive -or -.I cpuset.mem_exclusive -on a cpuset whose parent lacks the same setting. -.TP -.B EACCES -Attempted to -.BR write (2) -a -.I cpuset.memory_pressure -file. -.TP -.B EACCES -Attempted to create a file in a cpuset directory. -.TP -.B EBUSY -Attempted to remove, using -.BR rmdir (2), -a cpuset with attached processes. -.TP -.B EBUSY -Attempted to remove, using -.BR rmdir (2), -a cpuset with child cpusets. -.TP -.B EBUSY -Attempted to remove -a CPU or memory node from a cpuset -that is also in a child of that cpuset. -.TP -.B EEXIST -Attempted to create, using -.BR mkdir (2), -a cpuset that already exists. -.TP -.B EEXIST -Attempted to -.BR rename (2) -a cpuset to a name that already exists. -.TP -.B EFAULT -Attempted to -.BR read (2) -or -.BR write (2) -a cpuset file using -a buffer that is outside the writing processes accessible address space. -.TP -.B EINVAL -Attempted to change a cpuset, using -.BR write (2), -in a way that would violate a -.I cpu_exclusive -or -.I mem_exclusive -attribute of that cpuset or any of its siblings. -.TP -.B EINVAL -Attempted to -.BR write (2) -an empty -.I cpuset.cpus -or -.I cpuset.mems -list to a cpuset which has attached processes or child cpusets. -.TP -.B EINVAL -Attempted to -.BR write (2) -a -.I cpuset.cpus -or -.I cpuset.mems -list which included a range with the second number smaller than -the first number. -.TP -.B EINVAL -Attempted to -.BR write (2) -a -.I cpuset.cpus -or -.I cpuset.mems -list which included an invalid character in the string. -.TP -.B EINVAL -Attempted to -.BR write (2) -a list to a -.I cpuset.cpus -file that did not include any online CPUs. -.TP -.B EINVAL -Attempted to -.BR write (2) -a list to a -.I cpuset.mems -file that did not include any online memory nodes. -.TP -.B EINVAL -Attempted to -.BR write (2) -a list to a -.I cpuset.mems -file that included a node that held no memory. -.TP -.B EIO -Attempted to -.BR write (2) -a string to a cpuset -.I tasks -file that -does not begin with an ASCII decimal integer. -.TP -.B EIO -Attempted to -.BR rename (2) -a cpuset into a different directory. -.TP -.B ENAMETOOLONG -Attempted to -.BR read (2) -a -.IR /proc/ pid /cpuset -file for a cpuset path that is longer than the kernel page size. -.TP -.B ENAMETOOLONG -Attempted to create, using -.BR mkdir (2), -a cpuset whose base directory name is longer than 255 characters. -.TP -.B ENAMETOOLONG -Attempted to create, using -.BR mkdir (2), -a cpuset whose full pathname, -including the mount point (typically "/dev/cpuset/") prefix, -is longer than 4095 characters. -.TP -.B ENODEV -The cpuset was removed by another process at the same time as a -.BR write (2) -was attempted on one of the pseudo-files in the cpuset directory. -.TP -.B ENOENT -Attempted to create, using -.BR mkdir (2), -a cpuset in a parent cpuset that doesn't exist. -.TP -.B ENOENT -Attempted to -.BR access (2) -or -.BR open (2) -a nonexistent file in a cpuset directory. -.TP -.B ENOMEM -Insufficient memory is available within the kernel; can occur -on a variety of system calls affecting cpusets, but only if the -system is extremely short of memory. -.TP -.B ENOSPC -Attempted to -.BR write (2) -the process ID (PID) -of a process to a cpuset -.I tasks -file when the cpuset had an empty -.I cpuset.cpus -or empty -.I cpuset.mems -setting. -.TP -.B ENOSPC -Attempted to -.BR write (2) -an empty -.I cpuset.cpus -or -.I cpuset.mems -setting to a cpuset that -has tasks attached. -.TP -.B ENOTDIR -Attempted to -.BR rename (2) -a nonexistent cpuset. -.TP -.B EPERM -Attempted to remove a file from a cpuset directory. -.TP -.B ERANGE -Specified a -.I cpuset.cpus -or -.I cpuset.mems -list to the kernel which included a number too large for the kernel -to set in its bit masks. -.TP -.B ESRCH -Attempted to -.BR write (2) -the process ID (PID) of a nonexistent process to a cpuset -.I tasks -file. -.\" ================== VERSIONS ================== -.SH VERSIONS -Cpusets appeared in Linux 2.6.12. -.\" ================== NOTES ================== -.SH NOTES -Despite its name, the -.I pid -parameter is actually a thread ID, -and each thread in a threaded group can be attached to a different -cpuset. -The value returned from a call to -.BR gettid (2) -can be passed in the argument -.IR pid . -.\" ================== BUGS ================== -.SH BUGS -.I cpuset.memory_pressure -cpuset files can be opened -for writing, creation, or truncation, but then the -.BR write (2) -fails with -.I errno -set to -.BR EACCES , -and the creation and truncation options on -.BR open (2) -have no effect. -.\" ================== EXAMPLES ================== -.SH EXAMPLES -The following examples demonstrate querying and setting cpuset -options using shell commands. -.SS Creating and attaching to a cpuset. -To create a new cpuset and attach the current command shell to it, -the steps are: -.P -.PD 0 -.IP (1) 5 -mkdir /dev/cpuset (if not already done) -.IP (2) -mount \-t cpuset none /dev/cpuset (if not already done) -.IP (3) -Create the new cpuset using -.BR mkdir (1). -.IP (4) -Assign CPUs and memory nodes to the new cpuset. -.IP (5) -Attach the shell to the new cpuset. -.PD -.P -For example, the following sequence of commands will set up a cpuset -named "Charlie", containing just CPUs 2 and 3, and memory node 1, -and then attach the current shell to that cpuset. -.P -.in +4n -.EX -.RB "$" " mkdir /dev/cpuset" -.RB "$" " mount \-t cpuset cpuset /dev/cpuset" -.RB "$" " cd /dev/cpuset" -.RB "$" " mkdir Charlie" -.RB "$" " cd Charlie" -.RB "$" " /bin/echo 2\-3 > cpuset.cpus" -.RB "$" " /bin/echo 1 > cpuset.mems" -.RB "$" " /bin/echo $$ > tasks" -# The current shell is now running in cpuset Charlie -# The next line should display \[aq]/Charlie\[aq] -.RB "$" " cat /proc/self/cpuset" -.EE -.in -.\" -.SS Migrating a job to different memory nodes. -To migrate a job (the set of processes attached to a cpuset) -to different CPUs and memory nodes in the system, including moving -the memory pages currently allocated to that job, -perform the following steps. -.P -.PD 0 -.IP (1) 5 -Let's say we want to move the job in cpuset -.I alpha -(CPUs 4\[en]7 and memory nodes 2\[en]3) to a new cpuset -.I beta -(CPUs 16\[en]19 and memory nodes 8\[en]9). -.IP (2) -First create the new cpuset -.IR beta . -.IP (3) -Then allow CPUs 16\[en]19 and memory nodes 8\[en]9 in -.IR beta . -.IP (4) -Then enable -.I memory_migration -in -.IR beta . -.IP (5) -Then move each process from -.I alpha -to -.IR beta . -.PD -.P -The following sequence of commands accomplishes this. -.P -.in +4n -.EX -.RB "$" " cd /dev/cpuset" -.RB "$" " mkdir beta" -.RB "$" " cd beta" -.RB "$" " /bin/echo 16\-19 > cpuset.cpus" -.RB "$" " /bin/echo 8\-9 > cpuset.mems" -.RB "$" " /bin/echo 1 > cpuset.memory_migrate" -.RB "$" " while read i; do /bin/echo $i; done < ../alpha/tasks > tasks" -.EE -.in -.P -The above should move any processes in -.I alpha -to -.IR beta , -and any memory held by these processes on memory nodes 2\[en]3 to memory -nodes 8\[en]9, respectively. -.P -Notice that the last step of the above sequence did not do: -.P -.in +4n -.EX -.RB "$" " cp ../alpha/tasks tasks" -.EE -.in -.P -The -.I while -loop, rather than the seemingly easier use of the -.BR cp (1) -command, was necessary because -only one process PID at a time may be written to the -.I tasks -file. -.P -The same effect (writing one PID at a time) as the -.I while -loop can be accomplished more efficiently, in fewer keystrokes and in -syntax that works on any shell, but alas more obscurely, by using the -.B \-u -(unbuffered) option of -.BR sed (1): -.P -.in +4n -.EX -.RB "$" " sed \-un p < ../alpha/tasks > tasks" -.EE -.in -.\" ================== SEE ALSO ================== -.SH SEE ALSO -.BR taskset (1), -.BR get_mempolicy (2), -.BR getcpu (2), -.BR mbind (2), -.BR sched_getaffinity (2), -.BR sched_setaffinity (2), -.BR sched_setscheduler (2), -.BR set_mempolicy (2), -.BR CPU_SET (3), -.BR proc (5), -.BR cgroups (7), -.BR numa (7), -.BR sched (7), -.BR migratepages (8), -.BR numactl (8) -.P -.I Documentation/admin\-guide/cgroup\-v1/cpusets.rst -in the Linux kernel source tree -.\" commit 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8 -(or -.I Documentation/cgroup\-v1/cpusets.txt -before Linux 4.18, and -.I Documentation/cpusets.txt -before Linux 2.6.29) |