diff options
Diffstat (limited to 'man7/cpuset.7')
-rw-r--r-- | man7/cpuset.7 | 1504 |
1 files changed, 1504 insertions, 0 deletions
diff --git a/man7/cpuset.7 b/man7/cpuset.7 new file mode 100644 index 0000000..800e4da --- /dev/null +++ b/man7/cpuset.7 @@ -0,0 +1,1504 @@ +.\" Copyright (c) 2008 Silicon Graphics, Inc. +.\" +.\" Author: Paul Jackson (http://oss.sgi.com/projects/cpusets) +.\" +.\" SPDX-License-Identifier: GPL-2.0-only +.\" +.TH cpuset 7 2023-07-18 "Linux man-pages 6.05.01" +.SH NAME +cpuset \- confine processes to processor and memory node subsets +.SH DESCRIPTION +The cpuset filesystem is a pseudo-filesystem interface +to the kernel cpuset mechanism, +which is used to control the processor placement +and memory placement of processes. +It is commonly mounted at +.IR /dev/cpuset . +.PP +On systems with kernels compiled with built in support for cpusets, +all processes are attached to a cpuset, and cpusets are always present. +If a system supports cpusets, then it will have the entry +.B nodev cpuset +in the file +.IR /proc/filesystems . +By mounting the cpuset filesystem (see the +.B EXAMPLES +section below), +the administrator can configure the cpusets on a system +to control the processor and memory placement of processes +on that system. +By default, if the cpuset configuration +on a system is not modified or if the cpuset filesystem +is not even mounted, then the cpuset mechanism, +though present, has no effect on the system's behavior. +.PP +A cpuset defines a list of CPUs and memory nodes. +.PP +The CPUs of a system include all the logical processing +units on which a process can execute, including, if present, +multiple processor cores within a package and Hyper-Threads +within a processor core. +Memory nodes include all distinct +banks of main memory; small and SMP systems typically have +just one memory node that contains all the system's main memory, +while NUMA (non-uniform memory access) systems have multiple memory nodes. +.PP +Cpusets are represented as directories in a hierarchical +pseudo-filesystem, where the top directory in the hierarchy +.RI ( /dev/cpuset ) +represents the entire system (all online CPUs and memory nodes) +and any cpuset that is the child (descendant) of +another parent cpuset contains a subset of that parent's +CPUs and memory nodes. +The directories and files representing cpusets have normal +filesystem permissions. +.PP +Every process in the system belongs to exactly one cpuset. +A process is confined to run only on the CPUs in +the cpuset it belongs to, and to allocate memory only +on the memory nodes in that cpuset. +When a process +.BR fork (2)s, +the child process is placed in the same cpuset as its parent. +With sufficient privilege, a process may be moved from one +cpuset to another and the allowed CPUs and memory nodes +of an existing cpuset may be changed. +.PP +When the system begins booting, a single cpuset is +defined that includes all CPUs and memory nodes on the +system, and all processes are in that cpuset. +During the boot process, or later during normal system operation, +other cpusets may be created, as subdirectories of this top cpuset, +under the control of the system administrator, +and processes may be placed in these other cpusets. +.PP +Cpusets are integrated with the +.BR sched_setaffinity (2) +scheduling affinity mechanism and the +.BR mbind (2) +and +.BR set_mempolicy (2) +memory-placement mechanisms in the kernel. +Neither of these mechanisms let a process make use +of a CPU or memory node that is not allowed by that process's cpuset. +If changes to a process's cpuset placement conflict with these +other mechanisms, then cpuset placement is enforced +even if it means overriding these other mechanisms. +The kernel accomplishes this overriding by silently +restricting the CPUs and memory nodes requested by +these other mechanisms to those allowed by the +invoking process's cpuset. +This can result in these +other calls returning an error, if for example, such +a call ends up requesting an empty set of CPUs or +memory nodes, after that request is restricted to +the invoking process's cpuset. +.PP +Typically, a cpuset is used to manage +the CPU and memory-node confinement for a set of +cooperating processes such as a batch scheduler job, and these +other mechanisms are used to manage the placement of +individual processes or memory regions within that set or job. +.SH FILES +Each directory below +.I /dev/cpuset +represents a cpuset and contains a fixed set of pseudo-files +describing the state of that cpuset. +.PP +New cpusets are created using the +.BR mkdir (2) +system call or the +.BR mkdir (1) +command. +The properties of a cpuset, such as its flags, allowed +CPUs and memory nodes, and attached processes, are queried and modified +by reading or writing to the appropriate file in that cpuset's directory, +as listed below. +.PP +The pseudo-files in each cpuset directory are automatically created when +the cpuset is created, as a result of the +.BR mkdir (2) +invocation. +It is not possible to directly add or remove these pseudo-files. +.PP +A cpuset directory that contains no child cpuset directories, +and has no attached processes, can be removed using +.BR rmdir (2) +or +.BR rmdir (1). +It is not necessary, or possible, +to remove the pseudo-files inside the directory before removing it. +.PP +The pseudo-files in each cpuset directory are +small text files that may be read and +written using traditional shell utilities such as +.BR cat (1), +and +.BR echo (1), +or from a program by using file I/O library functions or system calls, +such as +.BR open (2), +.BR read (2), +.BR write (2), +and +.BR close (2). +.PP +The pseudo-files in a cpuset directory represent internal kernel +state and do not have any persistent image on disk. +Each of these per-cpuset files is listed and described below. +.\" ====================== tasks ====================== +.TP +.I tasks +List of the process IDs (PIDs) of the processes in that cpuset. +The list is formatted as a series of ASCII +decimal numbers, each followed by a newline. +A process may be added to a cpuset (automatically removing +it from the cpuset that previously contained it) by writing its +PID to that cpuset's +.I tasks +file (with or without a trailing newline). +.IP +.B Warning: +only one PID may be written to the +.I tasks +file at a time. +If a string is written that contains more +than one PID, only the first one will be used. +.\" =================== notify_on_release =================== +.TP +.I notify_on_release +Flag (0 or 1). +If set (1), that cpuset will receive special handling +after it is released, that is, after all processes cease using +it (i.e., terminate or are moved to a different cpuset) +and all child cpuset directories have been removed. +See the \fBNotify On Release\fR section, below. +.\" ====================== cpus ====================== +.TP +.I cpuset.cpus +List of the physical numbers of the CPUs on which processes +in that cpuset are allowed to execute. +See \fBList Format\fR below for a description of the +format of +.IR cpus . +.IP +The CPUs allowed to a cpuset may be changed by +writing a new list to its +.I cpus +file. +.\" ==================== cpu_exclusive ==================== +.TP +.I cpuset.cpu_exclusive +Flag (0 or 1). +If set (1), the cpuset has exclusive use of +its CPUs (no sibling or cousin cpuset may overlap CPUs). +By default, this is off (0). +Newly created cpusets also initially default this to off (0). +.IP +Two cpusets are +.I sibling +cpusets if they share the same parent cpuset in the +.I /dev/cpuset +hierarchy. +Two cpusets are +.I cousin +cpusets if neither is the ancestor of the other. +Regardless of the +.I cpu_exclusive +setting, if one cpuset is the ancestor of another, +and if both of these cpusets have nonempty +.IR cpus , +then their +.I cpus +must overlap, because the +.I cpus +of any cpuset are always a subset of the +.I cpus +of its parent cpuset. +.\" ====================== mems ====================== +.TP +.I cpuset.mems +List of memory nodes on which processes in this cpuset are +allowed to allocate memory. +See \fBList Format\fR below for a description of the +format of +.IR mems . +.\" ==================== mem_exclusive ==================== +.TP +.I cpuset.mem_exclusive +Flag (0 or 1). +If set (1), the cpuset has exclusive use of +its memory nodes (no sibling or cousin may overlap). +Also if set (1), the cpuset is a \fBHardwall\fR cpuset (see below). +By default, this is off (0). +Newly created cpusets also initially default this to off (0). +.IP +Regardless of the +.I mem_exclusive +setting, if one cpuset is the ancestor of another, +then their memory nodes must overlap, because the memory +nodes of any cpuset are always a subset of the memory nodes +of that cpuset's parent cpuset. +.\" ==================== mem_hardwall ==================== +.TP +.IR cpuset.mem_hardwall " (since Linux 2.6.26)" +Flag (0 or 1). +If set (1), the cpuset is a \fBHardwall\fR cpuset (see below). +Unlike \fBmem_exclusive\fR, +there is no constraint on whether cpusets +marked \fBmem_hardwall\fR may have overlapping +memory nodes with sibling or cousin cpusets. +By default, this is off (0). +Newly created cpusets also initially default this to off (0). +.\" ==================== memory_migrate ==================== +.TP +.IR cpuset.memory_migrate " (since Linux 2.6.16)" +Flag (0 or 1). +If set (1), then memory migration is enabled. +By default, this is off (0). +See the \fBMemory Migration\fR section, below. +.\" ==================== memory_pressure ==================== +.TP +.IR cpuset.memory_pressure " (since Linux 2.6.16)" +A measure of how much memory pressure the processes in this +cpuset are causing. +See the \fBMemory Pressure\fR section, below. +Unless +.I memory_pressure_enabled +is enabled, always has value zero (0). +This file is read-only. +See the +.B WARNINGS +section, below. +.\" ================= memory_pressure_enabled ================= +.TP +.IR cpuset.memory_pressure_enabled " (since Linux 2.6.16)" +Flag (0 or 1). +This file is present only in the root cpuset, normally +.IR /dev/cpuset . +If set (1), the +.I memory_pressure +calculations are enabled for all cpusets in the system. +By default, this is off (0). +See the +\fBMemory Pressure\fR section, below. +.\" ================== memory_spread_page ================== +.TP +.IR cpuset.memory_spread_page " (since Linux 2.6.17)" +Flag (0 or 1). +If set (1), pages in the kernel page cache +(filesystem buffers) are uniformly spread across the cpuset. +By default, this is off (0) in the top cpuset, +and inherited from the parent cpuset in +newly created cpusets. +See the \fBMemory Spread\fR section, below. +.\" ================== memory_spread_slab ================== +.TP +.IR cpuset.memory_spread_slab " (since Linux 2.6.17)" +Flag (0 or 1). +If set (1), the kernel slab caches +for file I/O (directory and inode structures) are +uniformly spread across the cpuset. +By default, is off (0) in the top cpuset, +and inherited from the parent cpuset in +newly created cpusets. +See the \fBMemory Spread\fR section, below. +.\" ================== sched_load_balance ================== +.TP +.IR cpuset.sched_load_balance " (since Linux 2.6.24)" +Flag (0 or 1). +If set (1, the default) the kernel will +automatically load balance processes in that cpuset over +the allowed CPUs in that cpuset. +If cleared (0) the +kernel will avoid load balancing processes in this cpuset, +.I unless +some other cpuset with overlapping CPUs has its +.I sched_load_balance +flag set. +See \fBScheduler Load Balancing\fR, below, for further details. +.\" ================== sched_relax_domain_level ================== +.TP +.IR cpuset.sched_relax_domain_level " (since Linux 2.6.26)" +Integer, between \-1 and a small positive value. +The +.I sched_relax_domain_level +controls the width of the range of CPUs over which the kernel scheduler +performs immediate rebalancing of runnable tasks across CPUs. +If +.I sched_load_balance +is disabled, then the setting of +.I sched_relax_domain_level +does not matter, as no such load balancing is done. +If +.I sched_load_balance +is enabled, then the higher the value of the +.IR sched_relax_domain_level , +the wider +the range of CPUs over which immediate load balancing is attempted. +See \fBScheduler Relax Domain Level\fR, below, for further details. +.\" ================== proc cpuset ================== +.PP +In addition to the above pseudo-files in each directory below +.IR /dev/cpuset , +each process has a pseudo-file, +.IR /proc/ pid /cpuset , +that displays the path of the process's cpuset directory +relative to the root of the cpuset filesystem. +.\" ================== proc status ================== +.PP +Also the +.IR /proc/ pid /status +file for each process has four added lines, +displaying the process's +.I Cpus_allowed +(on which CPUs it may be scheduled) and +.I Mems_allowed +(on which memory nodes it may obtain memory), +in the two formats \fBMask Format\fR and \fBList Format\fR (see below) +as shown in the following example: +.PP +.in +4n +.EX +Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff +Cpus_allowed_list: 0\-127 +Mems_allowed: ffffffff,ffffffff +Mems_allowed_list: 0\-63 +.EE +.in +.PP +The "allowed" fields were added in Linux 2.6.24; +the "allowed_list" fields were added in Linux 2.6.26. +.\" ================== EXTENDED CAPABILITIES ================== +.SH EXTENDED CAPABILITIES +In addition to controlling which +.I cpus +and +.I mems +a process is allowed to use, cpusets provide the following +extended capabilities. +.\" ================== Exclusive Cpusets ================== +.SS Exclusive cpusets +If a cpuset is marked +.I cpu_exclusive +or +.IR mem_exclusive , +no other cpuset, other than a direct ancestor or descendant, +may share any of the same CPUs or memory nodes. +.PP +A cpuset that is +.I mem_exclusive +restricts kernel allocations for +buffer cache pages and other internal kernel data pages +commonly shared by the kernel across +multiple users. +All cpusets, whether +.I mem_exclusive +or not, restrict allocations of memory for user space. +This enables configuring a +system so that several independent jobs can share common kernel data, +while isolating each job's user allocation in +its own cpuset. +To do this, construct a large +.I mem_exclusive +cpuset to hold all the jobs, and construct child, +.RI non- mem_exclusive +cpusets for each individual job. +Only a small amount of kernel memory, +such as requests from interrupt handlers, is allowed to be +placed on memory nodes +outside even a +.I mem_exclusive +cpuset. +.\" ================== Hardwall ================== +.SS Hardwall +A cpuset that has +.I mem_exclusive +or +.I mem_hardwall +set is a +.I hardwall +cpuset. +A +.I hardwall +cpuset restricts kernel allocations for page, buffer, +and other data commonly shared by the kernel across multiple users. +All cpusets, whether +.I hardwall +or not, restrict allocations of memory for user space. +.PP +This enables configuring a system so that several independent +jobs can share common kernel data, such as filesystem pages, +while isolating each job's user allocation in its own cpuset. +To do this, construct a large +.I hardwall +cpuset to hold +all the jobs, and construct child cpusets for each individual +job which are not +.I hardwall +cpusets. +.PP +Only a small amount of kernel memory, such as requests from +interrupt handlers, is allowed to be taken outside even a +.I hardwall +cpuset. +.\" ================== Notify On Release ================== +.SS Notify on release +If the +.I notify_on_release +flag is enabled (1) in a cpuset, +then whenever the last process in the cpuset leaves +(exits or attaches to some other cpuset) +and the last child cpuset of that cpuset is removed, +the kernel will run the command +.IR /sbin/cpuset_release_agent , +supplying the pathname (relative to the mount point of the +cpuset filesystem) of the abandoned cpuset. +This enables automatic removal of abandoned cpusets. +.PP +The default value of +.I notify_on_release +in the root cpuset at system boot is disabled (0). +The default value of other cpusets at creation +is the current value of their parent's +.I notify_on_release +setting. +.PP +The command +.I /sbin/cpuset_release_agent +is invoked, with the name +.RI ( /dev/cpuset +relative path) +of the to-be-released cpuset in +.IR argv[1] . +.PP +The usual contents of the command +.I /sbin/cpuset_release_agent +is simply the shell script: +.PP +.in +4n +.EX +#!/bin/sh +rmdir /dev/cpuset/$1 +.EE +.in +.PP +As with other flag values below, this flag can +be changed by writing an ASCII +number 0 or 1 (with optional trailing newline) +into the file, to clear or set the flag, respectively. +.\" ================== Memory Pressure ================== +.SS Memory pressure +The +.I memory_pressure +of a cpuset provides a simple per-cpuset running average of +the rate that the processes in a cpuset are attempting to free up in-use +memory on the nodes of the cpuset to satisfy additional memory requests. +.PP +This enables batch managers that are monitoring jobs running in dedicated +cpusets to efficiently detect what level of memory pressure that job +is causing. +.PP +This is useful both on tightly managed systems running a wide mix of +submitted jobs, which may choose to terminate or reprioritize jobs that +are trying to use more memory than allowed on the nodes assigned them, +and with tightly coupled, long-running, massively parallel scientific +computing jobs that will dramatically fail to meet required performance +goals if they start to use more memory than allowed to them. +.PP +This mechanism provides a very economical way for the batch manager +to monitor a cpuset for signs of memory pressure. +It's up to the batch manager or other user code to decide +what action to take if it detects signs of memory pressure. +.PP +Unless memory pressure calculation is enabled by setting the pseudo-file +.IR /dev/cpuset/cpuset.memory_pressure_enabled , +it is not computed for any cpuset, and reads from any +.I memory_pressure +always return zero, as represented by the ASCII string "0\en". +See the \fBWARNINGS\fR section, below. +.PP +A per-cpuset, running average is employed for the following reasons: +.IP \[bu] 3 +Because this meter is per-cpuset rather than per-process or per virtual +memory region, the system load imposed by a batch scheduler monitoring +this metric is sharply reduced on large systems, because a scan of +the tasklist can be avoided on each set of queries. +.IP \[bu] +Because this meter is a running average rather than an accumulating +counter, a batch scheduler can detect memory pressure with a +single read, instead of having to read and accumulate results +for a period of time. +.IP \[bu] +Because this meter is per-cpuset rather than per-process, +the batch scheduler can obtain the key information\[em]memory +pressure in a cpuset\[em]with a single read, rather than having to +query and accumulate results over all the (dynamically changing) +set of processes in the cpuset. +.PP +The +.I memory_pressure +of a cpuset is calculated using a per-cpuset simple digital filter +that is kept within the kernel. +For each cpuset, this filter tracks +the recent rate at which processes attached to that cpuset enter the +kernel direct reclaim code. +.PP +The kernel direct reclaim code is entered whenever a process has to +satisfy a memory page request by first finding some other page to +repurpose, due to lack of any readily available already free pages. +Dirty filesystem pages are repurposed by first writing them +to disk. +Unmodified filesystem buffer pages are repurposed +by simply dropping them, though if that page is needed again, it +will have to be reread from disk. +.PP +The +.I cpuset.memory_pressure +file provides an integer number representing the recent (half-life of +10 seconds) rate of entries to the direct reclaim code caused by any +process in the cpuset, in units of reclaims attempted per second, +times 1000. +.\" ================== Memory Spread ================== +.SS Memory spread +There are two Boolean flag files per cpuset that control where the +kernel allocates pages for the filesystem buffers and related +in-kernel data structures. +They are called +.I cpuset.memory_spread_page +and +.IR cpuset.memory_spread_slab . +.PP +If the per-cpuset Boolean flag file +.I cpuset.memory_spread_page +is set, then +the kernel will spread the filesystem buffers (page cache) evenly +over all the nodes that the faulting process is allowed to use, instead +of preferring to put those pages on the node where the process is running. +.PP +If the per-cpuset Boolean flag file +.I cpuset.memory_spread_slab +is set, +then the kernel will spread some filesystem-related slab caches, +such as those for inodes and directory entries, evenly over all the nodes +that the faulting process is allowed to use, instead of preferring to +put those pages on the node where the process is running. +.PP +The setting of these flags does not affect the data segment +(see +.BR brk (2)) +or stack segment pages of a process. +.PP +By default, both kinds of memory spreading are off and the kernel +prefers to allocate memory pages on the node local to where the +requesting process is running. +If that node is not allowed by the +process's NUMA memory policy or cpuset configuration or if there are +insufficient free memory pages on that node, then the kernel looks +for the nearest node that is allowed and has sufficient free memory. +.PP +When new cpusets are created, they inherit the memory spread settings +of their parent. +.PP +Setting memory spreading causes allocations for the affected page or +slab caches to ignore the process's NUMA memory policy and be spread +instead. +However, the effect of these changes in memory placement +caused by cpuset-specified memory spreading is hidden from the +.BR mbind (2) +or +.BR set_mempolicy (2) +calls. +These two NUMA memory policy calls always appear to behave as if +no cpuset-specified memory spreading is in effect, even if it is. +If cpuset memory spreading is subsequently turned off, the NUMA +memory policy most recently specified by these calls is automatically +reapplied. +.PP +Both +.I cpuset.memory_spread_page +and +.I cpuset.memory_spread_slab +are Boolean flag files. +By default, they contain "0", meaning that the feature is off +for that cpuset. +If a "1" is written to that file, that turns the named feature on. +.PP +Cpuset-specified memory spreading behaves similarly to what is known +(in other contexts) as round-robin or interleave memory placement. +.PP +Cpuset-specified memory spreading can provide substantial performance +improvements for jobs that: +.IP \[bu] 3 +need to place thread-local data on +memory nodes close to the CPUs which are running the threads that most +frequently access that data; but also +.IP \[bu] +need to access large filesystem data sets that must to be spread +across the several nodes in the job's cpuset in order to fit. +.PP +Without this policy, +the memory allocation across the nodes in the job's cpuset +can become very uneven, +especially for jobs that might have just a single +thread initializing or reading in the data set. +.\" ================== Memory Migration ================== +.SS Memory migration +Normally, under the default setting (disabled) of +.IR cpuset.memory_migrate , +once a page is allocated (given a physical page +of main memory), then that page stays on whatever node it +was allocated, so long as it remains allocated, even if the +cpuset's memory-placement policy +.I mems +subsequently changes. +.PP +When memory migration is enabled in a cpuset, if the +.I mems +setting of the cpuset is changed, then any memory page in use by any +process in the cpuset that is on a memory node that is no longer +allowed will be migrated to a memory node that is allowed. +.PP +Furthermore, if a process is moved into a cpuset with +.I memory_migrate +enabled, any memory pages it uses that were on memory nodes allowed +in its previous cpuset, but which are not allowed in its new cpuset, +will be migrated to a memory node allowed in the new cpuset. +.PP +The relative placement of a migrated page within +the cpuset is preserved during these migration operations if possible. +For example, +if the page was on the second valid node of the prior cpuset, +then the page will be placed on the second valid node of the new cpuset, +if possible. +.\" ================== Scheduler Load Balancing ================== +.SS Scheduler load balancing +The kernel scheduler automatically load balances processes. +If one CPU is underutilized, +the kernel will look for processes on other more +overloaded CPUs and move those processes to the underutilized CPU, +within the constraints of such placement mechanisms as cpusets and +.BR sched_setaffinity (2). +.PP +The algorithmic cost of load balancing and its impact on key shared +kernel data structures such as the process list increases more than +linearly with the number of CPUs being balanced. +For example, it +costs more to load balance across one large set of CPUs than it does +to balance across two smaller sets of CPUs, each of half the size +of the larger set. +(The precise relationship between the number of CPUs being balanced +and the cost of load balancing depends +on implementation details of the kernel process scheduler, which is +subject to change over time, as improved kernel scheduler algorithms +are implemented.) +.PP +The per-cpuset flag +.I sched_load_balance +provides a mechanism to suppress this automatic scheduler load +balancing in cases where it is not needed and suppressing it would have +worthwhile performance benefits. +.PP +By default, load balancing is done across all CPUs, except those +marked isolated using the kernel boot time "isolcpus=" argument. +(See \fBScheduler Relax Domain Level\fR, below, to change this default.) +.PP +This default load balancing across all CPUs is not well suited to +the following two situations: +.IP \[bu] 3 +On large systems, load balancing across many CPUs is expensive. +If the system is managed using cpusets to place independent jobs +on separate sets of CPUs, full load balancing is unnecessary. +.IP \[bu] +Systems supporting real-time on some CPUs need to minimize +system overhead on those CPUs, including avoiding process load +balancing if that is not needed. +.PP +When the per-cpuset flag +.I sched_load_balance +is enabled (the default setting), +it requests load balancing across +all the CPUs in that cpuset's allowed CPUs, +ensuring that load balancing can move a process (not otherwise pinned, +as by +.BR sched_setaffinity (2)) +from any CPU in that cpuset to any other. +.PP +When the per-cpuset flag +.I sched_load_balance +is disabled, then the +scheduler will avoid load balancing across the CPUs in that cpuset, +\fIexcept\fR in so far as is necessary because some overlapping cpuset +has +.I sched_load_balance +enabled. +.PP +So, for example, if the top cpuset has the flag +.I sched_load_balance +enabled, then the scheduler will load balance across all +CPUs, and the setting of the +.I sched_load_balance +flag in other cpusets has no effect, +as we're already fully load balancing. +.PP +Therefore in the above two situations, the flag +.I sched_load_balance +should be disabled in the top cpuset, and only some of the smaller, +child cpusets would have this flag enabled. +.PP +When doing this, you don't usually want to leave any unpinned processes in +the top cpuset that might use nontrivial amounts of CPU, as such processes +may be artificially constrained to some subset of CPUs, depending on +the particulars of this flag setting in descendant cpusets. +Even if such a process could use spare CPU cycles in some other CPUs, +the kernel scheduler might not consider the possibility of +load balancing that process to the underused CPU. +.PP +Of course, processes pinned to a particular CPU can be left in a cpuset +that disables +.I sched_load_balance +as those processes aren't going anywhere else anyway. +.\" ================== Scheduler Relax Domain Level ================== +.SS Scheduler relax domain level +The kernel scheduler performs immediate load balancing whenever +a CPU becomes free or another task becomes runnable. +This load +balancing works to ensure that as many CPUs as possible are usefully +employed running tasks. +The kernel also performs periodic load +balancing off the software clock described in +.BR time (7). +The setting of +.I sched_relax_domain_level +applies only to immediate load balancing. +Regardless of the +.I sched_relax_domain_level +setting, periodic load balancing is attempted over all CPUs +(unless disabled by turning off +.IR sched_load_balance .) +In any case, of course, tasks will be scheduled to run only on +CPUs allowed by their cpuset, as modified by +.BR sched_setaffinity (2) +system calls. +.PP +On small systems, such as those with just a few CPUs, immediate load +balancing is useful to improve system interactivity and to minimize +wasteful idle CPU cycles. +But on large systems, attempting immediate +load balancing across a large number of CPUs can be more costly than +it is worth, depending on the particular performance characteristics +of the job mix and the hardware. +.PP +The exact meaning of the small integer values of +.I sched_relax_domain_level +will depend on internal +implementation details of the kernel scheduler code and on the +non-uniform architecture of the hardware. +Both of these will evolve +over time and vary by system architecture and kernel version. +.PP +As of this writing, when this capability was introduced in Linux +2.6.26, on certain popular architectures, the positive values of +.I sched_relax_domain_level +have the following meanings. +.PP +.PD 0 +.TP +.B 1 +Perform immediate load balancing across Hyper-Thread +siblings on the same core. +.TP +.B 2 +Perform immediate load balancing across other cores in the same package. +.TP +.B 3 +Perform immediate load balancing across other CPUs +on the same node or blade. +.TP +.B 4 +Perform immediate load balancing across over several +(implementation detail) nodes [On NUMA systems]. +.TP +.B 5 +Perform immediate load balancing across over all CPUs +in system [On NUMA systems]. +.PD +.PP +The +.I sched_relax_domain_level +value of zero (0) always means +don't perform immediate load balancing, +hence that load balancing is done only periodically, +not immediately when a CPU becomes available or another task becomes +runnable. +.PP +The +.I sched_relax_domain_level +value of minus one (\-1) +always means use the system default value. +The system default value can vary by architecture and kernel version. +This system default value can be changed by kernel +boot-time "relax_domain_level=" argument. +.PP +In the case of multiple overlapping cpusets which have conflicting +.I sched_relax_domain_level +values, then the highest such value +applies to all CPUs in any of the overlapping cpusets. +In such cases, +.B \-1 +is the lowest value, +overridden by any other value, +and +.B 0 +is the next lowest value. +.SH FORMATS +The following formats are used to represent sets of +CPUs and memory nodes. +.\" ================== Mask Format ================== +.SS Mask format +The \fBMask Format\fR is used to represent CPU and memory-node bit masks +in the +.IR /proc/ pid /status +file. +.PP +This format displays each 32-bit +word in hexadecimal (using ASCII characters "0" - "9" and "a" - "f"); +words are filled with leading zeros, if required. +For masks longer than one word, a comma separator is used between words. +Words are displayed in big-endian +order, which has the most significant bit first. +The hex digits within a word are also in big-endian order. +.PP +The number of 32-bit words displayed is the minimum number needed to +display all bits of the bit mask, based on the size of the bit mask. +.PP +Examples of the \fBMask Format\fR: +.PP +.in +4n +.EX +00000001 # just bit 0 set +40000000,00000000,00000000 # just bit 94 set +00000001,00000000,00000000 # just bit 64 set +000000ff,00000000 # bits 32\-39 set +00000000,000e3862 # 1,5,6,11\-13,17\-19 set +.EE +.in +.PP +A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as: +.PP +.in +4n +.EX +00000001,00000001,00010117 +.EE +.in +.PP +The first "1" is for bit 64, the +second for bit 32, the third for bit 16, the fourth for bit 8, the +fifth for bit 4, and the "7" is for bits 2, 1, and 0. +.\" ================== List Format ================== +.SS List format +The \fBList Format\fR for +.I cpus +and +.I mems +is a comma-separated list of CPU or memory-node +numbers and ranges of numbers, in ASCII decimal. +.PP +Examples of the \fBList Format\fR: +.PP +.in +4n +.EX +0\-4,9 # bits 0, 1, 2, 3, 4, and 9 set +0\-2,7,12\-14 # bits 0, 1, 2, 7, 12, 13, and 14 set +.EE +.in +.\" ================== RULES ================== +.SH RULES +The following rules apply to each cpuset: +.IP \[bu] 3 +Its CPUs and memory nodes must be a (possibly equal) +subset of its parent's. +.IP \[bu] +It can be marked +.I cpu_exclusive +only if its parent is. +.IP \[bu] +It can be marked +.I mem_exclusive +only if its parent is. +.IP \[bu] +If it is +.IR cpu_exclusive , +its CPUs may not overlap any sibling. +.IP \[bu] +If it is +.IR mem_exclusive , +its memory nodes may not overlap any sibling. +.\" ================== PERMISSIONS ================== +.SH PERMISSIONS +The permissions of a cpuset are determined by the permissions +of the directories and pseudo-files in the cpuset filesystem, +normally mounted at +.IR /dev/cpuset . +.PP +For instance, a process can put itself in some other cpuset (than +its current one) if it can write the +.I tasks +file for that cpuset. +This requires execute permission on the encompassing directories +and write permission on the +.I tasks +file. +.PP +An additional constraint is applied to requests to place some +other process in a cpuset. +One process may not attach another to +a cpuset unless it would have permission to send that process +a signal (see +.BR kill (2)). +.PP +A process may create a child cpuset if it can access and write the +parent cpuset directory. +It can modify the CPUs or memory nodes +in a cpuset if it can access that cpuset's directory (execute +permissions on the each of the parent directories) and write the +corresponding +.I cpus +or +.I mems +file. +.PP +There is one minor difference between the manner in which these +permissions are evaluated and the manner in which normal filesystem +operation permissions are evaluated. +The kernel interprets +relative pathnames starting at a process's current working directory. +Even if one is operating on a cpuset file, relative pathnames +are interpreted relative to the process's current working directory, +not relative to the process's current cpuset. +The only ways that +cpuset paths relative to a process's current cpuset can be used are +if either the process's current working directory is its cpuset +(it first did a +.B cd +or +.BR chdir (2) +to its cpuset directory beneath +.IR /dev/cpuset , +which is a bit unusual) +or if some user code converts the relative cpuset path to a +full filesystem path. +.PP +In theory, this means that user code should specify cpusets +using absolute pathnames, which requires knowing the mount point of +the cpuset filesystem (usually, but not necessarily, +.IR /dev/cpuset ). +In practice, all user level code that this author is aware of +simply assumes that if the cpuset filesystem is mounted, then +it is mounted at +.IR /dev/cpuset . +Furthermore, it is common practice for carefully written +user code to verify the presence of the pseudo-file +.I /dev/cpuset/tasks +in order to verify that the cpuset pseudo-filesystem +is currently mounted. +.\" ================== WARNINGS ================== +.SH WARNINGS +.SS Enabling memory_pressure +By default, the per-cpuset file +.I cpuset.memory_pressure +always contains zero (0). +Unless this feature is enabled by writing "1" to the pseudo-file +.IR /dev/cpuset/cpuset.memory_pressure_enabled , +the kernel does +not compute per-cpuset +.IR memory_pressure . +.SS Using the echo command +When using the +.B echo +command at the shell prompt to change the values of cpuset files, +beware that the built-in +.B echo +command in some shells does not display an error message if the +.BR write (2) +system call fails. +.\" Gack! csh(1)'s echo does this +For example, if the command: +.PP +.in +4n +.EX +echo 19 > cpuset.mems +.EE +.in +.PP +failed because memory node 19 was not allowed (perhaps +the current system does not have a memory node 19), then the +.B echo +command might not display any error. +It is better to use the +.B /bin/echo +external command to change cpuset file settings, as this +command will display +.BR write (2) +errors, as in the example: +.PP +.in +4n +.EX +/bin/echo 19 > cpuset.mems +/bin/echo: write error: Invalid argument +.EE +.in +.\" ================== EXCEPTIONS ================== +.SH EXCEPTIONS +.SS Memory placement +Not all allocations of system memory are constrained by cpusets, +for the following reasons. +.PP +If hot-plug functionality is used to remove all the CPUs that are +currently assigned to a cpuset, then the kernel will automatically +update the +.I cpus_allowed +of all processes attached to CPUs in that cpuset +to allow all CPUs. +When memory hot-plug functionality for removing +memory nodes is available, a similar exception is expected to apply +there as well. +In general, the kernel prefers to violate cpuset placement, +rather than starving a process that has had all its allowed CPUs or +memory nodes taken offline. +User code should reconfigure cpusets to refer only to online CPUs +and memory nodes when using hot-plug to add or remove such resources. +.PP +A few kernel-critical, internal memory-allocation requests, marked +GFP_ATOMIC, must be satisfied immediately. +The kernel may drop some +request or malfunction if one of these allocations fail. +If such a request cannot be satisfied within the current process's cpuset, +then we relax the cpuset, and look for memory anywhere we can find it. +It's better to violate the cpuset than stress the kernel. +.PP +Allocations of memory requested by kernel drivers while processing +an interrupt lack any relevant process context, and are not confined +by cpusets. +.SS Renaming cpusets +You can use the +.BR rename (2) +system call to rename cpusets. +Only simple renaming is supported; that is, changing the name of a cpuset +directory is permitted, but moving a directory into +a different directory is not permitted. +.\" ================== ERRORS ================== +.SH ERRORS +The Linux kernel implementation of cpusets sets +.I errno +to specify the reason for a failed system call affecting cpusets. +.PP +The possible +.I errno +settings and their meaning when set on +a failed cpuset call are as listed below. +.TP +.B E2BIG +Attempted a +.BR write (2) +on a special cpuset file +with a length larger than some kernel-determined upper +limit on the length of such writes. +.TP +.B EACCES +Attempted to +.BR write (2) +the process ID (PID) of a process to a cpuset +.I tasks +file when one lacks permission to move that process. +.TP +.B EACCES +Attempted to add, using +.BR write (2), +a CPU or memory node to a cpuset, when that CPU or memory node was +not already in its parent. +.TP +.B EACCES +Attempted to set, using +.BR write (2), +.I cpuset.cpu_exclusive +or +.I cpuset.mem_exclusive +on a cpuset whose parent lacks the same setting. +.TP +.B EACCES +Attempted to +.BR write (2) +a +.I cpuset.memory_pressure +file. +.TP +.B EACCES +Attempted to create a file in a cpuset directory. +.TP +.B EBUSY +Attempted to remove, using +.BR rmdir (2), +a cpuset with attached processes. +.TP +.B EBUSY +Attempted to remove, using +.BR rmdir (2), +a cpuset with child cpusets. +.TP +.B EBUSY +Attempted to remove +a CPU or memory node from a cpuset +that is also in a child of that cpuset. +.TP +.B EEXIST +Attempted to create, using +.BR mkdir (2), +a cpuset that already exists. +.TP +.B EEXIST +Attempted to +.BR rename (2) +a cpuset to a name that already exists. +.TP +.B EFAULT +Attempted to +.BR read (2) +or +.BR write (2) +a cpuset file using +a buffer that is outside the writing processes accessible address space. +.TP +.B EINVAL +Attempted to change a cpuset, using +.BR write (2), +in a way that would violate a +.I cpu_exclusive +or +.I mem_exclusive +attribute of that cpuset or any of its siblings. +.TP +.B EINVAL +Attempted to +.BR write (2) +an empty +.I cpuset.cpus +or +.I cpuset.mems +list to a cpuset which has attached processes or child cpusets. +.TP +.B EINVAL +Attempted to +.BR write (2) +a +.I cpuset.cpus +or +.I cpuset.mems +list which included a range with the second number smaller than +the first number. +.TP +.B EINVAL +Attempted to +.BR write (2) +a +.I cpuset.cpus +or +.I cpuset.mems +list which included an invalid character in the string. +.TP +.B EINVAL +Attempted to +.BR write (2) +a list to a +.I cpuset.cpus +file that did not include any online CPUs. +.TP +.B EINVAL +Attempted to +.BR write (2) +a list to a +.I cpuset.mems +file that did not include any online memory nodes. +.TP +.B EINVAL +Attempted to +.BR write (2) +a list to a +.I cpuset.mems +file that included a node that held no memory. +.TP +.B EIO +Attempted to +.BR write (2) +a string to a cpuset +.I tasks +file that +does not begin with an ASCII decimal integer. +.TP +.B EIO +Attempted to +.BR rename (2) +a cpuset into a different directory. +.TP +.B ENAMETOOLONG +Attempted to +.BR read (2) +a +.IR /proc/ pid /cpuset +file for a cpuset path that is longer than the kernel page size. +.TP +.B ENAMETOOLONG +Attempted to create, using +.BR mkdir (2), +a cpuset whose base directory name is longer than 255 characters. +.TP +.B ENAMETOOLONG +Attempted to create, using +.BR mkdir (2), +a cpuset whose full pathname, +including the mount point (typically "/dev/cpuset/") prefix, +is longer than 4095 characters. +.TP +.B ENODEV +The cpuset was removed by another process at the same time as a +.BR write (2) +was attempted on one of the pseudo-files in the cpuset directory. +.TP +.B ENOENT +Attempted to create, using +.BR mkdir (2), +a cpuset in a parent cpuset that doesn't exist. +.TP +.B ENOENT +Attempted to +.BR access (2) +or +.BR open (2) +a nonexistent file in a cpuset directory. +.TP +.B ENOMEM +Insufficient memory is available within the kernel; can occur +on a variety of system calls affecting cpusets, but only if the +system is extremely short of memory. +.TP +.B ENOSPC +Attempted to +.BR write (2) +the process ID (PID) +of a process to a cpuset +.I tasks +file when the cpuset had an empty +.I cpuset.cpus +or empty +.I cpuset.mems +setting. +.TP +.B ENOSPC +Attempted to +.BR write (2) +an empty +.I cpuset.cpus +or +.I cpuset.mems +setting to a cpuset that +has tasks attached. +.TP +.B ENOTDIR +Attempted to +.BR rename (2) +a nonexistent cpuset. +.TP +.B EPERM +Attempted to remove a file from a cpuset directory. +.TP +.B ERANGE +Specified a +.I cpuset.cpus +or +.I cpuset.mems +list to the kernel which included a number too large for the kernel +to set in its bit masks. +.TP +.B ESRCH +Attempted to +.BR write (2) +the process ID (PID) of a nonexistent process to a cpuset +.I tasks +file. +.\" ================== VERSIONS ================== +.SH VERSIONS +Cpusets appeared in Linux 2.6.12. +.\" ================== NOTES ================== +.SH NOTES +Despite its name, the +.I pid +parameter is actually a thread ID, +and each thread in a threaded group can be attached to a different +cpuset. +The value returned from a call to +.BR gettid (2) +can be passed in the argument +.IR pid . +.\" ================== BUGS ================== +.SH BUGS +.I cpuset.memory_pressure +cpuset files can be opened +for writing, creation, or truncation, but then the +.BR write (2) +fails with +.I errno +set to +.BR EACCES , +and the creation and truncation options on +.BR open (2) +have no effect. +.\" ================== EXAMPLES ================== +.SH EXAMPLES +The following examples demonstrate querying and setting cpuset +options using shell commands. +.SS Creating and attaching to a cpuset. +To create a new cpuset and attach the current command shell to it, +the steps are: +.PP +.PD 0 +.IP (1) 5 +mkdir /dev/cpuset (if not already done) +.IP (2) +mount \-t cpuset none /dev/cpuset (if not already done) +.IP (3) +Create the new cpuset using +.BR mkdir (1). +.IP (4) +Assign CPUs and memory nodes to the new cpuset. +.IP (5) +Attach the shell to the new cpuset. +.PD +.PP +For example, the following sequence of commands will set up a cpuset +named "Charlie", containing just CPUs 2 and 3, and memory node 1, +and then attach the current shell to that cpuset. +.PP +.in +4n +.EX +.RB "$" " mkdir /dev/cpuset" +.RB "$" " mount \-t cpuset cpuset /dev/cpuset" +.RB "$" " cd /dev/cpuset" +.RB "$" " mkdir Charlie" +.RB "$" " cd Charlie" +.RB "$" " /bin/echo 2\-3 > cpuset.cpus" +.RB "$" " /bin/echo 1 > cpuset.mems" +.RB "$" " /bin/echo $$ > tasks" +# The current shell is now running in cpuset Charlie +# The next line should display \[aq]/Charlie\[aq] +.RB "$" " cat /proc/self/cpuset" +.EE +.in +.\" +.SS Migrating a job to different memory nodes. +To migrate a job (the set of processes attached to a cpuset) +to different CPUs and memory nodes in the system, including moving +the memory pages currently allocated to that job, +perform the following steps. +.PP +.PD 0 +.IP (1) 5 +Let's say we want to move the job in cpuset +.I alpha +(CPUs 4\[en]7 and memory nodes 2\[en]3) to a new cpuset +.I beta +(CPUs 16\[en]19 and memory nodes 8\[en]9). +.IP (2) +First create the new cpuset +.IR beta . +.IP (3) +Then allow CPUs 16\[en]19 and memory nodes 8\[en]9 in +.IR beta . +.IP (4) +Then enable +.I memory_migration +in +.IR beta . +.IP (5) +Then move each process from +.I alpha +to +.IR beta . +.PD +.PP +The following sequence of commands accomplishes this. +.PP +.in +4n +.EX +.RB "$" " cd /dev/cpuset" +.RB "$" " mkdir beta" +.RB "$" " cd beta" +.RB "$" " /bin/echo 16\-19 > cpuset.cpus" +.RB "$" " /bin/echo 8\-9 > cpuset.mems" +.RB "$" " /bin/echo 1 > cpuset.memory_migrate" +.RB "$" " while read i; do /bin/echo $i; done < ../alpha/tasks > tasks" +.EE +.in +.PP +The above should move any processes in +.I alpha +to +.IR beta , +and any memory held by these processes on memory nodes 2\[en]3 to memory +nodes 8\[en]9, respectively. +.PP +Notice that the last step of the above sequence did not do: +.PP +.in +4n +.EX +.RB "$" " cp ../alpha/tasks tasks" +.EE +.in +.PP +The +.I while +loop, rather than the seemingly easier use of the +.BR cp (1) +command, was necessary because +only one process PID at a time may be written to the +.I tasks +file. +.PP +The same effect (writing one PID at a time) as the +.I while +loop can be accomplished more efficiently, in fewer keystrokes and in +syntax that works on any shell, but alas more obscurely, by using the +.B \-u +(unbuffered) option of +.BR sed (1): +.PP +.in +4n +.EX +.RB "$" " sed \-un p < ../alpha/tasks > tasks" +.EE +.in +.\" ================== SEE ALSO ================== +.SH SEE ALSO +.BR taskset (1), +.BR get_mempolicy (2), +.BR getcpu (2), +.BR mbind (2), +.BR sched_getaffinity (2), +.BR sched_setaffinity (2), +.BR sched_setscheduler (2), +.BR set_mempolicy (2), +.BR CPU_SET (3), +.BR proc (5), +.BR cgroups (7), +.BR numa (7), +.BR sched (7), +.BR migratepages (8), +.BR numactl (8) +.PP +.I Documentation/admin\-guide/cgroup\-v1/cpusets.rst +in the Linux kernel source tree +.\" commit 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8 +(or +.I Documentation/cgroup\-v1/cpusets.txt +before Linux 4.18, and +.I Documentation/cpusets.txt +before Linux 2.6.29) |