summaryrefslogtreecommitdiffstats
path: root/man7/cpuset.7
diff options
context:
space:
mode:
Diffstat (limited to 'man7/cpuset.7')
-rw-r--r--man7/cpuset.71504
1 files changed, 1504 insertions, 0 deletions
diff --git a/man7/cpuset.7 b/man7/cpuset.7
new file mode 100644
index 0000000..800e4da
--- /dev/null
+++ b/man7/cpuset.7
@@ -0,0 +1,1504 @@
+.\" Copyright (c) 2008 Silicon Graphics, Inc.
+.\"
+.\" Author: Paul Jackson (http://oss.sgi.com/projects/cpusets)
+.\"
+.\" SPDX-License-Identifier: GPL-2.0-only
+.\"
+.TH cpuset 7 2023-07-18 "Linux man-pages 6.05.01"
+.SH NAME
+cpuset \- confine processes to processor and memory node subsets
+.SH DESCRIPTION
+The cpuset filesystem is a pseudo-filesystem interface
+to the kernel cpuset mechanism,
+which is used to control the processor placement
+and memory placement of processes.
+It is commonly mounted at
+.IR /dev/cpuset .
+.PP
+On systems with kernels compiled with built in support for cpusets,
+all processes are attached to a cpuset, and cpusets are always present.
+If a system supports cpusets, then it will have the entry
+.B nodev cpuset
+in the file
+.IR /proc/filesystems .
+By mounting the cpuset filesystem (see the
+.B EXAMPLES
+section below),
+the administrator can configure the cpusets on a system
+to control the processor and memory placement of processes
+on that system.
+By default, if the cpuset configuration
+on a system is not modified or if the cpuset filesystem
+is not even mounted, then the cpuset mechanism,
+though present, has no effect on the system's behavior.
+.PP
+A cpuset defines a list of CPUs and memory nodes.
+.PP
+The CPUs of a system include all the logical processing
+units on which a process can execute, including, if present,
+multiple processor cores within a package and Hyper-Threads
+within a processor core.
+Memory nodes include all distinct
+banks of main memory; small and SMP systems typically have
+just one memory node that contains all the system's main memory,
+while NUMA (non-uniform memory access) systems have multiple memory nodes.
+.PP
+Cpusets are represented as directories in a hierarchical
+pseudo-filesystem, where the top directory in the hierarchy
+.RI ( /dev/cpuset )
+represents the entire system (all online CPUs and memory nodes)
+and any cpuset that is the child (descendant) of
+another parent cpuset contains a subset of that parent's
+CPUs and memory nodes.
+The directories and files representing cpusets have normal
+filesystem permissions.
+.PP
+Every process in the system belongs to exactly one cpuset.
+A process is confined to run only on the CPUs in
+the cpuset it belongs to, and to allocate memory only
+on the memory nodes in that cpuset.
+When a process
+.BR fork (2)s,
+the child process is placed in the same cpuset as its parent.
+With sufficient privilege, a process may be moved from one
+cpuset to another and the allowed CPUs and memory nodes
+of an existing cpuset may be changed.
+.PP
+When the system begins booting, a single cpuset is
+defined that includes all CPUs and memory nodes on the
+system, and all processes are in that cpuset.
+During the boot process, or later during normal system operation,
+other cpusets may be created, as subdirectories of this top cpuset,
+under the control of the system administrator,
+and processes may be placed in these other cpusets.
+.PP
+Cpusets are integrated with the
+.BR sched_setaffinity (2)
+scheduling affinity mechanism and the
+.BR mbind (2)
+and
+.BR set_mempolicy (2)
+memory-placement mechanisms in the kernel.
+Neither of these mechanisms let a process make use
+of a CPU or memory node that is not allowed by that process's cpuset.
+If changes to a process's cpuset placement conflict with these
+other mechanisms, then cpuset placement is enforced
+even if it means overriding these other mechanisms.
+The kernel accomplishes this overriding by silently
+restricting the CPUs and memory nodes requested by
+these other mechanisms to those allowed by the
+invoking process's cpuset.
+This can result in these
+other calls returning an error, if for example, such
+a call ends up requesting an empty set of CPUs or
+memory nodes, after that request is restricted to
+the invoking process's cpuset.
+.PP
+Typically, a cpuset is used to manage
+the CPU and memory-node confinement for a set of
+cooperating processes such as a batch scheduler job, and these
+other mechanisms are used to manage the placement of
+individual processes or memory regions within that set or job.
+.SH FILES
+Each directory below
+.I /dev/cpuset
+represents a cpuset and contains a fixed set of pseudo-files
+describing the state of that cpuset.
+.PP
+New cpusets are created using the
+.BR mkdir (2)
+system call or the
+.BR mkdir (1)
+command.
+The properties of a cpuset, such as its flags, allowed
+CPUs and memory nodes, and attached processes, are queried and modified
+by reading or writing to the appropriate file in that cpuset's directory,
+as listed below.
+.PP
+The pseudo-files in each cpuset directory are automatically created when
+the cpuset is created, as a result of the
+.BR mkdir (2)
+invocation.
+It is not possible to directly add or remove these pseudo-files.
+.PP
+A cpuset directory that contains no child cpuset directories,
+and has no attached processes, can be removed using
+.BR rmdir (2)
+or
+.BR rmdir (1).
+It is not necessary, or possible,
+to remove the pseudo-files inside the directory before removing it.
+.PP
+The pseudo-files in each cpuset directory are
+small text files that may be read and
+written using traditional shell utilities such as
+.BR cat (1),
+and
+.BR echo (1),
+or from a program by using file I/O library functions or system calls,
+such as
+.BR open (2),
+.BR read (2),
+.BR write (2),
+and
+.BR close (2).
+.PP
+The pseudo-files in a cpuset directory represent internal kernel
+state and do not have any persistent image on disk.
+Each of these per-cpuset files is listed and described below.
+.\" ====================== tasks ======================
+.TP
+.I tasks
+List of the process IDs (PIDs) of the processes in that cpuset.
+The list is formatted as a series of ASCII
+decimal numbers, each followed by a newline.
+A process may be added to a cpuset (automatically removing
+it from the cpuset that previously contained it) by writing its
+PID to that cpuset's
+.I tasks
+file (with or without a trailing newline).
+.IP
+.B Warning:
+only one PID may be written to the
+.I tasks
+file at a time.
+If a string is written that contains more
+than one PID, only the first one will be used.
+.\" =================== notify_on_release ===================
+.TP
+.I notify_on_release
+Flag (0 or 1).
+If set (1), that cpuset will receive special handling
+after it is released, that is, after all processes cease using
+it (i.e., terminate or are moved to a different cpuset)
+and all child cpuset directories have been removed.
+See the \fBNotify On Release\fR section, below.
+.\" ====================== cpus ======================
+.TP
+.I cpuset.cpus
+List of the physical numbers of the CPUs on which processes
+in that cpuset are allowed to execute.
+See \fBList Format\fR below for a description of the
+format of
+.IR cpus .
+.IP
+The CPUs allowed to a cpuset may be changed by
+writing a new list to its
+.I cpus
+file.
+.\" ==================== cpu_exclusive ====================
+.TP
+.I cpuset.cpu_exclusive
+Flag (0 or 1).
+If set (1), the cpuset has exclusive use of
+its CPUs (no sibling or cousin cpuset may overlap CPUs).
+By default, this is off (0).
+Newly created cpusets also initially default this to off (0).
+.IP
+Two cpusets are
+.I sibling
+cpusets if they share the same parent cpuset in the
+.I /dev/cpuset
+hierarchy.
+Two cpusets are
+.I cousin
+cpusets if neither is the ancestor of the other.
+Regardless of the
+.I cpu_exclusive
+setting, if one cpuset is the ancestor of another,
+and if both of these cpusets have nonempty
+.IR cpus ,
+then their
+.I cpus
+must overlap, because the
+.I cpus
+of any cpuset are always a subset of the
+.I cpus
+of its parent cpuset.
+.\" ====================== mems ======================
+.TP
+.I cpuset.mems
+List of memory nodes on which processes in this cpuset are
+allowed to allocate memory.
+See \fBList Format\fR below for a description of the
+format of
+.IR mems .
+.\" ==================== mem_exclusive ====================
+.TP
+.I cpuset.mem_exclusive
+Flag (0 or 1).
+If set (1), the cpuset has exclusive use of
+its memory nodes (no sibling or cousin may overlap).
+Also if set (1), the cpuset is a \fBHardwall\fR cpuset (see below).
+By default, this is off (0).
+Newly created cpusets also initially default this to off (0).
+.IP
+Regardless of the
+.I mem_exclusive
+setting, if one cpuset is the ancestor of another,
+then their memory nodes must overlap, because the memory
+nodes of any cpuset are always a subset of the memory nodes
+of that cpuset's parent cpuset.
+.\" ==================== mem_hardwall ====================
+.TP
+.IR cpuset.mem_hardwall " (since Linux 2.6.26)"
+Flag (0 or 1).
+If set (1), the cpuset is a \fBHardwall\fR cpuset (see below).
+Unlike \fBmem_exclusive\fR,
+there is no constraint on whether cpusets
+marked \fBmem_hardwall\fR may have overlapping
+memory nodes with sibling or cousin cpusets.
+By default, this is off (0).
+Newly created cpusets also initially default this to off (0).
+.\" ==================== memory_migrate ====================
+.TP
+.IR cpuset.memory_migrate " (since Linux 2.6.16)"
+Flag (0 or 1).
+If set (1), then memory migration is enabled.
+By default, this is off (0).
+See the \fBMemory Migration\fR section, below.
+.\" ==================== memory_pressure ====================
+.TP
+.IR cpuset.memory_pressure " (since Linux 2.6.16)"
+A measure of how much memory pressure the processes in this
+cpuset are causing.
+See the \fBMemory Pressure\fR section, below.
+Unless
+.I memory_pressure_enabled
+is enabled, always has value zero (0).
+This file is read-only.
+See the
+.B WARNINGS
+section, below.
+.\" ================= memory_pressure_enabled =================
+.TP
+.IR cpuset.memory_pressure_enabled " (since Linux 2.6.16)"
+Flag (0 or 1).
+This file is present only in the root cpuset, normally
+.IR /dev/cpuset .
+If set (1), the
+.I memory_pressure
+calculations are enabled for all cpusets in the system.
+By default, this is off (0).
+See the
+\fBMemory Pressure\fR section, below.
+.\" ================== memory_spread_page ==================
+.TP
+.IR cpuset.memory_spread_page " (since Linux 2.6.17)"
+Flag (0 or 1).
+If set (1), pages in the kernel page cache
+(filesystem buffers) are uniformly spread across the cpuset.
+By default, this is off (0) in the top cpuset,
+and inherited from the parent cpuset in
+newly created cpusets.
+See the \fBMemory Spread\fR section, below.
+.\" ================== memory_spread_slab ==================
+.TP
+.IR cpuset.memory_spread_slab " (since Linux 2.6.17)"
+Flag (0 or 1).
+If set (1), the kernel slab caches
+for file I/O (directory and inode structures) are
+uniformly spread across the cpuset.
+By default, is off (0) in the top cpuset,
+and inherited from the parent cpuset in
+newly created cpusets.
+See the \fBMemory Spread\fR section, below.
+.\" ================== sched_load_balance ==================
+.TP
+.IR cpuset.sched_load_balance " (since Linux 2.6.24)"
+Flag (0 or 1).
+If set (1, the default) the kernel will
+automatically load balance processes in that cpuset over
+the allowed CPUs in that cpuset.
+If cleared (0) the
+kernel will avoid load balancing processes in this cpuset,
+.I unless
+some other cpuset with overlapping CPUs has its
+.I sched_load_balance
+flag set.
+See \fBScheduler Load Balancing\fR, below, for further details.
+.\" ================== sched_relax_domain_level ==================
+.TP
+.IR cpuset.sched_relax_domain_level " (since Linux 2.6.26)"
+Integer, between \-1 and a small positive value.
+The
+.I sched_relax_domain_level
+controls the width of the range of CPUs over which the kernel scheduler
+performs immediate rebalancing of runnable tasks across CPUs.
+If
+.I sched_load_balance
+is disabled, then the setting of
+.I sched_relax_domain_level
+does not matter, as no such load balancing is done.
+If
+.I sched_load_balance
+is enabled, then the higher the value of the
+.IR sched_relax_domain_level ,
+the wider
+the range of CPUs over which immediate load balancing is attempted.
+See \fBScheduler Relax Domain Level\fR, below, for further details.
+.\" ================== proc cpuset ==================
+.PP
+In addition to the above pseudo-files in each directory below
+.IR /dev/cpuset ,
+each process has a pseudo-file,
+.IR /proc/ pid /cpuset ,
+that displays the path of the process's cpuset directory
+relative to the root of the cpuset filesystem.
+.\" ================== proc status ==================
+.PP
+Also the
+.IR /proc/ pid /status
+file for each process has four added lines,
+displaying the process's
+.I Cpus_allowed
+(on which CPUs it may be scheduled) and
+.I Mems_allowed
+(on which memory nodes it may obtain memory),
+in the two formats \fBMask Format\fR and \fBList Format\fR (see below)
+as shown in the following example:
+.PP
+.in +4n
+.EX
+Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
+Cpus_allowed_list: 0\-127
+Mems_allowed: ffffffff,ffffffff
+Mems_allowed_list: 0\-63
+.EE
+.in
+.PP
+The "allowed" fields were added in Linux 2.6.24;
+the "allowed_list" fields were added in Linux 2.6.26.
+.\" ================== EXTENDED CAPABILITIES ==================
+.SH EXTENDED CAPABILITIES
+In addition to controlling which
+.I cpus
+and
+.I mems
+a process is allowed to use, cpusets provide the following
+extended capabilities.
+.\" ================== Exclusive Cpusets ==================
+.SS Exclusive cpusets
+If a cpuset is marked
+.I cpu_exclusive
+or
+.IR mem_exclusive ,
+no other cpuset, other than a direct ancestor or descendant,
+may share any of the same CPUs or memory nodes.
+.PP
+A cpuset that is
+.I mem_exclusive
+restricts kernel allocations for
+buffer cache pages and other internal kernel data pages
+commonly shared by the kernel across
+multiple users.
+All cpusets, whether
+.I mem_exclusive
+or not, restrict allocations of memory for user space.
+This enables configuring a
+system so that several independent jobs can share common kernel data,
+while isolating each job's user allocation in
+its own cpuset.
+To do this, construct a large
+.I mem_exclusive
+cpuset to hold all the jobs, and construct child,
+.RI non- mem_exclusive
+cpusets for each individual job.
+Only a small amount of kernel memory,
+such as requests from interrupt handlers, is allowed to be
+placed on memory nodes
+outside even a
+.I mem_exclusive
+cpuset.
+.\" ================== Hardwall ==================
+.SS Hardwall
+A cpuset that has
+.I mem_exclusive
+or
+.I mem_hardwall
+set is a
+.I hardwall
+cpuset.
+A
+.I hardwall
+cpuset restricts kernel allocations for page, buffer,
+and other data commonly shared by the kernel across multiple users.
+All cpusets, whether
+.I hardwall
+or not, restrict allocations of memory for user space.
+.PP
+This enables configuring a system so that several independent
+jobs can share common kernel data, such as filesystem pages,
+while isolating each job's user allocation in its own cpuset.
+To do this, construct a large
+.I hardwall
+cpuset to hold
+all the jobs, and construct child cpusets for each individual
+job which are not
+.I hardwall
+cpusets.
+.PP
+Only a small amount of kernel memory, such as requests from
+interrupt handlers, is allowed to be taken outside even a
+.I hardwall
+cpuset.
+.\" ================== Notify On Release ==================
+.SS Notify on release
+If the
+.I notify_on_release
+flag is enabled (1) in a cpuset,
+then whenever the last process in the cpuset leaves
+(exits or attaches to some other cpuset)
+and the last child cpuset of that cpuset is removed,
+the kernel will run the command
+.IR /sbin/cpuset_release_agent ,
+supplying the pathname (relative to the mount point of the
+cpuset filesystem) of the abandoned cpuset.
+This enables automatic removal of abandoned cpusets.
+.PP
+The default value of
+.I notify_on_release
+in the root cpuset at system boot is disabled (0).
+The default value of other cpusets at creation
+is the current value of their parent's
+.I notify_on_release
+setting.
+.PP
+The command
+.I /sbin/cpuset_release_agent
+is invoked, with the name
+.RI ( /dev/cpuset
+relative path)
+of the to-be-released cpuset in
+.IR argv[1] .
+.PP
+The usual contents of the command
+.I /sbin/cpuset_release_agent
+is simply the shell script:
+.PP
+.in +4n
+.EX
+#!/bin/sh
+rmdir /dev/cpuset/$1
+.EE
+.in
+.PP
+As with other flag values below, this flag can
+be changed by writing an ASCII
+number 0 or 1 (with optional trailing newline)
+into the file, to clear or set the flag, respectively.
+.\" ================== Memory Pressure ==================
+.SS Memory pressure
+The
+.I memory_pressure
+of a cpuset provides a simple per-cpuset running average of
+the rate that the processes in a cpuset are attempting to free up in-use
+memory on the nodes of the cpuset to satisfy additional memory requests.
+.PP
+This enables batch managers that are monitoring jobs running in dedicated
+cpusets to efficiently detect what level of memory pressure that job
+is causing.
+.PP
+This is useful both on tightly managed systems running a wide mix of
+submitted jobs, which may choose to terminate or reprioritize jobs that
+are trying to use more memory than allowed on the nodes assigned them,
+and with tightly coupled, long-running, massively parallel scientific
+computing jobs that will dramatically fail to meet required performance
+goals if they start to use more memory than allowed to them.
+.PP
+This mechanism provides a very economical way for the batch manager
+to monitor a cpuset for signs of memory pressure.
+It's up to the batch manager or other user code to decide
+what action to take if it detects signs of memory pressure.
+.PP
+Unless memory pressure calculation is enabled by setting the pseudo-file
+.IR /dev/cpuset/cpuset.memory_pressure_enabled ,
+it is not computed for any cpuset, and reads from any
+.I memory_pressure
+always return zero, as represented by the ASCII string "0\en".
+See the \fBWARNINGS\fR section, below.
+.PP
+A per-cpuset, running average is employed for the following reasons:
+.IP \[bu] 3
+Because this meter is per-cpuset rather than per-process or per virtual
+memory region, the system load imposed by a batch scheduler monitoring
+this metric is sharply reduced on large systems, because a scan of
+the tasklist can be avoided on each set of queries.
+.IP \[bu]
+Because this meter is a running average rather than an accumulating
+counter, a batch scheduler can detect memory pressure with a
+single read, instead of having to read and accumulate results
+for a period of time.
+.IP \[bu]
+Because this meter is per-cpuset rather than per-process,
+the batch scheduler can obtain the key information\[em]memory
+pressure in a cpuset\[em]with a single read, rather than having to
+query and accumulate results over all the (dynamically changing)
+set of processes in the cpuset.
+.PP
+The
+.I memory_pressure
+of a cpuset is calculated using a per-cpuset simple digital filter
+that is kept within the kernel.
+For each cpuset, this filter tracks
+the recent rate at which processes attached to that cpuset enter the
+kernel direct reclaim code.
+.PP
+The kernel direct reclaim code is entered whenever a process has to
+satisfy a memory page request by first finding some other page to
+repurpose, due to lack of any readily available already free pages.
+Dirty filesystem pages are repurposed by first writing them
+to disk.
+Unmodified filesystem buffer pages are repurposed
+by simply dropping them, though if that page is needed again, it
+will have to be reread from disk.
+.PP
+The
+.I cpuset.memory_pressure
+file provides an integer number representing the recent (half-life of
+10 seconds) rate of entries to the direct reclaim code caused by any
+process in the cpuset, in units of reclaims attempted per second,
+times 1000.
+.\" ================== Memory Spread ==================
+.SS Memory spread
+There are two Boolean flag files per cpuset that control where the
+kernel allocates pages for the filesystem buffers and related
+in-kernel data structures.
+They are called
+.I cpuset.memory_spread_page
+and
+.IR cpuset.memory_spread_slab .
+.PP
+If the per-cpuset Boolean flag file
+.I cpuset.memory_spread_page
+is set, then
+the kernel will spread the filesystem buffers (page cache) evenly
+over all the nodes that the faulting process is allowed to use, instead
+of preferring to put those pages on the node where the process is running.
+.PP
+If the per-cpuset Boolean flag file
+.I cpuset.memory_spread_slab
+is set,
+then the kernel will spread some filesystem-related slab caches,
+such as those for inodes and directory entries, evenly over all the nodes
+that the faulting process is allowed to use, instead of preferring to
+put those pages on the node where the process is running.
+.PP
+The setting of these flags does not affect the data segment
+(see
+.BR brk (2))
+or stack segment pages of a process.
+.PP
+By default, both kinds of memory spreading are off and the kernel
+prefers to allocate memory pages on the node local to where the
+requesting process is running.
+If that node is not allowed by the
+process's NUMA memory policy or cpuset configuration or if there are
+insufficient free memory pages on that node, then the kernel looks
+for the nearest node that is allowed and has sufficient free memory.
+.PP
+When new cpusets are created, they inherit the memory spread settings
+of their parent.
+.PP
+Setting memory spreading causes allocations for the affected page or
+slab caches to ignore the process's NUMA memory policy and be spread
+instead.
+However, the effect of these changes in memory placement
+caused by cpuset-specified memory spreading is hidden from the
+.BR mbind (2)
+or
+.BR set_mempolicy (2)
+calls.
+These two NUMA memory policy calls always appear to behave as if
+no cpuset-specified memory spreading is in effect, even if it is.
+If cpuset memory spreading is subsequently turned off, the NUMA
+memory policy most recently specified by these calls is automatically
+reapplied.
+.PP
+Both
+.I cpuset.memory_spread_page
+and
+.I cpuset.memory_spread_slab
+are Boolean flag files.
+By default, they contain "0", meaning that the feature is off
+for that cpuset.
+If a "1" is written to that file, that turns the named feature on.
+.PP
+Cpuset-specified memory spreading behaves similarly to what is known
+(in other contexts) as round-robin or interleave memory placement.
+.PP
+Cpuset-specified memory spreading can provide substantial performance
+improvements for jobs that:
+.IP \[bu] 3
+need to place thread-local data on
+memory nodes close to the CPUs which are running the threads that most
+frequently access that data; but also
+.IP \[bu]
+need to access large filesystem data sets that must to be spread
+across the several nodes in the job's cpuset in order to fit.
+.PP
+Without this policy,
+the memory allocation across the nodes in the job's cpuset
+can become very uneven,
+especially for jobs that might have just a single
+thread initializing or reading in the data set.
+.\" ================== Memory Migration ==================
+.SS Memory migration
+Normally, under the default setting (disabled) of
+.IR cpuset.memory_migrate ,
+once a page is allocated (given a physical page
+of main memory), then that page stays on whatever node it
+was allocated, so long as it remains allocated, even if the
+cpuset's memory-placement policy
+.I mems
+subsequently changes.
+.PP
+When memory migration is enabled in a cpuset, if the
+.I mems
+setting of the cpuset is changed, then any memory page in use by any
+process in the cpuset that is on a memory node that is no longer
+allowed will be migrated to a memory node that is allowed.
+.PP
+Furthermore, if a process is moved into a cpuset with
+.I memory_migrate
+enabled, any memory pages it uses that were on memory nodes allowed
+in its previous cpuset, but which are not allowed in its new cpuset,
+will be migrated to a memory node allowed in the new cpuset.
+.PP
+The relative placement of a migrated page within
+the cpuset is preserved during these migration operations if possible.
+For example,
+if the page was on the second valid node of the prior cpuset,
+then the page will be placed on the second valid node of the new cpuset,
+if possible.
+.\" ================== Scheduler Load Balancing ==================
+.SS Scheduler load balancing
+The kernel scheduler automatically load balances processes.
+If one CPU is underutilized,
+the kernel will look for processes on other more
+overloaded CPUs and move those processes to the underutilized CPU,
+within the constraints of such placement mechanisms as cpusets and
+.BR sched_setaffinity (2).
+.PP
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the process list increases more than
+linearly with the number of CPUs being balanced.
+For example, it
+costs more to load balance across one large set of CPUs than it does
+to balance across two smaller sets of CPUs, each of half the size
+of the larger set.
+(The precise relationship between the number of CPUs being balanced
+and the cost of load balancing depends
+on implementation details of the kernel process scheduler, which is
+subject to change over time, as improved kernel scheduler algorithms
+are implemented.)
+.PP
+The per-cpuset flag
+.I sched_load_balance
+provides a mechanism to suppress this automatic scheduler load
+balancing in cases where it is not needed and suppressing it would have
+worthwhile performance benefits.
+.PP
+By default, load balancing is done across all CPUs, except those
+marked isolated using the kernel boot time "isolcpus=" argument.
+(See \fBScheduler Relax Domain Level\fR, below, to change this default.)
+.PP
+This default load balancing across all CPUs is not well suited to
+the following two situations:
+.IP \[bu] 3
+On large systems, load balancing across many CPUs is expensive.
+If the system is managed using cpusets to place independent jobs
+on separate sets of CPUs, full load balancing is unnecessary.
+.IP \[bu]
+Systems supporting real-time on some CPUs need to minimize
+system overhead on those CPUs, including avoiding process load
+balancing if that is not needed.
+.PP
+When the per-cpuset flag
+.I sched_load_balance
+is enabled (the default setting),
+it requests load balancing across
+all the CPUs in that cpuset's allowed CPUs,
+ensuring that load balancing can move a process (not otherwise pinned,
+as by
+.BR sched_setaffinity (2))
+from any CPU in that cpuset to any other.
+.PP
+When the per-cpuset flag
+.I sched_load_balance
+is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+\fIexcept\fR in so far as is necessary because some overlapping cpuset
+has
+.I sched_load_balance
+enabled.
+.PP
+So, for example, if the top cpuset has the flag
+.I sched_load_balance
+enabled, then the scheduler will load balance across all
+CPUs, and the setting of the
+.I sched_load_balance
+flag in other cpusets has no effect,
+as we're already fully load balancing.
+.PP
+Therefore in the above two situations, the flag
+.I sched_load_balance
+should be disabled in the top cpuset, and only some of the smaller,
+child cpusets would have this flag enabled.
+.PP
+When doing this, you don't usually want to leave any unpinned processes in
+the top cpuset that might use nontrivial amounts of CPU, as such processes
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendant cpusets.
+Even if such a process could use spare CPU cycles in some other CPUs,
+the kernel scheduler might not consider the possibility of
+load balancing that process to the underused CPU.
+.PP
+Of course, processes pinned to a particular CPU can be left in a cpuset
+that disables
+.I sched_load_balance
+as those processes aren't going anywhere else anyway.
+.\" ================== Scheduler Relax Domain Level ==================
+.SS Scheduler relax domain level
+The kernel scheduler performs immediate load balancing whenever
+a CPU becomes free or another task becomes runnable.
+This load
+balancing works to ensure that as many CPUs as possible are usefully
+employed running tasks.
+The kernel also performs periodic load
+balancing off the software clock described in
+.BR time (7).
+The setting of
+.I sched_relax_domain_level
+applies only to immediate load balancing.
+Regardless of the
+.I sched_relax_domain_level
+setting, periodic load balancing is attempted over all CPUs
+(unless disabled by turning off
+.IR sched_load_balance .)
+In any case, of course, tasks will be scheduled to run only on
+CPUs allowed by their cpuset, as modified by
+.BR sched_setaffinity (2)
+system calls.
+.PP
+On small systems, such as those with just a few CPUs, immediate load
+balancing is useful to improve system interactivity and to minimize
+wasteful idle CPU cycles.
+But on large systems, attempting immediate
+load balancing across a large number of CPUs can be more costly than
+it is worth, depending on the particular performance characteristics
+of the job mix and the hardware.
+.PP
+The exact meaning of the small integer values of
+.I sched_relax_domain_level
+will depend on internal
+implementation details of the kernel scheduler code and on the
+non-uniform architecture of the hardware.
+Both of these will evolve
+over time and vary by system architecture and kernel version.
+.PP
+As of this writing, when this capability was introduced in Linux
+2.6.26, on certain popular architectures, the positive values of
+.I sched_relax_domain_level
+have the following meanings.
+.PP
+.PD 0
+.TP
+.B 1
+Perform immediate load balancing across Hyper-Thread
+siblings on the same core.
+.TP
+.B 2
+Perform immediate load balancing across other cores in the same package.
+.TP
+.B 3
+Perform immediate load balancing across other CPUs
+on the same node or blade.
+.TP
+.B 4
+Perform immediate load balancing across over several
+(implementation detail) nodes [On NUMA systems].
+.TP
+.B 5
+Perform immediate load balancing across over all CPUs
+in system [On NUMA systems].
+.PD
+.PP
+The
+.I sched_relax_domain_level
+value of zero (0) always means
+don't perform immediate load balancing,
+hence that load balancing is done only periodically,
+not immediately when a CPU becomes available or another task becomes
+runnable.
+.PP
+The
+.I sched_relax_domain_level
+value of minus one (\-1)
+always means use the system default value.
+The system default value can vary by architecture and kernel version.
+This system default value can be changed by kernel
+boot-time "relax_domain_level=" argument.
+.PP
+In the case of multiple overlapping cpusets which have conflicting
+.I sched_relax_domain_level
+values, then the highest such value
+applies to all CPUs in any of the overlapping cpusets.
+In such cases,
+.B \-1
+is the lowest value,
+overridden by any other value,
+and
+.B 0
+is the next lowest value.
+.SH FORMATS
+The following formats are used to represent sets of
+CPUs and memory nodes.
+.\" ================== Mask Format ==================
+.SS Mask format
+The \fBMask Format\fR is used to represent CPU and memory-node bit masks
+in the
+.IR /proc/ pid /status
+file.
+.PP
+This format displays each 32-bit
+word in hexadecimal (using ASCII characters "0" - "9" and "a" - "f");
+words are filled with leading zeros, if required.
+For masks longer than one word, a comma separator is used between words.
+Words are displayed in big-endian
+order, which has the most significant bit first.
+The hex digits within a word are also in big-endian order.
+.PP
+The number of 32-bit words displayed is the minimum number needed to
+display all bits of the bit mask, based on the size of the bit mask.
+.PP
+Examples of the \fBMask Format\fR:
+.PP
+.in +4n
+.EX
+00000001 # just bit 0 set
+40000000,00000000,00000000 # just bit 94 set
+00000001,00000000,00000000 # just bit 64 set
+000000ff,00000000 # bits 32\-39 set
+00000000,000e3862 # 1,5,6,11\-13,17\-19 set
+.EE
+.in
+.PP
+A mask with bits 0, 1, 2, 4, 8, 16, 32, and 64 set displays as:
+.PP
+.in +4n
+.EX
+00000001,00000001,00010117
+.EE
+.in
+.PP
+The first "1" is for bit 64, the
+second for bit 32, the third for bit 16, the fourth for bit 8, the
+fifth for bit 4, and the "7" is for bits 2, 1, and 0.
+.\" ================== List Format ==================
+.SS List format
+The \fBList Format\fR for
+.I cpus
+and
+.I mems
+is a comma-separated list of CPU or memory-node
+numbers and ranges of numbers, in ASCII decimal.
+.PP
+Examples of the \fBList Format\fR:
+.PP
+.in +4n
+.EX
+0\-4,9 # bits 0, 1, 2, 3, 4, and 9 set
+0\-2,7,12\-14 # bits 0, 1, 2, 7, 12, 13, and 14 set
+.EE
+.in
+.\" ================== RULES ==================
+.SH RULES
+The following rules apply to each cpuset:
+.IP \[bu] 3
+Its CPUs and memory nodes must be a (possibly equal)
+subset of its parent's.
+.IP \[bu]
+It can be marked
+.I cpu_exclusive
+only if its parent is.
+.IP \[bu]
+It can be marked
+.I mem_exclusive
+only if its parent is.
+.IP \[bu]
+If it is
+.IR cpu_exclusive ,
+its CPUs may not overlap any sibling.
+.IP \[bu]
+If it is
+.IR mem_exclusive ,
+its memory nodes may not overlap any sibling.
+.\" ================== PERMISSIONS ==================
+.SH PERMISSIONS
+The permissions of a cpuset are determined by the permissions
+of the directories and pseudo-files in the cpuset filesystem,
+normally mounted at
+.IR /dev/cpuset .
+.PP
+For instance, a process can put itself in some other cpuset (than
+its current one) if it can write the
+.I tasks
+file for that cpuset.
+This requires execute permission on the encompassing directories
+and write permission on the
+.I tasks
+file.
+.PP
+An additional constraint is applied to requests to place some
+other process in a cpuset.
+One process may not attach another to
+a cpuset unless it would have permission to send that process
+a signal (see
+.BR kill (2)).
+.PP
+A process may create a child cpuset if it can access and write the
+parent cpuset directory.
+It can modify the CPUs or memory nodes
+in a cpuset if it can access that cpuset's directory (execute
+permissions on the each of the parent directories) and write the
+corresponding
+.I cpus
+or
+.I mems
+file.
+.PP
+There is one minor difference between the manner in which these
+permissions are evaluated and the manner in which normal filesystem
+operation permissions are evaluated.
+The kernel interprets
+relative pathnames starting at a process's current working directory.
+Even if one is operating on a cpuset file, relative pathnames
+are interpreted relative to the process's current working directory,
+not relative to the process's current cpuset.
+The only ways that
+cpuset paths relative to a process's current cpuset can be used are
+if either the process's current working directory is its cpuset
+(it first did a
+.B cd
+or
+.BR chdir (2)
+to its cpuset directory beneath
+.IR /dev/cpuset ,
+which is a bit unusual)
+or if some user code converts the relative cpuset path to a
+full filesystem path.
+.PP
+In theory, this means that user code should specify cpusets
+using absolute pathnames, which requires knowing the mount point of
+the cpuset filesystem (usually, but not necessarily,
+.IR /dev/cpuset ).
+In practice, all user level code that this author is aware of
+simply assumes that if the cpuset filesystem is mounted, then
+it is mounted at
+.IR /dev/cpuset .
+Furthermore, it is common practice for carefully written
+user code to verify the presence of the pseudo-file
+.I /dev/cpuset/tasks
+in order to verify that the cpuset pseudo-filesystem
+is currently mounted.
+.\" ================== WARNINGS ==================
+.SH WARNINGS
+.SS Enabling memory_pressure
+By default, the per-cpuset file
+.I cpuset.memory_pressure
+always contains zero (0).
+Unless this feature is enabled by writing "1" to the pseudo-file
+.IR /dev/cpuset/cpuset.memory_pressure_enabled ,
+the kernel does
+not compute per-cpuset
+.IR memory_pressure .
+.SS Using the echo command
+When using the
+.B echo
+command at the shell prompt to change the values of cpuset files,
+beware that the built-in
+.B echo
+command in some shells does not display an error message if the
+.BR write (2)
+system call fails.
+.\" Gack! csh(1)'s echo does this
+For example, if the command:
+.PP
+.in +4n
+.EX
+echo 19 > cpuset.mems
+.EE
+.in
+.PP
+failed because memory node 19 was not allowed (perhaps
+the current system does not have a memory node 19), then the
+.B echo
+command might not display any error.
+It is better to use the
+.B /bin/echo
+external command to change cpuset file settings, as this
+command will display
+.BR write (2)
+errors, as in the example:
+.PP
+.in +4n
+.EX
+/bin/echo 19 > cpuset.mems
+/bin/echo: write error: Invalid argument
+.EE
+.in
+.\" ================== EXCEPTIONS ==================
+.SH EXCEPTIONS
+.SS Memory placement
+Not all allocations of system memory are constrained by cpusets,
+for the following reasons.
+.PP
+If hot-plug functionality is used to remove all the CPUs that are
+currently assigned to a cpuset, then the kernel will automatically
+update the
+.I cpus_allowed
+of all processes attached to CPUs in that cpuset
+to allow all CPUs.
+When memory hot-plug functionality for removing
+memory nodes is available, a similar exception is expected to apply
+there as well.
+In general, the kernel prefers to violate cpuset placement,
+rather than starving a process that has had all its allowed CPUs or
+memory nodes taken offline.
+User code should reconfigure cpusets to refer only to online CPUs
+and memory nodes when using hot-plug to add or remove such resources.
+.PP
+A few kernel-critical, internal memory-allocation requests, marked
+GFP_ATOMIC, must be satisfied immediately.
+The kernel may drop some
+request or malfunction if one of these allocations fail.
+If such a request cannot be satisfied within the current process's cpuset,
+then we relax the cpuset, and look for memory anywhere we can find it.
+It's better to violate the cpuset than stress the kernel.
+.PP
+Allocations of memory requested by kernel drivers while processing
+an interrupt lack any relevant process context, and are not confined
+by cpusets.
+.SS Renaming cpusets
+You can use the
+.BR rename (2)
+system call to rename cpusets.
+Only simple renaming is supported; that is, changing the name of a cpuset
+directory is permitted, but moving a directory into
+a different directory is not permitted.
+.\" ================== ERRORS ==================
+.SH ERRORS
+The Linux kernel implementation of cpusets sets
+.I errno
+to specify the reason for a failed system call affecting cpusets.
+.PP
+The possible
+.I errno
+settings and their meaning when set on
+a failed cpuset call are as listed below.
+.TP
+.B E2BIG
+Attempted a
+.BR write (2)
+on a special cpuset file
+with a length larger than some kernel-determined upper
+limit on the length of such writes.
+.TP
+.B EACCES
+Attempted to
+.BR write (2)
+the process ID (PID) of a process to a cpuset
+.I tasks
+file when one lacks permission to move that process.
+.TP
+.B EACCES
+Attempted to add, using
+.BR write (2),
+a CPU or memory node to a cpuset, when that CPU or memory node was
+not already in its parent.
+.TP
+.B EACCES
+Attempted to set, using
+.BR write (2),
+.I cpuset.cpu_exclusive
+or
+.I cpuset.mem_exclusive
+on a cpuset whose parent lacks the same setting.
+.TP
+.B EACCES
+Attempted to
+.BR write (2)
+a
+.I cpuset.memory_pressure
+file.
+.TP
+.B EACCES
+Attempted to create a file in a cpuset directory.
+.TP
+.B EBUSY
+Attempted to remove, using
+.BR rmdir (2),
+a cpuset with attached processes.
+.TP
+.B EBUSY
+Attempted to remove, using
+.BR rmdir (2),
+a cpuset with child cpusets.
+.TP
+.B EBUSY
+Attempted to remove
+a CPU or memory node from a cpuset
+that is also in a child of that cpuset.
+.TP
+.B EEXIST
+Attempted to create, using
+.BR mkdir (2),
+a cpuset that already exists.
+.TP
+.B EEXIST
+Attempted to
+.BR rename (2)
+a cpuset to a name that already exists.
+.TP
+.B EFAULT
+Attempted to
+.BR read (2)
+or
+.BR write (2)
+a cpuset file using
+a buffer that is outside the writing processes accessible address space.
+.TP
+.B EINVAL
+Attempted to change a cpuset, using
+.BR write (2),
+in a way that would violate a
+.I cpu_exclusive
+or
+.I mem_exclusive
+attribute of that cpuset or any of its siblings.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+an empty
+.I cpuset.cpus
+or
+.I cpuset.mems
+list to a cpuset which has attached processes or child cpusets.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+a
+.I cpuset.cpus
+or
+.I cpuset.mems
+list which included a range with the second number smaller than
+the first number.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+a
+.I cpuset.cpus
+or
+.I cpuset.mems
+list which included an invalid character in the string.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+a list to a
+.I cpuset.cpus
+file that did not include any online CPUs.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+a list to a
+.I cpuset.mems
+file that did not include any online memory nodes.
+.TP
+.B EINVAL
+Attempted to
+.BR write (2)
+a list to a
+.I cpuset.mems
+file that included a node that held no memory.
+.TP
+.B EIO
+Attempted to
+.BR write (2)
+a string to a cpuset
+.I tasks
+file that
+does not begin with an ASCII decimal integer.
+.TP
+.B EIO
+Attempted to
+.BR rename (2)
+a cpuset into a different directory.
+.TP
+.B ENAMETOOLONG
+Attempted to
+.BR read (2)
+a
+.IR /proc/ pid /cpuset
+file for a cpuset path that is longer than the kernel page size.
+.TP
+.B ENAMETOOLONG
+Attempted to create, using
+.BR mkdir (2),
+a cpuset whose base directory name is longer than 255 characters.
+.TP
+.B ENAMETOOLONG
+Attempted to create, using
+.BR mkdir (2),
+a cpuset whose full pathname,
+including the mount point (typically "/dev/cpuset/") prefix,
+is longer than 4095 characters.
+.TP
+.B ENODEV
+The cpuset was removed by another process at the same time as a
+.BR write (2)
+was attempted on one of the pseudo-files in the cpuset directory.
+.TP
+.B ENOENT
+Attempted to create, using
+.BR mkdir (2),
+a cpuset in a parent cpuset that doesn't exist.
+.TP
+.B ENOENT
+Attempted to
+.BR access (2)
+or
+.BR open (2)
+a nonexistent file in a cpuset directory.
+.TP
+.B ENOMEM
+Insufficient memory is available within the kernel; can occur
+on a variety of system calls affecting cpusets, but only if the
+system is extremely short of memory.
+.TP
+.B ENOSPC
+Attempted to
+.BR write (2)
+the process ID (PID)
+of a process to a cpuset
+.I tasks
+file when the cpuset had an empty
+.I cpuset.cpus
+or empty
+.I cpuset.mems
+setting.
+.TP
+.B ENOSPC
+Attempted to
+.BR write (2)
+an empty
+.I cpuset.cpus
+or
+.I cpuset.mems
+setting to a cpuset that
+has tasks attached.
+.TP
+.B ENOTDIR
+Attempted to
+.BR rename (2)
+a nonexistent cpuset.
+.TP
+.B EPERM
+Attempted to remove a file from a cpuset directory.
+.TP
+.B ERANGE
+Specified a
+.I cpuset.cpus
+or
+.I cpuset.mems
+list to the kernel which included a number too large for the kernel
+to set in its bit masks.
+.TP
+.B ESRCH
+Attempted to
+.BR write (2)
+the process ID (PID) of a nonexistent process to a cpuset
+.I tasks
+file.
+.\" ================== VERSIONS ==================
+.SH VERSIONS
+Cpusets appeared in Linux 2.6.12.
+.\" ================== NOTES ==================
+.SH NOTES
+Despite its name, the
+.I pid
+parameter is actually a thread ID,
+and each thread in a threaded group can be attached to a different
+cpuset.
+The value returned from a call to
+.BR gettid (2)
+can be passed in the argument
+.IR pid .
+.\" ================== BUGS ==================
+.SH BUGS
+.I cpuset.memory_pressure
+cpuset files can be opened
+for writing, creation, or truncation, but then the
+.BR write (2)
+fails with
+.I errno
+set to
+.BR EACCES ,
+and the creation and truncation options on
+.BR open (2)
+have no effect.
+.\" ================== EXAMPLES ==================
+.SH EXAMPLES
+The following examples demonstrate querying and setting cpuset
+options using shell commands.
+.SS Creating and attaching to a cpuset.
+To create a new cpuset and attach the current command shell to it,
+the steps are:
+.PP
+.PD 0
+.IP (1) 5
+mkdir /dev/cpuset (if not already done)
+.IP (2)
+mount \-t cpuset none /dev/cpuset (if not already done)
+.IP (3)
+Create the new cpuset using
+.BR mkdir (1).
+.IP (4)
+Assign CPUs and memory nodes to the new cpuset.
+.IP (5)
+Attach the shell to the new cpuset.
+.PD
+.PP
+For example, the following sequence of commands will set up a cpuset
+named "Charlie", containing just CPUs 2 and 3, and memory node 1,
+and then attach the current shell to that cpuset.
+.PP
+.in +4n
+.EX
+.RB "$" " mkdir /dev/cpuset"
+.RB "$" " mount \-t cpuset cpuset /dev/cpuset"
+.RB "$" " cd /dev/cpuset"
+.RB "$" " mkdir Charlie"
+.RB "$" " cd Charlie"
+.RB "$" " /bin/echo 2\-3 > cpuset.cpus"
+.RB "$" " /bin/echo 1 > cpuset.mems"
+.RB "$" " /bin/echo $$ > tasks"
+# The current shell is now running in cpuset Charlie
+# The next line should display \[aq]/Charlie\[aq]
+.RB "$" " cat /proc/self/cpuset"
+.EE
+.in
+.\"
+.SS Migrating a job to different memory nodes.
+To migrate a job (the set of processes attached to a cpuset)
+to different CPUs and memory nodes in the system, including moving
+the memory pages currently allocated to that job,
+perform the following steps.
+.PP
+.PD 0
+.IP (1) 5
+Let's say we want to move the job in cpuset
+.I alpha
+(CPUs 4\[en]7 and memory nodes 2\[en]3) to a new cpuset
+.I beta
+(CPUs 16\[en]19 and memory nodes 8\[en]9).
+.IP (2)
+First create the new cpuset
+.IR beta .
+.IP (3)
+Then allow CPUs 16\[en]19 and memory nodes 8\[en]9 in
+.IR beta .
+.IP (4)
+Then enable
+.I memory_migration
+in
+.IR beta .
+.IP (5)
+Then move each process from
+.I alpha
+to
+.IR beta .
+.PD
+.PP
+The following sequence of commands accomplishes this.
+.PP
+.in +4n
+.EX
+.RB "$" " cd /dev/cpuset"
+.RB "$" " mkdir beta"
+.RB "$" " cd beta"
+.RB "$" " /bin/echo 16\-19 > cpuset.cpus"
+.RB "$" " /bin/echo 8\-9 > cpuset.mems"
+.RB "$" " /bin/echo 1 > cpuset.memory_migrate"
+.RB "$" " while read i; do /bin/echo $i; done < ../alpha/tasks > tasks"
+.EE
+.in
+.PP
+The above should move any processes in
+.I alpha
+to
+.IR beta ,
+and any memory held by these processes on memory nodes 2\[en]3 to memory
+nodes 8\[en]9, respectively.
+.PP
+Notice that the last step of the above sequence did not do:
+.PP
+.in +4n
+.EX
+.RB "$" " cp ../alpha/tasks tasks"
+.EE
+.in
+.PP
+The
+.I while
+loop, rather than the seemingly easier use of the
+.BR cp (1)
+command, was necessary because
+only one process PID at a time may be written to the
+.I tasks
+file.
+.PP
+The same effect (writing one PID at a time) as the
+.I while
+loop can be accomplished more efficiently, in fewer keystrokes and in
+syntax that works on any shell, but alas more obscurely, by using the
+.B \-u
+(unbuffered) option of
+.BR sed (1):
+.PP
+.in +4n
+.EX
+.RB "$" " sed \-un p < ../alpha/tasks > tasks"
+.EE
+.in
+.\" ================== SEE ALSO ==================
+.SH SEE ALSO
+.BR taskset (1),
+.BR get_mempolicy (2),
+.BR getcpu (2),
+.BR mbind (2),
+.BR sched_getaffinity (2),
+.BR sched_setaffinity (2),
+.BR sched_setscheduler (2),
+.BR set_mempolicy (2),
+.BR CPU_SET (3),
+.BR proc (5),
+.BR cgroups (7),
+.BR numa (7),
+.BR sched (7),
+.BR migratepages (8),
+.BR numactl (8)
+.PP
+.I Documentation/admin\-guide/cgroup\-v1/cpusets.rst
+in the Linux kernel source tree
+.\" commit 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8
+(or
+.I Documentation/cgroup\-v1/cpusets.txt
+before Linux 4.18, and
+.I Documentation/cpusets.txt
+before Linux 2.6.29)